Assign 1: Unix Filters and Regular Expressions

Due: Thursday, Feb 17 at 11:59 p.m.

Goals for Assignment 1

After the assignment, you should

Further customize your environment.
Understand pipes and how to use them effectively
Know how to use filter commands, e.g., sort, uniq, cut, paste, grep, etc.
Know how to analyze data files using the above tools.

Objective: Set Up

Create an assign1 subdirectory within your cs397/assignments directory.

Copy all the files from /csci/courses/cs397/handouts/assign1 into your assign1 directory

All of your code and output files should be in the assign1 directory.

Objective: Customizing Your Environment

If your terminal is acting "weird" after the changes you made to your prompt in the last assignment (it's hard to describe, but lines start wrapping strangely, and using the arrows and tab works strangely), comment the last one and uncomment the original one in the .bashrc file so that that is the prompt to use.

If you don't already have one, create a bin directory in your home directory.

When you start a new terminal, your PATH is set up such that it prepends your bin directory to the PATH (if the bin directory exists). Confirm this by running echo $PATH and seeing your bin directory as the first element in the PATH. If not, let me know so we can figure out what is going wrong.

Copy your submit397.sh file from your assign0 directory into your bin directory. Make the script executable by you (if it's not already). Type submit397.sh and confirm that the script can be executed (if you implemented the script correctly, you should see an error message about a missing argument). Now, go to some directory (that is not your bin directory) and try to run your script (just using submit397.sh, not the path).

Update ~/.bash_aliases to have a new alias called cd397 that (as the name suggests), cds to the cs397 directory.

Update ~/.bash_aliases to have a new alias called cdTurnin that (as the name suggests), cds to your turnin directory.

Objective: UNIX Practice (20 pts)

As before, do the following operations, then put them into a script called characters.sh

Display the contents of ~/.bash_aliases
Display villains.txt and heroes.txt parallel to each other. You should see matchups between the heroes and villains.
Rerun the previous command but put those matchups into a file called matchups.txt
Display the unique villains from villains.txt.
Run a command to show how many unique villains there are.

Finally, run your script and save its output in the file characters.out

Objective: grep Family and Regular Expressions Practice (40 pts)

For the following questions, try the commands out first in a terminal (probably in intermediate steps), and then just show me the final correct command/answer in the script file. Use the most appropriate member of the grep family (or the -E option) to solve each problem. Don't use other commands besides grep. Record the command for each question in a file called regex.sh. Before the command, label what you're answering. For example, your script would look something like

echo "How many unique villains are there?"
cmnd_here

I worded these in such a way to encourage use of options or regular expression tricks.

Unless otherwise stated, you can assume you're just looking for lowercase letters.

I want as little output as possible to answer the question. For an extreme example, if I ask you if computer is a word, I don't want you to grep for "er" and then require a person to read through the output to find if computer is in that list.

How many words in /usr/share/dict/words contain "gre" somewhere in the word?
How many words in /usr/share/dict/words contain "grep" somewhere in the word?
Is "Google" a word, according to /usr/share/dict/words? (Reminder: get as little output as possible to answer this question. I don't want a person to interpret the output.)
How many words in /usr/share/dict/words contain either the sequence "yes" or the sequence "no"?
How many words in /usr/share/dict/words contain at least 3 vowels in a row?
How many words in /usr/share/dict/words contain no upper or lowercase vowels?
How many words in /usr/share/dict/words contain at least 4 o's (need not be consecutive)?
How many words in /usr/share/dict/words begin and end with a vowel, but have no vowels in between?
How many words in /usr/share/dict/words begin and end with the same vowel (a, e, i, o, or u)?
How many words in /usr/share/dict/words begin and end with the same 3-letter sequence?
We need to run this on the file from the old CS server, rather than the new one. I copied it over. How many words in /csci/courses/cs397/handouts/assign1/words contain 3 copies of the same 3-character sequence (not necessarily consecutively)?
Display the words in /usr/share/dict/words that have 3 consecutive double-letter pairs (like "bookkeeper" has oo, kk, ee)

To help you verify your answers, here are the answers for some of the questions.

Run the program and save its output in a file called regex.out

Objective: Analyzing Student Information

Analyzing Names (50 pts)

For this objective, since there is the potential for so much output in each of the intermediate steps, figure out the solution for each (numbered) problem. I want the least amount of input that shows me the answer. (For example, if I ask you for a number of something, don't show me all of them and make me count them up.)

Then, after you have figured out the solutions to all of the problems, put them in a labeled script file (as above, where you echo the question before the result gets displayed) called names_analysis.sh. Run the program and save its output in names_analysis.out

You probably should review the commands cut, sort, uniq, grep to solve these problems. There are multiple ways to solve the problems. Some will require more than 2 commands piped together.

firstnames.txt contains the first names and lastnames.txt contains the last names of all currently enrolled W&L undergraduates

In a separate text file called names_analysis.txt, write a short report that makes it clear what your answers are to each of these problems. This is in response to previous students' results where it wasn't clear what their answers were. This reflection may make you reconsider your solutions, to make the results more clear.

How many names are listed in the file? (should be the same number in both the first and last names files)
How many students have last names that contain spaces?
What are the 5 most common last names at W&L? The final result/output should have the last names sorted in decreasing order of frequency.
Example (not accurate) output:
```
  10 Sprenkle
   8 Watson
   5 Levy
   5 Matthews
   4 Khan
```
How many unique first names (i.e., only one person has that name) are there in the W&L undergraduate class?
What are the 5 most popular first names at W&L?
How common is your first name at W&L, i.e., how many students at W&L have the same name as you and where does it rank in popularity?
Example output:
```
69:      5 Sara
```
From the above output, I know there are 5 Saras and it is the 69th most popular name. (Note that this is not the answer that Sarah should get. Also, it doesn't account for ties. Don't worry about ties.)
Pose a question that you'd like to answer with this data and answer it. Explain the question and your result in the analysis document.

Analyzing Majors Data (70 pts)

Follow the general process from the previous part. Answer the following questions, and show your work by creating a script called majors_analysis.sh, labeling the output. Save the output from your script in majors_analysis.out and analyze the data in majors_analysis.txt. Your analysis is just meant to be a reflection on what you did in answering the questions. It's not long. (It will include answers to two questions at the end too.)

Your data file is majors.txt You'll have to figure out what the contents of the file is (there are no headings).

How many students are in each class year?
How many students are expected to graduate this year?
What is the most popular degree being pursued?
How many students are still undecided?
How many students are pursuing a second major?
What are the 5 most popular first majors (may include undecided)?
How many students are pursuing CSCI as their first major and where does CSCI rank in popularity for first majors?
This one will probably require you to create some files from the data. (Using Unix commands; do not do this manually. Should not include "blank" second majors.) What are the 10 most popular majors? Where does CSCI rank among all majors?
The History department offers seval different concentrations in their major, as indicated by the last two letters in the major name. What are the various concentration/majors that the history department offers? (Just the last two letters.)
Pose one question that you'd like to answer about this data and answer it. Discuss in the analysis document.
Finally, you could have solved these problems (for each objective) using a Python (or other high-level programming language) program. What are the benefits/limitations/tradeoffs of using the Unix commands? Discuss in the analysis document.

Finishing up: What to turn in for this assignment

Use your script to submit your assignment to the turnin directory.

Grading (180 pts)

See the above breakdown. Graded on: executing appropriate commands, evidenced by scripts; conciseness in generated output; clarity of analysis

CSCI 397: Tools for the Software Life Cycle and Beyond