Assign 1: Unix Filters and Regular Expressions
Due: Thursday, Feb 17 at 11:59 p.m.
Goals for Assignment 1
After the assignment, you should
- Further customize your environment.
- Understand pipes and how to use them effectively
- Know how to use filter commands,
e.g.,
sort
,uniq
,cut
,paste
,grep
, etc. - Know how to analyze data files using the above tools.
Objective: Set Up
Create an assign1
subdirectory
within your cs397/assignments
directory.
Copy all the files from
/csci/courses/cs397/handouts/assign1
into your
assign1
directory
All of your code and output files should be in the assign1 directory.
Objective: Customizing Your Environment
If your terminal is acting "weird" after the changes you made to
your prompt in the last assignment (it's hard to describe, but lines
start wrapping strangely, and using the arrows and tab works
strangely), comment the last one and uncomment the original one in
the .bashrc
file so that that is the prompt to use.
If you don't already have one, create a bin
directory in
your home directory.
When you start a new terminal, your PATH is set up such that it
prepends your bin
directory to the
PATH
(if the bin
directory exists). Confirm
this by running echo $PATH
and seeing your bin
directory as the first element in the PATH. If not, let me know so we
can figure out what is going wrong.
Copy your submit397.sh
file from your
assign0
directory into your bin
directory.
Make the script executable by you (if it's not already). Type
submit397.sh and confirm that the script can be executed (if
you implemented the script correctly, you should see an error message
about a missing argument). Now, go to some directory (that is not your
bin directory) and try to run your script (just using
submit397.sh, not the path).
Update ~/.bash_aliases
to have a new alias called
cd397
that (as the name suggests), cds to the cs397 directory.
Update ~/.bash_aliases
to have a new alias called
cdTurnin
that (as the name suggests), cds to your
turnin directory.
Objective: UNIX Practice (20 pts)
As before, do the following operations, then put them into a script
called characters.sh
- Display the contents of
~/.bash_aliases
- Display
villains.txt
andheroes.txt
parallel to each other. You should see matchups between the heroes and villains. - Rerun the previous command but put those matchups into a file
called
matchups.txt
- Display the unique villains
from
villains.txt
. - Run a command to show how many unique villains there are.
Finally, run your script and save its output in the file characters.out
Objective: grep Family and Regular Expressions Practice (40 pts)
For the following questions, try the commands out first in
a terminal (probably in intermediate steps), and then just show me the
final correct command/answer in the script file. Use the most
appropriate member of the grep
family (or
the -E
option) to solve each problem. Don't use other
commands besides grep. Record the command for each question in a file
called regex.sh
. Before the command, label what you're
answering. For example, your script would look something like
echo "How many unique villains are there?" cmnd_here
I worded these in such a way to encourage use of options or regular expression tricks.
Unless otherwise stated, you can assume you're just looking for lowercase letters.
I want as little output as possible to
answer the question. For an extreme example, if I ask you
if computer
is a word, I don't want you to grep for "er"
and then require a person to read through the output to find if
computer is in that list.
- How many words in
/usr/share/dict/words
contain "gre" somewhere in the word? - How many words in
/usr/share/dict/words
contain "grep" somewhere in the word? - Is "Google" a word, according
to
/usr/share/dict/words
? (Reminder: get as little output as possible to answer this question. I don't want a person to interpret the output.) - How many words
in
/usr/share/dict/words
contain either the sequence "yes" or the sequence "no"? - How many words
in
/usr/share/dict/words
contain at least 3 vowels in a row? - How many words in
/usr/share/dict/words
contain no upper or lowercase vowels? - How many words
in
/usr/share/dict/words
contain at least 4 o's (need not be consecutive)? - How many words
in
/usr/share/dict/words
begin and end with a vowel, but have no vowels in between? - How many words in
/usr/share/dict/words
begin and end with the same vowel (a, e, i, o, or u)? - How many words in
/usr/share/dict/words
begin and end with the same 3-letter sequence? - We need to run this on the file from the old CS server, rather than the new one. I copied it over. How many words
in
/csci/courses/cs397/handouts/assign1/words
contain 3 copies of the same 3-character sequence (not necessarily consecutively)? - Display the words
in
/usr/share/dict/words
that have 3 consecutive double-letter pairs (like "bookkeeper" has oo, kk, ee)
To help you verify your answers, here are the answers for some of the questions.
Run the program and save its output in a file called regex.out
Objective: Analyzing Student Information
Analyzing Names (50 pts)
For this objective, since there is the potential for so much output in each of the intermediate steps, figure out the solution for each (numbered) problem. I want the least amount of input that shows me the answer. (For example, if I ask you for a number of something, don't show me all of them and make me count them up.)
Then, after you have figured out the solutions to all of the
problems, put them in a labeled script file (as above, where you echo
the question before the result gets displayed) called
names_analysis.sh
. Run the program and save its output in
names_analysis.out
You probably should review the commands cut, sort, uniq,
grep
to solve these problems. There are multiple ways to
solve the problems. Some will require more than 2 commands piped
together.
firstnames.txt
contains the first names
and lastnames.txt
contains the last names of all
currently enrolled W&L undergraduates
In a separate text file called names_analysis.txt
,
write a short report that makes it clear what your answers are to each
of these problems. This is in response to previous students' results
where it wasn't clear what their answers were. This reflection may
make you reconsider your solutions, to make the results more clear.
- How many names are listed in the file? (should be the same number in both the first and last names files)
- How many students have last names that contain spaces?
- What are the 5 most common last names at W&L?
The final result/output should have the last names sorted in
decreasing order of frequency.
Example (not accurate) output:
10 Sprenkle 8 Watson 5 Levy 5 Matthews 4 Khan
- How many unique first names (i.e., only one person has that name) are there in the W&L undergraduate class?
- What are the 5 most popular first names at W&L?
- How common is your first name at W&L, i.e., how many students at W&L have the same name as you and where does it rank in popularity?
Example output:
69: 5 Sara
From the above output, I know there are 5 Saras and it is the 69th most popular name. (Note that this is not the answer that Sarah should get. Also, it doesn't account for ties. Don't worry about ties.)
- Pose a question that you'd like to answer with this data and answer it. Explain the question and your result in the analysis document.
Analyzing Majors Data (70 pts)
Follow the general process from the previous part. Answer the following questions, and show your work by creating a script called
majors_analysis.sh
, labeling the output. Save the output from your script in majors_analysis.out
and analyze
the data in majors_analysis.txt
. Your analysis is just meant to be a reflection on what you did in answering the
questions. It's not long. (It will include answers to two questions at the end too.)
Your data file is majors.txt
You'll have to figure out what the contents of the file is (there are no headings).
- How many students are in each class year?
- How many students are expected to graduate this year?
- What is the most popular degree being pursued?
- How many students are still undecided?
- How many students are pursuing a second major?
- What are the 5 most popular first majors (may include undecided)?
- How many students are pursuing CSCI as their first major and where does CSCI rank in popularity for first majors?
- This one will probably require you to create some files from the data. (Using Unix commands; do not do this manually. Should not include "blank" second majors.) What are the 10 most popular majors? Where does CSCI rank among all majors?
- The History department offers seval different concentrations in their major, as indicated by the last two letters in the major name. What are the various concentration/majors that the history department offers? (Just the last two letters.)
- Pose one question that you'd like to answer about this data and answer it. Discuss in the analysis document.
- Finally, you could have solved these problems (for each objective) using a Python (or other high-level programming language) program. What are the benefits/limitations/tradeoffs of using the Unix commands? Discuss in the analysis document.
Finishing up: What to turn in for this assignment
Use your script to submit your assignment to the turnin directory.
Grading (180 pts)
- See the above breakdown. Graded on: executing appropriate commands, evidenced by scripts; conciseness in generated output; clarity of analysis