Turning it in. In general, homework assignments in this class will involve both short answers and coding. You’ll turn in your short answers in PDF form via Canvas. But your code will be ‘turned in’ by putting it in the ~/ling334/hw1/
folder, which you’ll create in problem 0, and running the command chmod -R g+rX ~/ling334/hw1/
on the SSCC when you’ve finished.
Setting up homework directories. First, create a ling334
directory in your SSCC home directory and some subdirectories:
mkdir ~/ling334
mkdir ~/ling334/hw1
mkdir ~/ling334/hw2
mkdir ~/ling334/hw3
mkdir ~/ling334/hw4
mkdir ~/ling334/project
This is where you’ll save your homework code that I will grade. So that I have access, next, set the permissions as follows:
chmod g=x $HOME
chmod -R g=rX ~/ling334
Editing files. Using nano
(or your linux editor of choice like emacs
or vi
), create a file called aboutme.txt
in your new ling334/hw1/
directory. In this file, answer the following questions about yourself:
Counting words. In class, we used egrep
to find lines of a file that contained a string matching a particular regular expression. This is useful when wanting to see the context a match occurs in, but it isn’t ideal for counting occurrences, because it only counts each line once, even if a line contains multiple matches. So for this problem, we’ll make use of the -o
flag in egrep
to return only the matched string:
egrep -o <pattern> <filename>
This will return each match on a separate line. As in class, we can pipe the output of egrep
to wc
to count the number of matches.
Using this method, you’ll count the number of times each of a list of words occurs in the Brown corpus. The full Brown corpus can be found in one file at $NLTK_DATA/corpora/brown/brown_all.txt
. ($NLTK_DATA
is a variable that was defined when you followed the class’s nltk setup instructions, and is available only from the terminal, not from python.) Make sure your regular expressions:
You’ll save your regular expressions and your results in a file called problem2.txt
in your ling334/hw1
directory. There should be one line of this file for each word in the list below. Each line should be formatted containing (a) the word, (b) a space, (c) your regular expression enclosed in slashes, (d) a space, and (e) the number of counts. Make sure your formatting follows this template exactly. For example, two (incorrect) lines could look as follows:
cactus /[ca]ctus/ 131
app /ap+/ 112351
Here’s the list of words:
they
the
nomenclature
Pennsylvania
himself
could
would
multiplicity
almost
decentralizing
necessarily
was
have
polyethylene
temperature
with
development
spectrometer
that
sockdologizing
Finally, short answer questions to include in your PDF write up: What general correlation do you see in the counts? What words are examples that break this pattern? Test a few and note whether they do indeed break the pattern. Speculate about why this correlation might emerge in a language system.
Say what? In English, the verb say can appear with a full sentence (as in ‘say it ain’t so’) or an embedded question (as in ‘say what you did’ or ‘say who you met’). Use egrep
as in the previous problem to answer the following questions about say. Again, save each regular expression and the number of hits it gets in a file called problem3.txt
. (Answer the other questions in your PDF.)
Basic python. Create a python script called problem4.py
. In this program, initialize a variable
sentence = "isn't python so much fun?"
Now, make the script print out each word from sentence
, including any puctuation following it, on a separate line. (Hint: use the string’s split()
method, a for
loop, and the print
statement.) Test your script by running python problem4.py
to make sure it works.
Regular expressions in python. Create problem5.py
in ling334/hw1/
. Have the script go through each line of the brown corpus (the single file mentioned above), check whether each line contains at least 3 words that have only the first letter capitalized (don’t include words in all-caps) using a single regular expression, and write those lines that do to a file called browncaps.txt
in ling334/hw1/
. You must use python’s regular expression module re
to do this. Write these matching sentences to the file one per line without any blank lines in between. To verify that your script works, run it from the terminal and then look at browncaps.txt
to verify its output.
Important note: The first part of the path to the Brown corpus given above with the dollar sign ($NLTK_DATA
) is a shell variable that python doesn’t have access to. Instead, you’ll need to specify the path to the Brown corpus like this:
import nltk
brown_filename = nltk.data.path[0] + "/corpora/brown/brown_all.txt"
Now, brown_filename
is a string that you can use as the path to the corpus when calling open
to read the file.
Remember to run the command chmod -R g+rX ~/ling334/hw1/
on the SSCC when you’ve made your final edits! (as indicated at the top of this page)