DATA 419 — NLP — Fall 2025

Homework #1 — Regex's

Possible experience: +40XP

Due: Sun Sep 14th, midnight

Overview

For this homework, you will be using regular expressions to answer questions about (1) Emily Dickinson's poetry, and (2) your own corpus.

Part One: Emily Dickinson

First, download the complete works of Emily Dickinson from gutenberg.org, a repository of public domain literature. Save it as a plain-text file called exactly "emily.txt." The easiest way to do this is probably to cut-and-paste the text from your browser into your favorite text editor (vim, Emacs, Sublime, Notepad++, Spyder, NetBeans, Eclipse, or literally anything else that can read and write plain text) and save the file.

(Note: please do not give the file any other name. Do not use a capital E, or use an extension other than .txt, or put in Emily's full name, or be creative or clever in any way. Simply call the file emily.txt, period. Thanks.)

Take out all the junk in the file. Junk includes: the preface/preamble to each of the three "Serieses," the "Index of first lines" at the end of the file, and anything else that looks like something other than Emily Dickinson's actual poetry. (We'll say that the titles to the poems, since Emily wrote those herself and they are sort of "part of" the poem, are not junk.) Save the junk-less file.

You should now have a single emily.txt file that has all of Emily Dickinson's poems (and their titles), and only Emily Dickinson's poems (and their titles). You are now ready to begin thinking.

Practice problems

Armed with your knowledge of Python and regular expressions, answer the following practice questions about Emily Dickinson's life work. (Make sure your answers matches the answer in bananas at the end of each question):

Graded problems

Choose any ten of the following items, and use Python regex's to discover the answers. You should do this in a single Python program literally and exactly called "emily.py".

(You can do more than ten items if you want, but this part of the assignment will be graded as +2XP per answer, and capped at 20XP total.)

  1. How many times does Emily use the word "storm" in all her poems?
  2. How many times does she use the word "live" (or lives, or lived)?
  3. How many times does she use the word "die" (or dies, or died)?
  4. How many times does she use the word "housewife" (or "housewives")?
  5. What line of poetry has the word "oh" twice?
  6. How many lines have the word "enough" twice?
  7. What day of the week does she refer to the most often, and how many times?
  8. What month of the year does Emily refer to the most often, and how many times?
  9. How many lines have "sleep," or "sleeps," or "asleep" as their last word?
  10. What line contains both the word "bee" and "be?"
  11. How many lines are exactly three words long? (Note that a word like "I'm" is one word, not two.)
  12. How many lines have "lip" or "lips" as their last word?
  13. What line of poetry has the word "love" twice?
  14. How many times does Emily use a word that ends with "ath" but not with "eath"? (For instance, "death" would not count, but "hath" would.)
  15. In how many lines does she use the same word (at least) twice in a row?
  16. How many lines are only two words long? (Note that a word like "I'm" is one word, not two.)
  17. In all her poems, Emily used the same word three times in a row only once. What was that line of poetry?

Part Two: your corpus

For part two, first think of some interesting things you might ask about your corpus from homework #0, that can be answered by regular expressions.

Then, write some Python code to read in your corpus, carry out those regular expressions, and give you the answers. Have a little fun with this. Pose some little interesting questions to yourself — "I wonder whether my corpus has the word X or the word Y more often?" "I wonder what word most commonly precedes word Z in my corpus?" "I wonder how often word Q appears more than once on a line?" etc. Try to predict their answers, and then see how close you are.

Finally, write me a couple paragraphs of grammatically-correct, well-written, mistake-free English text that describe what kinds of things you thought about answering with regular expressions, and what you learned about your corpus from this exercise.

What to turn in

You should write a single Python program for the Emily Dickinson portion of the assignment (called exactly emily.py). It should open the corpus, perform the necessary operations to answer the ten questions you chose, and then print the numbered answer to each one of them in a completely clear and obvious way. For instance, here's the kind of output I'm looking for:

1. Emily uses the word "storm" 4 times.
4. Emily uses the word "housewife/housewives" 3 times.
7. Emily refers to "Wednesday" the most, a total of 99 times.
13. The line is: "Stephen I love thee, oh Stephen I surely love thee."
17. The line is: "This program sucks sucks SUCKS!!"
...

(Note: those are not the actual correct answers This sample output is only to show you the form of what your program should print.)

For the second part of the assignment, just write an interesting couple of paragraphs about what kinds of things you looked for, what regex's you used, and what you found interesting about your corpus through this exercise.

Tips

For Part One: the reason I gave you the first set of practice questions, with answers, is so you could verify that your understanding of regular expressions is correct and giving you the correct answers. I strongly encourage you to do the practice questions first before attempting the graded problems, for which there is no answer to verify.

For Part Two: the main factor in my judging your part two is "interestingness." I want to see evidence that you took the mission seriously, formulated some interesting questions to pose to your corpus, did the necessary trial and error, and found something worthwhile. One aspect that could possibly improve your submission's "interestingness" is to use some regex's that are more complicated than simply looking for single words. Take a little time and do some stuff that really shows off the power of regex's.

Turning it in

To turn in this assignment, send me an email with subject line "DATA 470 Homework #1 turnin", with your emily.py Python file as an attachment (double-check that it's actually attached!) In the body of the email, include the couple of paragraphs narrating your regex journey through your corpus. (If you want to send those paragraphs in a PDF or Word doc, that's okay, but just in the body of the email is fine too.)

Extra credit point

For +1XP, include in the body of your submission email the single English adjective that most accurately describes what you think of Emily Dickinson's poetry.

Getting help

I'm happy to answer questions and help you with tricky regex's! Come to office hours, or send me email with subject line "DATA 470 Homework #1 help!!"