DATA 419 — NLP — Fall 2025
Possible experience: +40XP
Due: Sun Sep 14th, midnight
For this homework, you will be using regular expressions to answer questions about (1) Emily Dickinson's poetry, and (2) your own corpus.
First, download the complete works of Emily Dickinson from gutenberg.org, a repository of public domain literature. Save it as a plain-text file called exactly "emily.txt." The easiest way to do this is probably to cut-and-paste the text from your browser into your favorite text editor (vim, Emacs, Sublime, Notepad++, Spyder, NetBeans, Eclipse, or literally anything else that can read and write plain text) and save the file.
(Note: please do not give the file any other name. Do not use a capital E, or use an extension other than .txt, or put in Emily's full name, or be creative or clever in any way. Simply call the file emily.txt, period. Thanks.)
Take out all the junk in the file. Junk includes: the preface/preamble to each of the three "Serieses," the "Index of first lines" at the end of the file, and anything else that looks like something other than Emily Dickinson's actual poetry. (We'll say that the titles to the poems, since Emily wrote those herself and they are sort of "part of" the poem, are not junk.) Save the junk-less file.
You should now have a single emily.txt file that has all of Emily Dickinson's poems (and their titles), and only Emily Dickinson's poems (and their titles). You are now ready to begin thinking.
Armed with your knowledge of Python and regular expressions, answer the following practice questions about Emily Dickinson's life work. (Make sure your answers matches the answer in bananas at the end of each question):
Choose any ten of the following items, and use Python regex's to discover the answers. You should do this in a single Python program literally and exactly called "emily.py".
(You can do more than ten items if you want, but this part of the assignment will be graded as +2XP per answer, and capped at 20XP total.)
For part two, first think of some interesting things you might ask about your corpus from homework #0, that can be answered by regular expressions.
Then, write some Python code to read in your corpus, carry out those regular expressions, and give you the answers. Have a little fun with this. Pose some little interesting questions to yourself — "I wonder whether my corpus has the word X or the word Y more often?" "I wonder what word most commonly precedes word Z in my corpus?" "I wonder how often word Q appears more than once on a line?" etc. Try to predict their answers, and then see how close you are.
Finally, write me a couple paragraphs of grammatically-correct, well-written, mistake-free English text that describe what kinds of things you thought about answering with regular expressions, and what you learned about your corpus from this exercise.
You should write a single Python program for the Emily Dickinson portion of the assignment (called exactly emily.py). It should open the corpus, perform the necessary operations to answer the ten questions you chose, and then print the numbered answer to each one of them in a completely clear and obvious way. For instance, here's the kind of output I'm looking for:
1. Emily uses the word "storm" 4 times. 4. Emily uses the word "housewife/housewives" 3 times. 7. Emily refers to "Wednesday" the most, a total of 99 times. 13. The line is: "Stephen I love thee, oh Stephen I surely love thee." 17. The line is: "This program sucks sucks SUCKS!!" ...
(Note: those are not the actual correct answers This sample output is only to show you the form of what your program should print.)
For the second part of the assignment, just write an interesting couple of paragraphs about what kinds of things you looked for, what regex's you used, and what you found interesting about your corpus through this exercise.
For Part One: the reason I gave you the first set of practice questions, with answers, is so you could verify that your understanding of regular expressions is correct and giving you the correct answers. I strongly encourage you to do the practice questions first before attempting the graded problems, for which there is no answer to verify.
For Part Two: the main factor in my judging your part two is "interestingness." I want to see evidence that you took the mission seriously, formulated some interesting questions to pose to your corpus, did the necessary trial and error, and found something worthwhile. One aspect that could possibly improve your submission's "interestingness" is to use some regex's that are more complicated than simply looking for single words. Take a little time and do some stuff that really shows off the power of regex's.
To turn in this assignment, send me an email with subject line "DATA 470 Homework #1 turnin", with your emily.py Python file as an attachment (double-check that it's actually attached!) In the body of the email, include the couple of paragraphs narrating your regex journey through your corpus. (If you want to send those paragraphs in a PDF or Word doc, that's okay, but just in the body of the email is fine too.)
For +1XP, include in the body of your submission email the single English adjective that most accurately describes what you think of Emily Dickinson's poetry.
I'm happy to answer questions and help you with tricky regex's! Come to office hours, or send me email with subject line "DATA 470 Homework #1 help!!"