DATA 470 - NLP

DATA 470 — NLP — Spring 2023

Homework #5 — Sequence labeling

Possible experience: +40XP

Due: Sat Apr 22nd, midnight

Overview

In this assignment, you'll run spaCy's part-of-speech tagger and named entity recognition algorithm on your corpus, and see whether its results match your intuition.

0. Taking notes

Create a text file (using your text editor, Notepad, Microsoft Word, Google Docs, or whatever) called "Homework5stuff.txt" to record observations you make while doing this homework.

1. Formulate your POS intuition

Thoughtfully re-examine the list of Penn Treebank part-of-speech tags and the Universal Dependencies part-of-speech tags. Re-familiarize yourself with what each one means.

Then, reflect silently upon your corpus for at least 30 seconds, forming in your mind your best estimate to the following question:

"Which tags do I think will be the top 5 most common in my corpus, for each of the two tag sets? And what percentage of all the tokens in my corpus do I estimate will belong to those tags?"

Commit to a guess, and write your answer down in Homework5stuff.txt. (A possible answer would be of the form: "For UD tags, I predict 25% of the tokens will be verbs, 22% coordinating conjunctions, 16% pronouns, 12% symbols, and 10% particles. That adds up to 85%, so I predict that the remaining 15% of the tags will be from the other, less common categories. For Penn Treebank tags, I predict...")

Just think about it carefully and make an educated guess. It's okay to be wrong (and you will be).

2. Test your POS intuition

If you haven't already downloaded and installed spaCy, do so. Then, write code to read your entire corpus, tag it with both UD and Penn Treebank part-of-speech tags, and create a bar graph showing the top 10 most common tags (in each set) and what percentage each of those common tags comprises of the whole. The bar graphs I have in mind look like those below (which I produced from Jane Eyre):

3. Formulate your NER intuition

Reflect silently upon your corpus for at least 30 seconds, forming in your mind your best estimate to the following question:

"Which persons, locations, and organizations do I think will be the top 5 most common of each category in my corpus? And what percentage of all that type of entity do I estimate it will be?"

Commit to a guess, and write your answer down in Homework5stuff.txt. (A possible answer would be of the form: "I predict the most common five location entities will be Toronto (20%), New York City (10%), The Bronx (8%), Greenwich Village (5%), and Central Park (5%). I predict the five most common person entities will be...")

Just think about it carefully and make an educated guess. It's okay to be wrong (and you will be).

4. Test your NER intuition

Write code to run spaCy's named-entity-recognition algorithm on your entire corpus, and create three bar graphs, showing the top 10 most common person, location, and organization entities and what percentage each of those entities comprises of the whole. The bar graphs I have in mind look like those from Jane Eyre:

(Note that spaCy does get some values wrong — for instance, Ingram is not really an organization, and Hay is not really a location. It also thinks that "Jane" and "Eyre" are two different people. That's okay. Nobody's perfect.)

5. Reflect

Finally, in Homework5stuff.txt, compare your POS/NER predictions with the actual results, and discuss why you think you got wrong what you got wrong. Say other interesting stuff, too, about what you learned from this assignment. Write no more than three paragraphs.

Turning it in

The complete submission for this assignment will consist of a tarball (.tar.gz) or zip file (.zip) with the following contents:

A nicely-organized Homework5stuff.txt file, containing your predictions and comparison with actual results, and your no-more-than-three-paragraph summative analysis.
Five .png files with bar graphs:
- Parts of speech (UD)
- Parts of speech (Penn Treebank)
- Person entities
- Location entities
- Organization entities

To turn it in, send me an email with subject line "DATA 470 Homework #5 turnin," and with your tarball or zip file included as an attachment.

(If you don't know how to create a tarball or zip file, google "how to create a tarball" or "how to create a zip file" for help.)

Getting help

Send me email with subject line "DATA 470 Homework #5 help!!"