DATA 470 - NLP

DATA 419 — NLP — Spring 2023

Homework #2 — Ngram models

Possible experience: +40XP

Due: Fri Feb 3rd Sun Feb 5th, midnight

Overview

In this assignment, you'll generate an N-gram model from your corpus for some value of N, and then use it to generate random sentences from it. As you increase N, you should see the generated sentences get more reminiscent of the corpus, but less flexible and "creative."

Review example

For example, let's say this is our (one-sentence) corpus:

oh you bad bad bad boy you hurt me bad

When considering single start-of-sentence and end-of-sentence tokens, this corpus has:

Eight different unigrams:

<s> — (1)
oh — (1)
you — (2)
bad — (4)
boy — (1)
hurt — (1)
me — (1)
</s> — (1)

Ten different bigrams:

<s> oh — (1)
oh you — (1)
you bad — (1)
bad bad — (2)
bad boy — (1)
boy you — (1)
you hurt — (1)
hurt me — (1)
me bad — (1)
bad </s> — (1)

Eleven different trigrams:

<s> <s> oh — (1)
<s> oh you — (1)
oh you bad — (1)
you bad bad — (1)
bad bad bad — (1)
bad bad boy — (1)
bad boy you — (1)
boy you hurt — (1)
you hurt me — (1)
hurt me bad — (1)
me bad </s> — (1)

In parentheses following each N-gram is the number of times it appeared in the corpus. This will turn out to be a crucial quantity for the model.

Building an N-gram model

Training an N-gram model is nothing more than counting. For a unigram model, you go through the text, and you count up how many times each word occurs. That's it. That's your model.

For a bigram model, you go through the text, and you count up how many times each pair of words appears in sequence. That's it. Same thing for a trigram model, a 4-gram model, etc.

Simulating: running the model "backwards"

Now that you have a bunch of statistics on how often the writer uses each word (or each pair of words, or each triple of words, etc.) all you have to do to "simulate" the writer is run the model backwards. Put another way, instead of counting words, you'll be generating words according to those counts.

Unigram model

Consider the unigram model case. You have dutifully gathered statistics on how often each word is used (in isolation) by the writer. So what if you wanted to generate some "sample sentences" by this writer based on this unigram model? After a start-of-sentence token, you just randomly pick words for the sentences, one at a time, according to those frequencies.

For the author above (after starting a sentence with "<s>") we need to randomly choose one of the words "oh", "you", "bad", "boy", "hurt", or "me", to be the first word of the sentence. After all, according to our data, these are the only words this author ever uses. (Think about that fact for a moment and consider its advantages and limitations.)

We're not going to choose uniformly randomly, though, because according to our data on this author, he uses the word "you" twice as often, and "bad" four times as often, as he uses any of the other words. If you do the math, we're going to be choosing:

oh — ¹⁄₁₁th of the time
you — ²⁄₁₁th of the time
bad — ⁴⁄₁₁th of the time
boy — ¹⁄₁₁th of the time
hurt — ¹⁄₁₁th of the time
me — ¹⁄₁₁th of the time
</s> — ¹⁄₁₁th of the time

(Note that there's no point in choosing the start-of-sentence token "<s>" since once you've begun the sentence, you should never have another start-of-sentence without first generating an end-of-sentence.)

A unigram model is nothing more than that. All words are chosen independently: none of the choices have anything to do with any previous (or later) choices.

An example generated "sentence" might be:

<s> hurt boy me boy you you bad bad bad you bad oh oh </s>

Not particularly compelling, but hey we're just getting started.

Bigram model

Same thing, except now each time we generate a new word, we take into account what the previous word was when we choose.

Our first word, as always, is "<s>". What should the next word be? According to our corpus, the next word will always be "oh". After all, there is only one utterance in the corpus, and it has an oh after the <s>, so as far as our model is concerned, this is the only possible sentence starter. For similar reasons, the word following oh will always be you.

Now, for the first time, it's mildly interesting. The word you could be followed by either of two words: "bad" or "hurt". Also, each of those two choices is equally likely, since the corpus has one "you bad" bigram and one "you hurt" bigram.

Say we choose "bad". Now it's even trickier: what should follow "bad"? By my calculations, there should be a 50% chance that the word following bad is chosen to be another "bad", a 25% chance it is "boy", and a 25% chance that this is the end of the sentence. Look at the bigram counts, above, and see whether you agree with me.

A sentence generated from the bigram model might be:

<s> oh you bad boy you bad bad bad bad </s>

It still doesn't make any "sense," of course, but it's more grammatically correct, as you can see. We never have malformed sequences like "hurt oh" or "boy me" because those bigrams never appeared in the original.

(Important: study this bigram-generated sentence and verify that it is possible to generate with the bigram model. Then, study the unigram-generated sentence and verify that it is not possible to generate with the bigram model.)

Trigrams and beyond

You get the pattern. For a trigram model, you consider the previous two words in generating each word. For a 4-gram, the previous three words. Etc.

Note that as N gets higher, the data gets more sparse. Even in a large corpus, there won't be all that many appearances of a given sequence of eleven back-to-back words. (This is reminiscent of the Data Science phenomenon of "overfitting.")

Requirements

You will write a Python program that reads information from your corpus, normalizes and tokenizes it, and builds an N-gram model for some value of N where 1 ≤ N ≤ 3. It will then spit a generated sentence to the screen based on that model. (Note that each time your program runs, it will spit out a different "sentence" since you will use randomness to determine which words are chosen.)

The complete submission for this assignment will thus consist of two files, one called yourumwuserid_ngram.py and one called corpus.txt. For maximum credit (see below) your program should use a trigram model when it generates the 100-word "sentence."

Rubric

Feature	Points
A Python program that gives no errors when run	5XP
Reading the corpus	5XP
Normalizing the corpus appropriately (I think you'll agree that this at least includes reasonable tokenization and case-folding)	5XP
Generating a random sentence from a unigram model	10XP
Generating a random sentence from a bigram model	5XP
Generating a random sentence from a trigram model	5XP
Using command-line arguments to accept a corpus filename and an integer N, with defaults if the user does not provide them, and with sensible error messages if the filename does not exist or the N value is non-numeric or out of range	5XP

Turning it in

For this assignment, send an email with subject line "DATA 470 Homework #2 turnin," and with a zip file called "yourumwuserid_homework2.zip" as an attachment. (Please replace "yourumwuserid" with your actual UMW user ID. For instance, "jsmith19_homework2.zip" would be an appropriate name.) Please do not deviate from this naming convention in any way. Please do not get clever, or creative, or include capital letters, or change the file extension, or omit the underscore, or replace the underscore with a hyphen, or anything else counter to the above instructions. Please name your files exactly as above.

(If you don't know how to create a zip file, google "how to create a zip file" for help.)

This zip file should contain at least two files: your main Python program (called exactly "yourumwuserid_ngram.py") and your corpus (called whatever you wish). If your Python program contains more than just the main file, include those as well. If your corpus contains more than just one file, include the additional file(s) as well.

In the body of the email, tell me how to run your program (i.e., what order the command-line arguments go in, and what values they should have. If you didn't complete the last item of the rubric, say "just run yourumwuserid_ngram.py", Stephen.")

Hints, if you need them

If you're stuck on how to approach this and would like some advice and a road map, see these helpful tips from Stephen.

Getting help

Come to office hours, or send me email with subject line "DATA 470 Homework #2 help!!"