DATA 419 — NLP — Fall 2025
Possible experience: +40XP
Due: Wed Sep 24th Sat Sep 27th, midnight
In this assignment, you'll generate a bigram model from your corpus for and then use it to do two things: (1) predict the probability of a sentence, and (2) generate random sentences from it.
Let's say this is our (one-sentence) corpus:
When considering single end-of-sentence (but not start-of-sentence) tokens, this corpus has:
|
Seven different unigrams:
|
Nine different bigrams:
|
Nine different trigrams:
|
In parentheses following each N-gram is the number of times it appeared in the corpus.
Training an N-gram model is really nothing much more than counting. For a unigram model, you go through the text, and you count up how many times each word occurs. That's it. That's your model.
For a bigram model, you go through the text, and you count up how many times each pair of words appears in sequence. That's it. (Same thing for a trigram model, a 4-gram model, etc.)
Now that you have a bunch of statistics on how often the author of your corpus uses each N-gram, you can prompt the user for a sample sentence, and predict the "probability of that sentence." (Or, if you prefer, the model's perplexity on that sentence.)
For a unigram model, this is simply the product of the individual word probabilities. For example, this sample sentence
has probability 4⁄11 × 1⁄11 × 4⁄11 × 4⁄11 × 1⁄11 × 1⁄11 = 64⁄1771561 = 3.613 × 10-5, and this one:
has probability 1⁄11 × 2⁄11 × 4⁄11 × 1⁄11 = 8⁄14641 = 5.464 × 10-4.
For a bigram model, you'll need to use the probability of a sentence starter times the probability of all the individual bigram probabilities. For example,
has probability 0⁄1 × 1⁄4 × 0⁄1 × 2⁄4 × 1⁄4 × = 0⁄64 = 0.0, and this one:
has probability 1⁄1 × 1⁄1 × 1⁄2 × 1⁄4 = 1⁄8 = 1.25 × 10-1.
All you have to do to "simulate" output from the author of the corpus is run the model backwards. Put another way, instead of counting words, you'll be generating words according to those counts.
Consider the unigram model case. After a start-of-sentence token, you just randomly pick words for the sentence, one at a time, according to those frequencies.
For our corpus above (after starting a sentence with "<s>") you need to randomly choose one of the words "oh", "you", "bad", "boy", "hurt", or "me", to be the first word of the sentence. After all, according to our data, these are the only words this author ever uses.
You're not going to choose uniformly randomly, though, because according to our data on this author, they use the word "you" twice as often, and "bad" four times as often, as they use any of the other words. If you do the math, you're going to be choosing:
(Note that there's no point in choosing the start-of-sentence token "<s>". That's because once you've begun the sentence, you should never have another start-of-sentence without first generating an end-of-sentence.)
An example generated "sentence" might be:
Same thing with the bigram model, except now each time we generate a new word, we take into account what the previous word was when we choose.
Our first word, as always, is "<s>". What should the next word be? According to our corpus, the next word will always be "oh". After all, there is only one sentence in the corpus, and it has an oh after the <s>, so as far as our model is concerned, this is the only possible sentence starter. For similar reasons, the word following oh will always be you.
Now, for the first time, it's mildly interesting. The word you could be followed by either of two words: "bad" or "hurt". Also, each of those two choices is equally likely, since the corpus has one "you bad" bigram and one "you hurt" bigram.
Say we choose "bad". Okay, now what should follow "bad"? There should be a 50% chance that the word following bad is chosen to be another "bad", a 25% chance it is "boy", and a 25% chance that this is the end of the sentence.
A sentence generated from the bigram model might be:
It still doesn't make any "sense," of course, but it's more grammatically correct. We never have malformed sequences like "hurt oh" or "boy me" because those bigrams never appeared in the original.
You will write a Python program that reads information from a corpus, normalizes and tokenizes it, and builds both a unigram and a bigram model from it. Then it will repeatedly prompt the user for a sample sentence to compute the probability of (with both models). If the user types the one-word sentence "random" to this prompt, your program will instead generate two random sentences (one from each model) and print them to the screen. It should continue in this fashion until the user types the word "done" at the prompt.
Note that whatever normalizing and tokenization process you performed on the corpus, you should also perform on the user's input.
Your program should take one command-line argument, which is the name of the corpus file.
Here's what an interactive session with your program should look like (text in bold indicates what the user types):
$ python dtargaryen2_ngram.py instagram_corpus.txt Building models... ...done! Enter a sentence (or "random", or "done"): Don't believe me, just watch! Unigram prob: 6.832e-25 Bigram prob: 1.242e-6 Enter a sentence (or "random", or "done"): Good see it you can is Unigram prob: 2.832e-23 Bigram prob: 4.242e-99 Enter a sentence (or "random", or "done"): random Unigram sentence: the you and case to in Bigram sentence: you can tell it is true Enter a sentence (or "random", or "done"): random Unigram sentence: of it and and purple the Bigram sentence: they are going to the headphones Enter a sentence (or "random", or "done"): done $
+5XP: make it work for trigrams. Note that as the N of your N-grams gets higher, the data gets more sparse. Even in a large corpus, there won't be all that many appearances of a given sequence of eleven back-to-back words. (This is reminiscent of the Data Science phenomenon called "overfitting.") For this reason, you may want to implement some version of stupid backoff, described in the book (p.49), unless you're also doing the second extra credit.
+5XP: implement add-k smoothing. Incorporate a way to modify the value of k, and experiment with its effect.
For this assignment, send an email with subject line "DATA 470 Homework #2 turnin," and with your "yourumwuserid_ngram.py" file as an attachment. (Please replace "yourumwuserid" with your actual UMW user ID. For instance, "jsmith19_homework2.zip" would be an appropriate name. Please do not deviate from this naming convention in any way. Please do not get clever, or creative, or include capital letters, or change the file extension, or omit the underscore, or replace the underscore with a hyphen, or anything else counter to the above instructions. Please name your files exactly as above. Thanks.)
If you're doing either of the extra credits, please clearly mention that fact in the body of the email.
If you're stuck on how to approach this and would like some advice and a road map, see these helpful tips from Stephen.
Come to office hours, or send me email with subject line "DATA 470 Homework #2 help!!"