DATA 419 — NLP — Fall 2025

Homework #2 — N-gram models

Possible experience: +40XP

Due: Wed Sep 24th Sat Sep 27th, midnight

Overview

In this assignment, you'll generate a bigram model from your corpus for and then use it to do two things: (1) predict the probability of a sentence, and (2) generate random sentences from it.

Review example

Let's say this is our (one-sentence) corpus:

oh you bad bad bad boy you hurt me bad

When considering single end-of-sentence (but not start-of-sentence) tokens, this corpus has:

Seven different unigrams:

  • oh — (1)
  • you(2)
  • bad(4)
  • boy — (1)
  • hurt — (1)
  • me — (1)
  • </s> — (1)

Nine different bigrams:

  • oh you — (1)
  • you bad — (1)
  • bad bad(2)
  • bad boy — (1)
  • boy you — (1)
  • you hurt — (1)
  • hurt me — (1)
  • me bad — (1)
  • bad </s> — (1)

Nine different trigrams:

  • oh you bad — (1)
  • you bad bad — (1)
  • bad bad bad — (1)
  • bad bad boy — (1)
  • bad boy you — (1)
  • boy you hurt — (1)
  • you hurt me — (1)
  • hurt me bad — (1)
  • me bad </s> — (1)

In parentheses following each N-gram is the number of times it appeared in the corpus.

Building an N-gram model

Training an N-gram model is really nothing much more than counting. For a unigram model, you go through the text, and you count up how many times each word occurs. That's it. That's your model.

For a bigram model, you go through the text, and you count up how many times each pair of words appears in sequence. That's it. (Same thing for a trigram model, a 4-gram model, etc.)

Part 1: computing sentence probability

Now that you have a bunch of statistics on how often the author of your corpus uses each N-gram, you can prompt the user for a sample sentence, and predict the "probability of that sentence." (Or, if you prefer, the model's perplexity on that sentence.)

Unigram example

For a unigram model, this is simply the product of the individual word probabilities. For example, this sample sentence

bad boy bad bad boy </s>

has probability 411 × 111 × 411 × 411 × 111 × 111 = 641771561 = 3.613 × 10-5, and this one:

oh you bad </s>

has probability 111 × 211 × 411 × 111 = 814641 = 5.464 × 10-4.

Bigram example

For a bigram model, you'll need to use the probability of a sentence starter times the probability of all the individual bigram probabilities. For example,

bad boy bad bad boy </s>

has probability 01 × 14 × 01 × 24 × 14 × = 064 = 0.0, and this one:

oh you bad </s>

has probability 11 × 11 × 12 × 14 = 18 = 1.25 × 10-1.

Part 2: generating sample sentences

All you have to do to "simulate" output from the author of the corpus is run the model backwards. Put another way, instead of counting words, you'll be generating words according to those counts.

Unigram model

Consider the unigram model case. After a start-of-sentence token, you just randomly pick words for the sentence, one at a time, according to those frequencies.

For our corpus above (after starting a sentence with "<s>") you need to randomly choose one of the words "oh", "you", "bad", "boy", "hurt", or "me", to be the first word of the sentence. After all, according to our data, these are the only words this author ever uses.

You're not going to choose uniformly randomly, though, because according to our data on this author, they use the word "you" twice as often, and "bad" four times as often, as they use any of the other words. If you do the math, you're going to be choosing:

(Note that there's no point in choosing the start-of-sentence token "<s>". That's because once you've begun the sentence, you should never have another start-of-sentence without first generating an end-of-sentence.)

An example generated "sentence" might be:

<s> hurt boy me boy you you bad bad bad you bad oh oh </s>

Bigram model

Same thing with the bigram model, except now each time we generate a new word, we take into account what the previous word was when we choose.

Our first word, as always, is "<s>". What should the next word be? According to our corpus, the next word will always be "oh". After all, there is only one sentence in the corpus, and it has an oh after the <s>, so as far as our model is concerned, this is the only possible sentence starter. For similar reasons, the word following oh will always be you.

Now, for the first time, it's mildly interesting. The word you could be followed by either of two words: "bad" or "hurt". Also, each of those two choices is equally likely, since the corpus has one "you bad" bigram and one "you hurt" bigram.

Say we choose "bad". Okay, now what should follow "bad"? There should be a 50% chance that the word following bad is chosen to be another "bad", a 25% chance it is "boy", and a 25% chance that this is the end of the sentence.

A sentence generated from the bigram model might be:

<s> oh you bad boy you bad bad bad bad </s>

It still doesn't make any "sense," of course, but it's more grammatically correct. We never have malformed sequences like "hurt oh" or "boy me" because those bigrams never appeared in the original.

Requirements

You will write a Python program that reads information from a corpus, normalizes and tokenizes it, and builds both a unigram and a bigram model from it. Then it will repeatedly prompt the user for a sample sentence to compute the probability of (with both models). If the user types the one-word sentence "random" to this prompt, your program will instead generate two random sentences (one from each model) and print them to the screen. It should continue in this fashion until the user types the word "done" at the prompt.

Note that whatever normalizing and tokenization process you performed on the corpus, you should also perform on the user's input.

Your program should take one command-line argument, which is the name of the corpus file.

Here's what an interactive session with your program should look like (text in bold indicates what the user types):

$ python dtargaryen2_ngram.py instagram_corpus.txt
Building models...
...done!
Enter a sentence (or "random", or "done"): Don't believe me, just watch!
Unigram prob: 6.832e-25
Bigram prob: 1.242e-6
Enter a sentence (or "random", or "done"): Good see it you can is
Unigram prob: 2.832e-23
Bigram prob: 4.242e-99
Enter a sentence (or "random", or "done"): random
Unigram sentence: the you and case to in
Bigram sentence: you can tell it is true
Enter a sentence (or "random", or "done"): random
Unigram sentence: of it and and purple the
Bigram sentence: they are going to the headphones
Enter a sentence (or "random", or "done"): done
$

Extra credit

+5XP: make it work for trigrams. Note that as the N of your N-grams gets higher, the data gets more sparse. Even in a large corpus, there won't be all that many appearances of a given sequence of eleven back-to-back words. (This is reminiscent of the Data Science phenomenon called "overfitting.") For this reason, you may want to implement some version of stupid backoff, described in the book (p.49), unless you're also doing the second extra credit.

+5XP: implement add-k smoothing. Incorporate a way to modify the value of k, and experiment with its effect.

Turning it in

For this assignment, send an email with subject line "DATA 470 Homework #2 turnin," and with your "yourumwuserid_ngram.py" file as an attachment. (Please replace "yourumwuserid" with your actual UMW user ID. For instance, "jsmith19_homework2.zip" would be an appropriate name. Please do not deviate from this naming convention in any way. Please do not get clever, or creative, or include capital letters, or change the file extension, or omit the underscore, or replace the underscore with a hyphen, or anything else counter to the above instructions. Please name your files exactly as above. Thanks.)

If you're doing either of the extra credits, please clearly mention that fact in the body of the email.

Hints, if you need them

If you're stuck on how to approach this and would like some advice and a road map, see these helpful tips from Stephen.

Getting help

Come to office hours, or send me email with subject line "DATA 470 Homework #2 help!!"