Homework #2 roadmap

  1. Set up.
    1. Create your new yourumwuserid_ngram.py file in vim, Spyder (or whatever editor you're using). Put it in the same directory/folder your corpus is in (or copy your corpus to it, if your corpus is small enough).
  2. Read and clean the corpus.
    1. In your Python file, call open() to open the corpus file, and then either read the entire corpus into one big string with .read(), or else read each line individually and create a list of strings using .readlines(), your choice.
    2. Remove all undesirable characters from this string using the .replace() method. (This includes newlines ('\n'), punctuation marks, etc.)
    3. Turn the entire string into all lower-case.
    4. Now turn that huge string into a list of individual words by using the .split() method. (See Hints.) For each individual word, strip off any leading and trailing whitespace.
    5. (At this point, my list of words still had some "empty string" entries. If yours does too, remove all those empty strings from the list.)
  3. Build a unigram model.
    1. Create a dictionary called unigrams. This dictionary will have key/value pairs whose keys are (unique) words, and whose values are integers corresponding to the number of times that word appeared in the corpus. (Or you can use the Counter class for this.)
    2. Go through the list of words, in order. For each word, if you haven't seen it before, enter it in the dictionary with a value of 1. If you have already seen it before, increment its value in the dictionary.
    3. Don't forget to add a <s> to the dictionary at the start of every "sentence," and a </s> at the end of every "sentence." (Exactly what counts as a "sentence" depends, of course, on your corpus.
    4. Sanity check: if your corpus is "i think i can", then your unigrams dictionary should look like this:
              {'can': 1, 'i': 2, '<s>': 1, '</s>': 1, 'think': 1}
          
  4. Run your unigram model.
    1. Import the Python "random" module at the top of your source file. Use the random.choice() function to repeatedly generate randomly chosen words from your unigrams dictionary, where the probability of each word being chosen is proportional to its word count (value in the dictionary). Keep doing this until you generate a "</s> token.
    2. Again use the .join() technique to turn that list into a string of words. Print it out.
  5. Keep the unigram model around (you'll need it), but now write some code to also build a bigram model.
    1. Create a dictionary called bigrams. This dictionary will have key/value pairs whose keys are (unique) words, and whose values are each dictionaries of words and counts. Each key of bigrams is the first word in a bigram. Its value will be a dictionary whose keys are the second word in a bigram, and whose values are the number of times that bigram appeared in the corpus.
    2. Go through the list of word pairs, in order. (The only way I know how to do this is with the "for i in range(len(...))" pattern instead of the "for word in ..." pattern, and inside the loop to use "i" and "i+1" as indices into the list to retrieve the first and second words of each pair. Warning: make sure your loop doesn't go one-too-far! (If there are 1000 words, there are only 999 pairs of words.)
    3. If you haven't yet seen a bigram whose first word is the first word of this pair, enter a new dictionary into the bigrams corresponding to the key of that first word. This new dictionary should have just one key/value pair at the moment; namely, the second word in the bigram and the value 1.

      If you do already have an entry in bigrams for this first word in the pair, update that dictionary to create/increment the value for the second word in the pair, just like you did when you encountered a new word in the unigram model.
    4. Don't forget to add bigrams with <s> and </s>, as appropriate.
    5.         {'i': {'can': 1, 'think': 1}, 'think': {'i': 1},
               '<s>: {'i': 1}, 'can': {'</s>': 1} }
          
  6. Run your bigram model.
    1. Generate a starting token <s>. This is your starting "word."
    2. Now, continually do the following: first, look up the most recent word generated in the bigrams dictionary. Get its value (which is a dictionary of the words that follow that word and their counts) and use it to choose a second word for the pair in just the same way.
    3. Continue to do this until you generate a </s> token, at which point end your sentence.
    4. Again use the .join() technique to turn that list into a string of words. Print it out.