DATA 470 - NLP

DATA 419 — NLP — Spring 2023

Homework #3 — Naïve Bayes text classification

Possible experience: +40XP

Due: Fri Feb 24th, midnight

Overview

In this assignment, you'll divide your corpus into multiple parts, create a Naïve Bayes text classifier that can detect the difference, and use it to classify future sentences.

Dividing your corpus

The first step is to divide your corpus into multiple components, each of which has some meaning. The way you do this will vary wildly upon your corpus, and there is no one "right answer." Remember, the goal of the assignment is to auto-predict which of the components new sentences are most reminiscent of. So try to divide your corpus in a way that is meaningful.

Here are some examples to give you the idea:

If your corpus contains lyrics from various hip hop artists, you could divide them by composer, and train your program to classify new lyrics as likely to be from Lil Wayne, Jay-Z, or Notorious B.I.G.
If your corpus contains the screenplays from all six seasons of The Sopranos, you could divide them by season, and train your program to classify new scenes as likely to be from season 1, 2, 3, 4, 5, or 6.
If your corpus is the complete works of William Shakespeare, you could divide them by genre into tragedies, comedies, and histories, and train your program to classify new scenes as one of those three genres.
If your corpus is articles from the Free-Lance Star, you could divide them by section into news, feature, and sports articles, and train your program to classify new articles as one of those three categories. Or, you could divide them by year or by decade, and train your program to predict when articles were written.
If your corpus is all lyrics from Bob Dylan, you could divide them by album, and train your program to classify new lyrics as likely to be from Highway 61 Revisited, Desire, or Blonde on Blonde.
If your corpus is poems by T.S. Eliot, you could divide them by era, and train your program to classify new verses as likely to be from "the early Eliot" era or "the late Eliot" era.
If your corpus is a single fictional work, you could divide it based on a key plot development. For example, if your corpus is the text of Graham Green's The End of the Affair, you could divide your corpus into the chapters before, and after, Maurice and Sarah are discovered. If it's the text of Frank Herbert's Dune, it could be divided into the chapters before, and after, Paul and Jessica join the Fremen.

Spend some careful time thinking about this, then move to the next section of the instructions.

. . .

Still here? Spend some more careful time thinking about this, then move to the next section of the instructions.

. . .

If you're still here, then a deep default you can use in a pinch is to divide your text into "the first half of the corpus, and the second half of the corpus, in the order in which the text appears in your corpus file." Not only is this pretty lame, but your results aren't likely to be very meaningful. Try to do anything else if at all possible.

Program behavior

When your program runs, it should first open the two divided-corpora and read them into whatever data structures you use to store word counts. (I recommend a simple dictionary for each; see below.)

Then, your program should repeatedly prompt the user for phrases/sentences. For each phrase, it should use the Naïve Bayes approach to predict which divided-corpus that phrase is most likely to have come from, and print both the name of that divided-corpus and its estimated probability (or percentage). Give the user a way to end the program gracefully (perhaps by entering "q" to quit).

Example

For example, I used the complete text of my book A Quick, Steep Climb Up Linear Algebra for my first divided-corpus, and the complete text of my book Blueprints for my second. The first of these is about linear algebra (used in CPSC 284) and the second is about object-oriented programming (used in CPSC 240). Here's some sample output when it runs (text in blue is what the user types):

Loading corpora (be patient)...

Enter a phrase (q to quit): object-oriented design
Predition: Blueprints 100.00%

Enter a phrase (q to quit): matrix multiplication
Predition: Quick Steep Climb 100.00%

Enter a phrase (q to quit): this is a tricky one
Predition: Blueprints 88.53%

Enter a phrase (q to quit): catch an exception
Predition: Blueprints 97.01%

Enter a phrase (q to quit): the norm of a vector
Predition: Quick Steep Climb 98.16%

Enter a phrase (q to quit): q
Bye!

Advice

Unless you have good reason not to, do the following things:

Store your 2+ divided-corpora in different files, with different filenames. (You can, of course, also leave your original undivided corpus where it's at for future use.)
Also store an abbreviated version of each divided-corpus (perhaps 1-5% of the full size) so that while you're programming and debugging, it doesn't take forever every time you run your program. You can switch back to use the full divided-corpora when you're in the final stages before submission.
Use a dictionary for each divided-corpus, to count how many times each word appears in that corpus.
Use the spaCy tokenizer, just as you did for homework 2.
Call .strip() and .lower() on each token before you put it in your dictionary, so that you ignore whitespace and capitalization.
Since you have each divided-corpus all in its own file, there really isn't a meaningful "prior." So ignore this.
Use Laplace smoothing.
Ignore any word of the user's phrase that doesn't appear in either corpus, of course.
Instead of multiplying probabilities, add the logs of the probabilities. Then, convert your final log-probability back to a regular-probability before printing.

Requirements

Rubric

Feature	Points
Dividing your corpus intelligently into 2+ subsets	5XP
A Python program that gives no errors when run	5XP
Normalizing the corpora appropriately (I think you'll agree that this at least includes reasonable tokenization and case-folding)	5XP
Building dictionaries (or other data structure(s)) to hold the word counts of your divided corpora	10XP
Prompting repeatedly for user-generated sentences/phrases	5XP
Prediction using the Naïve Bayes algorithm	10XP

Turning it in

The complete submission for this assignment will consist of a tarball (.tar.gz) or zip file (.zip) with the following contents:

Your Python program, in a file called yourumwuserid_naive.py. (Please replace "yourumwuserid" with your actual UMW user ID. For instance, "jsmith19_naive.py" would be an appropriate name. Please do not deviate from this naming convention in any way. Please do not get clever, or creative, or include capital letters, or change the file extension, or omit the underscore, or replace the underscore with a hyphen, or anything else counter to the above instructions. Please name your files exactly as above. Thanks.)
Each of your divided-corpora in its own file, whose names should be hardcoded in your .py file so that when I run it, it finds its corpora in the current directory.

To turn it in, send me an email with subject line "DATA 470 Homework #3 turnin," and with your tarball or zip file included as an attachment. In the body of the email, describe what criteria you used to divide your corpus into multiple divided-corpora (i.e., what each of your divided-corpora represents.) Also type three (or more) interesting phrases/sentences, and their predicted probabilities, that you discovered during testing: one that was strongly reminiscent of one divided-corpus, one that was strongly reminiscent of a different divided-corpus, and one that was very close to the line.

(If you don't know how to create a tarball or zip file, google "how to create a tarball" or "how to create a zip file" for help.)

Getting help

Come to office hours, or send me email with subject line "DATA 470 Homework #3 help!!"