DATA 419 — NLP — Spring 2023

Homework #3 — Naïve Bayes text classification

Possible experience: +40XP

Due: Fri Feb 24th, midnight

Overview

In this assignment, you'll divide your corpus into multiple parts, create a Naïve Bayes text classifier that can detect the difference, and use it to classify future sentences.

Dividing your corpus

The first step is to divide your corpus into multiple components, each of which has some meaning. The way you do this will vary wildly upon your corpus, and there is no one "right answer." Remember, the goal of the assignment is to auto-predict which of the components new sentences are most reminiscent of. So try to divide your corpus in a way that is meaningful.

Here are some examples to give you the idea:

Spend some careful time thinking about this, then move to the next section of the instructions.

. . .

Still here? Spend some more careful time thinking about this, then move to the next section of the instructions.

. . .

If you're still here, then a deep default you can use in a pinch is to divide your text into "the first half of the corpus, and the second half of the corpus, in the order in which the text appears in your corpus file." Not only is this pretty lame, but your results aren't likely to be very meaningful. Try to do anything else if at all possible.

Program behavior

When your program runs, it should first open the two divided-corpora and read them into whatever data structures you use to store word counts. (I recommend a simple dictionary for each; see below.)

Then, your program should repeatedly prompt the user for phrases/sentences. For each phrase, it should use the Naïve Bayes approach to predict which divided-corpus that phrase is most likely to have come from, and print both the name of that divided-corpus and its estimated probability (or percentage). Give the user a way to end the program gracefully (perhaps by entering "q" to quit).

Example

For example, I used the complete text of my book A Quick, Steep Climb Up Linear Algebra for my first divided-corpus, and the complete text of my book Blueprints for my second. The first of these is about linear algebra (used in CPSC 284) and the second is about object-oriented programming (used in CPSC 240). Here's some sample output when it runs (text in blue is what the user types):

Loading corpora (be patient)...

Enter a phrase (q to quit): object-oriented design
Predition: Blueprints 100.00%

Enter a phrase (q to quit): matrix multiplication
Predition: Quick Steep Climb 100.00%

Enter a phrase (q to quit): this is a tricky one
Predition: Blueprints 88.53%

Enter a phrase (q to quit): catch an exception
Predition: Blueprints 97.01%

Enter a phrase (q to quit): the norm of a vector
Predition: Quick Steep Climb 98.16%

Enter a phrase (q to quit): q
Bye!

Advice

Unless you have good reason not to, do the following things:

Requirements

Rubric

FeaturePoints
Dividing your corpus intelligently into 2+ subsets 5XP
A Python program that gives no errors when run 5XP
Normalizing the corpora appropriately (I think you'll agree that this
at least includes reasonable tokenization and case-folding)
5XP
Building dictionaries (or other data structure(s)) to hold the word
counts of your divided corpora
10XP
Prompting repeatedly for user-generated sentences/phrases 5XP
Prediction using the Naïve Bayes algorithm 10XP

Turning it in

The complete submission for this assignment will consist of a tarball (.tar.gz) or zip file (.zip) with the following contents:

  1. Your Python program, in a file called yourumwuserid_naive.py. (Please replace "yourumwuserid" with your actual UMW user ID. For instance, "jsmith19_naive.py" would be an appropriate name. Please do not deviate from this naming convention in any way. Please do not get clever, or creative, or include capital letters, or change the file extension, or omit the underscore, or replace the underscore with a hyphen, or anything else counter to the above instructions. Please name your files exactly as above. Thanks.)
  2. Each of your divided-corpora in its own file, whose names should be hardcoded in your .py file so that when I run it, it finds its corpora in the current directory.

To turn it in, send me an email with subject line "DATA 470 Homework #3 turnin," and with your tarball or zip file included as an attachment. In the body of the email, describe what criteria you used to divide your corpus into multiple divided-corpora (i.e., what each of your divided-corpora represents.) Also type three (or more) interesting phrases/sentences, and their predicted probabilities, that you discovered during testing: one that was strongly reminiscent of one divided-corpus, one that was strongly reminiscent of a different divided-corpus, and one that was very close to the line.

(If you don't know how to create a tarball or zip file, google "how to create a tarball" or "how to create a zip file" for help.)

Getting help

Come to office hours, or send me email with subject line "DATA 470 Homework #3 help!!"