DATA 419 — NLP — Spring 2023
Possible experience: +40XP
Due: Fri Feb 24th, midnight
In this assignment, you'll divide your corpus into multiple parts, create a Naïve Bayes text classifier that can detect the difference, and use it to classify future sentences.
The first step is to divide your corpus into multiple components, each of which has some meaning. The way you do this will vary wildly upon your corpus, and there is no one "right answer." Remember, the goal of the assignment is to auto-predict which of the components new sentences are most reminiscent of. So try to divide your corpus in a way that is meaningful.
Here are some examples to give you the idea:
Spend some careful time thinking about this, then move to the next section of the instructions.
. . .
Still here? Spend some more careful time thinking about this, then move to the next section of the instructions.
. . .
If you're still here, then a deep default you can use in a pinch is to divide your text into "the first half of the corpus, and the second half of the corpus, in the order in which the text appears in your corpus file." Not only is this pretty lame, but your results aren't likely to be very meaningful. Try to do anything else if at all possible.
When your program runs, it should first open the two divided-corpora and read them into whatever data structures you use to store word counts. (I recommend a simple dictionary for each; see below.)
Then, your program should repeatedly prompt the user for phrases/sentences. For each phrase, it should use the Naïve Bayes approach to predict which divided-corpus that phrase is most likely to have come from, and print both the name of that divided-corpus and its estimated probability (or percentage). Give the user a way to end the program gracefully (perhaps by entering "q" to quit).
For example, I used the complete text of my book A Quick, Steep Climb Up Linear Algebra for my first divided-corpus, and the complete text of my book Blueprints for my second. The first of these is about linear algebra (used in CPSC 284) and the second is about object-oriented programming (used in CPSC 240). Here's some sample output when it runs (text in blue is what the user types):
Loading corpora (be patient)... Enter a phrase (q to quit): object-oriented design Predition: Blueprints 100.00% Enter a phrase (q to quit): matrix multiplication Predition: Quick Steep Climb 100.00% Enter a phrase (q to quit): this is a tricky one Predition: Blueprints 88.53% Enter a phrase (q to quit): catch an exception Predition: Blueprints 97.01% Enter a phrase (q to quit): the norm of a vector Predition: Quick Steep Climb 98.16% Enter a phrase (q to quit): q Bye!
Unless you have good reason not to, do the following things:
Feature | Points |
---|---|
Dividing your corpus intelligently into 2+ subsets | 5XP |
A Python program that gives no errors when run | 5XP |
Normalizing the corpora appropriately (I think you'll agree that
this at least includes reasonable tokenization and case-folding) |
5XP |
Building dictionaries (or other data structure(s)) to hold the word
counts of your divided corpora |
10XP |
Prompting repeatedly for user-generated sentences/phrases | 5XP |
Prediction using the Naïve Bayes algorithm | 10XP |
The complete submission for this assignment will consist of a tarball (.tar.gz) or zip file (.zip) with the following contents:
To turn it in, send me an email with subject line "DATA 470 Homework #3 turnin," and with your tarball or zip file included as an attachment. In the body of the email, describe what criteria you used to divide your corpus into multiple divided-corpora (i.e., what each of your divided-corpora represents.) Also type three (or more) interesting phrases/sentences, and their predicted probabilities, that you discovered during testing: one that was strongly reminiscent of one divided-corpus, one that was strongly reminiscent of a different divided-corpus, and one that was very close to the line.
(If you don't know how to create a tarball or zip file, google "how to create a tarball" or "how to create a zip file" for help.)
Come to office hours, or send me email with subject line "DATA 470 Homework #3 help!!"