DATA 470 - NLP

DATA 470 — NLP — Spring 2023

Homework #4 — Text classiffication with neural networks

Possible experience: +40XP

Due: Fri Mar 31st, midnight

Overview

In this assignment, you'll implement and evaluate a neural network for three-(or more)-class text classification using pre-trained word embeddings. Since this is a complicated, multifaceted task, please follow the instructions outlined below.

0. Taking notes

Create a text file (using your text editor, Notepad, Microsoft Word, Google Docs, or whatever) called "Homework3stuff.txt" to record observations you make while doing this homework.

1. Download and install stuff

Use pip, Cocalc, Colab, Anaconda package management, or whatever tool you use in your Python environment to download and install TensorFlow 2.11 or better. (This version of TensorFlow comes pre-packaged and integrated with Keras, so no need for a separate download/install.)

Download the pre-trained GloVe word embeddings and unpack the smallest embeddings file (50-dimensional vectors, based on a mere 6 billion tokens). This should be a single plain-text file called glove.6B.50d.txt. Put this file in the same folder as your Python code for the project. You can discard any other (huge) stuff you might have downloaded from GloVe.

Make sure that in your Python environment you can:

from tensorflow import keras
import tensorflow as tf

open("glove.6B.50d.txt", encoding="utf-8")

without errors. (An output message printing some kind of TextIO object for the GloVe embeddings file is not an error.)

2. (Further) dividing your corpus

If you already divided your corpus into three or more components for homework #3, skip the rest of this section.

If you divided your corpus into only two components for homework #3, further divide at least one of the components into a third component. As before, your corpus's content will dictate how to do this in a sensible way. (If you divided into Lil Wayne lyrics and Jay-Z lyrics, you could further divide each of those components into "early career Lil Wayne / Jay-Z lyrics" and "late career Lil Wayne / Jay-Z lyrics" to yield four corpora. If you divided into Old Testament and New Testament, you could further divide Old Testament into "Historical," "Prophetic," and "Wisdom Literature." If you divided into pre-Hank-knows-about-Walter Breaking Bad scripts and post-Hank-knows-about-Walter Breaking Bad scripts, you could further subdivide the pre-Hank-knows half into pre-plane-accident and post-plane-accident scripts. Etc.)

3. Play fair

As we discovered in class, using corpora of widely different sizes can give bad text classification results. For this reason, trim all your sub-corpora down to the size of the smallest sub-corpus. You should probably save the original, pre-trimmed sub-corpora in a separate file before you do this.

(The word "size" in the previous paragraph really refers to "number of sentences," which will be the "number of documents" you train your neural net on. However, if each sub-corpus merely has the same number of lines, paragraphs, or kilobytes (whatever's easiest for you to calculate), that should be a useful enough substitute.)

(Also, the sizes of the sub-corpora don't have to be exactly the same. Don't stress out about that. "Approximately the same" is good enough.)

4. Create and train your BoW neural net

Write code in a Python file called yourumwuserid_nn.py to load your corpora, break them into sentences (which will be your "documents"), fit a tokenizer to all the documents, and create a DTM based on your choice of mode (binary, count, TF-IDF, etc.). Inspect your DTM and perform a couple sanity checks to make sure it looks correct. (Remember the .index_word and .word_index dictionaries you can obtain from your Tokenizer object which are useful for inspection.)

In your code, create a vector of one-hot encoded "gold" labels, one per document.

Now add code to create a simple feed-forward network with an input layer of the correct size, one hidden layer of a dozen or so units, and a softmax output layer with one unit per sub-corpus. Compile this neural net, fit it to your DTM and labels, using a number of epochs proportional to your own personal level of patience and tolerance.

Write a little loop to read user input so you can experiment with new sentence classification. Verify that your model predicts reasonably well based on what each sub-corpora contains. Come see Stephen if you suspect it doesn't.

Save a copy of the Python file in a new filename yourumwuserid_interactive_bow.py. Also, in Homework3stuff.txt, record one interesting sentence you invented for each sub-corpus during experimentation, and the numerical predictions your neural net gave you for it.

5. Evaluate your BoW neural net

Now, back in your yourumwuserid_nn.py file, remove the interactive user input loop. Put your DTM-creation code and neural-net-building-code in a function (called, perhaps, evaluate()) that takes the input sentence list, the "gold" labels, and a number of trials. This function should, for each trial, split the incoming data in a 80%/20% train/test split, train a neural net from scratch on it, and retrieve the model's accuracy on that trial. It should return an array of accuracies, one for each trial.

At the bottom of the file, write code to actually call your function for 20 trials. Compute the mean and standard deviation of the accuracies, and record them in Homework3stuff.txt.

Save a copy of the Python file in a new filename yourumwuserid_evaluate_bow.py.

6. Add word embeddings and sequencing

Now, back in your yourumwuserid_nn.py file, write code to read in the GloVe embeddings file and create an E matrix with one row per corpus vocabulary word, and fifty columns. Inspect this matrix to make sure it looks reasonable.

In your evaluate() function, get rid of all the DTM stuff. Replace it with code to generate padded sequences of some fixed length, one per document.

Modify your neural net code to add an embedding layer and flatten it before it passes data along to the hidden layer.

When you've got all this working, run your code again with 20 trials, compute the new mean and standard deviation of the accuracies, and record them in Homework3stuff.txt.

Save a copy of the Python file in a new filename yourumwuserid_evaluate_seq.py.

7. Reflect

Finally, in Homework3stuff.txt, write a maximum of three paragraphs documenting what you learned about your corpus, neural nets, and word embeddings from this assignment.

Rubric

Feature	Points
Dividing your corpus intelligently into 3+ equally-sized sub-corpora	5XP
A Python program that gives no errors when run	5XP
Tokenizing the corpora and producing a DTM and gold labels vector	5XP
Creating an interactive BoW neural net model	5XP
Evaluating your BoW neural net model	5XP
Creating and evaluating a sequence-based neural net model	5XP
Properly incorporating word embeddings	5XP
Observations and writeup	5XP

Turning it in

The complete submission for this assignment will consist of a tarball (.tar.gz) or zip file (.zip) with the following contents:

Your three finished Python programs:
- yourumwuserid_interactive_bow.py
- yourumwuserid_evaluate_bow.py
- yourumwuserid_evaluate_seq.py
Your sub-corpora, in three or more plain-text files.
A nicely-organized Homework3stuff.txt file, containing your maximum-of-three-paragraphs reflection, your interesting sentences and predictions, and a performance comparison between the BoW model and the word-embedding-enabled sequential model.

(DO NOT include the glove.6B.50d.txt file in your tarball. I already have my own copy, and including this would clog up the UMW email system.)

To turn it in, send me an email with subject line "DATA 470 Homework #4 turnin," and with your tarball or zip file included as an attachment.

(If you don't know how to create a tarball or zip file, google "how to create a tarball" or "how to create a zip file" for help.)

Getting help

Come to office hours, or send me email with subject line "DATA 470 Homework #4 help!!"