DATA 419 — NLP — Fall 2025
Possible experience: +40XP
Due: Wed Nov 5th, midnight
In this team assignment, you'll join up with two or three partners (forming a group of three or four) and share your corpus with everyone. Then, you'll create a feed-forward neural network for multiclass classification that uses TF-IDF pre-trained embedding centroids. (Say that five times fast.) You'll train this network to predict which of the three/four corpora a particular test sentence is most likely to have come from.
(For example: if teammate A's corpus is Virgil's Aeneid, teammate B's is the complete Beatles lyrics, and teammate C's is transcripts of Call of Duty chats, your model will hopefully assign the highest probability to corpus A for the test sentence "I came from the shores of Troy," to corpus B for the test sentence "I love you baby but you make me blue", and to corpus C for the test sentence "watch that window, they're camping.")
The only mandatory teamwork in this assignment is that you choose partners and share your corpus with the others. You can all work separately after that. However, if you'd like to jointly work on the code and analysis together, you may do so. In that case, you'll submit just one submission for the whole team (see "Turning it in," below) and you'll all receive the same grade for it.
Using corpora of widely different sizes can give bad text classification results. For this reason, your team should trim all your corpora down to the size of the smallest corpus. (Obviously, corpus owners should each save their original, pre-trimmed corpus in a separate file before they do this.)
(The word "size" in the previous paragraph really refers to "number of sentences," which will be the "number of documents" you train your neural net on. However, if each sub-corpus merely has the same number of lines, paragraphs, or kilobytes (whatever's easiest for you to calculate), that should be a useful enough substitute.)
(Also, the sizes of the sub-corpora don't have to be exactly the same. Don't stress out about that. "Approximately the same" is good enough.)
You can do this resizing of corpora from the very get-go, or as part of your Python code, whichever approach your team deems easiest:
$ shuf -n 123 full_corpus.txt > smaller_corpus.txtThis would generate a smaller, randomly-shuffled version of 123 lines from your full_corpus.txt, and put them in a new, smaller smaller_corpus.txt file. (Obviously, change the number 123 to whatever the actual number of lines your team wants to go with per corpus. Btw, a handy way to discover how many lines are in a file is to use the command:
$ wc -l full_corpus.txtwhere "wc" stands for "word count," btw.)
Download a set of pre-trained GloVe word embeddings. I recommend the 50-dimensional ones, since they're the smallest download. I also recommend the ones that were trained on Wikipedia & Gigaword, just because I'm familiar with those. You can accomplish this by installing gensim ("pip install gensim") then doing this:
import gensim.downloader
embeds = gensim.downloader.load("glove-wiki-gigaword-50")
The variable embeds will then be a dict-like object that will give you the 50-dimensional embedding for a particular word:
print(embeds['king'])
array([ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 ,
-0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 ,
-0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 ,
-0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 ,
0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 ,
1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 ,
0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 ,
-1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 ,
-0.64426 , -0.51042 ], dtype=float32)
Calling the above gensim load() method is a good way to acquire the embeddings, but calling it every time you run your program is time-consuming. Much, much faster is to call load() just once (perhaps in a separate .py file, or simply interactively at the Python command line) and then run:
from gensim.models import KeyedVectors
embeds.save("glove_embeddings.data")
This will create a large glove_embeddings.data file in your current directory, plus an even larger glove_embeddings.data.vectors.npy file. They will total around 100 MBytes. However, the code you run repeatedly can then simply have this in it:
from gensim.models import KeyedVectors
embeds = KeyedVectors.load("glove_embeddings.data")
which loads the embeddings back into Python thousands of times faster.
Before moving on, play around with the embeddings a bit. It's fun! Among other things, experiment with the methods .similarity() and .most_similar(), like this:
>>> print(embeds.similarity('leader','president'))
0.761
>>> print(embeds.similarity('king','president'))
0.527
>>> print(embeds.similarity('king','dictator'))
0.537
>>> print(embeds.similarity('king','pineapple'))
0.0912
>>> print(embeds.most_similar('pineapple'))
[('mango', 0.910),
('coconut', 0.864),
('cranberry', 0.822),
('guava', 0.818),
('juice', 0.807),
('avocado', 0.807),
('lemon', 0.800),
('pear', 0.799),
('papaya', 0.798),
('tomato', 0.794)]
>>> print(embeds.most_similar('university'))
[('college', 0.874
('harvard', 0.871
('yale', 0.856
('graduate', 0.855
('institute', 0.848
('professor', 0.841
('school', 0.826
('faculty', 0.825
('graduated', 0.814
('academy', 0.810
(I could play around with this for hours.)
Okay, back to work.
Write code to prepare your data. I suggest a separate Python file for this, called something like "prep.py". Load each corpora, and break each into "documents" of the appropriate type. (I recommend using the spaCy sentence segmenter if you're using "English sentences" as your documents. If you have a corpus of poetry or lyrics or chats or something non-sentenc-ey, you might want to choose a different way of divvying up that corpus into "documents.")
Shuffle the documents, and if you haven't already done so, discard any extras from the larger corpora so that all three are roughly the same size, as described above. Finally, split each corpora into a training set and a test set (of about 80/20% each) and label each sentence with the name of its corpora.
At this point, you should have some sort of data structure with all of your training and test data. Mine was a dict of two lists: one for training and one for testing. Each of these lists was a list of 2-tuples, each of which was a sentence and a label. Here's a sample of mine, which has Shakespeare, Charlotte Brontë, and the Bible as its three corpora:
{ 'train':
[ ("What ever have been thought on in this state, That could be brought to bodily act ere", "Shake"),
("With every minute you do change a mind, And call him noble that was now your hate, Him vile that was your garland.", "Shake"),
...
("Chancing to take up a newspaper of your county the other day, my eye fell upon your name.", "Bronte"),
("“My sight was always too weak to endure a blaze, Frances,” and we had now reached the wicket.", "Bronte"),
...
("God saw that it was good.", "Bible"),
("The fruits which your soul lusted after have been lost to you.", "Bible"),
... ],
'test':
[ ("O that this too too solid flesh would melt, and resolve itself into a dew", "Shake"),
...
("They are disputing about Victor, of whom Hunsden affirms that his mother is making a milksop.", "Bronte"),
...
("I am the Way and the Truth and the Life: no one comes to the Father except through me.", "Bible"),
... ],
}
(You don't have to encode yours in exactly this dict-of-lists-of-tuples structure. I'm not even sure this is a great way to structure it; it's just the first thing I thought of.)
Now write the code to convert each of these test sentences into a TF-IDF centroid vector, based on the GloVe word embeddings. Recall from class that the "centroid" vector is simply the (normalized) sum of all the embeddings: it will point somewhere in 50-dimensional space that represents sort of "the direction of the average word in the sentence." You can simply add vectors with "+"; don't forget to normalize them before you do this (typing "x / x.norm()" will produce a normalized version of the vector x)
As you add each vector into the sum, you'll want to discount it by its IDF value, of course, which is simply N divided by the fraction of documents that have at least one occurrence of that word. (It's a good and efficient idea to compute all the IDF values for all words in the corpus all at once, near the beginning of your pipeline.)
Your input should now have tensors instead of strings. Mine looked like:
{ 'train':
(tensor([1.1923, 0.1290, 0.0343, -0.1310, 0.0101, -0.0495, -0.1563, 0.0586,
2.0151, -0.0210, -0.1204, -0.1669, 0.0008, 0.0781, 0.1624, -0.0513,
0.0032, 0.0523, -0.3067, -0.0754, 0.0621, 0.1450, 0.2593, -0.0138,
0.0462, -0.1348, -0.1253, 0.0871, 0.1080, -0.0627, 0.7014, 0.0454,
-0.0091, 0.0512, 0.0451, 0.0608, 0.0948, -0.0421, -0.0309, 0.0809,
-0.1350, -0.0382, 0.0265, 0.0456, -0.0781, 0.0455, -0.0523, -0.0774,
-0.1357, -0.1421]),
'Shake'),
...
(tensor([0.0998, 0.0391, 0.0343, -0.1310, 0.0101, -0.0495, -0.1563, 0.0586,
-0.0958, -0.0210, -0.1204, -0.1669, 0.0008, 0.0781, 0.1624, -0.0513,
0.0032, 0.0523, -0.3067, -0.0754, 0.0621, 0.1450, 0.2593, -0.0138,
0.0462, -0.1348, -0.1253, 0.0871, 0.1080, -0.0627, 0.7014, 0.0454,
-0.0091, 0.0512, 0.0451, 0.0608, 0.0948, -0.0421, -0.0309, 0.0809,
-0.1350, -0.0382, 0.0265, 0.0456, -0.0781, 0.0455, -0.0523, -0.0774,
-0.1357, -0.1421]),
'Bronte'),
...
(tensor([0.0954, 0.0281, 0.0166, -0.0011, 0.0902, 0.0430, -0.1358, 0.0694,
0.0193, 0.0669, -0.0382, 0.0214, -0.0948, -0.0417, 0.0806, 0.0398,
0.0383, -0.1093, 0.0494, -0.0987, 0.0326, 0.1061, -0.0714, -0.0637,
-0.0058, -0.4800, -0.0905, -0.1047, 0.0213, -0.0615, 0.7290, 0.1197,
-0.1673, -0.0990, -0.0009, -0.0488, -0.0807, -0.0106, -0.0423, -0.1374,
-0.0363, 0.0175, 0.0249, 0.0225, 0.0233, 0.0017, -0.0797, 0.0537,
-0.0386, -0.0530]),
'Bible')],
...
'test':
(tensor([0.7334, -0.9032, 0.0343, -0.1310, 0.0101, -0.0495, -0.1563, 0.0586,
-0.4058, -0.0210, -0.1204, -0.1669, 0.0008, 0.0781, 0.1624, -0.0513,
0.0032, 0.0523, -0.3067, -0.0754, 0.0621, 0.1450, 0.2593, -0.0138,
0.0462, -0.1348, -0.1253, 0.0871, 0.1080, -0.0627, 0.7014, 0.0454,
-0.0091, 0.0512, 0.0451, 0.0608, 0.0948, -0.0421, -0.0309, 0.0809,
-0.1350, -0.0382, 0.0265, 0.0456, -0.0781, 0.0455, -0.0523, -0.0774,
-0.1357, -0.1421]),
'Shake'),
...
(tensor([1.0096, -2.3294, 0.0343, -0.1310, 0.0101, -0.0495, -0.1563, 0.0586,
-0.0958, -0.0210, -0.1204, -0.1669, 0.0008, 0.0781, 0.1624, -0.0513,
0.0032, 0.0523, -0.3067, -0.0754, 0.0621, 0.1450, 0.2593, -0.0138,
0.0462, -0.1348, -0.1253, 0.0871, 0.1080, -0.0627, 0.7014, 0.0454,
-0.0091, 0.0512, 0.0451, 0.0608, 0.0948, -0.0421, -0.0309, 0.0809,
-0.1350, -0.0382, 0.0265, 0.0456, -0.0781, 0.0455, -0.0523, -0.0774,
-0.1357, -0.1421]),
'Bronte'),
...
(tensor([0.0954, 0.0281, 0.0166, -0.0011, 0.0902, 0.0430, -0.1358, 0.0694,
0.0193, 0.0669, -0.0382, 0.0214, -0.0948, -0.0417, 0.0806, 0.0398,
0.0383, -0.1093, 0.0494, -0.0987, 0.0326, 0.1061, -0.0714, -0.0637,
-0.0058, -0.4800, -0.0905, -0.1047, 0.0213, -0.0615, 0.7290, 0.1197
-0.1673, -0.0990, -0.0009, -0.0488, -0.0807, -0.0106, -0.0423, -0.1374,
-0.0363, 0.0175, 0.0249, 0.0225, 0.0233, 0.0017, -0.0797, 0.0537,
-0.0386, -0.0530]),
'Bible')]
}
When you've reached this point, have your prep.py file write out the data to a file so that you can re-load it whenever you want to (including in the next step). The function call:
torch.save(variable_name, "variable_name.pt")
is a good way to do this, for each Python variable you want to save and then retrieve later.
You now have a labeled data set of "documents," with a tensor for each document. Now, actually build your classifier. I recommend a new file called "train.py" which will, first thing, load the variables you saved to files in your prep.py file, using the pattern:
variable_name = torch.load("variable_name.pt")
for each one.
Then, put the data in the form you will need for training: a single N × 51 matrix of inputs (conventionally called "X_train") and a single N × 3 matrix of one-hot encoded outputs (conventionally called "y_train".) I wrote the number 51, assuming you're using the smallest, 50-dimensional GloVe embeddings (tweak that number if you're using a larger embedding size). Tip: you may want to use functions like torch.cat(), torch.stack(), torch.nn.functional.one_hot(), etc., to automate all this. (Or you can do it yourself; it's not hard code to write.)
Now create a simple feed-forward network with an input layer of the correct size, one hidden layer of a dozen or so units, and a softmax output layer with one unit per sub-corpus.
This will involve creating two weight matrices (called, perhaps, W_1 and W_2, to mimic what we did in class) of the appropriate shapes. (Remember: each of these matrices is essentially a function
fi: ℝn ↦ ℝm,
where the matrix is of dimensions m × n.) Each of these matrices will need gradient computation enabled, since they'll be what you're training. (Put another way, you'll be computing their gradient in order to tweak their values the proper amount each iteration.) Make the contents of these weight matrices initially random.
Now write a loop to iterate through your training set multiple times and adjust the weights each time, based on taking the partial derivatives of the loss function. (For the loss function, you should use the (multi-class) categorical cross-entropy. The function torch.nn.functional.cross_entropy() is an easy way to compute this, though be warned: it expects raw logits as its argument, not softmaxed results!)
Set variables for η ("eta" — the learning rate) and for the number of iterations, so you can play with them. Also print out the loss each time through the loop so you can (hopefully) see it going down as you train.
At the bottom of the file, use torch.save() to save your trained weight matrices.
Now evaluate your classifier on the held-out test set. I recommend creating a third Python file called "eval.py" for this. In it, load back in the test set, encoded it into X_test and y_test variables just like you did with the training set, compute your y_hat vector (choosing the most probable class for each example), and print out the accuracy, precision, recall, and F1-score for each corpus.
Finally, write another little Python program (called, perhaps, "interact.py") which has a little loop to read user input and prints out class probabilities. Take some time to experiment with this and see what your classifier does well predicting and what it doesn't.
The complete submission for this assignment will consist of a tarball (.tar.gz) or zip file (.zip) with the following contents:
(DO NOT include the GloVe embeddings file in your tarball. This is too big for email. Instead, just tell me what GloVe embedding size you used, so I can download it if necessary.)
To turn it in, send me an email with subject line "DATA 470 Homework #4 turnin," and with your tarball or zip file included as an attachment. If you are turning in work for a team, rather than just you individually, please Cc: each person on the team whose work this represents. (Only one of you on the team actually needs to send this submission email.)
(If you don't know how to create a tarball or zip file, google "how to create a tarball" or "how to create a zip file" for help.)
Come to office hours, or send me email with subject line "DATA 470 Homework #4 help!!"