DATA 419 — NLP — Fall 2025
Possible experience: +40XP
Due: Fri Dec 5th, midnight
In this assignment, you'll download and configure a HuggingFace (🤗) text generation model, and fine-tune it on your corpus. It will then be able to generate language in a way that should kick butt over your simple N-gram generator from homework #2. You'll also experiment with different decoder settings and observe how that can radically change what the model produces.
You will install the HuggingFace package in your Python environment, and write a program to download and use the distilgpt2 model for causal language modeling. (Name trivia: the acronym BERT, as you know from class, refers to non-causal encoder-only architectures. The standard BERT model is very large, however, and so researchers created a smaller version of it called "DistilBERT." This is what I always use when I run a BERT-style model, to preserve my machine from exploding. The name "DistilBERT" is pretty cute — it's a play on the popular geeky "Dilbert" comic strip, and the "distilled" part means "the essence of BERT is preserved, even though the volume has been reduced." The name cleverness, however, does not translate well to the smaller-than-GPT2 version it inspired: "DistilGPT2". Try to laugh anyway.)
Then, you will write a Python program to fine-tune this model using your corpus. After some amount of processing, the final result will be a GPT-style model that can generate text in the style of your corpus. You'll then write a second Python program to perform text generation in response to user inputs. Like homework #2, it will generate sentences stochastically (randomly) but this time you'll be asking the user for a prompt, which it will follow up with a continuation. If it works well, this should be a lot of fun.
Make your fine-tuning program program accept a command-line argument, which is the name of the corpus file. (Then you can try it on different corpora, for example your team's corpora from homework #4, to see how the style and content changes.)
I think you'll want to import all of these:
import torch import torch.nn.functional as F from torch.utils.data import Dataset, TensorDataset, DataLoader from transformers import AutoTokenizer, AutoModelForCausalLM
Load the pretrained model, and its tokenizer, from HuggingFace via:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2")
Read your entire corpus into a string, and tokenize it. For simplicity, I recommend dividing it evenly into non-overlapping, fixed-length blocks (you might start with a block size of 128), discarding any remainder. You can create a TensorDataset with two arguments, the inputs and the targets. The inputs are the tokens in an excerpt from your corpus, and the targets are the same, except shifted right by one token, so that you can train to predict token ti from tokens ti - block_size to ti - 1.
For example, if you used a block size of 5, then the first row of inputs would be the input_ids for tokens 0 through 4, and the first row of targets would be the input_ids for tokens 1 through 5. The second row of inputs would be the input_ids for tokens 5 through 9, and the first row of targets would be the input_ids for tokens 6 through 10. Etc.
(Another possibility is to use a sliding window across your tokens, rather than overlapping blocks. Come see me if you'd like to explore this option.)
Once you have your inputs and targets tensors, you can make a new TensorDataset, and give them as arguments. Then, create a new DataLoader object, giving the TensorDataset to it, and a smallish batch_size.
Your program will load the pre-trained (but not fine-tuned on your corpus, of course) model from 🤗 using AutoModelForCausalLM.
Finally, you're ready to fine-tune. Create an AdamW optimizer from the torch.optim package, and give it your model's trainable parameters. Set the model to "training mode," and for each of your epochs (you decide how many to train for) run the standard training loop:
You now have a fine-tuned model. You should save both it and the tokenizer by calling .save_pretrained() on them. This way you don't have to retune from scratch every time you want to generate text.
In a second Python file, load your saved tokenizer and model, and put the model in "evaluation mode." Then, repeatedly prompt for input, tokenize what the user types, and call and pass those tokenized input_ids to the model.
Then generate a continuation sequence from the model, one token at a time, for a fixed number of tokens. For each token you need to generate, decode the logits that the model returns, using only the top-k logits (for a parameter k you can tune) and using a configurable "temperature."
What I'm looking for in this homework is:
For this assignment, send an email with subject line "DATA 470 Homework #5 turnin," and with your Python files and PDF write-up as attachments.
Come to office hours, or send me email with subject line "DATA 470 Homework #5 help!!"