DATA 470 — NLP — Fall 2025
Homework #0 — Corpora
Possible experience: +10XP
Due: Sun Sep 7th, midnight
Overview
You'll need to find and assemble a corpus of text to use in future
assignments in this course. This will involve some reflection about what kinds
of things you like to read or listen to, and/or what information you're
interested in. Then you'll need to scour the internet looking for textual data
you can do something with, make some choices, and do whatever fiddly, hassle-y
stuff is required to get your corpus in a set of organized, plain-text
file(s).
Hey.
Please give yourself enough time to do this assignment properly. Do not wait
until the last minute to start thinking about it.
At any point in your exploration where you could choose something lame that
would be less work, or else choose something cool that would be more work,
choose the latter, unless the "more work" really does look super daunting.
Requirements
First, let me state that your corpus must not be on the same topic as
anyone else's in the class. This is a hard requirement.
By hook or by crook, you must put together your corpus in a single
plain-text file with absolutely no extraneous formatting, HTML/CSS code,
JSON, hypertext, embedded JavaScript, or other adornments.
A natural question is "how big does my corpus have to be?" That's a tough
one to answer, because of the wide variety of topics and their availability.
It'd be easy to just say "your corpus must have between
10 million and 100 million words," but neither The Weekly Ringer nor
Kendrick Lamar has ever produced close to this many. And I don't want to
prevent you from choosing a topic you really like just because the available
data isn't massive. So my answer is going to be: "your corpus needs to be as
sizeable as possible, given the limitations of your topic choice." (If that
seems vague, aim for at least 50,000 words. Feel free to bounce topic ideas off
me and ask questions.)
Starter ideas
These are some random ideas of where
you might start poking around, to get your creative juices flowing. You
do not have to ultimately choose something from this
list.
Public-Domain Literature
- Project Gutenberg – Thousands of free books in plain text; great for
classic literature analysis.
- Internet Archive (text collections) – Out-of-copyright novels,
magazines, and historical works.
Conversational Text
- OpenSubtitles – Movie and TV subtitles for informal, dialogue-heavy
language.
- Tatoeba – Multilingual example sentences for language learning and
translation.
- Parliamentary transcripts (Hansard, Congressional Record) – Real-world
political speech.
Fan Communities
- Archive of Our Own (AO3) – Fanfiction with rich, creative
storytelling.
- FanFiction.net – Massive archive of hobbyist writing in many
genres.
News & Magazines
- NewsAPI.org – Structured access to current news articles.
- GDELT Project – Worldwide news coverage for geopolitical and media
analysis.
Topic-Specific Text
- Wikipedia dumps – Comprehensive coverage of nearly any topic.
- Specialized wikis – E.g., Stormlight Archive Wiki, Marvel Wiki,
Battlestar Galactica Wiki, for domain-specific vocabulary.
- ArXiv – Research papers in science, math, and engineering.
- PubMed abstracts – Biomedical and health research summaries.
Niche & Specialized Writing
- Cooking & Recipes – Procedural, instruction-heavy text from recipe
blogs or RecipeNLG.
- Patents (USPTO) – Formal, structured technical descriptions.
- Court opinions (Caselaw Access Project) – Legal reasoning and
argumentation.
- Medical transcription samples (MTSamples) – Professional/clinical
language.
Spoken Language Transcripts
- TED Talk transcripts – Polished, presentation-style speech.
- Podcast transcripts – Conversational, topic-driven dialogue.
- Interview archives – Rich Q&A with real speakers (e.g., Paris
Review).
Historical Text
- Chronicling America – Digitized U.S. newspapers from 1789–1963.
- Letters & diaries collections – Personal and historical
correspondence.
- Public-domain translations of various "holy books."
Instructional / Procedural
- wikiHow dump – How-to guides in step-by-step format.
- GameFAQs – Game walkthroughs with informal commentary.
- iFixit manuals – Technical repair instructions.
Governmental
- U.S. congressional bills, resolutions, amendments, member profiles, the
Congressional Record (transcripts of floor speeches), etc.
- Europarl corpus – Parallel text in multiple European languages.
- United Nations documents.
Humorous / Creative
- Jokes datasets – Short, structured humor content.
- Public-domain song lyrics – Poetic and rhythmic text.
- Poetry Foundation archives – Public-domain poems.
User-Generated Reviews
- Yelp Open Dataset – Restaurant and business reviews with ratings.
- Amazon product reviews – Consumer opinions with metadata.
- TripAdvisor – Travel stories and location reviews.
Educational & Academic
- Released exam questions – SAT, AP, GRE, LSAT prompts for formal test
language.
- Simple English Wikipedia – Accessible, easy-to-read writing.
- MOOC transcripts – Educational lectures in text form.
Social Media APIs
- Pushshift (Reddit) – Millions of forum-like discussions on any
imaginable subject.
- YouTube Data API – Video titles, descriptions, tags, and comment
threads for topic-based language.
- Mastodon API – Decentralized microblogging platform with diverse
communities.
- Tumblr API – Mixed media posts with rich informal and fandom-driven
text.
- TikTok API – Video captions, hashtags, and comment threads (varies by
access).
- Instagram Graph API – Captions, hashtags, and comments for image-based
posts (requires account setup).
Some important tips
- Using a text editor like vim, Notepad++, or Spyder is a good tool to
use for this, in conjunction with other browsing and editing operations.
Not so good is a word processor (like MS Word or Google Docs) that natively
stores formatting information. (If you're absolutely wed to your favorite
word processor, you can use it if you save your documents in plain-text
mode, with a .txt extension.)
- Btw, note that saving a web page from your browser will very likely
save a .html file on your disk with embedded formatting. If you do
this, you'll have to use a technique like regex's to strip out all the HTML
tags (this is not too hard, but it must be done).
- Keep around both the raw and cleaned versions so you can revisit
preprocessing choices.
- Make sure your data is legally usable (public domain, Creative Commons,
or allowed by terms of service).
- Choose something you'll enjoy working with all semester.
Turning it in
To turn in this assignment, send me an email with subject line "DATA 470
Homework #0 turnin". In the body of the email, describe your corpus
to me: tell me the topic, why it interests you, what sources you used, and how
large it is (in MB). Also in the body of the email, paste a couple of
representative passages from the corpus. (These don't have to be especially
curated or pretty — I just want to get a feel for what the text looks
like. Each excerpt you paste should have between, say, 100 and 1000 words.)
Getting help
Come to office hours, or send me email with subject line "DATA 470
Homework #0 help!!"