DATA 470 — NLP — Fall 2025

Homework #0 — Corpora

Possible experience: +10XP

Due: Sun Sep 7th, midnight

Overview

You'll need to find and assemble a corpus of text to use in future assignments in this course. This will involve some reflection about what kinds of things you like to read or listen to, and/or what information you're interested in. Then you'll need to scour the internet looking for textual data you can do something with, make some choices, and do whatever fiddly, hassle-y stuff is required to get your corpus in a set of organized, plain-text file(s).

Hey.

Please give yourself enough time to do this assignment properly. Do not wait until the last minute to start thinking about it.

At any point in your exploration where you could choose something lame that would be less work, or else choose something cool that would be more work, choose the latter, unless the "more work" really does look super daunting.

Requirements

First, let me state that your corpus must not be on the same topic as anyone else's in the class. This is a hard requirement.

By hook or by crook, you must put together your corpus in a single plain-text file with absolutely no extraneous formatting, HTML/CSS code, JSON, hypertext, embedded JavaScript, or other adornments.

A natural question is "how big does my corpus have to be?" That's a tough one to answer, because of the wide variety of topics and their availability. It'd be easy to just say "your corpus must have between 10 million and 100 million words," but neither The Weekly Ringer nor Kendrick Lamar has ever produced close to this many. And I don't want to prevent you from choosing a topic you really like just because the available data isn't massive. So my answer is going to be: "your corpus needs to be as sizeable as possible, given the limitations of your topic choice." (If that seems vague, aim for at least 50,000 words. Feel free to bounce topic ideas off me and ask questions.)

Starter ideas

These are some random ideas of where you might start poking around, to get your creative juices flowing. You do not have to ultimately choose something from this list.

Public-Domain Literature

  1. Project Gutenberg – Thousands of free books in plain text; great for classic literature analysis.
  2. Internet Archive (text collections) – Out-of-copyright novels, magazines, and historical works.

Conversational Text

  1. OpenSubtitles – Movie and TV subtitles for informal, dialogue-heavy language.
  2. Tatoeba – Multilingual example sentences for language learning and translation.
  3. Parliamentary transcripts (Hansard, Congressional Record) – Real-world political speech.

Fan Communities

  1. Archive of Our Own (AO3) – Fanfiction with rich, creative storytelling.
  2. FanFiction.net – Massive archive of hobbyist writing in many genres.

News & Magazines

  1. NewsAPI.org – Structured access to current news articles.
  2. GDELT Project – Worldwide news coverage for geopolitical and media analysis.

Topic-Specific Text

  1. Wikipedia dumps – Comprehensive coverage of nearly any topic.
  2. Specialized wikis – E.g., Stormlight Archive Wiki, Marvel Wiki, Battlestar Galactica Wiki, for domain-specific vocabulary.
  3. ArXiv – Research papers in science, math, and engineering.
  4. PubMed abstracts – Biomedical and health research summaries.

Niche & Specialized Writing

  1. Cooking & Recipes – Procedural, instruction-heavy text from recipe blogs or RecipeNLG.
  2. Patents (USPTO) – Formal, structured technical descriptions.
  3. Court opinions (Caselaw Access Project) – Legal reasoning and argumentation.
  4. Medical transcription samples (MTSamples) – Professional/clinical language.

Spoken Language Transcripts

  1. TED Talk transcripts – Polished, presentation-style speech.
  2. Podcast transcripts – Conversational, topic-driven dialogue.
  3. Interview archives – Rich Q&A with real speakers (e.g., Paris Review).

Historical Text

  1. Chronicling America – Digitized U.S. newspapers from 1789–1963.
  2. Letters & diaries collections – Personal and historical correspondence.
  3. Public-domain translations of various "holy books."

Instructional / Procedural

  1. wikiHow dump – How-to guides in step-by-step format.
  2. GameFAQs – Game walkthroughs with informal commentary.
  3. iFixit manuals – Technical repair instructions.

Governmental

  1. U.S. congressional bills, resolutions, amendments, member profiles, the Congressional Record (transcripts of floor speeches), etc.
  2. Europarl corpus – Parallel text in multiple European languages.
  3. United Nations documents.

Humorous / Creative

  1. Jokes datasets – Short, structured humor content.
  2. Public-domain song lyrics – Poetic and rhythmic text.
  3. Poetry Foundation archives – Public-domain poems.

User-Generated Reviews

  1. Yelp Open Dataset – Restaurant and business reviews with ratings.
  2. Amazon product reviews – Consumer opinions with metadata.
  3. TripAdvisor – Travel stories and location reviews.

Educational & Academic

  1. Released exam questions – SAT, AP, GRE, LSAT prompts for formal test language.
  2. Simple English Wikipedia – Accessible, easy-to-read writing.
  3. MOOC transcripts – Educational lectures in text form.

Social Media APIs

  1. Pushshift (Reddit) – Millions of forum-like discussions on any imaginable subject.
  2. YouTube Data API – Video titles, descriptions, tags, and comment threads for topic-based language.
  3. Mastodon API – Decentralized microblogging platform with diverse communities.
  4. Tumblr API – Mixed media posts with rich informal and fandom-driven text.
  5. TikTok API – Video captions, hashtags, and comment threads (varies by access).
  6. Instagram Graph API – Captions, hashtags, and comments for image-based posts (requires account setup).

Some important tips

Turning it in

To turn in this assignment, send me an email with subject line "DATA 470 Homework #0 turnin". In the body of the email, describe your corpus to me: tell me the topic, why it interests you, what sources you used, and how large it is (in MB). Also in the body of the email, paste a couple of representative passages from the corpus. (These don't have to be especially curated or pretty — I just want to get a feel for what the text looks like. Each excerpt you paste should have between, say, 100 and 1000 words.)

Getting help

Come to office hours, or send me email with subject line "DATA 470 Homework #0 help!!"