DATA 419 — Data Mining (Machine Learning) — Fall 2022

Homework #1 — Data prep

Possible experience: +40XP

Due: Thu Sept 1st, midnight

Overview

For the first half of this course, you'll need two rectangular data sets that can be loaded into Pandas DataFrames. To concoct these, you'll first need to scour the internet looking for data on topics of personal interest. You'll then need to do whatever fiddly, hassle-y stuff is required to get each of your two data sets into a .csv file that Pandas can read.

Requirements

Your two data sets should be on different topics, and must not be the same as anyone else's in the class.

Each data set must have:

Import code

In addition to turning in your two .csv files, you'll also turn in one .py file which, when executed in the same directory as your .csv files, reads both files into Pandas DataFrames and prints out the first five rows of each. Depending on how much additional processing your data sets need, this .py file could be as simple as back-to-back pd.read_csv() and print() calls. Or, it might involve doing some data transformations to build additional columns, recode data, convert to dates/times, etc.

Where do I start?

Google.

What am I looking for?

Anywhere you can find data in a somewhat structured format. You might get lucky and find something already in .csv (or .xls) format that you can touch up. Web pages that contain tables (as many Wikipedia pages do) are also good, and can be snarfed into Pandas via pd.read_html().

Turning it in

To turn in this assignment, send me an email with subject line "DATA 419 Homework #1 turnin", with three attachments:

  1. Your first data set as a .csv file.
  2. Your second data set as a .csv file.
  3. Your .py file, which I can run to load and print the first few rows of each data set.

Important: When I run your .py file, on my machine and my filesystem, it must work! Beware putting hardcoded path names into your code, which specify the exact location of the .csv files on your machine, but which don't apply on mine.

Getting help

I'm happy to answer questions and help get your data wrangled! Come to office hours, or send me email with subject line "DATA 419 Homework #1 help!!"