DATA 419 — Data Mining (Machine Learning) — Fall 2022
Possible experience: +40XP
Due: Thu Sept 1st, midnight
For the first half of this course, you'll need two rectangular data sets that can be loaded into Pandas DataFrames. To concoct these, you'll first need to scour the internet looking for data on topics of personal interest. You'll then need to do whatever fiddly, hassle-y stuff is required to get each of your two data sets into a .csv file that Pandas can read.
Your two data sets should be on different topics, and must not be the same as anyone else's in the class.
Each data set must have:
In addition to turning in your two .csv files, you'll also turn in one .py file which, when executed in the same directory as your .csv files, reads both files into Pandas DataFrames and prints out the first five rows of each. Depending on how much additional processing your data sets need, this .py file could be as simple as back-to-back pd.read_csv() and print() calls. Or, it might involve doing some data transformations to build additional columns, recode data, convert to dates/times, etc.
Google.
Anywhere you can find data in a somewhat structured format. You might get lucky and find something already in .csv (or .xls) format that you can touch up. Web pages that contain tables (as many Wikipedia pages do) are also good, and can be snarfed into Pandas via pd.read_html().
To turn in this assignment, send me an email with subject line "DATA 419 Homework #1 turnin", with three attachments:
Important: When I run your .py file, on my machine and my filesystem, it must work! Beware putting hardcoded path names into your code, which specify the exact location of the .csv files on your machine, but which don't apply on mine.
I'm happy to answer questions and help get your data wrangled! Come to office hours, or send me email with subject line "DATA 419 Homework #1 help!!"