For our last trick...

Choose your favorite target categorical variable from your favorite of your two data sets. Ideally, you'll pick one that you had some previous success on earlier in the semester, but not total previous success.
Based on everything you've experimented with this semester, decide:
- whether to treat your target variable "as is," or whether to group some of the labels together in any way (for example, if your target variable is "state," do you really want to predict one of 50 different labels? or does it make sense to group them into "northeast," "southeast," "midwest," and "pacific?")
- which features to include and exclude
- for your categorical feature(s), whether to one-hot or ordinal encode them
- for your numeric feature(s), whether to standardize them or normalize them or leave them as is
- whether to use PCA, and if so, how many dimensions to reduce to.
Spend some unrushed time creating a Decision Tree classifier for this target. Use sklearn's DecisionTreeClassifier class, and experiment at some length with the parameters. Produce a tree with between two and five levels, which strikes a good balance between predictive performance and understandability. Make sure the Decision Tree diagram is appropriately colored and labeled and understandable, and save this to include in your scientific report.
Spend some unrushed time creating a Random Forest classifier for this target. Use sklearn's RandomForestClassifier class, and experiment at some length with the parameters. Try to get the best cross-validation performance you can. Copy your accuracy metrics, confusion matrix, and list of feature importances (in decreasing order of importance) for your scientific report.
Finally, spend some unrushed time creating an ensemble classifier using sklearn's VotingClassifier class. Include at least three different classifier algorithms as components of the ensemble, and at least a couple of different parameter settings from each. Experiment until you can get the best predictive accuracy you can. In your scientific report, include the number and types of classifiers that contributed to your ensemble, as well as the accuracy metrics and final confusion matrix.
Write a concise yet complete data scientific report, with no grammatical or spelling errors, and with coherent and reasonably-sized paragraphs, documenting all of the above steps. In it:
1. briefly discuss and justify each choice you made in step 2
2. describe the results of your experimentation in steps 3-5
3. referring to your Decision Tree as a figure in the text, narrate how it seems to "work" (i.e., sum up what it's using in its predictive algorithm and speculate as to why)
4. state your overall conclusions about this data set, the algorithms you used, and what you learned from it all.

What I'm looking for

I'm looking for evidence that you learned something this semester. I'm looking for you to take a step back from the details and communicate what this data set has taught you. And I'm looking for the qualities recommended by this article.

Turning it in

To turn in this assignment, send me an email with subject line "DATA 419 Homework #6 turnin". It should have the following attachments:

A single file, called homework6.py, which loads, cleans, and performs ML on your data set. When I run this program, it should print out a brief description of what the dataset and target represent (just like in homework #3), followed by the scores and confusion matrices produced by by your decision tree, random forest, and ensemble. I should not have to guess what any of your output means. It should all be perfectly clear.
Your actual data set. If this is too large to attach to an email, upload it somewhere and provide me with a link I can use to download it.
Your scientific report, as a PDF file (in PDF format) with at least one figure (your decision tree) in it.

Important: When I run your .py file, on my machine and my filesystem, it must work! Do not hardcode the exact location of the .csv files on your machine, which won't apply on mine.

Getting help

I'm happy to answer questions and help! Come to office hours, or send me email with subject line "DATA 419 Homework #6 help!!"