DATA 419 — Data Mining (Machine Learning) — Fall 2022
Homework #6 — Trees, ensembles, and putting it all
together
Possible experience: +60XP
Due: Fri Dec 2nd, midnight
For our last trick...
- Choose your favorite target categorical variable from your favorite of your
two data sets.
Ideally, you'll pick one that you had some previous success on earlier
in the semester, but not total previous success.
- Based on everything you've experimented with this semester, decide:
- whether to treat your target variable "as is," or whether to group some of
the labels together in any way (for example, if your target variable is
"state," do you really want to predict one of 50 different labels? or does it
make sense to group them into "northeast," "southeast," "midwest," and
"pacific?")
- which features to include and exclude
- for your categorical feature(s), whether to one-hot or ordinal encode
them
- for your numeric feature(s), whether to standardize them or normalize them
or leave them as is
- whether to use PCA, and if so, how many dimensions to reduce to.
- Spend some unrushed time creating a Decision Tree classifier for
this target. Use sklearn's DecisionTreeClassifier class, and
experiment at some length with the parameters. Produce a tree with between two
and five levels, which strikes a good balance between predictive performance
and understandability. Make sure the Decision Tree diagram is appropriately
colored and labeled and understandable, and save this to include in your
scientific report.
- Spend some unrushed time creating a Random Forest classifier for
this target. Use sklearn's RandomForestClassifier class, and
experiment at some length with the parameters. Try to get the best
cross-validation performance you can. Copy your accuracy metrics, confusion
matrix, and list of feature importances (in decreasing order of importance) for
your scientific report.
- Finally, spend some unrushed time creating an ensemble classifier
using sklearn's VotingClassifier class. Include at least three
different classifier algorithms as components of the ensemble, and at least a
couple of different parameter settings from each. Experiment until you can get
the best predictive accuracy you can. In your scientific report, include the
number and types of classifiers that contributed to your ensemble, as well as
the accuracy metrics and final confusion matrix.
- Write a concise yet complete data scientific report, with no grammatical or
spelling errors, and with coherent and reasonably-sized paragraphs, documenting
all of the above steps. In it:
- briefly discuss and justify each choice you made in step 2
- describe the results of your experimentation in steps 3-5
- referring to your Decision Tree as a figure in the text, narrate how it
seems to "work" (i.e., sum up what it's using in its predictive
algorithm and speculate as to why)
- state your overall conclusions about this data set, the algorithms you
used, and what you learned from it all.
What I'm looking for
I'm looking for evidence that you learned something this semester. I'm
looking for you to take a step back from the details and communicate what this
data set has taught you. And I'm looking for the qualities recommended by this
article.
Turning it in
To turn in this assignment, send me an email with subject line "DATA 419
Homework #6 turnin". It should have the following attachments:
- A single file, called homework6.py, which loads, cleans, and
performs ML on your data set. When I run this program, it should print out a
brief description of what the dataset and target represent (just like in
homework #3), followed by the scores and confusion matrices produced by by your
decision tree, random forest, and ensemble. I should not have to guess what any
of your output means. It should all be perfectly clear.
- Your actual data set. If this is too large to attach to an email, upload
it somewhere and provide me with a link I can use to download it.
- Your scientific report, as a PDF file (in PDF format) with at least one
figure (your decision tree) in it.
Important: When I run your .py file, on my machine and my
filesystem, it must work! Do not hardcode the exact location of the
.csv files on your machine, which won't apply on mine.
Getting help
I'm happy to answer questions and help! Come to office hours, or send me
email with subject line "DATA 419 Homework #6 help!!"