
Overview
You may choose to work on any machine learning project of your choice. This could be similar to the types of regression, classification, or clustering that we will cover in class, or could be on machine learning related to text (things like sentiment analysis, or text generation), or could be on machine learning related to images (things like image recognition, or image generation), or another area of machine learning altogether.
You must choose to complete this project individually.
Requirements
The first requirement for the project is that you have created a workflow plan for each section of the ML workflow process, and have code that implements that plan. There is a template up on Moodle for you to copy. You may make any changes you want to this template.
Your workflow should have some form of data pre-processing, some kind of exploratory data analysis, training and testing of machine learning algorithms (including some parameter tuning), and evaluating the performance of your models.
The second requirement for the project is that it use one of the following datasets. These datasets cover a variety of topics and have enough records to build substantial models. If you wish, you may choose your own data set (both the UCI repo or data.world are good places to start) that is not on this list, but you’ll need to check with your instructor first.
- Cardiotocography
- Human Activity Recognition Using Smartphones
- Letter Recognition
- Turkiye Student Evaluation
- Wine Quality
- Bank Marketing Data Set
- Contraceptive Method Choice Data Set
- Crimes in Chicago
- Default of Credit Card Clients Data Set
- Online Shoppers Purchasing Intention Dataset
- Synthetic Financial Datasets for Fraud Detection (fake data set, but should still provide useful practice)
- Rice Variety Recognition
- Mice Protein Expression
Project Check-ins
There are nine progress checks during the semester, each worth 4 points (see the schedule). To receive full credit for a progress check, you must check in with your instructor before leaving class on the day of the check.
PC0
Fork project site and deploy via GitPages.
slides
PC1
Identify three data sets that you’re interested in. If you include a data set not on the list above, you need to check with your instructor. Include links to each dataset.
slides
PC2
Select one dataset and do some background research. Make sure your data is accessible on a git repo.
Depending on your project, this might be as simple as locating a git repo that has your data set, or it might require you to upload the dataset to your personal git repo. Additionally, this could involve combining multiple data sources, or gathering data yourself. If your project requires you to gather your own data, please discuss this with your instructor before beginning, so that we can make sure you have a realistic plan to gather enough data.
PC3
Perform some exploratory data analysis. Discuss the results, and what this tells you about your data, as well as your expectations for your results. This should include some data visualizations that go beyond a basic chart.
PC4
Time to do some research. What is the goal of the model you plan to build? Has something similar been done with this type of data? Do a write up of background information, making sure to link sources.
PC5
Prepare your data set for training. This might include handling missing or categorical values, basic conversions, or more extensive conversions. Any choices that you make should be explained. Additionally, spend time making good visualizations.
PC6
Plan out your modeling workflow: what models would you like to use, what evaluation are you doing? Start building your presentation.
PC7
Train models using machine learning algorithms. Although you do not need to use every algorithm that we’ve covered, you should use those that make sense for your data set. You should explain your choices, and perform parameter tuning in order to optimize their performance.Continue building presentation.
PC8
Discussion of results. Compare the performance of your models, and discuss the results. Do you consider your results successful? Explain why or why not, and discuss how further improvements might be made.
Continue working on your final presentation. You want a presentation that is digestible to a non-technical audidence and is interesting in some way (visualizations, insights, models, data).
PC9
Evaluation of your plan. Discuss how closely you were able to follow your workflow. What unexpected issues did you encounter? What adjustments did you need to make? How would this affect planning a future machine learning project? Finalize your presentation, and submit your link along with a screenshot to Moodle.
Final Submission
The final submission evaluates the overall quality of your project and presentation, based on the materials submitted prior to the final exam. Submissions are due the night before finals.
A detailed grading rubric TBD
Small group presentations
(12 pts): During presentations, you will participate in three rounds of small-group presentations, presenting one-on-one with another project. Grading is based on the quality of your peer review during these rounds.
The peer review form is available here
