Overview

You may choose to work on any machine learning project of your choice. This could be similar to the types of regression, classification, or clustering that we will cover in class, or could be on machine learning related to text (things like sentiment analysis, or text generation), or could be on machine learning related to images (things like image recognition, or image generation), or another area of machine learning altogether.

You must complete this project individually.

Requirements

The first requirement for the project is that you have created a workflow plan for each section of the ML workflow process, and have code that implements that plan. You will want to create a colab workbook, and include this plan at the beginning of your code notebook.

Your plan should include the following:

What will be your goal in working on this project?
What will be your source for data for the project? If this requires you to gather data, how will you go about this?
What data cleaning, conversion, or preparation will you need to do to prepare your data set?
What kinds of exploratory data analysis will you do?
Which machine learning algorithms will you use to train models? Why are you choosing these algorithms?
How will you attempt to optimize your models?
How will you analyze the accuracy of your models?

Your workflow should have some form of data pre-processing, some kind of exploratory data analysis, training and testing of machine learning algorithms (including some parameter tuning), and evaluating the performance of your models. It is very important that your plan is complete and very detailed, in order to show a thorough understanding of the process of training a model using machine learning, in case you don’t have time to fully complete your project.

The second requirement for the project is that it use one of the following datasets. These datasets cover a variety of topics and have enough records to build substantial models. If you wish, you may choose your own data set (both the UCI repo or data.world are good places to start) that is not on this list, but you’ll need to check with your instructor first (look for data >5k records with both numeric and categorial features)

Cardiotocography
Human Activity Recognition Using Smartphones
Letter Recognition
Turkiye Student Evaluation
Bank Marketing Data Set
Contraceptive Method Choice Data Set
Crimes in Chicago
Default of Credit Card Clients Data Set
Online Shoppers Purchasing Intention Dataset
Synthetic Financial Datasets for Fraud Detection (fake data set, but should still provide useful practice)
Rice Variety Recognition
Mice Protein Expression

Project Check-ins

There are nine progress checks during the semester, each worth 4 points (see the schedule). To receive full credit for a progress check, you must submit work before the next lecture’s afternoon session.

PC0

Read over the project, taking a look at deadlines and deliverables. Then start to look at datasets you’re interested in. You can certainly use the ones given, but it is a requirement to find an outside datasource and post on the Forum under “Places to find additional data sources”

PC1

Identify three data sets that you’re interested in. Include a link, a two sentence description, and the reasoning behind selection for each dataset. Post on the Forum under “Three datasets of interest”

PC2

Create a Colab workbook and copy-paste the workflow plan above into a text box. Identify the dataset you’d like to use for your project, and fill out the project goal (first question). You may have to do some background research on your topic. Then fill out the next three questions.

Next, make sure your data is accessible on a git repo somewhere so you can get started on the next PC. Depending on your project, this might be as simple as locating a git repo that has your data set, or it might require you to upload the dataset to your personal git repo.

Submit the link to your colab workbook to the Moodle submission.

PC3

Perform some exploratory data analysis using your workbook. Discuss the results, and what this tells you about your data, as well as your expectations for your results. This should include some data visualizations that go beyond a basic chart.

Submit a picture of a plot you made along with a quick relection to the Moodle submission

PC4

Time to do some more background research. By now, hopefully you have identified what the goal of the model you plan to build is. Your task is to see if something similar been done with this type of data. Do a brief write up of background information (one paragraph) making sure to link sources.

You should also continue to build out your workflow plan, adding more details as you learn new material in class.

PC5

Prepare your data set for training. This might include handling missing or categorical values, basic conversions, or more extensive conversions. Any choices that you make should be explained. Additionally, you may want to consider creating more visualizations using this pre-processed (or what we call “clean”) data.

PC6

Finish laying out your project workflow and begin building your presentation. Think about how to make the insights from your analysis, your model goal, or the background information engaging for a general audience. Use this checkpoint to explore what makes a story compelling and experiment with ways to bring that story to life in your presentation.

PC7

Train models using machine learning algorithms. Although you should not use every algorithm that we’ve covered, you should use those that make sense for your data set. You should explain your choices, and perform parameter tuning in order to optimize their performance.

Add a slide to your presentation describing the algorithms and tuning you did, making sure to make it relatable for a general audience.

PC8

Discussion of results. Compare the performance of your models, and discuss the results. Do you consider your results successful? Explain why or why not, and discuss how further improvements might be made.

Continue working on your final presentation. You want a presentation that is digestible to a non-technical audidence and is interesting (visualizations, insights, models, data).

PC9

Evaluation of your plan. Discuss how closely you were able to follow your workflow. What unexpected issues did you encounter? What adjustments did you need to make? How would this affect planning a future machine learning project? Finalize your presentation, and submit your link along with a screenshot to Moodle.

Final Submission

The final submission evaluates the overall quality of your project and presentation, based on the materials submitted prior to the final exam. Submissions are due by the last day of class.

A detailed grading rubric

Presentations

(12 pts): During presentations, you will participate in three rounds of small-group presentations, presenting one-on-one with another project. Grading is based on the quality of your peer review during these rounds.

The peer review forms are available up on Moodle