Skip to main content
CF

Data Preparation & Cleaning for ML

3h 7m 23s
English
Paid

Data Preparation & Cleaning for ML is a 18-lesson 3 hours 7 minutes self-paced course by Andrew Jones. Have you ever heard the expression "data preparation and cleaning"?

Course facts

Lessons
18
Duration
3 hours 7 minutes
Level
All levels
Language
English
Updated
Instructor
Andrew Jones
Price
Premium

Have you ever heard the expression "data preparation and cleaning"? This is perhaps the most important part of the entire machine learning process. Real-world data is often "messy"—it can contain errors, omissions, duplicates, and outliers, leading to distortions, issues, and failures in model performance. That is why it is crucial that data is cleaned and ready for analysis.

Understanding Data Preparation and Cleaning

Simply put, data preparation and cleaning are implementations of the principle of "garbage in, garbage out." Identifying and correcting errors, removing damaged and duplicate records, filling in missing values, and handling outliers are all essential steps in preparation. This process can be labor-intensive, but it is quality data that determines a project's success. Even the most advanced machine learning algorithms cannot be trained on unstructured or "dirty" data.

Importance of Quality Data

High-quality data is the foundation of an effective machine learning model. Without it, even the best algorithms cannot perform optimally. Therefore, investing time and effort in data cleaning is indispensable for your project's success.

Course Overview

To ensure you feel confident in your ML projects, this mini-course will cover everything you need to know about data preparation.

What You Will Learn

  • Start with an 8-key-step checklist to keep in mind when launching any project.
  • Delve into theory, including missing values, outliers, feature selection, and more.
  • Move on to practice, where for each segment you'll complete tasks in Python, working with real data.

Who teaches Data Preparation & Cleaning for ML? Andrew Jones

Andrew Jones thumbnail

Andrew Jones is a data engineer and ML educator focused on the unsexy but critical foundation of machine-learning work: the data preparation and cleaning craft that determines whether downstream models are even worth training.

His CourseFlix listing carries Data Preparation & Cleaning for ML — a structured treatment of the patterns and tooling for taking raw data through to ML-ready feature sets, covering missing-value handling, outlier detection, encoding strategies for categorical variables, scaling, and the validation patterns that catch data-quality issues before they corrupt model training.

Material is paid and aimed at engineers and analysts entering production ML work. For broader content, see CourseFlix's Machine learning category page.

What lessons are included in Data Preparation & Cleaning for ML?

This is a demo lesson (10:00 remaining)

You can watch up to 10 minutes for free. Subscribe to unlock all 18 lessons in this course and access 10,000+ hours of premium content across all courses.

View Pricing
0:00
/
#1: Introduction
All Course Lessons (18)
#Lesson TitleDurationAccess
1
Introduction Demo
01:02
2
ML Prep Checklist
07:18
3
Theory Missing Values
08:48
4
Missing Values with Pandas
12:43
5
Missing Values with SimpleImputer
11:06
6
Missing Values with KNNImputer
11:50
7
Theory Categorical Variables
08:19
8
Categorical Variables One-Hot-Encoding
10:51
9
Theory Outliers
08:56
10
Outliers hands-on
13:35
11
Theory Feature Scaling
09:20
12
Feature Scaling hands-on
08:19
13
Theory Feature Selection
12:05
14
Practical Correlation Matrix
04:27
15
Practical Univariate Testing
17:54
16
Practical RFECV
13:49
17
Theory Model Validation
08:54
18
Practical Model Validation
18:07
Unlock unlimited learning

Get instant access to all 17 lessons in this course, plus thousands of other premium courses. One subscription, unlimited knowledge.

Learn more about subscription

Books

Read Book Data Preparation & Cleaning for ML

#TitleTypeOpen
1Note to students from Andrew Jones PDF

What courses are similar to Data Preparation & Cleaning for ML?

Frequently asked questions

What are the prerequisites for this course?
The course is designed for individuals with a basic understanding of machine learning principles. Familiarity with Python and libraries such as Pandas is beneficial since some lessons involve hands-on exercises using these tools. There are no specific advanced prerequisites, making it accessible to those who are at an intermediate level in data science or machine learning.
What will participants build or accomplish by the end of the course?
Participants will develop a robust understanding of data preparation and cleaning techniques crucial for machine learning projects. They will learn to implement an 8-key-step checklist, handle missing values using Pandas, SimpleImputer, and KNNImputer, encode categorical variables, and manage outliers and feature scaling. Practical exercises include using correlation matrices and RFECV for feature selection and conducting model validation.
Who is the target audience for this course?
This course is aimed at data scientists, machine learning engineers, and analysts who need to improve their skills in data preparation and cleaning. It is suitable for those looking to enhance their ability to preprocess data for machine learning models to ensure optimal performance.
How does the depth of this course compare to similar courses?
The course is a mini-course focusing specifically on data preparation and cleaning, offering a concentrated overview of essential techniques. While other courses may cover data preparation as part of a broader curriculum, this course provides a focused exploration of the subject, including theory and practical applications of key data cleaning and readiness techniques.
What specific tools or platforms are covered in the course?
The course covers data preparation techniques using Python libraries such as Pandas. It includes lessons on using SimpleImputer and KNNImputer for handling missing values and demonstrates practical applications of one-hot encoding for categorical variables. The course also delves into feature selection using correlation matrices and RFECV.
What topics are not covered in this course?
The course does not cover advanced machine learning algorithms or the deployment of machine learning models. It focuses on the data preparation and cleaning phase, which precedes model training and deployment. Participants looking for courses on algorithm development or deployment strategies will need to explore other offerings.
What is the carry-over value of this course to other areas or careers?
The skills acquired in this course are crucial for any role involving data science or machine learning. Mastery of data preparation and cleaning is fundamental to building reliable models, enhancing data quality, and improving model performance. These skills are transferable across various projects and industries, making them valuable for career advancement in data-related fields.