MLJourney
Day 4
Week 1

Data Cleaning

Overview

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

It's estimated that data scientists spend up to 80% of their time cleaning and preparing data, making it a crucial skill.

Key Concepts
  • Handling missing values
  • Dealing with outliers
  • Fixing data types
  • Removing duplicates
  • Feature scaling and normalization
Practice Exercise

Exercise: Clean a Messy Dataset

Using the 'Dirty Data' dataset provided:

  1. Identify and handle missing values appropriately
  2. Convert data types to their proper format
  3. Detect and handle outliers
  4. Remove duplicate entries
  5. Create a clean version of the dataset ready for analysis
Resources

Kaggle Data Cleaning

Main resource for today

Data Cleaning with Python

Real Python tutorial

Handling Missing Data

Towards Data Science article

Outlier Detection

Methods for detecting and handling outliers

Complete Today's Task

Mark today's task as complete to track your progress and earn achievements.