Course Description
We know that data is very messy and comes in a variety of form. As part of the overall data mining and machine learning process, we must take the time to preprocess our data. This means we must ensure that it is structured, cleansed, and address any problems that the data may have. Preprocessing the data includes gaining a better understanding of the data through descriptive statistics and data visualization techniques. It also includes ensuring that missing data or outliers are handled accordingly.
What am I going to get from this course?
- Understand what data preprocessing is and why it is needed as part of an overall data science and machine learning methodology
- Review and understand data quality issues and how to address them
- Apply specific Python functions to assist in cleansing and transforming your data
- Be able to summarize your data by using some statistics and data visualization.
Prerequisites and Target Audience
What will students need to know or do before starting this course?
Programming Knowledge in Python
- Lists, variables, loops, etc.
Basic Statistics Knowledge
- Inferential and Descriptive Statistics
Python loaded onto your computer.
- I use Spyder IDE and the Anaconda distribution.
- I have Python 3.6.1 on my machine, so any version greater than 3.6 will work.
Who should take this course? Who should not?
Individuals with basic Python & statistics knowledge can take this course.
Curriculum
Module 1: Introduction to Data Preprocessing
Lecture 1
What is data preprocessing?
Lecture 2
What is dirty data?
Lecture 3
Structuring Data
Lecture 4
Overview of Data Cleansing
Lecture 6
Data Quality Challenges
Lecture 7
Raw Files and File Formats
Lecture 8
Structured Data
Lecture 9
Finding Data Sets
Lecture 10
Loading Data into Python
Lecture 11
Loading Data Into Python Part 2
Module 3: Summarizing Data with StatisticsModule...
Lecture 12
Review of Basic Statistics
Lecture 13
Summarizing Data with Python
Module 4: Data Visualization
Lecture 14
Introduction to Data Visualization
Lecture 16
Creating a Histogram
Lecture 20
Missing Data Part 1
Lecture 21
Missing Data Part 2
Lecture 22
Outlier Detection Part 1
Lecture 23
High-Dimensional Data
Lecture 24
Outlier Detection Part 2
Module 6: Feature Scaling
Lecture 25
Introduction to Feature Scaling
Lecture 26
Final Thoughts