Course Description
R is an extraordinarily powerful language with a vast community of great resources, but where should you start when all you want to do is get your data into a usable format? How do you know your data might be ready? What are the pitfalls you should watch for so that you don’t perform an analysis on bad data?
This course will teach you from start to finish how to get your data into R efficiently and polish it up so that it is as good as it can be. This will let you or your team focus after this step on the statistical modeling, visualization, reporting, sharing, or any other post-processing task you wish to perform. Confidence, reliability, and reproducibility in your data acquisition and preparation are the kingpins to being able to maximize your data’s value.
This course uses a variety of real-world data sets that contain real-world data quality, formatting, and other issues. It will ensure that you understand not just the R syntax to perform a task, but also sources of quality issues, how to recognize hidden data problems, and the benefits and adverse effects of the most common data manipulations. This course will give you real experience in the art and science of data preparation that you can take to your next real project forward with confidence.
The capstone project utilizes open agricultural industry data in preparation for a future statistical analysis of the products and brands of the companies. Like a real project, the project goals and background are provided but the step-by-step data preparation is not given - the course will have provided the methods and insights needed to prepare this data for future statistical analysis! The capstone project is reviewed by the instructor and feedback is individually provided to each student in the course along with a full project solution.
What am I going to get from this course?
- Understand the R syntax to perform a task
- Identify sources of quality issues
- Recognize hidden data problems
- Understand benefits/detriments of the most common data manipulations
- Prepare a real-world dataset for future statistical analysis and utilize the capstone project as a portfolio piece.
Curriculum
Module 1: Introduction
05:30
Lecture 1
Introduction to the Course
05:30
Course Objectives, Audience and Instructor Information.
Download the entire set of Course Slides as a PDF to take notes/etc as you take this course.
Module 2: Data Sources
18:08
Lecture 3
Importance of Metadata
02:11
An overview and understanding of why metadata is important.
Lecture 4
Collection Bias
02:05
Understanding Collection Bias and why it is critical to keep in mind during data collection and analysis.
Lecture 5
Public Data Sources
03:24
Using public data sources including best practices.
Lecture 6
Private Data
10:28
Defining and understanding private data.
Module 3: Obtaining Data
16:33
Lecture 7
Database Connections
02:40
Connecting to and querying data directly from databases in R
Obtaining data from various file types and formats
Interacting with Hadoop data stores in R
Lecture 10
Mini-Project 1
In this project we are going to obtain the data used in the mini-projects.
Complete this project before the quiz!
Questions related to Mini-Project 1.
Module 4: Cleaning Data
31:05
Dealing with HTML encoding in fields
Dealing with JSON formatted data
Excel-specific data cleaning issues and tips.
Lecture 14
Whitespace/Languages
02:05
Handling whitespace and multi-language issues in R
Lecture 15
Units and Conversions
01:48
Handling unit conversions
Lecture 16
Data Type Issues
01:55
Lecture 17
Categorical Creep
03:47
Recognizing and solving categorical "creep" or spread
Lecture 18
Minor Corrections
01:36
Best Practices for minor corrections
Lecture 19
Completeness
03:50
Overview of detecting and handling of completeness issues during data cleaning
Lecture 20
Accuracy
01:31
Notes on accuracy considerations while cleaning data
Module 5: Shaping Data
20:14
Lecture 21
Long vs. Wide Formats
03:09
Understanding and converting between these commonly referenced data shapes
Lecture 22
Combined Data
02:22
Separating combined data in a single field
Lecture 23
Column & Row Names
03:31
Capturing data contained in column and row names
Lecture 24
Internally Structured Data
03:53
Flattening data with embedded structured data
Lecture 25
Internal Lists
04:22
Handling lists inside fields
Lecture 26
Naming Columns
00:58
Quick best-practices and considerations when naming columns
Lecture 27
OLAP Cubes
01:59
Using OLAP cube data in R
Lecture 28
Mini-Project 2
In this project we are going to prepare the data from Mini-Project 1 for analysis.
Complete this project before the quiz!
Questions related to Mini-Project 2
Module 6: Features/Variables
27:45
Lecture 29
Introduction
01:14
Introducing Feature/Variable Selection
Lecture 30
Elimination - Variance
05:06
Eliminating features with zero or near-zero variance
Lecture 31
Elimination - Correlation
05:39
Eliminating features using correlation
Lecture 32
Feature Creation
03:35
Finding and creating features
Lecture 33
Examining Distributions
02:26
Examining variable distributions - continuous data
Lecture 34
Finding Rare Events
02:32
Finding rare events in data that may signal an issue
Lecture 35
Normalization
03:52
Normalizing and rescaling data
Lecture 36
Advanced Preprocessing
02:28
Handling less-common dat preprocessing scenarios such as baseline removal.
Comments on selecting features/variables
Lecture 38
Mini-Project 3
In this project we are going to refine the dataset by feature manipulation
Complete this project before the quiz!
Questions related to Mini-Project 3
Module 7: Exporting & Saving
05:07
Lecture 39
Exporting & Saving Prepared Data
05:07
Tips, tricks and notes about exporting and saving your prepared data
Module 8: Data Pipeline
06:25
Lecture 40
Working with R in a Data Pipeline
06:25
Considerations when Data Wrangling as part of a data pipeline.
Module 9: Conclusion & Capstone
03:52
Lecture 41
Next Steps and Additional Resources
03:52
Lecture 42
Capstone Project
Instructions for the Capstone Project
The capstone project utilizes open agricultural industry data in preparation for a future statistical analysis of the products and brands of the companies. Like a real project, the project goals and background are provided but the step-by-step data preparation is not given - you will be able to use the methods you learned in the class to prepare this data for the project's future statistical analysis.