Industry recognized certification enables you to add this credential to your resume upon completion of all courses

Need Custom Training for Your Team?
Get Quote
Call Us

Toll Free (844) 397-3739

Inquire About This Course
Dr. Connie Brett, Instructor - Data Wrangling in R

Dr. Connie Brett

Dr. Connie Brett is a successful Data Scientist, Entrepreneur, and Educator who has spent the past 15+ years implementing and coaching analytics teams across the entire SDLC. She brings a unique perspective to the problems faced by all phases of planning, developing and use of online products and solutions - this helps her teach you how to use analytics tools in the most effective way. With an M.S. and Ph.D. from The Ohio State University in Computational Chemistry, she worked in the quagmire of data problems, preparation, and analysis long before the coinage of the term "Data Science" or "Big Data". She has been published in peer-reviewed journals and recently filed for a US Patent on a Data Visualization Framework.

Instructor: Dr. Connie Brett

Real-world data preparation for further analysis using R

  • Learn from start to finish how to get your data into R efficiently and polish it up so that it is as good as it can be.
  • Instructor is the founder of Analytics Incubation Center at Cisco and has 15 years of analytics development experience.
  • Capstone project reviewed by the instructor.

Duration: 2h 15m

Course Description

R is an extraordinarily powerful language with a vast community of great resources, but where should you start when all you want to do is get your data into a usable format? How do you know your data might be ready? What are the pitfalls you should watch for so that you don’t perform an analysis on bad data? This course will teach you from start to finish how to get your data into R efficiently and polish it up so that it is as good as it can be. This will let you or your team focus after this step on the statistical modeling, visualization, reporting, sharing, or any other post-processing task you wish to perform. Confidence, reliability, and reproducibility in your data acquisition and preparation are the kingpins to being able to maximize your data’s value. This course uses a variety of real-world data sets that contain real-world data quality, formatting, and other issues. It will ensure that you understand not just the R syntax to perform a task, but also sources of quality issues, how to recognize hidden data problems, and the benefits and adverse effects of the most common data manipulations. This course will give you real experience in the art and science of data preparation that you can take to your next real project forward with confidence. The capstone project utilizes open agricultural industry data in preparation for a future statistical analysis of the products and brands of the companies. Like a real project, the project goals and background are provided but the step-by-step data preparation is not given - the course will have provided the methods and insights needed to prepare this data for future statistical analysis! The capstone project is reviewed by the instructor and feedback is individually provided to each student in the course along with a full project solution.

What am I going to get from this course?

  • Understand the R syntax to perform a task
  • Identify sources of quality issues
  • Recognize hidden data problems
  • Understand benefits/detriments of the most common data manipulations
  • Prepare a real-world dataset for future statistical analysis and utilize the capstone project as a portfolio piece.

Prerequisites and Target Audience

What will students need to know or do before starting this course?

  • R-Studio installed (optional, but strongly suggested)
  • R installed
  • Basic R programming knowledge

Who should take this course? Who should not?

  • Students do not need to be an R expert to take this course, but should have a basic knowledge of how to use R.
  • Students should be persons who use data and R and want to better understand how to prepare data for analysis correctly and efficiently.


Module 1: Introduction

Lecture 1 Introduction to the Course

Course Objectives, Audience and Instructor Information.

Lecture 2 Course Slides

Download the entire set of Course Slides as a PDF to take notes/etc as you take this course.

Module 2: Data Sources

Lecture 3 Importance of Metadata

An overview and understanding of why metadata is important.

Lecture 4 Collection Bias

Understanding Collection Bias and why it is critical to keep in mind during data collection and analysis.

Lecture 5 Public Data Sources

Using public data sources including best practices.

Lecture 6 Private Data

Defining and understanding private data.

Module 3: Obtaining Data

Lecture 7 Database Connections

Connecting to and querying data directly from databases in R

Lecture 8 Files

Obtaining data from various file types and formats

Lecture 9 Hadoop

Interacting with Hadoop data stores in R

Lecture 10 Mini-Project 1

In this project we are going to obtain the data used in the mini-projects. Complete this project before the quiz!

Quiz 1 Mini-Project 1

Questions related to Mini-Project 1.

Module 4: Cleaning Data

Lecture 11 HTML

Dealing with HTML encoding in fields

Lecture 12 JSON

Dealing with JSON formatted data

Lecture 13 Excel

Excel-specific data cleaning issues and tips.

Lecture 14 Whitespace/Languages

Handling whitespace and multi-language issues in R

Lecture 15 Units and Conversions

Handling unit conversions

Lecture 16 Data Type Issues

Common data type issues

Lecture 17 Categorical Creep

Recognizing and solving categorical "creep" or spread

Lecture 18 Minor Corrections

Best Practices for minor corrections

Lecture 19 Completeness

Overview of detecting and handling of completeness issues during data cleaning

Lecture 20 Accuracy

Notes on accuracy considerations while cleaning data

Module 5: Shaping Data

Lecture 21 Long vs. Wide Formats

Understanding and converting between these commonly referenced data shapes

Lecture 22 Combined Data

Separating combined data in a single field

Lecture 23 Column & Row Names

Capturing data contained in column and row names

Lecture 24 Internally Structured Data

Flattening data with embedded structured data

Lecture 25 Internal Lists

Handling lists inside fields

Lecture 26 Naming Columns

Quick best-practices and considerations when naming columns

Lecture 27 OLAP Cubes

Using OLAP cube data in R

Lecture 28 Mini-Project 2

In this project we are going to prepare the data from Mini-Project 1 for analysis. Complete this project before the quiz!

Quiz 2 Mini-Project 2

Questions related to Mini-Project 2

Module 6: Features/Variables

Lecture 29 Introduction

Introducing Feature/Variable Selection

Lecture 30 Elimination - Variance

Eliminating features with zero or near-zero variance

Lecture 31 Elimination - Correlation

Eliminating features using correlation

Lecture 32 Feature Creation

Finding and creating features

Lecture 33 Examining Distributions

Examining variable distributions - continuous data

Lecture 34 Finding Rare Events

Finding rare events in data that may signal an issue

Lecture 35 Normalization

Normalizing and rescaling data

Lecture 36 Advanced Preprocessing

Handling less-common dat preprocessing scenarios such as baseline removal.

Lecture 37 Wrap-Up

Comments on selecting features/variables

Lecture 38 Mini-Project 3

In this project we are going to refine the dataset by feature manipulation Complete this project before the quiz!

Quiz 3 Mini-Project 3

Questions related to Mini-Project 3

Module 7: Exporting & Saving

Lecture 39 Exporting & Saving Prepared Data

Tips, tricks and notes about exporting and saving your prepared data

Module 8: Data Pipeline

Lecture 40 Working with R in a Data Pipeline

Considerations when Data Wrangling as part of a data pipeline.

Module 9: Conclusion & Capstone

Lecture 41 Next Steps and Additional Resources

Course Wrap-up

Lecture 42 Capstone Project

Instructions for the Capstone Project The capstone project utilizes open agricultural industry data in preparation for a future statistical analysis of the products and brands of the companies. Like a real project, the project goals and background are provided but the step-by-step data preparation is not given - you will be able to use the methods you learned in the class to prepare this data for the project's future statistical analysis.


6 Reviews

Xiao X

December, 2016

Weldon C

July, 2017

Very comprehensive course. Learned a tremendous amount about R.

Chris B

May, 2017

It is a great experience to learn this course from the founder instructor of analytics incubation center at Cisco. There cannot be a greater place than learning from such an instructor. He lectured very well on analysis with R and how to get data into R efficiently. The perplex involving how and where to start data in a usable format using R is well explained in the course. It is also equally important to know about the readiness of your data, and pitfalls to watch and not to perform analysis on bad data. The instructor was open-minded for ideas and encourage to contribute and collaborative in the participation. The course material is very well informative.

Jason C

May, 2017

This course expertly teachers to get data into R efficiently. Learning this way, I could concentrate on the statistical modeling, reporting, visualization and sharing. The course build confidence in me in my data acquisition and preparation that are main tasks to maximize data value.

Jason S

May, 2017

As I am new to the field of R programming language, this course is of immense help to me to learn the basics of how to use R. and better understand how to prepare data for analysis correctly and efficiently

Victor G

July, 2017

Great experience and overall interesting course. The instructor was remarkably distinct in presentation and you can clearly see the effort that went into producing this course. This is one of the strongest courses on this subject I have taken.