Course Description
Big data remains simple because it scales the processing power across several computers, but big analytics will be more challenging because each dimension is analyzed differently. In this course, you will learn a framework to generate easy to understand algorithm. This will enable you to scale advanced analytics work for high dimension data set.
What am I going to get from this course?
- Large scale data preparation
- Automate data process
- Large scale reporting tasks
- Mass reporting easily automated
- Large scale statistical modelling
- Predictive modelling with high dimension datasets
- Master big analytics
Curriculum
Module 1: Key concepts
04:49
Lecture 1
Scaling Advanced Analytics
04:49
This introduction section will explain purpose of the course, define target audience, and present key concepts that will be used.
Module 2: Data Driven programming
07:22
Lecture 2
How Data Driven Programming Works
01:22
This section explains the programming, the methodology that enable to tackle complex data sets, and scale analytics tasks across high volume of variables. Upon the completion of this section student should be able to use efficient inception method to prepare large scale analytics.
Design a sas script that can process an excel list.
Non technical staff should be able to use the excel file to select variables they need for the analysis.
The list will enable to select numerical and categorical variables separately.
The SAS script will produce descriptive statistics for each type of variables
Lecture 3
Macro inception type
03:34
We will explain how single macro, macro vector and macro array will help for initialisation stage
Lecture 4
Inception table
02:26
Tables can be used for inception, this video will explain main methods for tables inception.
Module 3: Data preparation algorithms
13:22
Lecture 5
Outliers Removal
04:47
This module will use inception methods used earlier along with loops to apply methodology for data preparation purpose. Upon completing this part the student should be able automate data preparations steps for statistical modeling with massive data sets. This lecture will demonstrate data driven programming to tailor an outliers removal algorithm.
The excel file varlabels.xlsx contain variables labels.
Process this file to automate the allocation of labels for each variable.
Quiz 3
Outliers for Left Skewed Variables
The script studied in section 3 related to outlier removal for left skewed variables.
Use market dataset instead of airline dataset
Adapt the algorithm to deal with right skewed variables as well
Binning enable to transform a numerical into categorical variables and is often required to run learning algorithms. the following video shows an algorithms that does that sequentially for any volume of variables. This is one of the most difficult part, you may skip this video for the end.
Lecture 7
Distinct and Missing values
02:32
Variables with too many level or missing values will cause stability issues. A simple approach is used here to tackle these issues
Lecture 8
Balanced Distribution
02:13
Categorical predictor with a balanced distribution will lead to more stable statistical models. The lecture explain approach taken to detect automatically these distributions.
Module 4: Dimension reduction
07:37
Lecture 9
Bivariate Dimension Reduction
04:14
Sometimes redundant information is caused by similar variables. This module will use data driven method to enable dimension reduction techniques with massive datasets. The following lecture explains algorithms used to detect bivariate relationship.
Lecture 10
Multivariate Dimension Reduction
03:23
Multivariate relationship detected method is explained and simple script shows how to use proc Varclus. .
Module 5: Regression adjustment algorithms
07:43
Lecture 11
Exeptional Data Points
04:30
Vast amount of variable means adapting the data modeling process can be time consuming. Examples shown will enable student to adapt, tailor regression algorithms to enhance modeling performance, and adapt modeling policies. This lecture will explain how to remove exceptional data during the regression process
Quiz 4
Clustering for Regression
The purpose of this exercise is to select a set of variable and clusters them. The best variable within each clusters will selected using sequences of logistic regression for each cluster.
Lecture 12
Ods Output as Inception
03:13
This lecture shows how we can use 'ods output' and combine it with data driven programming to remove automatically variable contributing to multi collinearity. The purpose is to enable data scientists to use these programming concept to develop and tailor easily it's own modeling algorithms.