Course Description
Gartner, IBM, Accenture and many others have asserted that 80% or more of the world’s information is unstructured – and inherently hard to analyze. What does that mean? And what is required to extract insight from unstructured data?
Unstructured data is infinitely variable in quality and format, because it is produced by humans who can be fastidious, unpredictable, ill-informed, or even cynical, but always unique, not standard in any way. Recent advances in natural language processing provides the notion that unstructured content can be included in data analysis.
Serious growth and value companies are committed to data. The exponential growth of Big Data has posed major challenges in data governance and data analysis. Good data governance is pivotal for business growth.
Therefore, it is of paramount importance to slice and dice Big Data that addresses data governance and data analysis issues. In order to support high quality business decision making, it is important to fully harness the potential of Big Data by implementing proper Data Migration, Data Ingestion, Data Management, Data Analysis, Data Visualization and Data Virtualization tools.
What am I going to get from this course?
At completion of this course, students will possess an in-depth understanding that will help them to
- Architect Big Data Infrastructure
- Use Big Data (Hadoop) tools and techniques
- Design and Deploy MapReduce processing
- Perform Big Data Migration from Oracle to HIVE
- Plan & Implement Big Data ETL
- Manage Big Data
- Analyze and Visualize Big Data
These skills will be applied on projects that will be either storage driven or application driven. These projects may serve the following end goals:
- Big Data Ingestion (ETL)
- Big Data Management (Apache HIVE Datawarehouse)
- Big Data Visualization and Analytics (Tableau / 3D - Dashboard)
- Big Data Migration
- Big Data Integration
Curriculum
Module 1: Introduction to Big Data
Lecture 1
Business Value of Big Data
This class will focus on:
(1) Why Big Data is a big leap forward from Business Intelligence world of the past, and
(2) Various ways to slice and dice Big Data to extract maximum value from it.
Lecture 2
Rapid Growth of Big Data, Big Data Definition, and Big Data Projects
This class will focus on:
(1) Understanding of the primary drivers for the growth of Big Data and why Health Care industry is most involved in Big Data analytics,
(2) Understanding of what Big Data is, the hidden value in it, and how new architecture, algorithms, and techniques can be used to extract that hidden value, and
(3) Understanding of the broad characteristics of Big Data projects
Module 2: Big Data Implementation
Lecture 3
Hadoop Eco System, Hadoop Infrastructure, and Hadoop JVM Framework
This class will focus on:
(1) How Hadoop Eco System and Hadoop Infrastructure exploit latest technologies to support efficient and distributed processing of massive amounts of data,
(2) How to harness the capability of Virtual Machines that enables use of large number of inexpensive commodity servers, and
(3) How Hadoop Infrastructure capitalizes on the Compute, Network and Storage technologies.
Lecture 4
Hadoop Distributed File System (HDFS) and associated tutorials
This class will focus on:
(1) How Hadoop Version 2 manages the cluster of Virtual Machines,
(2)How HDFS incorporates fast, efficient and fault tolerant design, and
(3) File and directory manipulation commands that are used on HDFS.
Lecture 5
MapReduce Software, MapReduce Processing and associated tutorial
This class will focus on:
(1) How the components of Hadoop Eco system are packaged in Cloudera distribution bundle that are designed to run on Virtual Machine clusters,
(2) How MapReduce splits input dataset into independent chunks which are processed by MapReduce tasks in a completely parallel manner, and
(3) Pseudo code for MapReduce JAVA classes such as Mapper, Reducer etc.
Module 3: Big Data Migration
Lecture 6
Apache SQOOP - Data Migration, SQOOP commands, and HIVE arguments
This class will focus on:
(1) Apache SQOOP as a powerful data exchange tool, and
(2) SQOOP command line interface commands for migrating data from Oracle R-DBMS to Cloudera Hive.
Lecture 7
SQOOP Architecture and associated tutorial
This class will focus on:
(1) Salient features of Apache SQOOP such as connectors for all major R-DBMS to load data into Apache HIVE, and
(2) SQOOP Architecture and how different components interact to facilitate data transfer between legacy Enterprise Data Warehouses / R-DBMSs and HDFS / Apache HIVE.
Module 4: Big Data Ingestion / Big Data Management
Lecture 8
Tools & Techniques - Informatica BDM and HIVE
This class will focus on:
(1) Importance of Informatica BDM for Data Ingestion and HIVE for Data Management as effective way to build Big Data repository for data analytics, and
(2) HIVE architecture and how it supports HIVE Web Interface and HIVE Command Line Interface.
Lecture 9
High Level Tasks to set up Big Data Business Intelligence Application
This class will focus on:
(1) Sequence of tasks required to build a Big Data business intelligence application that will be instrumental in extracting business value from Big Data, and
(2) Technical architecture of Big Data business intelligence application.
Module 5: Big Data Visualization
Lecture 10
Success Factors for Big Data Analytics and TABLEAU
This class will focus on:
(1) Implications of scale, velocity and scope of Big Data,
(2) Characteristics of a great Data Visualization tool such as TABLEAU, and
(3) Importance of Type 3 data that provides actionable insights.
Lecture 11
3-D Dashboards (Fast, Wide and Deep) and TABLEAU Architecture
This class will focus on:
(1) Fast, Wide and Deep (3-D) dashboards that are result of (i) streaming analytics of click stream data, (ii) analysis of real-time data, and (iii) machine learning, and
(2) TABLEAU Architecture which uses a proprietary technology that makes interactive data visualization an integral part of understanding data.
Module 6: Cloud Computing
Lecture 12
Cloud Computing versus Hadoop Processing and effective use of Cloud Computing in Big Data
This class will focus on:
(1) Difference between Cloud Computing and Hadoop Processing,
(2) Why Big Data is converging towards Cloud Computing, and
(3) Why IaaS is the preferred cloud type for Big Data applications.