Course Description
Learn the fundamentals of how to produce industrial strength applications using the Hadoop ecosystem. In addition to the basics we introduce advanced topics such as intelligent hashing, partition skew detection, Monte Carlo simulation, partition pruning, and push predicates. Emerging industry standards in data formats, messaging, and stream processing provide guidance to students on future studies.
What am I going to get from this course?
- Understand core Hadoop components, how they work together, and real world industry best practices in this hadoop training course.
- How to produce industrial strength MapReduce applications with the highest standards of quality and robustness.
- Learn to utilize the Hadoop APIs for basic Data Science tasks such as Monte Carlo Simulation and data preparation.
- How to partition, reduce, sort, and join data using MapReduce to produce any result you could produce using SQL.
- Leverage the latest data storage formats to make data processing using MapReduce faster and easier than ever before.
- Proper usage of compression in large scale environments.
- How to collect data using Flume and Sqoop.
- Data exploration using Hive, Pig, and Drill.
- How to create truly reusable User Defined Functions which operate identically regardless of Hadoop distributions or version upgrades.
- Methods of exposing an API to enable Hadoop as a Service (HaaS)
- Future directions and trends in Big Data.
Prerequisites and Target Audience
What will students need to know or do before starting this course?
- Working knowledge of Java, equivalent courses, or Java certification.
- Ability to use the basics of the Unix command line.
Who should take this course? Who should not?
Students wanting to learn hadoop and desiring a deep dive into real world usage of Hadoop and related APIs and tools will benefit most from the course. Students must master all the relevant details of the Hadoop APIs and complete rigorous and challenging assignments in the context of a data aggregator case study.
Curriculum
Module 1: Hadoop Cluster Overview
11:57
We discuss job execution framework called YARN: Yet Another Resource Negotiator, and how the components of YARN interact to manage Hadoop job execution.
We discuss Hadoop Distributed File System (HDFS) including design principles, proper usage, and best practices.
Quiz 1
Hadoop Cluster Overview
Verifies student understanding of the basic YARN and HDFS components of Hadoop and how they interact to provide job management and storage management for an application.
Module 2: Industrial Strength MapReduce
01:24:03
Lecture 3
Industrial Strength MapReduce Part 1
13:44
Covers how to utilize Eclipse IDE and JUnit to produce and maintain an industrial quality Hadoop MapReduce code base.
Quiz 2
Industrial Strength MapReduce Part 1
Lecture 4
Industrial Strength MapReduce Part 2
12:59
Covers how to utilize Eclipse IDE and JUnit to produce and maintain an industrial quality Hadoop MapReduce code base.
Quiz 3
Industrial Strength MapReduce Part 2
Lecture 5
Industrial Strength MapReduce Part 3
14:08
Covers how to utilize Eclipse IDE and JUnit to produce and maintain an industrial quality Hadoop MapReduce code base.
Lecture 6
Industrial Strength MapReduce Part 4
13:04
Covers how to utilize Eclipse IDE and JUnit to produce and maintain an industrial quality Hadoop MapReduce code base.
Lecture 7
Viewing Log Files and Understanding Counters
06:39
How to view and interpret Hadoop MapReduce log files and counters.
Lecture 8
Exercise 00 - Your first Hadoop Test Case
19:30
Create a test case from scratch using the MapReduce APIs and complying with code coverage KPIs.
Lecture 9
Exercise 00 - Review
03:59
Review your answer and the provided answer. Compare and contrast.
Module 3: Basic Data Science with the Hadoop APIs
34:15
Lecture 10
Writable and WritableComparable
06:09
Covers fundamentals of how Hadoop serializes various Java data types. Also, how to utilize Eclipse and Maven to explore class and interface hierarchies in the Hadoop code base.
Quiz 4
Writable and WritableComparable
Review the basic structure of WritableComparable as it is visible in Eclipse.
Lecture 11
Introduction to Monte Carlo Simulation
15:46
Shows how to implement a Monte Carlo Simulation in Hadoop to verify logic and allow local mode performance and load test.
Quiz 5
Introduction to Monte Carlo Simulation
Review the role of Monte Carlo Simulation in producing robust Hadoop solutions.
Lecture 12
Introduction to Intelligent Hashing
07:57
Show how Intelligent Hashing APIs from Google can make Hadoop jobs more efficient and fault tolerant.
Quiz 6
Introduction to Intelligent Hashing
Review how to create an Intelligent Hash and what trade offs are involved in their use.
Lecture 13
Exercise 01 - Data Enrichment using Hadoop
00:28
Enrich the data with an Intelligent Hash
Lecture 14
Exercise 01 - Review
03:55
Review your solution with the provided solution. Compare and contrast.
Module 4: Partitioners, Reducers, and Sorting
40:36
Lecture 15
Partitioners, Reducers, and Sorting
15:22
Quiz 7
Partitioners, Reducers, and Sorting
Test your understanding of how partitioning, reducing, and sorting work.
Lecture 16
Exercise 02 - Monitoring Partition Skew
15:22
How to identify partition skew, potentially well before the job completes.
Lecture 17
Exercise 02 - Review
09:52
Review your answer to the provided answer. Compare and contrast.
Module 5: Data Formats, Compression, and Splitting
29:12
Lecture 18
Data Input and Output
14:56
Role of Custom Writable and WritableComparable
File Compression and Splitting
Custom InputFormats and OutputFormats
Multiple Inputs
Schema Evolution
File formats: SequenceFile, Avro, and Parquet
Lecture 19
Exercise 03 - convert a file from text to Avro
09:26
How to create an Avro output from a text input.
Lecture 20
Exercise 04 - convert a file from text to ParquetAvro
04:50
How to create a ParquetAvro file from text input.
Module 6: Joining Using MapReduce
55:58
Lecture 21
Importance of EIPs - Enterprise Integration Patterns
16:13
Relation of EIPs to proper MapReduce application design and implementation.
Lecture 22
Orchestration and Routing Job Flows
04:21
Best practices for handling precedence and inter-job dependencies in MapReduce applications.
Lecture 23
Partition Pruning, Push Predicates, and Joins
10:59
How to control the exact behavior of joins as well as how to filter data in storage with all the functionality of the SQL WHERE clause.
Lecture 24
Exercise 05 - joining emp with activity logs
07:43
Create an inner join using MapReduce.
Lecture 25
Exercise 06 - partition pruning and push predicates
16:42
Add partition pruning and push predicates to the join job.
Module 7: Data Collection with Flume and Sqoop
33:18
Lecture 26
Flume Fundamentals
19:11
Learn the best practices and key use cases for Flume, and how to stream data to HDFS using Flume.
Lecture 27
Sqoop Fundamentals
14:07
Use Sqoop to import a data set from a relational database to HDFS.
Module 8: Data Exploration with Hive, Pig, and Drill
48:49
Lecture 28
Hive Fundamentals
25:09
Learn how to import and query text data in Hive and the advantages of using ParquetAvro to simplify data management and improve performance.
Lecture 29
Pig Fundamentals
11:22
Learn some fundamentals of Pig and how to load, store, and transform data using PigLatin.
Lecture 30
Drill Fundamentals
12:18
Learn how Drill can scale out high performance end user queries using ANSI standard SQL.
Module 9: Integrating Hadoop into the Enterprise
36:53
Lecture 31
Reusable User Defined Functions (UDFs)
19:38
Learn how to create reusable User Defined Functions (UDFs) which work identically across multiple big data tools and across tool version upgrades.
Lecture 32
Exercise 07 - reusable Linear Regression function
17:15
Create a reusable function which integrates with multiple big data tools.
Lecture 33
Hadoop as a Service (HaaS)
Learn how to apply EIPs to create Hadoop as a Service (HaaS)
Lecture 34
Exercise 08 - develop basic Hadoop as a Service
Learn how to expose a Hadoop endpoint to future proof and simplify your data processing architecture.
Lecture 35
Job Scheduling with Oozie
Learn best practices for leveraging Oozie to trigger workflows.
Module 10: Future Trends in Hadoop
Lecture 36
Truly Scalable Messaging - LinkedIn’s Apache Kafka
Learn how Kafka enables robust EIP based designs to scale to big data and beyond.
Lecture 37
Unified Batch and Real-time - Google’s Apache Beam
Learn about Apache Beam: the most significant open source contribution since Hadoop.
Lecture 38
Hadoop as a Service Cloud - Amazon Web Services and Google Cloud
Learn how Cloud providers expose APIs to allow pay-as-you-go Hadoop infrastructure.