Apache Hadoop is the most popular framework for processing Big Data. Hadoop provides rich and deep analytics capability, and it is making in-roads into the traditional business intelligence and analytics world. This course will introduce an analyst to core components of the Hadoop ecosystem and its analytics.
Participants should have a programming background with databases / SQL and a basic knowledge of Linux (e.g., be able to navigate Linux command line, editing files with vi / nano).
There is no need to install Hadoop software on students’ machines. A working Hadoop cluster will be provided for students. Participants will only need the following:
- SSH client – Linux and Mac already have SSH client; Putty is recommended for Windows
- Browser to access the cluster – We recommend Firefox browser with FoxyProxy extension installed
There is no need to install Hadoop software on students’ machines. A working Hadoop cluster will be provided for students. Participants will only need the following:
- SSH client – Linux and Mac already have SSH client; Putty is recommended for Windows
- Browser to access the cluster – We recommend Firefox browser with FoxyProxy extension installed
- Introduction to Hadoop
- Hadoop History, Concepts
- Ecosystem
- Distributions
- High-Level Architecture
- Hadoop Myths
- Hadoop Challenges
- Hardware / Software
- HDFS Overview
- Concepts (Horizontal Scaling, Replication, Data Locality, Rack Awareness)
- Architecture (NameNode, Secondary NameNode, Data Node)
- Data Integrity
- Future of HDFS (NameNode HA, Federation)
- Lab Exercises
- Map Reduce Overview
- MapReduce Concepts
- Daemons: JobTracker / TaskTracker
- Phases: Driver, Mapper, Shuffle/Sort, Reducer
- Thinking in MapReduce
- Future of MapReduce (YARN)
- Lab Exercises
- Pig
- Pig Versus Java MapReduce
- Pig Latin Language
- User-Defined Functions
- Understanding Pig Job Flow
- Basic Data Analysis with Pig
- Complex Data Analysis with Pig
- Multi Datasets with Pig
- Advanced Concept
- Lab Exercises
- Hive
- Hive Concepts
- Architecture
- Data Types
- Hive Data Management
- Hive Versus SQL
- Lab Exercises
- BI Tools for Hadoop
- BI Tools and Hadoop
- Overview of Current BI Tools Landscape
- Conclusion
- Choosing the Best Tool for the Job
- Introduction to Hadoop
- Hadoop History, Concepts
- Ecosystem
- Distributions
- High-Level Architecture
- Hadoop Myths
- Hadoop Challenges
- Hardware / Software
- HDFS Overview
- Concepts (Horizontal Scaling, Replication, Data Locality, Rack Awareness)
- Architecture (NameNode, Secondary NameNode, Data Node)
- Data Integrity
- Future of HDFS (NameNode HA, Federation)
- Lab Exercises
- Map Reduce Overview
- MapReduce Concepts
- Daemons: JobTracker / TaskTracker
- Phases: Driver, Mapper, Shuffle/Sort, Reducer
- Thinking in MapReduce
- Future of MapReduce (YARN)
- Lab Exercises
- Pig
- Pig Versus Java MapReduce
- Pig Latin Language
- User-Defined Functions
- Understanding Pig Job Flow
- Basic Data Analysis with Pig
- Complex Data Analysis with Pig
- Multi Datasets with Pig
- Advanced Concept
- Lab Exercises
- Hive
- Hive Concepts
- Architecture
- Data Types
- Hive Data Management
- Hive Versus SQL
- Lab Exercises
- BI Tools for Hadoop
- BI Tools and Hadoop
- Overview of Current BI Tools Landscape
- Conclusion
- Choosing the Best Tool for the Job