Apache Hadoop is the most popular framework for processing Big Data. Hadoop provides rich and deep analytics capability, and it is making in-roads into the traditional business intelligence and analytics world. This course will introduce an analyst to core components of the Hadoop ecosystem and its analytics.

Audience: This course is designed especially for business analysts.
Course Duration: 3 days

Participants should have a programming background with databases / SQL and a basic knowledge of Linux (e.g., be able to navigate Linux command line, editing files with vi / nano).

Hardware and Software Requirements:

There is no need to install Hadoop software on students’ machines. A working Hadoop cluster will be provided for students. Participants will only need the following:

  • SSH client – Linux and Mac already have SSH client; Putty is recommended for Windows
  • Browser to access the cluster – We recommend Firefox browser with FoxyProxy extension installed
Course Outline:
  • Introduction to Hadoop
    • Hadoop History, Concepts
    • Ecosystem
    • Distributions
    • High-Level Architecture
    • Hadoop Myths
    • Hadoop Challenges
    • Hardware / Software


  • HDFS Overview
    • Concepts (Horizontal Scaling, Replication, Data Locality, Rack Awareness)
    • Architecture (NameNode, Secondary NameNode, Data Node)
    • Data Integrity
    • Future of HDFS (NameNode HA, Federation)
    • Lab Exercises


  • Map Reduce Overview
    • MapReduce Concepts
    • Daemons: JobTracker / TaskTracker
    • Phases: Driver, Mapper, Shuffle/Sort, Reducer
    • Thinking in MapReduce
    • Future of MapReduce (YARN)
    • Lab Exercises


  • Pig
    • Pig Versus Java MapReduce
    • Pig Latin Language
    • User-Defined Functions
    • Understanding Pig Job Flow
    • Basic Data Analysis with Pig
    • Complex Data Analysis with Pig
    • Multi Datasets with Pig
    • Advanced Concept
    • Lab Exercises


  • Hive
    • Hive Concepts
    • Architecture
    • Data Types
    • Hive Data Management
    • Hive Versus SQL
    • Lab Exercises


  • BI Tools for Hadoop
    • BI Tools and Hadoop
    • Overview of Current BI Tools Landscape


  • Conclusion
    • Choosing the Best Tool for the Job