Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. During this five-day course, attendees will learn about the business benefits and use cases for Hadoop and its ecosystem; how to plan cluster deployment and growth; and how to install, maintain, monitor, troubleshoot and optimize Hadoop. They will also practice cluster bulk data load, become familiar with various Hadoop distribution and practice installing and managing Hadoop ecosystem tools. The section will conclude with discussion of securing cluster with Kerberos.

Audience: This class is designed for Hadoop administrators and developers.
Course Duration: 5 days
Prerequisites:

Participants should be comfortable with basic Linux system administration and scripting skills. Knowledge of Hadoop and distributed computing is not required but will be introduced and explained in the course.

Hardware and Software Requirements:

There is no need to install Hadoop software on students’ machines. A working Hadoop cluster will be provided for students. Participants will only need the following:

  • SSH client – Linux and Mac already have SSH client; Putty is recommended for Windows
  • Browser to access the cluster – We recommend Firefox browser with FoxyProxy extension installed
Course Outline:

Section One – Administration

  • Introduction
    • Hadoop History, Concepts
    • Ecosystem
    • Distributions
    • High-Level Architecture
    • Hadoop Myths
    • Hadoop Challenges (Hardware / Software)
    • Planning and Installation
    • Selecting Software, Hadoop Distributions
    • Sizing The Cluster, Planning for Growth
    • Selecting Hardware and Network
    • Rack Topology
    • Installation
    • Multi-Tenancy
    • Directory Structure, Logs
    • Benchmarking

 

  • HDFS Operations
    • Concepts (Horizontal Scaling, Replication, Data Locality, Rack Awareness)
    • Nodes and Daemons (NameNode, Secondary NameNode, Ha Standby NameNode, Datanode)
    • Health Monitoring
    • Command-Line and Browser-Based AdministrationAdding Storage, Replacing Defective Drives

 

  • Data Ingestion
    • Note: This Section is Also Good for Developers
    • Flume for Logs and Other Data Ingestion into HDFS
    • Sqoop for Importing from SQL Databases to HDFS and Exporting Back to SQL
    • Hadoop Data Warehousing with Hive
    • Copying Data Between Clusters (DistCp)
    • Using S3 as Complementary to HDFS
    • Data Ingestion Best Practices and Architectures
    • MapReduce Operations and Administration
    • Parallel Computing Before MapReduce
      • Compare HPC versus Hadoop Administration
    • MapReduce Cluster Loads
    • Nodes and Daemons (JobTracker, TaskTracker)
    • MapReduce UI Walk Through
    • MapReduce Configuration
    • Job Config
    • Optimizing MapReduce
    • Fool-Proofing MR: What to Tell Your Programmers

 

  • Yarn: New Architecture and New Capabilities
    • Note: This Section is Also Good for Developers
    • Yarn Design Goals and Implementation Architecture
    • New Actors: ResourceManager, NodeManager, Application Master
    • Installing Yarn
    • Job Scheduling Under Yarn

 

  • Advanced Topics
    • Hardware Monitoring
    • Cluster Monitoring
    • Adding and Removing Servers, Upgrading Hadoop
    • Backup, Recovery and Business Continuity Planning
    • Oozie Job Workflows
    • Hadoop High Availability (HA)
    • Hadoop Federation
    • Securing Your Cluster with Kerberos

 

  • Cloudera Distribution (CDH5) Track
    • Note: This Section is Optional
    • Cloudera Manager for Cluster Administration, Monitoring and Routine Tasks; Installation; Use
      • In This Track, All Exercises and Labs Are Performed Within the Cloudera Distribution Environment (CDH5)

Section Two – Development

  • Introduction to Hadoop
    • Note: This Section is a Brief Overview of the Administration Section; Can Be Omitted if Needed
    • Hadoop History, Concepts
    • Ecosystem
    • Distributions
    • High-Level Architecture
    • Hadoop Myths
    • Hadoop Challenges
    • Hardware / Software

 

  •  HDFS
    • Concepts (Horizontal Scaling, Replication, Data Locality, Rack Awareness)
    • Architecture
    • NameNode
    • Secondary NameNode
    • Data Node
    • Communications / Heart-Beats
    • Block Manager / Balancer
    • Health Check / Safe Mode
    • Read / Write Path
    • File Systems Abstractions
    • Data Integrity
    • Future of HDFS: NameNode HA, Federation
    • Lab Exercises

 

  • MapReduce
    • MapReduce Concepts
    • Daemons: JobTracker / TaskTracker
    • Phases: Driver, Mapper, Shuffle/Sort, Reducer
    • Counters
    • Distributed Cache
    • Combiners
    • MapReduce Configuration
    • MR Types and Formats
    • Sorting
    • Joins (Map Side and Reduce Side)
    • Job Schedulers
    • Unit Testing
    • Thinking in MapReduce
    • Future of MapReduce (Yarn)
    • Lab Exercises

 

  • Pig
    • Pig versus Java MapReduce
    • Pig Job Flow
    • Pig Latin Language
    • Lab Exercises

 

  • Hive
    • Hive Concepts
    • Architecture
    • Data Types
    • Hive Versus SQL
    • Lab Exercises

 

  • HBase
    • Introduction
    • Concepts
    • Architecture
    • HBase versus RDBMS
    • Read Path / Write Path
    • Schema Design
    • Lab Exercises