This course will familiarize participants with Apache Spark, explaining how Spark fits into the big data ecosystem and how to use Spark for data analysis. Topics covered will include Spark shell for interactive data analysis, Spark internals, Spark APIs, Spark SQL, Spark streaming, machine learning and GraphX.

Audience: This course is appropriate for developers and data analysts.
Course Duration: 3 Days
Prerequisites:
  • Familiarity with either Java, Scala or Python language is recommended (labs are in Scala and Python)
  • A basic understanding of the Linux development environment (i.e., command line navigation / editing files using VI or nano) is also required
Hardware and Software Requirements:

All participants will need a reasonably modern laptop (i.e., the ability to connect to clusters running on cloud services; corporate laptops with overly restrictive firewalls are not recommended). Other necessities include:

  • SSH client
    • For Windows use Putty or SecureCRT; Mac and Linux come with SSH clients
  • Chrome browser with Markdown Preview Plus plugin
  • A programmer’s editor
    • Windows: Sublime, NotePad++, Programmer’s NotePad, TextPad
    • Mac: Sublime, TextWrangler
    • Linux: Sublime, Gedit, Vim, Emacs
Course Outline:
  • Scala Primer
    • A Quick Introduction to Scala
  • Spark Basics
    • Background and History
    • Spark and Hadoop
    • Spark Concepts and Architecture
    • Spark Ecosystem (Core, Spark SQL, MLlib, Streaming)
  • RDDs
    • Running Spark in Local Mode
    • Spark Web UI
    • Spark Shell
    • Analyzing Dataset (Part One)
    • Inspecting RDDs
  • RDDs In-Depth
    • Partitions
    • RDD Operations /Transformations
    • RDD Types
    • Key-Value Pair RDDs
    • MapReduce on RDD
    • Caching and persistence
  • Spark and Hadoop
    • Hadoop Introduction (HDFS / YARN)
    • Hadoop + Spark Architecture
    • Running Spark on Hadoop YARN
    • Processing HDFS files using Spark
  • Spark API Programming
    • Introduction to Spark API / RDD API
    • Submitting the First Program to Spark
    • Debugging / Logging
    • Configuration Properties
  • Spark SQL
    • SQL Context
    • Defining Tables and Importing Datasets
    • Querying
  • Spark Streaming
    • Streaming Overview
    • Streaming Operations
    • Sliding Window Operations
    • Writing Spark Streaming Applications
  • Spark MLlib
    • MLlib Introduction
    • MLlib Algorithms
    • Writing MLlib Applications
  • Spark GraphX
    • GraphX Library Overview
    • GraphX APIs
    • Processing Graph Data Using Spark
  • Spark Performance and Tuning
    • Broadcast Variables
    • Accumulators
    • Memory Management
  • Bonus Lab: Running Spark in Cluster Mode
    • Inspecting Masters and Workers in UIs
    • Configurations
    • Distributed Processing of Large Data Sets