Introduction to Apache Spark

Course Number:

N/A

This course will familiarize participants with Apache Spark, explaining how Spark fits into the big data ecosystem and how to use Spark for data analysis. Topics covered will include Spark shell for interactive data analysis, Spark internals, Spark APIs, Spark SQL, Spark streaming, machine learning and GraphX.

Audience:

This course is appropriate for developers and data analysts.
Course Duration:
3 Days

Prerequisites:

  • Familiarity with either Java, Scala or Python language is recommended (labs are in Scala and Python)
  • A basic understanding of the Linux development environment (i.e., command line navigation / editing files using VI or nano) is also required
Course Objectives:
Course Outline:
  • Scala Primer
    • A Quick Introduction to Scala
  • Spark Basics
    • Background and History
    • Spark and Hadoop
    • Spark Concepts and Architecture
    • Spark Ecosystem (Core, Spark SQL, MLlib, Streaming)
  • RDDs
    • Running Spark in Local Mode
    • Spark Web UI
    • Spark Shell
    • Analyzing Dataset (Part One)
    • Inspecting RDDs
  • RDDs In-Depth
    • Partitions
    • RDD Operations /Transformations
    • RDD Types
    • Key-Value Pair RDDs
    • MapReduce on RDD
    • Caching and persistence
  • Spark and Hadoop
    • Hadoop Introduction (HDFS / YARN)
    • Hadoop + Spark Architecture
    • Running Spark on Hadoop YARN
    • Processing HDFS files using Spark
  • Spark API Programming
    • Introduction to Spark API / RDD API
    • Submitting the First Program to Spark
    • Debugging / Logging
    • Configuration Properties
  • Spark SQL
    • SQL Context
    • Defining Tables and Importing Datasets
    • Querying
  • Spark Streaming
    • Streaming Overview
    • Streaming Operations
    • Sliding Window Operations
    • Writing Spark Streaming Applications
  • Spark MLlib
    • MLlib Introduction
    • MLlib Algorithms
    • Writing MLlib Applications
  • Spark GraphX
    • GraphX Library Overview
    • GraphX APIs
    • Processing Graph Data Using Spark
  • Spark Performance and Tuning
    • Broadcast Variables
    • Accumulators
    • Memory Management
  • Bonus Lab: Running Spark in Cluster Mode
    • Inspecting Masters and Workers in UIs
    • Configurations
    • Distributed Processing of Large Data Sets

Related Posts

About Us

IT Training, Agile Ways of Working and High Impact Talent Development Strategies

Let Us Come to You!

Classes recently delivered in: Atlanta, Boston, Chicago, Columbus, Dallas, Detroit, Indianapolis, Jerusalem, London, Milan, New York, Palo Alto, Phoenix, Pittsburgh, Portland, Raleigh, San Antonio, San Diego, San Francisco, San Jose, Seattle, Springfield, Mass., St. Louis, Tampa and more!