Introduction to Spark 3 with Scala

Course Number:

NTX336

Audience:

Course Duration:
4 days

Prerequisites:

Working knowledge of some programming language. No Java experience needed.

Course Objectives:
  • Understand the need for Spark in data processing
  • Understand the Spark architecture and how it distributes computations to cluster nodes
  • Be familiar with basic installation / setup / layout of Spark
  • Use the Spark shell for interactive and ad-hoc operations
  • Understand RDDs (Resilient Distributed Datasets), and data partitioning, pipelining, and computations
  • Understand/use RDD ops such as map(), filter() and others.
  • Understand and use Spark SQL and the DataFrame/DataSet API.
  • Understand DataSet/DataFrame capabilities, including the Catalyst query optimizer and Tungsten memory/cpu optimizations.
  • Be familiar with performance issues, and use the DataSet/DataFrame and Spark SQL for efficient computations
  • Understand Spark’s data caching and use it for efficient data transfer
  • Write/run standalone Spark programs with the Spark API
  • Use Spark Structured Streaming to process streaming (real-time) data
  • Ingest streaming data from Kafka, and process via Spark Structured Streaming
  • Understand performance implications and optimizations when using Spark
Course Outline:

 

(Optional): Scala Ramp Up

  • Scala Introduction, Variables, Data Types, Control Flow
  • The Scala Interpreter
  • Collections and their Standard Methods (e.g. map())
  • Functions, Methods, Function Literals
  • Class, Object, Trait, case Class

Introduction to Spark

  • Overview, Motivations, Spark Systems
  • Spark Ecosystem
  • Spark vs. Hadoop
  • Acquiring and Installing Spark
  • The Spark Shell, SparkContext

RDDs and Spark Architecture

  • RDD Concepts, Lifecycle, Lazy Evaluation
  • RDD Partitioning and Transformations
  • Working with RDDs – Creating and Transforming (map, filter, etc.)

Spark SQL, DataFrames, and DataSets

  • Overview
  • SparkSession, Loading/Saving Data, Data Formats (JSON, CSV, Parquet, text …)
  • Introducing DataFrames and DataSets (Creation and Schema Inference)
  • Supported Data Formats (JSON, Text, CSV, Parquet)
  • Working with the DataFrame (untyped) Query DSL (Column, Filtering, Grouping, Aggregation)
  • SQL-based Queries
  • Working with the DataSet (typed) API
  • Mapping and Splitting (flatMap(), explode(), and split())
  • DataSets vs. DataFrames vs. RDDs

Shuffling Transformations and Performance

  • Grouping, Reducing, Joining
  • Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
  • Exploring the Catalyst Query Optimizer (explain(), Query Plans, Issues with lambdas)
  • The Tungsten Optimizer (Binary Format, Cache Awareness, Whole-Stage Code Gen)

Performance Tuning

  • Caching – Concepts, Storage Type, Guidelines
  • Minimizing Shuffling for Increased Performance
  • Using Broadcast Variables and Accumulators
  • General Performance Guidelines

Creating Standalone Applications

  • Core API, SparkSession.Builder
  • Configuring and Creating a SparkSession
  • Building and Running Applications – sbt/build.sbt and spark-submit
  • Application Lifecycle (Driver, Executors, and Tasks)
  • Cluster Managers (Standalone, YARN, Mesos)
  • Logging and Debugging

Spark Streaming

  • Introduction and Streaming Basics
  • Structured Streaming (Spark 2+)
    • Continuous Applications
    • Table Paradigm
    • Steps for Structured Streaming
    • Sources and Sinks
  • Consuming Kafka Data
    • Kafka Overview
    • Structured Streaming – “kafka” format
    • Processing the Stream

Related Posts

About Us

IT Training, Agile Ways of Working and High Impact Talent Development Strategies

Let Us Come to You!

Classes recently delivered in: Atlanta, Boston, Chicago, Columbus, Dallas, Detroit, Indianapolis, Jerusalem, London, Milan, New York, Palo Alto, Phoenix, Pittsburgh, Portland, Raleigh, San Antonio, San Diego, San Francisco, San Jose, Seattle, Springfield, Mass., St. Louis, Tampa and more!