Machine Learning with Apache Spark

Course Number:


This course teaches machine learning at scale with the popular Apache Spark framework.

No machine learning knowledge is assumed. For each machine learning concept, we first discuss the foundations, its applicability and limitations. Then we explain the implementation and use, covering specific use cases. This is achieved through a combination of 50 percent lecture, 50 percent lab work.

Please note that this course does not include in-depth coverage of the maths/stats behind machine learning.

This course is taught using Spark and Python.


This class is intended for data scientists and software engineers.
Course Duration:
3 Days


We assume no previous knowledge of Machine Learning. We teach popular Machine Learning algorithms from scratch.

Prior to taking this course, students should have:

  • A working knowledge of Apache Spark
  • A programming background
  • Familiarity with Python would be a plus, but not required
Course Objectives:
  • Learn popular machine learning algorithms, their applicability and limitations
  • Practice the application of these methods in the Spark machine learning environment
  • Learn practical use cases and limitations of algorithms
Course Outline:
  • Section One – Machine Learning (ML) Overview
    • Machine Learning landscape
    • Machine Learning applications
    • Understanding ML algorithms and models (supervised and unsupervised)
  • Section Two – ML in Python and Spark
    • Spark ML Overview
    • Introduction to Jupyter notebooks
    • Lab – Working with Jupyter + Python + Spark
    • Lab – Spark ML utilities
  • Section Three – Machine Learning Concepts
    • Statistics Primer
    • Covariance, Correlation, Covariance Matrix
    • Errors, Residuals
    • Overfitting / Underfitting
    • Cross validation, bootstrapping
    • Confusion Matrix
    • ROC curve, Area Under Curve (AUC)
    • Lab – Basic stats
  • Section Four – Feature Engineering (FE)
    • Preparing data for ML
    • Extracting features, enhancing data
    • Data cleanup
    • Visualizing Data
    • Lab – Data cleanup
    • Lab – Visualizing data
  • Section Five – Linear regression
    • Simple Linear Regression
    • Multiple Linear Regression
    • Running LR
    • Evaluating LR model performance
    • Lab
    • Use case – House price estimates
  • Section Six – Logistic Regression
    • Understanding Logistic Regression
    • Calculating Logistic Regression
    • Evaluating model performance
    • Lab
    • Use case –Credit card application, college admissions
  • Section Seven – Classification: SVM (Supervised Vector Machines)
    • SVM concepts and theory
    • SVM with kernel
    • Lab
    • Use case – Customer churn data
  • Section Eight – Classification: Decision Trees and Random Forests
    • Theory behind trees
    • Classification and Regression Trees (CART)
    • Random Forest concepts
    • Labs
    • Use case – Predicting loan defaults, estimating election contributions
  • Section Nine – Classification: Naive Bayes
    • Theory
    • Lab
    • Use case – Spam filtering
  • Section 10 – Clustering (K-Means)
    • Theory behind K-Means
    • Running K-Means algorithm
    • Estimating the performance
    • Lab
    • Use case – Grouping cars data, grouping shopping data
  • Section 11 – Principal Component Analysis (PCA)
    • Understanding PCA concepts
    • PCA applications
    • Running a PCA algorithm
    • Evaluating results
    • Lab
    • Use case – Analyzing retail shopping data
  • Section 12 – Recommendation (Collaborative Filtering)
    • Recommender systems overview
    • Collaborative Filtering concepts
    • Lab
    • Use case – Movie recommendations, music recommendations
  • Section 13 – Final Workshop (time permitting)
    • Students will analyze a couple of datasets and run ML algorithms
      • This is done as a group exercise. Each group will present findings to the class

Related Posts

About Us

IT Training, Agile Ways of Working and High Impact Talent Development Strategies

Let Us Come to You!

Classes recently delivered in: Atlanta, Boston, Chicago, Columbus, Dallas, Detroit, Indianapolis, Jerusalem, London, Milan, New York, Palo Alto, Phoenix, Pittsburgh, Portland, Raleigh, San Antonio, San Diego, San Francisco, San Jose, Seattle, Springfield, Mass., St. Louis, Tampa and more!