Spark Fundamentals

Available since November 2, 2019
...
Category

IT Business Management Training

Duration

3 days

Course description

This high-octane Spark training course provides theoretical and technical aspects of Spark programming. The course teaches developers Spark fundamentals, APIs, common programming idioms and more.This Spark training course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material and quickly get them up to speed on using Spark for data exploration.

Course Agenda :
- Elements of functional programming
- Spark Shell
- RDDs
- Parallel processing in Spark
- Spark SQL
- ETL with Spark
- MLib Machine Learning Library
- Graph Processing with GraphX
- Spark Streaming

Target audience

Developers, Business Analysts, and IT Architects

Course requirements

Participants should have the general knowledge of programming as well as experience working in Unix-like environments (e.g. running shell commands, etc.)

Course Plan

Section 01

Chapter 1

  • Introduction to Functional Programming
  • What is Functional Programming (FP)?
  • Terminology: First-Class and Higher-Order Functions
  • Terminology: Lambda vs Closure
  • A Short List of Languages that Support FPFP with Java
  • FP With JavaScript
  • Imperative Programming in JavaScript
  • The JavaScript map (FP) Example
  • The JavaScript reduce (FP) Example
  • Using reduce to Flatten an Array of Arrays (FP) Example
  • The JavaScript filter (FP) Example
  • Common High-Order Functions in Python
  • Common High-Order Functions in Scala
  • Elements of FP in R
  • Summary
Section 02

Chapter 2

  • Introduction to Apache Spark
  • What is SparkA Short History of Spark
  • Where to Get Spark?
  • The Spark Platform
  • Spark Logo
  • Common Spark Use Cases
  • Languages Supported by Spark
  • Running Spark on a Cluster
  • The Driver Process
  • Spark Applications
  • Spark Shell
  • The spark-submit Tool
  • The spark-submit Tool Configuration
  • The Executor and Worker Processes
  • The Spark Application Architecture
  • Interfaces with Data Storage Systems
  • Limitations of Hadoop's MapReduce
  • Spark vs MapReduceSpark as an Alternative to Apache Tez
  • The Resilient Distributed Dataset (RDD)
  • Spark Streaming (Micro-batching)
  • Spark SQLExample of Spark SQL
  • Spark Machine Learning Library
  • GraphXSpark vs RSummary
Section 03

Chapter 3

  • Hadoop Distributed File System Overview
  • Hadoop Distributed File System (HDFS)
  • HDFS High Availability
  • HDFS 'Fine Print'Storing Raw Data in HDFS
  • Hadoop Security
  • HDFS Rack-awareness
  • Data Blocks
  • Data Block Replication Example
  • HDFS Name
  • Node Directory Diagram
  • Accessing HDFS
  • Examples of HDFS Commands
  • Other Supported File Systems
  • WebHDFS
  • Examples of WebHDFS Calls
  • Client Interactions with HDFS for the Read Operation
  • Read Operation Sequence Diagram
  • Client Interactions with HDFS for the Write Operation
  • Communication inside HDFS
  • Summary
Section 04

Chapter 4

  • The Spark Shell
  • The Spark Shell
  • The Spark Shell UI
  • Spark Shell Options
  • Getting Help
  • The Spark Context (sc) and SQL Context (sqlContext)
  • The Shell Spark Context
  • Loading Files
  • Saving Files
  • Basic Spark ETL Operations
  • Summary
Section 05

Chapter 5

  • Spark RDDs
  • The Resilient Distributed Dataset (RDD)
  • Ways to Create an RDD
  • Custom RDDs
  • Supported Data Types
  • RDD Operations
  • RDDs are Immutable
  • Spark ActionsRDD Transformations
  • Other RDD Operations
  • Chaining RDD Operations
  • RDD LineageThe Big Picture
  • What May Go Wrong
  • Checkpointing RDDsLocal Checkpointing
  • Parallelized Collections
  • More on parallelize() Method
  • The Pair RDD
  • Where do I use Pair RDDs?
  • Example of Creating a Pair RDD with Map
  • Example of Creating a Pair RDD with keyBy
  • Miscellaneous Pair RDD Operations
  • RDD Caching
  • RDD Persistence
  • The Tachyon Storage
  • Summary
Section 06

Chapter 6

  • Shared Variables in Spark
  • Shared Variables in Spark
  • Broadcast Variables
  • Creating and Using Broadcast Variables
  • Example of Using Broadcast Variables
  • Accumulators
  • Creating and Using Accumulators
  • Example of Using Accumulators
  • Custom Accumulators
  • Summary
Section 07

Chapter 7

  • Parallel Data Processing with Spark
  • Running Spark on a Cluster
  • Spark Stand-alone Option
  • The High-Level Execution Flow in Stand-alone Spark Cluster
  • Data Partitioning
  • Data Partitioning Diagram
  • Single Local File System RDD Partitioning
  • Multiple File RDD Partitioning
  • Special Cases for Small-sized Files
  • Parallel Data Processing of Partitions
  • Spark Application, Jobs, and Tasks
  • Stages and Shuffles
  • The 'Big Picture'
  • Summary
Section 08

Chapter 8

  • Introduction to Spark SQL
  • What is Spark SQL?
  • Uniform Data Access with Spark SQL
  • Hive Integration
  • Hive Interface
  • Integration with BI Tools
  • Spark SQL is No Longer Experimental Developer API!
  • What is a DataFrame?
  • The SQLContext Object
  • The SQLContext API
  • Changes Between Spark SQL 1.3 to 1.4
  • Example of Spark SQL (Scala Example)
  • Example of Working with a JSON File
  • Example of Working with a Parquet File
  • Using JDBC Sources
  • JDBC Connection Example
  • Performance & Scalability of Spark SQL
  • Summary
Section 09

Chapter 9

  • Graph Processing with GraphX
  • What is GraphX?
  • Supported Languages
  • Vertices and Edges
  • Graph Terminology
  • Example of Property Graph
  • The GraphX API
  • The GraphX Views
  • The Triplet View
  • Graph Algorithms
  • Graphs and RDDs
  • Constructing Graphs
  • Graph Operators
  • Example of Using GraphX Operators
  • GraphX Performance Optimization
  • The PageRank Algorithm
  • GraphX Support for PageRank
  • Summary
Section 10

Chapter 10

  • Machine Learning Algorithms
  • Supervised vs Unsupervised Machine Learning
  • Supervised Machine Learning Algorithms
  • Unsupervised Machine Learning Algorithms
  • Choose the Right Algorithm
  • Life-cycles of Machine Learning Development
  • Classifying with k-Nearest Neighbors (SL)k-Nearest Neighbors Algorithmk-Nearest Neighbors Algorithm
  • The Error

Reviews

Coming soon.

Scroll to top