Spark Fundamentals

Available since November 2, 2019

Course description

This high-octane Spark training course provides theoretical and technical aspects of Spark programming. The course teaches developers Spark fundamentals, APIs, common programming idioms and more.This Spark training course is supplemented by hands-on labs that help attendees reinforce their theoretical knowledge of the learned material and quickly get them up to speed on using Spark for data exploration.

Course Agenda :
- Elements of functional programming
- Spark Shell
- RDDs
- Parallel processing in Spark
- Spark SQL
- ETL with Spark
- MLib Machine Learning Library
- Graph Processing with GraphX
- Spark Streaming

Target audience

Developers, Business Analysts, and IT Architects

Course requirements

Participants should have the general knowledge of programming as well as experience working in Unix-like environments (e.g. running shell commands, etc.)

Course Plan

	Section 01 Chapter 1 Introduction to Functional Programming What is Functional Programming (FP)? Terminology: First-Class and Higher-Order Functions Terminology: Lambda vs Closure A Short List of Languages that Support FPFP with Java FP With JavaScript Imperative Programming in JavaScript The JavaScript map (FP) Example The JavaScript reduce (FP) Example Using reduce to Flatten an Array of Arrays (FP) Example The JavaScript filter (FP) Example Common High-Order Functions in Python Common High-Order Functions in Scala Elements of FP in R Summary
	Section 02 Chapter 2 Introduction to Apache Spark What is SparkA Short History of Spark Where to Get Spark? The Spark Platform Spark Logo Common Spark Use Cases Languages Supported by Spark Running Spark on a Cluster The Driver Process Spark Applications Spark Shell The spark-submit Tool The spark-submit Tool Configuration The Executor and Worker Processes The Spark Application Architecture Interfaces with Data Storage Systems Limitations of Hadoop's MapReduce Spark vs MapReduceSpark as an Alternative to Apache Tez The Resilient Distributed Dataset (RDD) Spark Streaming (Micro-batching) Spark SQLExample of Spark SQL Spark Machine Learning Library GraphXSpark vs RSummary
	Section 03 Chapter 3 Hadoop Distributed File System Overview Hadoop Distributed File System (HDFS) HDFS High Availability HDFS 'Fine Print'Storing Raw Data in HDFS Hadoop Security HDFS Rack-awareness Data Blocks Data Block Replication Example HDFS Name Node Directory Diagram Accessing HDFS Examples of HDFS Commands Other Supported File Systems WebHDFS Examples of WebHDFS Calls Client Interactions with HDFS for the Read Operation Read Operation Sequence Diagram Client Interactions with HDFS for the Write Operation Communication inside HDFS Summary
	Section 04 Chapter 4 The Spark Shell The Spark Shell The Spark Shell UI Spark Shell Options Getting Help The Spark Context (sc) and SQL Context (sqlContext) The Shell Spark Context Loading Files Saving Files Basic Spark ETL Operations Summary
	Section 05 Chapter 5 Spark RDDs The Resilient Distributed Dataset (RDD) Ways to Create an RDD Custom RDDs Supported Data Types RDD Operations RDDs are Immutable Spark ActionsRDD Transformations Other RDD Operations Chaining RDD Operations RDD LineageThe Big Picture What May Go Wrong Checkpointing RDDsLocal Checkpointing Parallelized Collections More on parallelize() Method The Pair RDD Where do I use Pair RDDs? Example of Creating a Pair RDD with Map Example of Creating a Pair RDD with keyBy Miscellaneous Pair RDD Operations RDD Caching RDD Persistence The Tachyon Storage Summary
	Section 06 Chapter 6 Shared Variables in Spark Shared Variables in Spark Broadcast Variables Creating and Using Broadcast Variables Example of Using Broadcast Variables Accumulators Creating and Using Accumulators Example of Using Accumulators Custom Accumulators Summary
	Section 07 Chapter 7 Parallel Data Processing with Spark Running Spark on a Cluster Spark Stand-alone Option The High-Level Execution Flow in Stand-alone Spark Cluster Data Partitioning Data Partitioning Diagram Single Local File System RDD Partitioning Multiple File RDD Partitioning Special Cases for Small-sized Files Parallel Data Processing of Partitions Spark Application, Jobs, and Tasks Stages and Shuffles The 'Big Picture' Summary
	Section 08 Chapter 8 Introduction to Spark SQL What is Spark SQL? Uniform Data Access with Spark SQL Hive Integration Hive Interface Integration with BI Tools Spark SQL is No Longer Experimental Developer API! What is a DataFrame? The SQLContext Object The SQLContext API Changes Between Spark SQL 1.3 to 1.4 Example of Spark SQL (Scala Example) Example of Working with a JSON File Example of Working with a Parquet File Using JDBC Sources JDBC Connection Example Performance & Scalability of Spark SQL Summary
	Section 09 Chapter 9 Graph Processing with GraphX What is GraphX? Supported Languages Vertices and Edges Graph Terminology Example of Property Graph The GraphX API The GraphX Views The Triplet View Graph Algorithms Graphs and RDDs Constructing Graphs Graph Operators Example of Using GraphX Operators GraphX Performance Optimization The PageRank Algorithm GraphX Support for PageRank Summary
	Section 10 Chapter 10 Machine Learning Algorithms Supervised vs Unsupervised Machine Learning Supervised Machine Learning Algorithms Unsupervised Machine Learning Algorithms Choose the Right Algorithm Life-cycles of Machine Learning Development Classifying with k-Nearest Neighbors (SL)k-Nearest Neighbors Algorithmk-Nearest Neighbors Algorithm The Error

Reviews

Coming soon.

Total Price: Request Quotation

Skill level: Beginner

Language: English

Certificate: No

Max students: 10

Total Duration: 3 days

Spark Fundamentals

IT Business Management Training