Apache Spark for Big Data Processing

...
This is now a virtual classroom course. You can find more information about our virtual classroom here
Total price:
490.00600.00 final price
Variation Title Price Add To Cart Quantity
(virtual) 6,7,8,9/6/2023 10am-2pm EEST € 490
(virtual) 4,5,6,7/12/2023 10am-2pm EET € 490

The 16-hour course starts with an introduction to Apache Spark and the fundamental concepts and APIs that enable Big Data Processing. Real world datasets and a Spark Computing cluster will be available and fully working examples in all languages (Python / R / Java / Scala / SQL) along with exercises will be provided.

Consequently, participants are focusing on using Spark API for common real world Data Engineering, Data Integration and processing tasks from various sources (eg Relational Databases, Distributed Filesystems) within Spark Engine. Particular focus will be given to Observability, Monitoring and Performance assessment of the Tasks so as to build up understanding and further tune and optimize for optimal performance and increased cluster utilization.

Concepts such as scheduling jobs and tuning schedulers for optimal performance will be covered as well. Once building a solid understanding on Spark API on processing batch data, participants will further move to Spark Structured Streaming and build and optimize stream processing pipelines (both Stateful and Stateless) using Kafka and other real world input sources.

Who should attend

This 16-hour course can be attended by anyone interested in using Spark for Data Engineering with some programming skills. However, this course has been created having in mind:

  • Software Engineers,
  • Data Warehouse engineers, 
  • Data Scientists and
  • Data Engineers

with adequate programming skills willing to make the next step and understand how distributed data processing engines work in practice and how they can make best use of them to solve real world problems.

Prerequisites

Programming experience with one of Python / R / Java / Scala / SQL. Solid understanding of the each selected language’s structures, collections and input/output API.

Experience with data management with relational databases and understanding of SQL internals is a plus.

What will you learn

Distributed Data Processing Fundamentals

  • HDFS and Distributed Filesystems
  • Resource Managers 
  • Distributed Jobs Scheduling
  • File Formats 

Batch Processing with Spark

  • Introduction to Sparks’ fundamental APIs (DataFrames, Datasets, RDDs)
  • Connecting to Sources , writing Output
  • Working and integrating different types of data (Structured/ Unstructured)
  • Schema definition and Management , Partition Management
  • High Performance Aggregations and Joins among disparate datasets
  • Advanced RDD operations
  • Deploying / Redeploying / Restarting after failure and Monitoring Spark applications
  • Debugging and Tuning Spark applications

Stream Processing with Spark

  • Intro to Stream Processing with Structured Streaming
  • Structured Streaming Sources and Output
  • Event Time based Stream Processing
  • Stateful Stream Processing 
  • Monitoring and Optimizing Structured Streaming Applications
  • Highly available Streams 
  • Handling Errors , Restarting , Redeploying streams without losing data

Schedule

Next virtual sessions have been scheduled for:

  • 6, 7, 8 & 9/06/2023, 10:00 – 14:00 EEST
  • 4, 5, 6 & 7/12/2023, 10:00 – 14:00 EET