The 16-hour course starts with an introduction to Apache Spark and the fundamental concepts and APIs that enable Big Data Processing. Real world datasets and a Spark Computing cluster will be available and fully working examples in all languages (Python / R / Java / Scala / SQL) along with exercises will be provided.
Consequently, participants are focusing on using Spark API for common real world Data Engineering, Data Integration and processing tasks from various sources (eg Relational Databases, Distributed Filesystems) within Spark Engine. Particular focus will be given to Observability, Monitoring and Performance assessment of the Tasks so as to build up understanding and further tune and optimize for optimal performance and increased cluster utilization.
Concepts such as scheduling jobs and tuning schedulers for optimal performance will be covered as well. Once building a solid understanding on Spark API on processing batch data, participants will further move to Spark Structured Streaming and build and optimize stream processing pipelines (both Stateful and Stateless) using Kafka and other real world input sources.
Who should attend
This 16-hour course can be attended by anyone interested in using Spark for Data Engineering with some programming skills. However, this course has been created having in mind:
- Software Engineers,
- Data Warehouse engineers,
- Data Scientists and
- Data Engineers
with adequate programming skills willing to make the next step and understand how distributed data processing engines work in practice and how they can make best use of them to solve real world problems.
Prerequisites
Programming experience with one of Python / R / Java / Scala / SQL. Solid understanding of the each selected language’s structures, collections and input/output API.
Experience with data management with relational databases and understanding of SQL internals is a plus.
What will you learn
Distributed Data Processing Fundamentals
- HDFS and Distributed Filesystems
- Resource Managers
- Distributed Jobs Scheduling
- File Formats
Batch Processing with Spark
- Introduction to Sparks’ fundamental APIs (DataFrames, Datasets, RDDs)
- Connecting to Sources , writing Output
- Working and integrating different types of data (Structured/ Unstructured)
- Schema definition and Management , Partition Management
- High Performance Aggregations and Joins among disparate datasets
- Advanced RDD operations
- Deploying / Redeploying / Restarting after failure and Monitoring Spark applications
- Debugging and Tuning Spark applications
Stream Processing with Spark
- Intro to Stream Processing with Structured Streaming
- Structured Streaming Sources and Output
- Event Time based Stream Processing
- Stateful Stream Processing
- Monitoring and Optimizing Structured Streaming Applications
- Highly available Streams
- Handling Errors , Restarting , Redeploying streams without losing data
Schedule
Next virtual sessions have been scheduled for:
- 6, 7, 8 & 9/06/2023, 10:00 – 14:00 EEST
- 4, 5, 6 & 7/12/2023, 10:00 – 14:00 EET