- This event has passed.
Apache Spark for Big Data Processing
The 16-hour course starts with an introduction to Apache Spark and the fundamental concepts and APIs that enable Big Data Processing. Real world datasets and a Spark Computing cluster will be available and fully working examples in all languages (Python / R / Java / Scala / SQL) along with exercises will be provided.
Consequently, participants are focusing on using Spark API for common real world Data Engineering, Data Integration and processing tasks from various sources (eg Relational Databases, Distributed Filesystems) within Spark Engine. Particular focus will be given to Observability, Monitoring and Performance assessment of the Tasks so as to build up understanding and further tune and optimize for optimal performance and increased cluster utilization.
Concepts such as scheduling jobs and tuning schedulers for optimal performance will be covered as well. Once building a solid understanding on Spark API on processing batch data, participants will further move to Spark Structured Streaming and build and optimize stream processing pipelines (both Stateful and Stateless) using Kafka and other real world input sources.