Data is no more static and offloading it for reporting in big batches once a day is not enough. Data is also not stored only in databases and useful data for your business can be generated from external sources (facebook , real time payments information from webhooks by a bank, 3rd party services) in real time.
Would you like a deep dive in the Big Data Streaming Architectures and Techniques? Then you are in the right place. Learn how to take advantage of streaming data to ingest and analyse them in real-time by developing streaming applications. Focus on building, monitoring and manipulating data pipelines with tools and frameworks such as Apache Spark.
Who should attend
IT Professionals interested in crossing over into development territory, in Big Data domain.
Prerequisites
- SQL fluency
- Python & Java or Scala basics
- An overview of Batch Big Data Processing
- Some experience with queueing systems (optional) or Change Data Capture/Log Shipping engines found in modern relational databases
What will you learn
The Course will consist of the following chapters:
- Big Data Architecture Overview
- Get to know the Big Data landscape including examples of real world streaming big data problems including the three key sources of Big Data: people, organizations, and sensors.
- Understand the architectural components and programming models used for scalable streaming big data analysis.
- Summarize the features and value of core Hadoop stack components including the YARN resource and job management system, the HDFS file system and the MapReduce programming model.
- Big Data Tools & Practices
- Data Warehouse
- Streaming Applications
- Retrieve data from example database and big data management systems real or semi-real time
- Identify when a big data problem needs data integration with real time data generating systems (e.g. web services, queueing systems)
- Execute simple big data integration and processing on Hadoop and Spark platforms
- Select a data model to suit the characteristics of your data
- Apply techniques to handle streaming data
- How to persist and organize streaming data for further offline processing
- Unified Tools & User Interfaces for
- Operations (join streams or with reference data, manage state , sort , filter) on streaming data using modern a framework like Apache Spark
- Monitoring operations on streaming data and long running flows
- How to handle high-availability and errors so as not to stay behind in realtime data processing
- Integration with Other tools
- Recognize different streaming data elements in your own work and in everyday life problems
- Explain why your team needs to design a Big Data Infrastructure Plan and Information System Design
- Identify the frequent data operations required for various types of streaming data
- Build your own streaming data analytics pipelines with tools and frameworks such as Apache Spark
- Data Engineers Workflows
All course material will be taught on a reference hadoop installation and all users will be required to run and/or develop examples using the tools that we will make available
Schedule
(virtual) 4-7 December 2023, 10 am-2 pm (EET)