Big Data Platform Design Using Apache Tools
Level
IntermediateDuration
24h / 3 daysDate
Individually arrangedPrice
Individually arrangedBig Data Platform Design Using Apache Tools
The Big Data Platform Design Using Apache Tools training is a practical, 3-day workshop where participants will learn modern methods of building scalable and efficient Big Data platforms. The program is based on a set of popular and open-source Apache tools such as Apache Hadoop, Spark, Kafka, NiFi, Flink, Iceberg, and Airflow. This course not only covers the theoretical foundations of architecture but also provides practical skills in designing, implementing, and managing complex analytical systems. The training combines 80% practice with 20% theory, enabling quick acquisition of competencies to work with large data volumes in production environments.
What You Will Learn
- Design and implement data pipelines for batch and stream processing
- Understand the principles of building modern, scalable Big Data architecture using Apache tools
- Gain skills in configuring and managing systems like Hadoop, Kafka, NiFi, Spark, and Flink
- Master techniques for managing metadata, data lineage, and automating workflows
- Learn best deployment practices and methods for optimizing and monitoring Big Data platforms
Who is this training for?
IT specialists, Big Data architects, and data engineers aiming to design modern, scalable Big Data platforms
DevOps and administrators responsible for deploying and managing Hadoop/Spark/Kafka infrastructure
Data analysts and engineers who wish to understand the architecture and tools of Apache for data processing and analysis
Individuals planning to expand existing solutions or start new Big Data projects
Training Program
-
Day 1: Fundamentals of Big Data Architecture and Apache Tools
-
Module 1: Introduction to Big Data Architecture
- Basic concepts and layers of Big Data architecture: data, processing, management, analysis
- Architecture models: Data Lake, Lambda, Kappa, Data Lakehouse
- Design criteria: data type, scalability, batch vs. stream processing
- Overview of data processing methods: batch vs. stream
-
Module 2: Apache Hadoop and HDFS
- HDFS architecture: NameNode and DataNode roles
- Batch processing with MapReduce – basics and use cases
- Administration and monitoring of Hadoop clusters
-
Module 3: Basics of Python Programming in the Context of Big Data
- Functional programming concepts and Python vs. Java comparison
- Python elements for data processing: DataFrames, lambdas, comprehensions, map, filter
- Practical exercises: simple data processing and integration with Big Data tools (e.g. PySpark)
-
Day 2: Data Processing and Integration Tools
-
Module 4: Streaming and Queues – Apache Kafka and Apache NiFi
- Apache Kafka architecture: producers, consumers, partitions, replication
- Apache NiFi: managing data flows and integrating sources and sinks
- Practical exercises: creating and monitoring data flows
-
Module 5: Real-Time and Batch Data Analysis – Apache Spark and Flink
- Spark architecture: RDD, DataFrame, Spark SQL
- Flink: stream processing, time windows, state management
- Designing batch and streaming jobs, optimization, Catalyst
- Integration with Apache Hadoop and application deployment
-
Day 3 (Optional): Data Storage, Workflow Management, and Governance
-
Module 6: Data and Metadata Management
- Apache Iceberg: scalable table format, ACID support, query optimization
- Apache Atlas: metadata management, governance, data lineage
- Apache Druid: architecture, indexing, real-time and batch analytics
-
Module 7: Automation and Orchestration – Apache Airflow and CI/CD
- Designing workflows and managing dependencies with Airflow
- Implementing data pipelines and automating processing
- Integration with CI/CD tools and production environments
- Defining DAGs and working with tasks in Python and Bash