PySpark Training

Level

Intermediate

Duration

32h / 4 days

Date

Individually arranged

Price

Individually arranged

PySpark is a library for Apache Spark that enables the creation and execution of distributed tasks on clusters using Python. It provides an API for working with distributed data through Spark and offers access to all Spark functions, such as mapping, aggregation, filtering, and grouping of data. PySpark is widely used in Big Data, data analysis, and machine learning.

What You Will Learn

Understand the application of Big Data in organizations
Learn fundamental concepts related to working with data in Apache Spark
Master Spark Project Core and Spark SQL
Apply Spark ML in practical scenarios

Who is this training for?

Developers with knowledge of Python
Individuals who want to learn one of the most popular tools for data processing
Data analysts with Python experience
Data scientists

Training Program

Module 1 – Apache Spark Architecture

Understanding Spark components and their roles
Positioning Apache Spark within the Big Data landscape

Module 2 – RDDs (Resilient Distributed Datasets)

Core concept for distributed data processing in Apache Spark

Module 3 – Differences Between Python Syntax and PySpark

Comparing RDDs and Pandas DataFrames

Module 4 – Variables, Partitioning, and Core Spark Concepts

Deep dive into Spark’s foundational elements

Module 5 – Spark SQL

Working with DataFrames
Syntax, schemas, and aggregations

Module 6 – Spark ML (Machine Learning)

Introduction to machine learning capabilities in Spark

Module 7 – Prototyping

Developing and testing data processing workflows

Module 8 – Running and Managing Jobs on a Cluster

Best practices for job execution and cluster management

Module 9 – Testing Processes

Ensuring reliability and correctness of data pipelines

Module 10 – Optimization and Task Configuration

Techniques for improving performance and resource utilization

Module 11 – Spark Structured Streaming

Handling real-time data streams with Apache Spark

Module 12 – Q&A Session

Addressing participant questions and clarifications

Contact us

we will organize training for you tailored to your needs

Przemysław Wołosz

Key Account Manager

+48 730 830 801

przemyslaw.wolosz@infoShareAcademy.com

PySpark Training