Level

Intermediate

Duration

32h / 4 days

Date

Individually arranged

Price

Individually arranged

PySpark Training

PySpark is a library for Apache Spark that enables the creation and execution of distributed tasks on clusters using Python. It provides an API for working with distributed data through Spark and offers access to all Spark functions, such as mapping, aggregation, filtering, and grouping of data. PySpark is widely used in Big Data, data analysis, and machine learning.

What You Will Learn

  • Understand the application of Big Data in organizations
  • Learn fundamental concepts related to working with data in Apache Spark
  • Master Spark Project Core and Spark SQL
  • Apply Spark ML in practical scenarios
Who is this training for?
  • logo infoshare Developers with knowledge of Python
  • logo infoshare Individuals who want to learn one of the most popular tools for data processing
  • logo infoshare Data analysts with Python experience
  • logo infoshare Data scientists

Training Program

  1. Module 1 – Apache Spark Architecture

  • Understanding Spark components and their roles
  • Positioning Apache Spark within the Big Data landscape
  1. Module 2 – RDDs (Resilient Distributed Datasets)

  • Core concept for distributed data processing in Apache Spark
  1. Module 3 – Differences Between Python Syntax and PySpark

  • Comparing RDDs and Pandas DataFrames
  1. Module 4 – Variables, Partitioning, and Core Spark Concepts

  • Deep dive into Spark’s foundational elements
  1. Module 5 – Spark SQL

  • Working with DataFrames
  • Syntax, schemas, and aggregations
  1. Module 6 – Spark ML (Machine Learning)

  • Introduction to machine learning capabilities in Spark
  1. Module 7 – Prototyping

  • Developing and testing data processing workflows
  1. Module 8 – Running and Managing Jobs on a Cluster

  • Best practices for job execution and cluster management
  1. Module 9 – Testing Processes

  • Ensuring reliability and correctness of data pipelines
  1. Module 10 – Optimization and Task Configuration

  • Techniques for improving performance and resource utilization
  1. Module 11 – Spark Structured Streaming

  • Handling real-time data streams with Apache Spark
  1. Module 12 – Q&A Session

  • Addressing participant questions and clarifications

Contact us

we will organize training for you tailored to your needs

Przemysław Wołosz

Key Account Manager

przemyslaw.wolosz@infoShareAcademy.com

    The controller of your personal data is InfoShare Academy Sp. z o.o. with its registered office in Gdańsk, al. Grunwaldzka 427B, 80-309 Gdańsk, KRS: 0000531749, NIP: 5842742121. Personal data are processed in accordance with information clause.