Spark architectural Components

Spark architecture

Bottom of Form

Spark Architecture Explained

Apache Spark is an open-source, distributed computing system designed for big data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark's architecture is based on two main abstractions: Resilient Distributed Datasets (RDDs) and Directed Acyclic Graphs (DAGs).

Key Components of Spark Architecture

1. Driver Program

The Driver Program runs the main function of the application and creates the Spark Context object. The Spark Context coordinates the Spark applications, running as independent sets of processes on a cluster. It connects to a cluster manager to acquire executors on nodes in the cluster, sends application code to the executors, and sends tasks to the executors to run.

2. Cluster Manager

The Cluster Manager allocates resources across applications. Spark can run on various cluster managers such as Hadoop YARN, Apache Mesos, and its own standalone scheduler. The cluster manager launches executors in worker nodes on behalf of the driver.

3. Worker Nodes

Worker Nodes are slave nodes that run the application code in the cluster. Each worker node runs an executor, which is a process launched for an application. Executors run tasks and keep data in memory or disk storage across them.

4. Executors

Executors are responsible for executing tasks and storing data in memory or disk storage. They register with the driver program at the beginning and have a number of time slots to run the application concurrently. Executors read and write external data and are dynamically allocated and removed during the execution of tasks.

Execution Modes

Spark supports three different execution modes:

    1. Cluster Mode: The driver process is launched on a worker node inside the cluster, and the cluster manager is in charge of all Spark application-related processes.
    2. Client Mode: The Spark driver remains on the client machine that submitted the application.
    3. Local Mode: The entire Spark application runs on a single machine, using threads instead of parallelized threads

Workflow of Spark Architecture

    1. Client Submission: The client submits the Spark user application code. The driver converts user code containing transformations and actions into a logically directed acyclic graph (DAG).
    2. DAG to Physical Execution Plan: The DAG is converted into a physical execution plan with many stages, creating physical execution units called tasks.
    3. Resource Negotiation: The driver talks to the cluster manager and negotiates resources. The cluster manager launches executors in worker nodes on behalf of the driver.
    4. Task Execution: The driver sends tasks to the executors based on data placement. Executors execute the tasks on the partitioned RDDs and return the results to the Spark Context

Conclusion

Apache Spark's architecture is designed to handle large-scale data processing efficiently. Its components, such as the driver program, cluster manager, worker nodes, and executors, work together to execute tasks in a distributed environment. Spark's ability to run in different execution modes and its support for various cluster managers make it a versatile tool for big data processing

Comments

Popular posts from this blog

Getting Started with DBT Core

The Complete Guide to DBT (Data Build Tool) File Structure and YAML Configurations

Connecting DBT to Snowflake