Spark architectural Components
Spark architecture
Spark Architecture Explained
Apache Spark is an open-source, distributed computing system
designed for big data processing. It provides an interface for programming
entire clusters with implicit data parallelism and fault tolerance. Spark's
architecture is based on two main abstractions: Resilient Distributed
Datasets (RDDs) and Directed Acyclic Graphs (DAGs).
Key Components of Spark Architecture
1. Driver Program
The Driver Program runs the main function
of the application and creates the Spark Context object. The Spark Context coordinates the Spark applications, running as independent sets of
processes on a cluster. It connects to a cluster manager to acquire executors
on nodes in the cluster, sends application code to the executors, and sends
tasks to the executors to run.
2. Cluster Manager
The Cluster Manager allocates resources
across applications. Spark can run on various cluster managers such as Hadoop
YARN, Apache Mesos, and its own standalone scheduler. The cluster manager
launches executors in worker nodes on behalf of the driver.
3. Worker Nodes
Worker Nodes are slave nodes that run the
application code in the cluster. Each worker node runs an executor, which is a
process launched for an application. Executors run tasks and keep data in
memory or disk storage across them.
4. Executors
Executors are responsible for executing tasks
and storing data in memory or disk storage. They register with the driver
program at the beginning and have a number of time slots to run the application
concurrently. Executors read and write external data and are dynamically
allocated and removed during the execution of tasks.
Execution Modes
Spark supports three different execution modes:
- Cluster
Mode: The driver process is launched on a worker node inside the
cluster, and the cluster manager is in charge of all Spark
application-related processes.
- Client
Mode: The Spark driver remains on the client machine that submitted
the application.
- Local
Mode: The entire Spark application runs on a single machine, using
threads instead of parallelized threads
Workflow of Spark Architecture
- Client
Submission: The client submits the Spark user application code. The
driver converts user code containing transformations and actions into a
logically directed acyclic graph (DAG).
- DAG
to Physical Execution Plan: The DAG is converted into a physical
execution plan with many stages, creating physical execution units called
tasks.
- Resource
Negotiation: The driver talks to the cluster manager and negotiates
resources. The cluster manager launches executors in worker nodes on
behalf of the driver.
- Task Execution: The driver sends tasks to the executors based on data placement. Executors execute the tasks on the partitioned RDDs and return the results to the Spark Context
Conclusion
Apache Spark's architecture is
designed to handle large-scale data processing efficiently. Its components,
such as the driver program, cluster manager, worker nodes, and executors, work
together to execute tasks in a distributed environment. Spark's ability to run
in different execution modes and its support for various cluster managers make
it a versatile tool for big data processing
Comments
Post a Comment