Top 45 Apache Spark Interview Questions and Answers
1. What is Apache Spark?
Answer:
Apache Spark is an open-source, distributed computing framework for big data processing. It provides an in-memory computation engine and supports batch as well as stream processing. Spark was designed to overcome limitations of Hadoop MapReduce by providing a faster execution engine with better resource management.
2. How does Apache Spark differ from Hadoop MapReduce?
Spark Architecture & Execution Model
3. What are the key components of the Apache Spark ecosystem?
Answer:
Apache Spark consists of the following core components:
- Spark Core — Provides basic functionalities like task scheduling, memory management, and job execution.
- Storage & Cluster Manager — Allows integration with storage systems like HDFS, S3, and Google Cloud Storage, and supports cluster managers like YARN, Mesos, and Kubernetes.
- Set of Libraries:
- Spark SQL: For structured data processing.
- Spark Streaming: For real-time data streaming.
- MLlib: For machine learning tasks.
- GraphX: For graph-based computations.
4. What are RDDs in Spark?
Answer:
Resilient Distributed Datasets (RDDs) are the fundamental data structures in Spark. They have the following characteristics:
- Immutable: Once created, they cannot be changed.
- Distributed: Data is partitioned across different nodes.
- Fault-tolerant: In case of node failure, RDDs can be recomputed.
RDD operations are categorized into:
- Transformations (e.g.,
map
,filter
) – Create new RDDs. - Actions (e.g.,
collect
,count
) – Return final results.
5. What is Lazy Evaluation in Spark?
Answer:
Lazy evaluation means that Spark does not execute transformations immediately. Instead, it builds a Directed Acyclic Graph (DAG) of transformations and executes them only when an action (e.g., collect()
, count()
) is triggered. This improves performance by optimizing execution plans.
6. What are Narrow and Wide Transformations in Spark?
Answer:
Execution & Cluster Management
7. What are the execution modes in Apache Spark?
Answer:
Apache Spark supports three execution modes:
- Local Mode — Runs on a single JVM, useful for testing and debugging.
- Client Mode — The driver runs on the local machine, while executors run on the cluster.
- Cluster Mode — Both the driver and executors run within the cluster, suitable for production environments.
8. What is a Driver in Spark?
Answer:
The Driver is the central component of a Spark application. It is responsible for:
- Creating a SparkSession/SparkContext.
- Breaking the job into smaller stages and tasks.
- Communicating with the Cluster Manager.
- Scheduling and monitoring task execution.
9. What are Executors in Spark?
Answer:
Executors are worker nodes in a Spark cluster responsible for:
- Running tasks assigned by the Driver.
- Storing intermediate data in memory.
- Reporting task completion status to the Driver.
Each Spark application has its own dedicated set of Executors.
10. What are the different cluster managers supported by Spark?
Answer:
Apache Spark supports multiple cluster managers:
- Standalone — Simple built-in Spark cluster manager.
- YARN (Hadoop 2+) — Integrates Spark with Hadoop clusters.
- Apache Mesos — General-purpose cluster manager.
- Kubernetes — Containerized cluster manager.
Spark SQL & Data Storage
11. What are the types of tables in Spark?
Answer:
Spark supports two types of tables:
- Managed Tables — Data and metadata are managed by Spark, stored in Spark’s warehouse directory. If dropped, both data and metadata are deleted.
- Unmanaged (External) Tables — Data is stored externally, and Spark manages only the metadata. Dropping the table does not delete the underlying data.
12. What are Views in Spark SQL?
Answer:
Views allow users to create virtual tables on top of existing data. There are two types:
- Global Temporary Views — Available across multiple Spark sessions.
- Temporary Views — Limited to the current Spark session.
Spark Streaming & MLlib
13. What is Spark Streaming?
Answer:
Spark Streaming is a real-time data processing component in Spark that processes data streams in micro-batches. It supports integration with Kafka, Flume, Kinesis, and other streaming sources.
14. What is MLlib in Spark?
Answer:
MLlib is Spark’s machine learning library, providing scalable implementations of:
- Classification (e.g., Logistic Regression, Decision Trees)
- Clustering (e.g., K-Means)
- Recommendation Systems (e.g., ALS)
- Feature Transformation (e.g., PCA, TF-IDF)
Performance Optimization
15. How can you optimize Apache Spark performance?
Answer:
Key performance optimization techniques include:
- Use Broadcast Variables — Reduce data shuffling.
- Cache and Persist Data — Store intermediate results in memory.
- Use Partitioning — Optimize data distribution across nodes.
- Use Columnar Storage Formats (e.g., Parquet, ORC) — Improve I/O efficiency.
- Avoid Wide Transformations — Reduce expensive data shuffling.
Advanced Apache Spark Interview Questions
Spark Core & RDDs
16. What is the difference between RDD, DataFrame, and Dataset in Spark?
Answer:
17. What are the benefits of using DataFrames over RDDs?
Answer:
- Performance — DataFrames use Catalyst Optimizer for query optimization.
- Memory Management — DataFrames use Tungsten Execution Engine for better memory optimization.
- Ease of Use — DataFrames support SQL-like operations.
- Code Simplicity — Less boilerplate code than RDDs.
18. What is a Broadcast Variable in Spark?
Answer:
A Broadcast Variable allows large read-only data to be cached on each worker node instead of sending it with every task.
Example Usage:
- Look-up tables
- Configurations
19. What is an Accumulator in Spark?
Answer:
Accumulators are variables used for aggregating values across multiple nodes efficiently.
They are mainly used for:
- Counting events
- Summing values across partitions
Spark Execution & DAG
20. What is a Directed Acyclic Graph (DAG) in Spark?
Answer:
A DAG (Directed Acyclic Graph) represents a series of transformations in Spark. It consists of:
- Stages (group of transformations)
- Tasks (smallest unit of execution)
Steps in DAG execution:
- Spark constructs a DAG based on transformations.
- DAG is divided into stages based on narrow/wide transformations.
- Stages are executed in parallel to optimize performance.
21. What are Jobs, Stages, and Tasks in Spark?
Answer:
- Job: A set of transformations triggered by an action (e.g.,
collect()
). - Stage: A group of transformations that do not require shuffling.
- Task: The smallest execution unit that runs on a single partition.
Example Execution Flow:
22. What is Shuffling in Spark?
Answer:
Shuffling is the redistribution of data across nodes, which occurs in wide transformations like groupByKey()
, reduceByKey()
, and join()
.
How to Optimize Shuffling?
- Use reduceByKey() instead of groupByKey() to minimize data movement.
- Use broadcast variables to avoid sending large data repeatedly.
- Increase shuffle partitions using
spark.sql.shuffle.partitions
.
Memory Management & Optimization
23. How does Spark handle Fault Tolerance?
Answer:
- Spark maintains lineage information in DAGs to recompute lost partitions.
- If a node fails, lost RDD partitions are recomputed using transformations.
- Checkpointing can be used to persist RDDs in HDFS to avoid recomputation.
24. What is the difference between Cache and Persist in Spark?
Answer:
Spark SQL & Data Processing
25. What are the different ways to create a DataFrame in Spark?
Answer:
26. What is the difference between repartition() and coalesce() in Spark?
Answer:
Streaming & Advanced Topics
27. What is Structured Streaming in Spark?
Answer:
Structured Streaming is a real-time stream processing engine built on top of Spark SQL. It processes streaming data incrementally using micro-batches.
Example:
28. What is the difference between Spark Streaming and Structured Streaming?
Answer:
29. How do you debug Spark jobs?
Answer:
- Check DAG Execution Plan:
df.explain()
- Enable Spark Logs:
spark.conf.set("spark.eventLog.enabled", "true")
- Use Web UI: View job execution details at
http://localhost:4040
30. How does Spark integrate with the cloud?
Answer:
- AWS: Uses S3, EMR (Elastic MapReduce).
- Azure: Uses Azure Data Lake, HDInsight.
- GCP: Uses Google Cloud Storage, Dataproc.
Performance Optimization & Troubleshooting
31. How can you reduce data shuffling in Spark?
Answer:
Shuffling is an expensive operation that involves redistributing data across partitions, which can slow down Spark jobs. To minimize shuffling:
32. What are the different persistence storage levels in Spark?
Answer:
Spark allows storing RDDs in memory, disk, or both for performance optimization.
33. What is Speculative Execution in Spark?
Answer:
Speculative execution is a performance optimization technique in Spark that detects slow-running tasks and launches duplicates on different nodes to complete them faster.
How to enable it?
Advanced RDD & DataFrame Operations
34. How does Spark handle schema inference in DataFrames?
Answer:
- CSV files: Infer schema automatically if
inferSchema=true
- JSON files: Automatically infer types based on values
- Manually defining schema:
35. How can you convert an RDD into a DataFrame?
Answer:
RDDs can be converted into DataFrames using case classes or schema definitions.
36. What is Window Function in Spark SQL?
Answer:
Window functions allow operations like ranking, running totals, and moving averages within a specified “window” of rows.
37. How can you optimize Spark SQL queries?
Answer:
Streaming & Real-Time Processing
38. What is Checkpointing in Spark Streaming?
39. What are the different output modes in Structured Streaming?
Answer:
40. How does Spark Streaming handle late data?
Answer:
Spark Streaming uses watermarking to handle late-arriving data.
Graph Processing & Machine Learning
41. What is GraphX in Spark?
Answer:
GraphX is Spark’s API for graph processing and analytics. It includes:
- Graph abstraction (vertices & edges)
- Graph algorithms (PageRank, BFS, Shortest Path)
Example:
42. What is Spark MLlib?
Answer:
MLlib is Spark’s machine learning library that includes:
- Classification (Logistic Regression, Decision Trees)
- Clustering (K-Means, GMM)
- Feature Engineering (TF-IDF, PCA)
- Recommendation Systems (ALS)
Example:
Security & Deployment
43. How do you secure Spark applications?
Answer:
- Kerberos Authentication — Secure cluster access.
- Role-based access control (RBAC) — Manage user permissions.
- Data Encryption — Encrypt data at rest (HDFS, S3) and in transit (SSL/TLS).
44. How do you monitor Spark applications?
Answer:
- Spark Web UI — View DAGs, stages, tasks.
45. How would you handle a Spark job that keeps failing due to OutOfMemory errors?
Answer: