“Sparking Up Your Career: Top Apache Spark Interview Questions and Answers”

Top Apache Spark Interview Questions and Answers/ Top Data Engineer Interview Questions and Answers

WAQ to find 2nd highest salary using window functions
WAQ to find employees having salary higher than their manager
Partitioning & Bucketing – Pros/Cons
What are Spark UDF?
How does Kafka work?
Accumulators vs Broadcast variables in Spark
State all Pyspark Dataframe API operations you have used previously.(Not SparkSQL)
Explain the whole flow of an typical ETL job
What all ETL tools you have worked on ?
What is Data Modeling & why is it important ?
Other common questions were also asked such as the,
File formats in Big data eco-system.
Explain your one project with an example of an obstacle which you faced and how to resolved it
An example of handing conflict with your teammate/manager
NoSQL databases in Big data ecosystem
Explain internal working of HBase & Cassandra
Brief about spark internals, Spark Session vs Spark Context
Spark Driver vs Spark Executor
Executor vs Executor core
Yarn client mode vs cluster mode
What is RDD and what do you understand by partitions?
What do you understand by Fault tolerance in Spark?
Spark vs Yarn Fault tolerance
Why Lazy evaluation is important in Spark?
Transformations vs actions
Map vs FlatMap
Spark Map vs Map Partition
Wide vs Narrow transformations
Reduce by key vs Group by key
What do you understand by Spark Lineage
Spark Lineage vs Spark DAG
Spark cache vs Spark persist
What do you understand by AggregateByKey and CombineByKey?
Briefly explain about Spark Accumulator
What do you mean by Broadcast variables?
Spark UDF functions, Why one should avoid UDF?
Why one should avoid RDDs, what is the alternative?
What are the benefits of a data frame?
What do you understand by Vectorized UDF?
Which one is better and when you should use, RDDs, Dataframe and Datasets?
Why Spark Dataset is typesafe?
Explain about Repartition and Coalesce.
How to read JSON from Spark?
Explain about Spark WIndow functions and it’s usage.
Spark Rank vs Dense Rank
Partitions vs Bucketing
Explain about catalyst optimizer
Stateless vs Stateful transformations
StructType and StructField
Explain about Apache parquet
What do you understand by CBO, Spark Cost Based Optimizer?
Explain Broadcast variable and shared variable with examples
Have you ever worked on Spark performance tuning and executor tuning
Explain Spark Join without shuffle
Explain about Paired RDD
Cache vs Persist in Spark UI
Why one should avoid groupBy?
How to decide the number of partitions in a data frame?
What is DAG? Explain in details.
Persistence vs Broadcast in Spark
Partition pruning and predicate pushdown
Fold vs reduce in Spark
Explain the interlinking of Pyspark and Apache Arrow
Explain about bucketing in Spark SQL
Explain dynamic resource allocation in Spark
Why fold-left and fold-right are not supported in Spark?
How to decide the number of executors and memory for any spark job?
Different types of cluster managers in spark
Can you explain how to minimize data transfers while working with Spark?
What are the different levels of persistence in Spark?
What is the function of filer()?
Define Partitions in Apache Spark?
What is the difference between reducing () and take() function?
Define YARN in Spark?
Can we trigger automated clean-ups in Spark?
What is another method than “Spark.cleaner.ttl” to trigger automated clean-ups in Spark?
What is the role of Akka in Spark?
Define SchemaRDD in Apache Spark RDD
What is a Spark Driver?
Introduction to Databricks | How to setup Account |
How to read CSV file in PySpark
How to Rename columns in DataFrame using PySpark
How to ADD New Columns in DataFrame using PySpark
How to filter a DataFrame using PySpark
How to Sort a Dataframe in PySpark
How to remove Duplicates in DataFrame using PySpark
How to use GroupBy in DataFrame using PySpark
How to write into CSV
How to merge two DataFrame using PySpark
How to use WHEN Otherwise in PySpark
How to join two DataFrames in PySpark
How to use Window Functions in PySpark
Why to use Repartition Method in PySpark
How to write DataFrame with Partitions using PartitionBy in PySpark
How to create UDF in PySpark
How to do casting of Columns in PySpark
How to handle NULLs in PySpark
Different types of mode while reading a file in DataFrame in PySpark
Spark Streaming Example with PySpark | Apache SPARK Structured STREAMING TUTORIAL with PySpark
What is Managed and External table in PySpark | Databricks Tutorial |
Dbutils commands in Data bricks
Get the Latest file from dbfs using dbutils
How to use InsertInto in PySpark using Databricks
Difference Between Collect and Select in PySpark using Databricks
Read Single-line and Multiline JSON in PySpark using Databricks
What is Success,Committed, started files in Databricks
How to Read and Write XML in Databricks
How to fill NA, NULL in dataframe using PySpark in Databricks
How to use Map Transformation in PySpark using Databricks
What is Cache and Persist in PySpark And Spark-SQL using Databricks\
How to connect Blob Storage using SAS token using Databricks
How to create Mount Point and connect Blob Storage using Access Keys
How to create Schema Dynamically
How to find out delimiter Dynamically in CSV files