“Sparking Up Your Career: Top Apache Spark Interview Questions and Answers”
Top Apache Spark Interview Questions and Answers/ Top Data Engineer Interview Questions and Answers
- WAQ to find 2nd highest salary using window functions
- WAQ to find employees having salary higher than their manager
- Partitioning & Bucketing – Pros/Cons
- What are Spark UDF?
- How does Kafka work?
- Accumulators vs Broadcast variables in Spark
- State all Pyspark Dataframe API operations you have used previously.(Not SparkSQL)
- Explain the whole flow of an typical ETL job
- What all ETL tools you have worked on ?
- What is Data Modeling & why is it important ?
- Other common questions were also asked such as the,
- File formats in Big data eco-system.
- Explain your one project with an example of an obstacle which you faced and how to resolved it
- An example of handing conflict with your teammate/manager
- NoSQL databases in Big data ecosystem
- Explain internal working of HBase & Cassandra
- Brief about spark internals, Spark Session vs Spark Context
- Spark Driver vs Spark Executor
- Executor vs Executor core
- Yarn client mode vs cluster mode
- What is RDD and what do you understand by partitions?
- What do you understand by Fault tolerance in Spark?
- Spark vs Yarn Fault tolerance
- Why Lazy evaluation is important in Spark?
- Transformations vs actions
- Map vs FlatMap
- Spark Map vs Map Partition
- Wide vs Narrow transformations
- Reduce by key vs Group by key
- What do you understand by Spark Lineage
- Spark Lineage vs Spark DAG
- Spark cache vs Spark persist
- What do you understand by AggregateByKey and CombineByKey?
- Briefly explain about Spark Accumulator
- What do you mean by Broadcast variables?
- Spark UDF functions, Why one should avoid UDF?
- Why one should avoid RDDs, what is the alternative?
- What are the benefits of a data frame?
- What do you understand by Vectorized UDF?
- Which one is better and when you should use, RDDs, Dataframe and Datasets?
- Why Spark Dataset is typesafe?
- Explain about Repartition and Coalesce.
- How to read JSON from Spark?
- Explain about Spark WIndow functions and it’s usage.
- Spark Rank vs Dense Rank
- Partitions vs Bucketing
- Explain about catalyst optimizer
- Stateless vs Stateful transformations
- StructType and StructField
- Explain about Apache parquet
- What do you understand by CBO, Spark Cost Based Optimizer?
- Explain Broadcast variable and shared variable with examples
- Have you ever worked on Spark performance tuning and executor tuning
- Explain Spark Join without shuffle
- Explain about Paired RDD
- Cache vs Persist in Spark UI
- Why one should avoid groupBy?
- How to decide the number of partitions in a data frame?
- What is DAG? Explain in details.
- Persistence vs Broadcast in Spark
- Partition pruning and predicate pushdown
- Fold vs reduce in Spark
- Explain the interlinking of Pyspark and Apache Arrow
- Explain about bucketing in Spark SQL
- Explain dynamic resource allocation in Spark
- Why fold-left and fold-right are not supported in Spark?
- How to decide the number of executors and memory for any spark job?
- Different types of cluster managers in spark
- Can you explain how to minimize data transfers while working with Spark?
- What are the different levels of persistence in Spark?
- What is the function of filer()?
- Define Partitions in Apache Spark?
- What is the difference between reducing () and take() function?
- Define YARN in Spark?
- Can we trigger automated clean-ups in Spark?
- What is another method than “Spark.cleaner.ttl” to trigger automated clean-ups in Spark?
- What is the role of Akka in Spark?
- Define SchemaRDD in Apache Spark RDD
- What is a Spark Driver?
- Introduction to Databricks | How to setup Account |
- How to read CSV file in PySpark
- How to Rename columns in DataFrame using PySpark
- How to ADD New Columns in DataFrame using PySpark
- How to filter a DataFrame using PySpark
- How to Sort a Dataframe in PySpark
- How to remove Duplicates in DataFrame using PySpark
- How to use GroupBy in DataFrame using PySpark
- How to write into CSV
- How to merge two DataFrame using PySpark
- How to use WHEN Otherwise in PySpark
- How to join two DataFrames in PySpark
- How to use Window Functions in PySpark
- Why to use Repartition Method in PySpark
- How to write DataFrame with Partitions using PartitionBy in PySpark
- How to create UDF in PySpark
- How to do casting of Columns in PySpark
- How to handle NULLs in PySpark
- Different types of mode while reading a file in DataFrame in PySpark
- Spark Streaming Example with PySpark | Apache SPARK Structured STREAMING TUTORIAL with PySpark
- What is Managed and External table in PySpark | Databricks Tutorial |
- Dbutils commands in Data bricks
- Get the Latest file from dbfs using dbutils
- How to use InsertInto in PySpark using Databricks
- Difference Between Collect and Select in PySpark using Databricks
- Read Single-line and Multiline JSON in PySpark using Databricks
- What is Success,Committed, started files in Databricks
- How to Read and Write XML in Databricks
- How to fill NA, NULL in dataframe using PySpark in Databricks
- How to use Map Transformation in PySpark using Databricks
- What is Cache and Persist in PySpark And Spark-SQL using Databricks\
- How to connect Blob Storage using SAS token using Databricks
- How to create Mount Point and connect Blob Storage using Access Keys
- How to create Schema Dynamically
- How to find out delimiter Dynamically in CSV files