“Sparking Up Your Career: Top Apache Spark Interview Questions and Answers”

Top Apache Spark Interview Questions and Answers/ Top Data Engineer Interview Questions and Answers

  1. WAQ to find 2nd highest salary using window functions
  2. WAQ to find employees having salary higher than their manager
  3. Partitioning & Bucketing – Pros/Cons
  4. What are Spark UDF?
  5. How does Kafka work?
  6. Accumulators vs Broadcast variables in Spark
  7. State all Pyspark Dataframe API operations you have used previously.(Not SparkSQL)
  8. Explain the whole flow of an typical ETL job
  9. What all ETL tools you have worked on ?
  10. What is Data Modeling & why is it important ?
  11. Other common questions were also asked such as the,
  12. File formats in Big data eco-system.
  13. Explain your one project with an example of an obstacle which you faced and how to resolved it
  14. An example of handing conflict with your teammate/manager
  15. NoSQL databases in Big data ecosystem
  16. Explain internal working of HBase & Cassandra
  17. Brief about spark internals, Spark Session vs Spark Context
  18. Spark Driver vs Spark Executor
  19. Executor vs Executor core
  20. Yarn client mode vs cluster mode
  21. What is RDD and what do you understand by partitions?
  22. What do you understand by Fault tolerance in Spark?
  23. Spark vs Yarn Fault tolerance
  24. Why Lazy evaluation is important in Spark?
  25. Transformations vs actions
  26. Map vs FlatMap
  27. Spark Map vs Map Partition
  28. Wide vs Narrow transformations
  29. Reduce by key vs Group by key
  30. What do you understand by Spark Lineage
  31. Spark Lineage vs Spark DAG
  32. Spark cache vs Spark persist
  33. What do you understand by AggregateByKey and CombineByKey?
  34. Briefly explain about Spark Accumulator
  35. What do you mean by Broadcast variables?
  36. Spark UDF functions, Why one should avoid UDF?
  37. Why one should avoid RDDs, what is the alternative?
  38. What are the benefits of a data frame?
  39. What do you understand by Vectorized UDF?
  40. Which one is better and when you should use, RDDs, Dataframe and Datasets?
  41. Why Spark Dataset is typesafe?
  42. Explain about Repartition and Coalesce.
  43. How to read JSON from Spark?
  44. Explain about Spark WIndow functions and it’s usage.
  45. Spark Rank vs Dense Rank
  46. Partitions vs Bucketing
  47. Explain about catalyst optimizer
  48. Stateless vs Stateful transformations
  49. StructType and StructField
  50. Explain about Apache parquet
  51. What do you understand by CBO, Spark Cost Based Optimizer?
  52. Explain Broadcast variable and shared variable with examples
  53. Have you ever worked on Spark performance tuning and executor tuning
  54. Explain Spark Join without shuffle
  55. Explain about Paired RDD
  56. Cache vs Persist in Spark UI
  57. Why one should avoid groupBy?
  58. How to decide the number of partitions in a data frame?
  59. What is DAG? Explain in details.
  60. Persistence vs Broadcast in Spark
  61. Partition pruning and predicate pushdown
  62. Fold vs reduce in Spark
  63. Explain the interlinking of Pyspark and Apache Arrow
  64. Explain about bucketing in Spark SQL
  65. Explain dynamic resource allocation in Spark
  66. Why fold-left and fold-right are not supported in Spark?
  67. How to decide the number of executors and memory for any spark job?
  68. Different types of cluster managers in spark
  69. Can you explain how to minimize data transfers while working with Spark?
  70. What are the different levels of persistence in Spark?
  71. What is the function of filer()?
  72. Define Partitions in Apache Spark?
  73. What is the difference between reducing () and take() function?
  74. Define YARN in Spark?
  75. Can we trigger automated clean-ups in Spark?
  76. What is another method than “Spark.cleaner.ttl” to trigger automated clean-ups in Spark?
  77. What is the role of Akka in Spark?
  78. Define SchemaRDD in Apache Spark RDD
  79. What is a Spark Driver?
  80. Introduction to Databricks | How to setup Account |
  81. How to read CSV file in PySpark
  82. How to Rename columns in DataFrame using PySpark
  83. How to ADD New Columns in DataFrame using PySpark
  84. How to filter a DataFrame using PySpark
  85. How to Sort a Dataframe in PySpark
  86. How to remove Duplicates in DataFrame using PySpark
  87. How to use GroupBy in DataFrame using PySpark
  88. How to write into CSV
  89. How to merge two DataFrame using PySpark
  90. How to use WHEN Otherwise in PySpark
  91. How to join two DataFrames in PySpark
  92. How to use Window Functions in PySpark
  93. Why to use Repartition Method in PySpark
  94. How to write DataFrame with Partitions using PartitionBy in PySpark
  95. How to create UDF in PySpark
  96. How to do casting of Columns in PySpark
  97. How to handle NULLs in PySpark
  98. Different types of mode while reading a file in DataFrame in PySpark
  99. Spark Streaming Example with PySpark | Apache SPARK Structured STREAMING TUTORIAL with PySpark
  100. What is Managed and External table in PySpark | Databricks Tutorial |
  101. Dbutils commands in Data bricks
  102. Get the Latest file from dbfs using dbutils
  103.  How to use InsertInto in PySpark using Databricks
  104.  Difference Between Collect and Select in PySpark using Databricks
  105. Read Single-line and Multiline JSON in PySpark using Databricks
  106. What is Success,Committed, started files in Databricks
  107. How to Read and Write XML in Databricks
  108. How to fill NA, NULL in dataframe using PySpark in Databricks
  109. How to use Map Transformation in PySpark using Databricks
  110. What is Cache and Persist in PySpark And Spark-SQL using Databricks\
  111. How to connect Blob Storage using SAS token using Databricks
  112. How to create Mount Point and connect Blob Storage using Access Keys
  113. How to create Schema Dynamically
  114. How to find out delimiter Dynamically in CSV files