Spark

Convert your hive output to Json format

Using CONCAT and string manipulation functions: SELECT CONCAT(‘{ “column1”: “‘, column1, ‘”, “column2”: “‘, column2, ‘”, “column3”: “‘, column3, ‘” }’) AS json_data FROM your_table; Using a custom UDF (User-Defined Function): If the built-in...

“Sparking Up Your Career: Top Apache Spark Interview Questions and Answers”

Top Apache Spark Interview Questions and Answers/ Top Data Engineer Interview Questions and Answers WAQ to find 2nd highest salary using window functions WAQ to find employees having salary higher than their manager Partitioning...

examples of transformations in PySpark

Here are some examples of transformations in PySpark: Map map(func) – Applies the given function to each element of the dataset and returns a new dataset with the results. For example: from pyspark.sql.functions import...

Transformations and Actions in Spark with example

In Apache Spark, transformation and action are two types of operations that can be performed on RDDs (Resilient Distributed Datasets) or DataFrames/Datasets. Transformations are operations that are performed on an RDD or DataFrame/Dataset that...

different file format ORC, Avro, Parquet in Hadoop, Spark

ORC (Optimized Row Columnar) is a file format that is optimized for reading and writing large datasets in a columnar format. It was developed by the Apache Hive project and is widely used in...

how to perform update & insert in Spark using dataframe

In Spark, you can perform updates and inserts on a DataFrame using the merge operation. The merge operation allows you to update or insert rows in a target DataFrame based on the values in...

Hadoop Vs Apache Storm Vs Apache beam

Hadoop, Apache Storm, Apache Beam, and Apache Spark are all open-source big data processing frameworks that are used to process large amounts of data. However, each of these frameworks has its own strengths and...

How do we trigger automated clean-ups in Spark?

To trigger automated clean-ups in Spark, you can use the spark.cleaner.ttl configuration property. This property specifies the maximum time (in seconds) that a task’s working data is retained after the task has completed. If...

Apache Spark vs Apache Flink vs Apache Storm

Apache Spark: It is a powerful, in-memory data processing engine that is designed to be fast, easy to use, and flexible. It can handle batch, interactive, and streaming workloads, and can also be used...

outputMode method in spark

The outputMode method specifies how the output of the streaming query should be written to the sink. There are three possible output modes: “append“: This mode writes the output of the streaming query to...