Transformations and Actions in Spark with example

In Apache Spark, transformation and action are two types of operations that can be performed on RDDs (Resilient Distributed Datasets) or DataFrames/Datasets.

Transformations are operations that are performed on an RDD or DataFrame/Dataset that create a new RDD or DataFrame/Dataset. Transformations are executed lazily, which means that they are not executed immediately when called, but instead are executed only when an action is called. Some examples of transformations include:

  • map: applies a function to each element of an RDD or DataFrame/Dataset
  • filter: filters elements of an RDD or DataFrame/Dataset based on a given condition
  • groupBy: groups elements of an RDD or DataFrame/Dataset based on a given key
  • distinct: removes duplicate elements from an RDD or DataFrame/Dataset
  • flatMap: applies a function to each element of an RDD or DataFrame/Dataset and flattens the results

Actions are operations that return a value or produce a side effect, such as writing data to disk or returning a value to the driver program. Actions are executed immediately when called and trigger the execution of any transformations that have not yet been executed. Some examples of actions include:

  • count: returns the number of elements in an RDD or DataFrame/Dataset
  • first: returns the first element of an RDD or DataFrame/Dataset
  • reduce: aggregates elements of an RDD or DataFrame/Dataset using a given function
  • foreach: applies a function to each element of an RDD or DataFrame/Dataset
  • collect: returns all elements of an RDD or DataFrame/Dataset to the driver program

Narrow transformations

Narrow transformation is a concept related to the physical execution plan of a Spark job. It refers to the process of shuffling only a subset of the data across the network, as opposed to shuffling all the data. This can greatly reduce the amount of network traffic and improve the performance of a Spark job.

Narrow transformations can occur when a job is executed using a more efficient operation such as a map-side join or a broadcast join, rather than a full shuffle. For example, in a map-side join, the data is partitioned on the key and the join is performed locally on each partition, before the data is shuffled across the network.

 Wide Transformation

Wide transformation is a concept related to the physical execution plan of a Spark job. It refers to the process of shuffling all the data across the network, as opposed to shuffling only a subset of the data. This can increase the amount of network traffic and decrease the performance of a Spark job.

Wide transformations can occur when a job is executed using a less efficient operation such as a full shuffle join or a reduceByKey. For example, in a full shuffle join, both DataFrames/Datasets are shuffled across the network and then joined on the same partition.