examples of transformations in PySpark

Here are some examples of transformations in PySpark:

  • Map

map(func) – Applies the given function to each element of the dataset and returns a new dataset with the results. For example:

from pyspark.sql.functions import udf

# Define a function to convert the "age" column to string
def to_string(age):
return str(age)

# Create a UDF from the function
to_string_udf = udf(to_string)

# Map the UDF over the "age" column to convert it to string
df_string = df.withColumn("age", to_string_udf(df.age))
  • Filter

Transformation: filter(predicate) – Filters the dataset using the given predicate. For example:

# Filter the DataFrame to include only rows where the "age" column is greater than 30
df_filtered = df.filter(df.age > 30)

  • flatMap

Transformation: flatMap(func) – Similar to map(), but the function returns an iterable and the elements are flattened. For example:

# Flat map the function over the "words" column to split the words
df_split = df.flatMap(split_words)
  • Join

Transformation: join(other_df, on, how) – Joins the dataset with another DataFrame, using the given columns as the join key and the specified join type. For example:

# Inner join the DataFrame with another DataFrame on the "id" column
df_joined = df.join(other_df, on="id", how="inner")
  • sort

Transformation: sort(columns, ascending) – Sorts the dataset by the given columns in the specified order. For example:

# Sort the DataFrame by the "age" column in ascending order
df_sorted = df.sort("age", ascending=True)
  • union

Transformation: union(other_df) – Returns a new dataset containing the input and another dataset’s elements. For example:

# Union the DataFrame with another DataFrame
df_union = df.union(other_df)
  • groupBy

Transformation: groupBy(columns) – Groups the dataset by the given columns and returns a GroupedData object. For example:

# Group the DataFrame by the "country" column
df_grouped = df.groupBy("country")
  • sample

Transformation: sample(withReplacement, fraction, seed) – Returns a sampled subset of the dataset, with or without replacement, using the specified fraction and seed. For example:

# Sample a fraction of the DataFrame without replacement
df_sampled = df.sample(False, 0.1, 42)
  • select

Transformation: select(columns) – Selects the specified columns from the dataset and returns a new DataFrame. For example:

# Select the "name" and "age" columns from the DataFrame
df_selected = df.select("name", "age")
  • withColumn

Transformation: withColumn(colName, col) – Adds a new column to the DataFrame, using the specified name and data. For example:

# Add a new column to the DataFrame with the square of the "age" column
df_with_column = df.withColumn("age_squared", df.age**2)

These are just a few examples of the transformations that are available in PySpark.