examples of transformations in PySpark
Here are some examples of transformations in PySpark:
- Map
map(func) – Applies the given function to each element of the dataset and returns a new dataset with the results. For example:
from pyspark.sql.functions import udf # Define a function to convert the "age" column to string def to_string(age): return str(age) # Create a UDF from the function to_string_udf = udf(to_string) # Map the UDF over the "age" column to convert it to string df_string = df.withColumn("age", to_string_udf(df.age))
- Filter
Transformation: filter(predicate) – Filters the dataset using the given predicate. For example:
# Filter the DataFrame to include only rows where the "age" column is greater than 30 df_filtered = df.filter(df.age > 30)
- flatMap
Transformation: flatMap(func) – Similar to map(), but the function returns an iterable and the elements are flattened. For example:
# Flat map the function over the "words" column to split the words df_split = df.flatMap(split_words)
- Join
Transformation: join(other_df, on, how) – Joins the dataset with another DataFrame, using the given columns as the join key and the specified join type. For example:
# Inner join the DataFrame with another DataFrame on the "id" column df_joined = df.join(other_df, on="id", how="inner")
- sort
Transformation: sort(columns, ascending) – Sorts the dataset by the given columns in the specified order. For example:
# Sort the DataFrame by the "age" column in ascending order df_sorted = df.sort("age", ascending=True)
- union
Transformation: union(other_df) – Returns a new dataset containing the input and another dataset’s elements. For example:
# Union the DataFrame with another DataFrame df_union = df.union(other_df)
- groupBy
Transformation: groupBy(columns) – Groups the dataset by the given columns and returns a GroupedData object. For example:
# Group the DataFrame by the "country" column df_grouped = df.groupBy("country")
- sample
Transformation: sample(withReplacement, fraction, seed) – Returns a sampled subset of the dataset, with or without replacement, using the specified fraction and seed. For example:
# Sample a fraction of the DataFrame without replacement df_sampled = df.sample(False, 0.1, 42)
- select
Transformation: select(columns) – Selects the specified columns from the dataset and returns a new DataFrame. For example:
# Select the "name" and "age" columns from the DataFrame df_selected = df.select("name", "age")
- withColumn
Transformation: withColumn(colName, col) – Adds a new column to the DataFrame, using the specified name and data. For example:
# Add a new column to the DataFrame with the square of the "age" column df_with_column = df.withColumn("age_squared", df.age**2)
These are just a few examples of the transformations that are available in PySpark.