Deduplicate

Removes rows with duplicate values of specified columns.

Parameters

Parameter	Description
DataFrame	Input DataFrame
Row to keep	Any (Default): Keeps any one row among duplicates. Uses underlying `dropDuplicates` construct. First: Keeps first occurrence of the duplicate row. Last: Keeps last occurrence of the duplicate row. Unique Only: Keeps rows that don’t have duplicates. Distinct Rows: Keeps all distinct rows. This is equivalent to performing a `df.distinct()` operation.
Deduplicate columns	Columns to consider while removing duplicate rows (not required for `Distinct Rows`)
Order columns	Columns to sort DataFrame on before de-duping in case of `First` and `Last` rows to keep. Sorting options: Ascending: Sort values in ascending order. Descending: Sort values in descending order. Nulls first: Place null values at the beginning of the sorted result. Nulls last: Place null values at the end of the sorted result.

Examples

Rows to keep: `Any`

def dedup(spark: SparkSession, in0: DataFrame) -> DataFrame:
 return in0.dropDuplicates(["tran_id"])

Rows to keep: `First`

def earliest_cust_order(spark: SparkSession, in0: DataFrame) -> DataFrame:
 return in0\
 .withColumn(
 "row_number",
 row_number()\
 .over(Window\
 .partitionBy("customer_id")\
 .orderBy(col("order_dt").asc())
 )\
 .filter(col("row_number") == lit(1))\
 .drop("row_number")

Rows to keep: `Last`

def latest_cust_order(spark: SparkSession, in0: DataFrame) -> DataFrame:
 return in0\
 .withColumn(
 "row_number",
 row_number()\
 .over(Window\
 .partitionBy("customer_id")\
 .orderBy(col("order_dt").asc())
 )\
 .withColumn(
 "count",
 count("*")\
 .over(Window\
 .partitionBy("customer_id")
 )\
 .filter(col("row_number") == col("count"))\
 .drop("row_number")\
 .drop("count")

Rows to keep: `Unique Only`

def single_order_customers(spark: SparkSession, in0: DataFrame) -> DataFrame:
 return in0\
 .withColumn(
 "count",
 count("*")\
 .over(Window\
 .partitionBy("customer_id")
 )\
 .filter(col("count") == lit(1))\
 .drop("count")

Rows to keep: `Distinct Rows`

def single_order_customers(spark: SparkSession, in0: DataFrame) -> DataFrame:
 return in0.distinct()

Overview

Sources and targets

Data processing

Advanced

Parameters

Examples

Rows to keep: `Any`

Rows to keep: `First`

Rows to keep: `Last`

Rows to keep: `Unique Only`

Rows to keep: `Distinct Rows`

Overview

Sources and targets

Data processing

Advanced

​Parameters

​Examples

​Rows to keep: Any

​Rows to keep: First

​Rows to keep: Last

​Rows to keep: Unique Only

​Rows to keep: Distinct Rows

Parameters

Examples

Rows to keep: `Any`

Rows to keep: `First`

Rows to keep: `Last`

Rows to keep: `Unique Only`

Rows to keep: `Distinct Rows`