Skip to main content
Dependencies:
  • ProphecySparkBasicsPython 0.0.1+
  • ProphecySparkBasicsScala 0.0.1+
Cluster requirements:
  • UC dedicated clusters 14.3+ supported
  • UC standard clusters 14.3+ supported
  • Livy clusters 3.0.1+ supported
Removes rows with duplicate values of specified columns.

Parameters

ParameterDescription
DataFrameInput DataFrame
Row to keep
  • Any (Default): Keeps any one row among duplicates. Uses underlying dropDuplicates construct.
  • First: Keeps first occurrence of the duplicate row.
  • Last: Keeps last occurrence of the duplicate row.
  • Unique Only: Keeps rows that don’t have duplicates.
  • Distinct Rows: Keeps all distinct rows. This is equivalent to performing a df.distinct() operation.
Deduplicate columnsColumns to consider while removing duplicate rows (not required for Distinct Rows)
Order columnsColumns to sort DataFrame on before de-duping in case of First and Last rows to keep. Sorting options:
  • Ascending: Sort values in ascending order.
  • Descending: Sort values in descending order.
  • Nulls first: Place null values at the beginning of the sorted result.
  • Nulls last: Place null values at the end of the sorted result.

Examples

Rows to keep: Any

Example usage of Deduplicate
def dedup(spark: SparkSession, in0: DataFrame) -> DataFrame:
 return in0.dropDuplicates(["tran_id"])

Rows to keep: First

Example usage of Deduplicate - First
def earliest_cust_order(spark: SparkSession, in0: DataFrame) -> DataFrame:
 return in0\
 .withColumn(
 "row_number",
 row_number()\
 .over(Window\
 .partitionBy("customer_id")\
 .orderBy(col("order_dt").asc())
 )\
 .filter(col("row_number") == lit(1))\
 .drop("row_number")

Rows to keep: Last

Example usage of Deduplicate - Last
def latest_cust_order(spark: SparkSession, in0: DataFrame) -> DataFrame:
 return in0\
 .withColumn(
 "row_number",
 row_number()\
 .over(Window\
 .partitionBy("customer_id")\
 .orderBy(col("order_dt").asc())
 )\
 .withColumn(
 "count",
 count("*")\
 .over(Window\
 .partitionBy("customer_id")
 )\
 .filter(col("row_number") == col("count"))\
 .drop("row_number")\
 .drop("count")

Rows to keep: Unique Only

Example usage of Deduplicate - Unique
def single_order_customers(spark: SparkSession, in0: DataFrame) -> DataFrame:
 return in0\
 .withColumn(
 "count",
 count("*")\
 .over(Window\
 .partitionBy("customer_id")
 )\
 .filter(col("count") == lit(1))\
 .drop("count")

Rows to keep: Distinct Rows

Example usage of Deduplicate - Distinct
def single_order_customers(spark: SparkSession, in0: DataFrame) -> DataFrame:
 return in0.distinct()