Documentation Index
Fetch the complete documentation index at: https://docs.prophecy.ai/llms.txt
Use this file to discover all available pages before exploring further.
Removes rows with duplicate values of specified columns.
Parameters
| Parameter | Description |
|---|
| DataFrame | Input DataFrame |
| Row to keep | - Any (Default): Keeps any one row among duplicates. Uses underlying
dropDuplicates construct. - First: Keeps first occurrence of the duplicate row.
- Last: Keeps last occurrence of the duplicate row.
- Unique Only: Keeps rows that don’t have duplicates.
- Distinct Rows: Keeps all distinct rows. This is equivalent to performing a
df.distinct() operation.
|
| Deduplicate columns | Columns to consider while removing duplicate rows (not required for Distinct Rows) |
| Order columns | Columns to sort DataFrame on before de-duping in case of First and Last rows to keep. Sorting options: - Ascending: Sort values in ascending order.
- Descending: Sort values in descending order.
- Nulls first: Place null values at the beginning of the sorted result.
- Nulls last: Place null values at the end of the sorted result.
|
Examples
Rows to keep: Any
def dedup(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.dropDuplicates(["tran_id"])
Rows to keep: First
def earliest_cust_order(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn(
"row_number",
row_number()\
.over(Window\
.partitionBy("customer_id")\
.orderBy(col("order_dt").asc())
)\
.filter(col("row_number") == lit(1))\
.drop("row_number")
Rows to keep: Last
def latest_cust_order(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn(
"row_number",
row_number()\
.over(Window\
.partitionBy("customer_id")\
.orderBy(col("order_dt").asc())
)\
.withColumn(
"count",
count("*")\
.over(Window\
.partitionBy("customer_id")
)\
.filter(col("row_number") == col("count"))\
.drop("row_number")\
.drop("count")
Rows to keep: Unique Only
def single_order_customers(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn(
"count",
count("*")\
.over(Window\
.partitionBy("customer_id")
)\
.filter(col("count") == lit(1))\
.drop("count")
Rows to keep: Distinct Rows
def single_order_customers(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0.distinct()