Use the SchemaTransform gem to apply transformations to columns from the incoming DataFrame. This page describes the different transformations—or operations—that you can use in this gem.
Unlike Reformat which is a set operation where all the transforms are applied in parallel, transformations here are applied in order.
Reformat is a SQL select and is preferable when making many changes.
Operations
Add/Replace Expression
Add a new column or replace an existing one based on an expression.
| Parameter | Description |
|---|
| New Column | Output column name |
| Expression | SQL expression to generate values in the new column |
Drop Column
Remove a column from next stage of the Pipeline.
| Parameter | Description |
|---|
| Column to drop | Column to be dropped |
Rename Column
Rename an existing column downstream in the Pipeline.
| Parameter | Description |
|---|
| Old Column Name | Column to be renamed |
| New Column Name | Output column name |
Add If Missing
Provide a default value for a column if it is missing from the source.
| Parameter | Description |
|---|
| Source Column Name | Column that contains missing values |
| Default Value (if missing) | The value that will replace missing values |
Add Rule
Use a business rule in your Pipeline. Visit the Business rules engine page to learn about business rules.
| Parameter | Description |
|---|
| New Column | The column that the business rule will apply to |
| Rule | The business rule that contains the logic that will populate the new column values |
Spark Code
def transform(spark: SparkSession, in0: DataFrame) -> DataFrame:
return in0\
.withColumn("business_date", to_date(lit("2022-05-05"), "yyyy-MM-dd"))\
.withColumnRenamed("bonus_rate", "bonus")\
.drop("slug")
Advanced Import
The Advanced Import feature allows you to bulk import statements that are structured similarly to CSV/TSV files. This can be useful if you have your expressions/transformation logic in another format and just want to quickly configure a SchemaTransform gem based on existing logic.
Using Advanced Import
- Click the Advanced tab in the SchemaTransform gem.
- Enter the expressions into the text area.
The format of these expressions is op_type,target_name,target_expr, where op_type is the type of operation (see below); target_name is the desired new column name and target_expr is the Spark expression that will be used to generate the new column. Each op_type has a different number of extra columns that have to be provided, see below for more details.
For target_expr values that contain a comma , or span multiple lines, you must surround them by `` on either side. For example:addrep,customer_id,customer_id
addrep,full_name,``concat(first_name, ' ', last_name)``
Advanced Import Operation types
| Operation Type | Advanced Import name | :Arguments: | Example |
|---|
| Add/Replace | addrep | 2 | addrep,foo,CAST(NULL as int) |
| Drop | drop | 1 | drop bar |
| Rename | rename | 2 | rename,foo,bar |
| Add if missing | missing | 2 | missing,foo,current_timestamp() |
| Add rule | addrule | 1 | addrule, PromoCodeRule() |