SchemaTransform

Use the SchemaTransform gem to apply transformations to columns from the incoming DataFrame. This page describes the different transformations—or operations—that you can use in this gem.

Unlike Reformat which is a set operation where all the transforms are applied in parallel, transformations here are applied in order. Reformat is a SQL select and is preferable when making many changes.

Operations

Add/Replace Expression

Add a new column or replace an existing one based on an expression.

Parameter	Description
New Column	Output column name
Expression	SQL expression to generate values in the new column

Drop Column

Remove a column from next stage of the Pipeline.

Parameter	Description
Column to drop	Column to be dropped

Rename Column

Rename an existing column downstream in the Pipeline.

Parameter	Description
Old Column Name	Column to be renamed
New Column Name	Output column name

Add If Missing

Provide a default value for a column if it is missing from the source.

Parameter	Description
Source Column Name	Column that contains missing values
Default Value (if missing)	The value that will replace missing values

Add Rule

Use a business rule in your Pipeline. Visit the Business rules engine page to learn about business rules.

Parameter	Description
New Column	The column that the business rule will apply to
Rule	The business rule that contains the logic that will populate the new column values

Spark Code

def transform(spark: SparkSession, in0: DataFrame) -> DataFrame:
 return in0\
 .withColumn("business_date", to_date(lit("2022-05-05"), "yyyy-MM-dd"))\
 .withColumnRenamed("bonus_rate", "bonus")\
 .drop("slug")

Advanced Import

The Advanced Import feature allows you to bulk import statements that are structured similarly to CSV/TSV files. This can be useful if you have your expressions/transformation logic in another format and just want to quickly configure a SchemaTransform gem based on existing logic.

Using Advanced Import

Click the Advanced tab in the SchemaTransform gem.
Enter the expressions into the text area.

Format

The format of these expressions is op_type,target_name,target_expr, where op_type is the type of operation (see below); target_name is the desired new column name and target_expr is the Spark expression that will be used to generate the new column. Each op_type has a different number of extra columns that have to be provided, see below for more details.

For target_expr values that contain a comma , or span multiple lines, you must surround them by `` on either side. For example:

addrep,customer_id,customer_id
addrep,full_name,``concat(first_name, ' ', last_name)``

Advanced Import Operation types

Operation Type	Advanced Import name	:Arguments:	Example
Add/Replace	`addrep`	2	`addrep,foo,CAST(NULL as int)`
Drop	`drop`	1	`drop bar`
Rename	`rename`	2	`rename,foo,bar`
Add if missing	`missing`	2	`missing,foo,current_timestamp()`
Add rule	`addrule`	1	`addrule, PromoCodeRule()`

Overview

Sources and targets

Data processing

Advanced

Operations

Add/Replace Expression

Drop Column

Rename Column

Add If Missing

Add Rule

Spark Code

Advanced Import

Using Advanced Import

Format

Advanced Import Operation types

Overview

Sources and targets

Data processing

Advanced

​Operations

​Add/Replace Expression

​Drop Column

​Rename Column

​Add If Missing

​Add Rule

​Spark Code

​Advanced Import

​Using Advanced Import

​Format

​Advanced Import Operation types

Operations

Add/Replace Expression

Drop Column

Rename Column

Add If Missing

Add Rule

Spark Code

Advanced Import

Using Advanced Import

Format

Advanced Import Operation types