Skip to main content
The SurrogateKeyGenerator gem lets you add a surrogate key column to an input dataset. A surrogate key is simply a unique, monotonically-increasing integer that serves as a key column for datasets that do not have a key. To ensure that the key increments steadily across pipeline runs, the SurrogateKeyGenerator gem uses a file, known as the key file to store the last used key. For example, if the file lists the last used key as 5000, the next row will start at 5001. Once the pipeline run completes, the gem writes the last value used to the key file, so that the next run starts there.

Parameters

To configure the SurrogateKeyGenerator gem, add the gem to the canvas, link it to an upstream gem, and enter information for the following parameters:
ParameterDescription
Surrogate Column NameThe name for the new surrogate key column.
Key File PathPath to key-tracking file.
Key Initial Value (optional)The starting value for the surrogate key sequence.

Example

This example adds a new surrogate key column (customer_sk) to a customer dimension DataFrame. The example:
  • Creates a new column called customer_sk.
  • Looks for a file located at /mnt/keys/customer_sk.seq
  • Starts key generation at 1.

Input schema

Column nameType
customer_idstring
namestring
emailstring

Gem configuration

ParameterExample Value
Surrogate Column Namecustomer_sk
Key File Path/mnt/keys/customer_sk.seq
Key Initial Value1

Output

customer_skcustomer_idnameemail
1C123Alice Wongalice@example.com
2C456Omar Davisomar@example.com
3C789Priya Shahpriya@example.com