Choose from multiple write strategies including append, merge, and replace
Prophecy provides a set of write strategies that determine how you will store your processed data and handle changes to the data over time. This page describes each strategy so you can choose the best one for your use case.You will configure the write strategy in the Write Options tab of a target table.
Replaces all existing data with new data on each run. This is the simplest approach and ensures your target table always reflects the latest state of your source data.
The following partitioning parameters are available for the Wipe and Replace Table write mode on BigQuery.
Parameter
Description
Column Name
The name of the column used for partitioning the target table.
Data Type
The data type of the partition column. Supported types: timestamp, date, datetime, and int64.
Partition By granularity
Applicable only to timestamp, date, or datetime data type. Defines the time-based partition granularity: hour, day, month, or year.
Partition Range
Applicable only to int64 data type. Specify a numeric range for partitioning using a start, end, and interval value (e.g., start=0, end=1000, interval=10). You must define an interval value so that Prophecy knows at what intervals to create the partitions.
Only BigQuery tables can be partitioned at the table level. To learn more about partitioning, jump
to Partitioning.
If a row with the same key exists, it is updated. Otherwise, a new row is inserted. You can also limit updates to specific columns, so only selected values are changed in matching rows.
Parameter
Description
Unique Key
Column(s) used to match existing records in the target dataset. In many cases, the unique key will be equivalent to your table’s primary key, if applicable.
Use Predicate
Lets you add conditions that specify when to apply the merge.
Use a condition to filter data or incremental runs
Enables applying conditions for filtering the incoming data into the table.
Merge Columns
Specifies which columns to update during the merge. If empty, the merge includes all columns.
Exclude Columns
Defines columns that should be excluded from the merge operation.
(Advanced) On Schema Change
Specifies how schema changes should be handled during the merge process.
ignore: Newly added columns will not be written to the model. This is the default option.
fail: Triggers an error message when the source and target schemas diverge.
append_new_columns: Append new columns to the existing table.
sync_all_columns: Adds any new columns to the existing table, and removes any columns that are now missing. Includes data type changes. This option uses the output of the previous gem.
When an incoming VEHICLE_ID matches an existing one, vehicle type information is updated while registration dates remain unchanged. The merge column is explicitly set to TYPE.
Incoming table
VEHICLE_ID
TYPE
101
Tram
102
Bus
104
Bus
Existing table
VEHICLE_ID
TYPE
REGISTERED_AT
101
Bus
2023-12-01
102
Train
2023-12-02
103
Bus
2023-12-03
Updated table
VEHICLE_ID
TYPE
REGISTERED_AT
101
Tram
2023-12-01
102
Bus
2023-12-02
103
Bus
2023-12-03
104
Bus
null
Notice that only the TYPE column was updated for vehicles 101 and 102 while the REGISTERED_AT values remained unchanged. Additionally, note that vehicle 103 remained unchanged since it didn’t match a vehicle in the incoming data. Finally, vehicle 104 was inserted, and no data was added to the REGISTERED_AT column.
Replaces entire partitions in the target table. Only partitions containing updated data will be overwritten; other partitions will not be modified.
Parameter
Description
Partition by
Defines the partitions of the target table.
Databricks: Each unique value of the partition column corresponds to a partition. You cannot change the granularity of the partitions.
BigQuery: You must manually define the granularity of your partitions. BigQuery does not automatically infer how to write the partitions.
(Advanced) On Schema Change
Specifies how schema changes should be handled during the merge process.
ignore: Newly added columns will not be written to the model. This is the default option.
fail: Triggers an error message when the source and target schemas diverge.
append_new_columns: Append new columns to the existing table.
sync_all_columns: Adds any new columns to the existing table, and removes any columns that are now missing. Includes data type changes. This option uses the output of the previous gem.
In this case, the partition column is the DATE column, where each partition corresponds to a single day. Since the incoming data includes records from 2024-01-15, all of the records from that date are dropped from the existing table and replaced with the new data.
Incoming table
TRIP_ID
VEHICLE_ID
DATE
401
201
2024-01-15
402
202
2024-01-15
Existing table
TRIP_ID
VEHICLE_ID
DATE
101
101
2024-01-14
301
201
2024-01-15
302
202
2024-01-15
Updated table
TRIP_ID
VEHICLE_ID
DATE
101
101
2024-01-14
401
201
2024-01-15
402
202
2024-01-15
Notice that only the 2024-01-15 partition was replaced with new data, while the 2024-01-14 partition remained untouched.
The following partitioning parameters allow you to define the partition granularity for this operation.
Parameter
Description
Column Name
The name of the column used for partitioning the target table.
Data Type
The data type of the partition column. Supported types: timestamp, date, datetime, and int64.
Partition By granularity
Applicable only to timestamp, date, or datetime data type. Defines the time-based partition granularity: hour, day, month, or year.
Partition Range
Applicable only to int64 data type. Specify a numeric range for partitioning using a start, end, and interval value (e.g., start=0, end=1000, interval=10). You must define an interval value so that Prophecy knows at what intervals to create the partitions.
Tracks historical changes by adding new rows instead of updating existing ones. This lets you preserve a complete history, since every change generates a new entry rather than erasing old information. To be specific:
New rows are added for incoming records with unique keys that don’t exist in the target table.
For records matching existing unique keys, a new row is added only when the incoming data differs from the existing record.
New rows are assigned a start date and a null end date (indicating it’s currently valid).
If the unique key of the new record matched an existing record, the existing row is assigned an end date to mark when it stopped being valid.
This creates a complete timeline showing how data evolved over time.
Parameter
Description
Unique Key
Column(s) used to match existing records in the target dataset. In many cases, the unique key will be equivalent to your table’s primary key, if applicable.
Invalidate deleted rows
When enabled, records that match deleted rows will be marked as no longer valid.
Determine new records by checking timestamp column
Recognizes new records by the time from the Updated at column that you define.
Determine new records by looking for differences in column values
Recognizes new records based on a change of values in one or more specified columns.
Each time a vehicle changes routes, a new row is added to preserve history. In this case, Determine new records by checking timestamp column is enabled, and ASSIGNED_AT is the timestamp column.
Incoming table
VEHICLE_ID
ROUTE
ASSIGNED_AT
101
Route B
2024-01-15
Existing table
VEHICLE_ID
ROUTE
ASSIGNED_AT
valid_from
valid_to
101
Route A
2024-01-01
2024-01-01
NULL
Updated table
VEHICLE_ID
ROUTE
ASSIGNED_AT
valid_from
valid_to
101
Route A
2024-01-01
2024-01-01
2024-01-15
101
Route B
2024-01-15
2024-01-15
NULL
Notice that the original Route A record now has an end date (2024-01-15 taken from the ASSIGNED_AT column) and a new Route B record was added with a null end date, preserving the complete history of route assignments.
Update records that match conditions defined in the Use Predicate parameter. Instead of updating individual rows, it replaces the entire partition of the table that matches the given predicate. The predicate defines which rows in the target table should be fully replaced with the incoming data.
Parameter
Description
Use Predicate
Lets you add conditions that specify when to apply the merge.
Use a condition to filter data or incremental runs
Enables applying conditions for filtering the incoming data into the table.
(Advanced) On Schema Change
Specifies how schema changes should be handled during the merge process.
ignore: Newly added columns will not be written to the model. This is the default option.
fail: Triggers an error message when the source and target schemas diverge.
append_new_columns: Append new columns to the existing table.
sync_all_columns: Adds any new columns to the existing table, and removes any columns that are now missing. Includes data type changes. This option uses the output of the previous gem.
When any record from January 5th onwards needs updating, ALL records from that date forward are dropped and replaced with the incoming data. Records from before January 5th remain unchanged.Predicate: DATE >= '2024-01-05'
Incoming table
TRIP_ID
STATUS
DATE
201
Closed
2024-01-05
203
Closed
2024-01-10
Existing table
TRIP_ID
VEHICLE_ID
DATE
STATUS
201
101
2024-01-05
Open
202
102
2023-12-20
Open
203
103
2024-01-10
Open
204
104
2024-01-15
Open
Updated table
TRIP_ID
VEHICLE_ID
DATE
STATUS
201
101
2024-01-05
Closed
203
103
2024-01-10
Closed
202
102
2023-12-20
Open
Notice that ALL trips from January 5th onwards (201, 203, and 204) were dropped and replaced with only the incoming records (201 and 203), while trip 202 (from December 20th) remained unchanged.
For rows in the target table that have keys that match rows in the incoming dataset, deletes matching rows from the existing table and replaces them. For incoming rows with new keys, inserts these normally. This ensures that updated records are fully replaced instead of partially updated.
This strategy helps when your unique key is not truly unique (multiple rows per key need to be
fully refreshed).
Parameter
Description
Unique Key
Column(s) used to match existing records in the target dataset.
Use Predicate
Lets you add conditions that specify when to apply the merge.
Use a condition to filter data or incremental runs
Enables applying conditions for filtering the incoming data into the table.
(Advanced) On Schema Change
Specifies how schema changes should be handled during the merge process.
ignore: Newly added columns will not be written to the model. This is the default option.
fail: Triggers an error message when the source and target schemas diverge.
append_new_columns: Append new columns to the existing table.
sync_all_columns: Adds any new columns to the existing table, and removes any columns that are now missing. Includes data type changes. This option uses the output of the previous gem.
If a vehicle already exists in the fleet, its old record is deleted and replaced with the incoming row. New vehicles are simply inserted.
Incoming table
VEHICLE_ID
TYPE
REGISTERED_AT
201
Train
2024-01-15
203
Bus
2024-01-16
Existing table
VEHICLE_ID
TYPE
REGISTERED_AT
201
Train
2023-12-01
202
Tram
2023-12-02
Updated table
VEHICLE_ID
TYPE
REGISTERED_AT
201
Train
2024-01-15
202
Tram
2023-12-02
203
Bus
2024-01-16
Notice that vehicle 201’s old record was completely replaced with the new data, vehicle 202 remained unchanged, and vehicle 203 was added as a new entry.
Prophecy simplifies data transformation by providing intuitive write mode options that abstract away the complexity of underlying SQL operations. Behind the scenes, Prophecy generates dbt models that implement these write strategies using SQL warehouse-specific commands.When you select a write mode in a Table gem, Prophecy automatically generates the appropriate dbt configuration and SQL logic. This means you can focus on your data transformation logic rather than learning dbt’s materialization strategies or writing complex SQL merge statements.To understand exactly what happens when Prophecy runs these write operations, switch to the Code view of your project and inspect the generated dbt model files. These files contain the SQL statements and dbt configuration (like materialized: 'incremental') that dbt uses to execute the write operation. To learn more about the specific configuration options available for each SQL warehouse, visit the dbt documentation links below.
Depending on the SQL warehouse you use to write tables, partitioning can have different behavior. Let’s examine the difference between partitioning in Google BigQuery and Databricks.
BigQuery
Databricks
In Google BigQuery, partitioning is a table property.Partitioning is defined at the table schema level (time, integer range, or column value). Because it is a part of the table architecture, the physical storage in BigQuery is optimized by partitioning automatically.Once a table is partitioned, every write to that table, full or incremental, respects the partitioning. That means even when you drop and create an entirely new table, Google BigQuery creates the table with partitions in an optimized way.
In Databricks, partitioning is a write strategy.Databricks organizes data in folders by column values. Partitioning only makes sense when you’re using the Wipe and Replace Partitions write mode because it allows you to overwrite specific directories (partitions) without rewriting the whole table.For the Wipe and Replace Table option, the table is dropped and completely recreated. Partitioning doesn’t add any runtime benefit here, so this is not an option for Databricks.
This happens when the incoming and existing schemas don’t align.To solve this, use the “On Schema Change” setting to set behavior or ensure schema compatibility.
You see unexpected duplicate data in the target table
This happens mainly when using the Append Rows write mode.To solve this, consider using one of the merge modes instead of append.