Overview
The Regex gem enables pattern matching and text extraction using regular expressions. This gem provides four distinct output methods for processing text data:
Prerequisites
- Add
prophecy_basics package version 1.0.0 or higher to your project.
The Regex gem uses the following ports:
| Port | Description |
|---|
| in0 | The source table containing text data that needs to be processed with regex patterns. |
| out | The output table containing: - Original columns preserved
- New columns created based on the selected output method (Replace, Tokenize, Parse, or Match)
The output schema depends on the chosen method and configuration. |
Parameters
Configure the Regex gem using the following parameters.
Common configuration
These parameters are available for all regex operations:
| Parameter | Description |
|---|
| Select Column to Split | Choose the input column containing the text data you want to process with regex patterns. |
| Output Method | Select how the regex operation should handle matches: - Replace: Substitute matched text with replacement values.
- Tokenize: Split text into tokens or columns based on regex patterns.
- Parse: Extract specific groups from regex matches into separate columns.
- Match: Determine whether text matches the pattern.
|
| Regex | Enter your regular expression pattern. The field supports standard regex syntax with capture groups for extracting specific portions of matched text.
Note: Different SQL dialects may require specific escaping. For example, to match a literal dot, generic regex uses a\. while Databricks SQL requires a\\. |
| Case Insensitive Matching | Enable this option to perform pattern matching without regard to letter case. |
Replace configuration
The Replace method substitutes matched portions of text with specified replacement values. When using this method, the gem outputs an additional column with the replaced values.
| Parameter | Description |
|---|
| Replacement Text | Specify replacement text or use capture group references. |
| Copy Unmatched Text to Output | When enabled, non-matching text is preserved in the appended output column. |
Example
Use this method to standardize phone number formats from 555-123-4567 to (555) 123-4567.
-
Select Column to Split:
phone_number
-
Regex:
(\d{3})-(\d{3})-(\d{4})
-
Replacement text:
($1)$2-$3
This inserts capture groups 1, 2, and 3 into the replacement pattern to create the new formatted string. The result is written to a new output column, while the original value is preserved.
Input table
| id | phone_number |
|---|
| 1 | 555-332-1234 |
| 2 | 555-034-9876 |
Output table
| id | phone_number | phone_number_replaced |
|---|
| 1 | 555-332-1234 | (555)332-1234 |
| 2 | 555-034-9876 | (555)034-9876 |
Tokenize configuration
The Tokenize method splits text into tokens based on regex patterns and capture groups. Each capture group becomes a token. This method creates either new columns or rows depending on your configuration.
| Parameter | Description |
|---|
| Select Split Strategy | Choose how to split the data: - Split to columns: Breaks text into tokens and places each token into a new column in the same row. Requires a fixed number of columns.
- Split to rows: Breaks text into tokens and outputs each token as a new row in a single column. Rows are generated dynamically, making this option useful when the number of tokens varies.
|
| Allow Blank Tokens (Split to columns only) | If there are fewer tokens than the defined number of columns, allow empty strings to fill the extra columns. Otherwise, those columns are set to NULL. |
| Number of columns (Split to columns only) | Specify the number of output columns to create for tokenized data. |
| For Extra Columns (Split to columns only) | Define how to handle cases where there are more tokens than columns. - Drop Extra with Warning: Skip writing excess tokens and log a warning message to indicate this.
- Drop Extra without Warning: Skip writing excess tokens silently without generating warnings.
- Error: Stop processing and raise an error when the number of tokens exceeds the defined number of columns.
|
| Output Root Name | Base name for the new column(s) containing the tokens. |
Example
Use this method to parse email addresses into username and domain components.
- Select Column to Split:
email
- Regex:
([^@]+)@(.+)
- Select Split Strategy: Split to columns
- Number of columns: 2
- Output root name:
token
Input table
| id | email |
|---|
| 1 | support@example.com |
| 2 | sales@company.org |
Output table
| id | email | token_1 | token_2 |
|---|
| 1 | support@example.com | support | example.com |
| 2 | sales@company.org | sales | company.org |
Parse configuration
The Parse method extracts capture groups from regex matches and outputs each group as a separate column. Prophecy automatically generates one output column for every capture group in the regex.
| Parameter | Description |
|---|
| New Column Name | Specify the name for the new column. |
| Select Data Type | Choose the data type. |
| Regex Expression | View the capture group that will populate the column. If you edit this value, Prophecy will automatically revert it to the original value. |
Rows in the Parse Configuration table are determined by the number of capture groups in the
Regex field. You cannot add additional rows to or remove rows from this table.
Example
Use this method to parse phone numbers into area_code, exchange, and number columns.
-
Select Column to Split:
phone_number
-
Regex:
([0-9]{3})-([0-9]{3})-([0-9]{4})
-
Parse Configuration:
| New Column Name | Select Data Type | Regex Expression |
|---|
area_code | String | ([0-9]{3}) |
exchange | String | ([0-9]{3}) |
number | String | ([0-9]{4}) |
Input table
| id | phone_number |
|---|
| 1 | 555-332-1234 |
| 2 | 555-034-9876 |
Output table
| id | phone_number | area_code | exchange | number |
|---|
| 1 | 555-332-1234 | 555 | 332 | 1234 |
| 2 | 555-034-9876 | 555 | 034 | 9876 |
Match configuration
The Match method determines whether text matches the specified regex pattern. Adds a column with 1 for matches and 0 for non-matches.
| Parameter | Description |
|---|
| Column name for match status | Specify the name for the new column containing match results. |
| Error if not Matched | Enable to raise an error when no match is found. When disabled, non-matching rows will receive a 0 value. |
Example
Use this method to validate email addresses and create a binary match column.
- Select Column to Split:
email
- Regex:
^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$
- Column name for match status:
is_valid_email
- Error if not matched:
Disabled
Input table
| id | email |
|---|
| 1 | support@example.com |
| 2 | sales.team |
| 3 | engineering@company.org |
Output table
| id | email | is_valid_email |
|---|
| 1 | support@example.com | 1 |
| 2 | sales.team | 0 |
| 3 | engineering@company.org | 1 |