The DataQualityCheck gem includes a variety of checks that are built on the open source tool Deequ. Use this gem to make sure your data adheres to predefined constraints.Documentation Index
Fetch the complete documentation index at: https://docs.prophecy.ai/llms.txt
Use this file to discover all available pages before exploring further.
Requirements
Dependencies
- ProphecySparkDataQualityPython 0.0.1+
Cluster requirements
- Set the
SPARK_VERSIONenvironment variable to a Spark version 3.3+ - Install the PyDeequ library on the cluster
- Install the Maven Deequ library on the cluster (choose the version that matches your Spark version)
Input and Output
| DataFrame | Description |
|---|---|
| in0 | Includes the DataFrame for which the data quality will be checked. |
| out0 | Passes through the in0 DataFrame unchanged. |
| out1 | Outputs a DataFrame with the verification results and failure messages (if applicable) that you can define per check. |
Data Quality Checks
| Check Type | Success Criteria |
|---|---|
| Completeness | Fraction of non-null values is greater than X. The default is 100% non-null. |
| Row count | Input DataFrame has at least X number of rows. |
| Distinct count | Number of distinct values in selected columns is equal to X. |
| Uniqueness | Values in selected columns are unique (occurring exactly once). |
| Data type | Selected columns have a certain data type. |
| Min-max length | Strings in selected columns have a minimum length of X and a maximum length of Y. |
| Total sum | Sum of values in selected columns is equal to X. |
| Mean value | Mean of values in selected columns is equal to X. |
| Standard deviation | Standard deviation of values in selected columns is equal to X. |
| Non-negative value | Fraction of non-negative values is at least X percent. |
| Positive value | Fraction of positive values is at least X percent. |
| Lookup | Fraction of values in selected columns that match lookup values is at least X percent. |
| Column to constant value greater than | Selected column values are greater than a constant value X. |
| Column to constant value greater than or equal to | Selected column values are greater than or equal to a constant value X. |
| Column to constant value less than | Selected column values are less than a constant value X. |
| Column to constant value less than or equal to | Selected column values are less than or equal to a constant value X. |
| Column to column greater than | All values in left column are greater than all values in right column. |
| Column to column greater than or equal to | All values in left column are greater than or equal to all values in right column. |
| Column to column less than | All values in left column are less than all values in right column. |
| Column to column less than or equal to | All values in left column are less than or equal to all values in right column. |
Post Actions
| Action | Description |
|---|---|
| Continue execution | Continue pipeline execution regardless of data quality success or failure. |
| Terminate execution | Stop pipeline execution after the DataQualityCheck gem runs based on a maximum number of failed checks. Review gem phases to understand the order in which gems run. |

