Requirements
Dependencies
- ProphecySparkDataQualityPython 0.0.1+
Cluster requirements
- Set the
SPARK_VERSIONenvironment variable to a Spark version 3.3+ - Install the PyDeequ library on the cluster
- Install the Maven Deequ library on the cluster (choose the version that matches your Spark version)
Input and Output
| DataFrame | Description |
|---|---|
| in0 | Includes the DataFrame for which the data quality will be checked. |
| out0 | Passes through the in0 DataFrame unchanged. |
| out1 | Outputs a DataFrame with the verification results and failure messages (if applicable) that you can define per check. |
Data Quality Checks
| Check Type | Success Criteria |
|---|---|
| Completeness | Fraction of non-null values is greater than X. The default is 100% non-null. |
| Row count | Input DataFrame has at least X number of rows. |
| Distinct count | Number of distinct values in selected columns is equal to X. |
| Uniqueness | Values in selected columns are unique (occurring exactly once). |
| Data type | Selected columns have a certain data type. |
| Min-max length | Strings in selected columns have a minimum length of X and a maximum length of Y. |
| Total sum | Sum of values in selected columns is equal to X. |
| Mean value | Mean of values in selected columns is equal to X. |
| Standard deviation | Standard deviation of values in selected columns is equal to X. |
| Non-negative value | Fraction of non-negative values is at least X percent. |
| Positive value | Fraction of positive values is at least X percent. |
| Lookup | Fraction of values in selected columns that match lookup values is at least X percent. |
| Column to constant value greater than | Selected column values are greater than a constant value X. |
| Column to constant value greater than or equal to | Selected column values are greater than or equal to a constant value X. |
| Column to constant value less than | Selected column values are less than a constant value X. |
| Column to constant value less than or equal to | Selected column values are less than or equal to a constant value X. |
| Column to column greater than | All values in left column are greater than all values in right column. |
| Column to column greater than or equal to | All values in left column are greater than or equal to all values in right column. |
| Column to column less than | All values in left column are less than all values in right column. |
| Column to column less than or equal to | All values in left column are less than or equal to all values in right column. |
Post Actions
| Action | Description |
|---|---|
| Continue execution | Continue pipeline execution regardless of data quality success or failure. |
| Terminate execution | Stop pipeline execution after the DataQualityCheck gem runs based on a maximum number of failed checks. Review gem phases to understand the order in which gems run. |

