Duplicate
A duplicate rule is a mechanism used to identify duplicate records, values, or entries within a dataset. It helps ensure data integrity by flagging or preventing the entry of duplicate data. Duplicate records can lead to inaccurate reporting, inefficiencies, and poor user experience, data testers can use this out of the box rule to quickly identify such records in early stages itself.
Use Cases
- Single Column Duplication: The rule checks for duplicate values within a single column. For example, it will look for multiple records that have the same value in the "email" or "phone number" column.
- Multiple Columns: Sometimes, a combination of columns is used to identify duplicates. In this case, the rule would check if the values in multiple columns (e.g., "first name" and "last name") match across different records.
- Entire Record: The duplicate rule can also check if an entire record, this helps identify if the ETL pipeline was mistakenly run more than once.
Components
There are five (5) different components in a Duplicate Rule:
- Overview: To provide rule related metadata.
- Source Dataset: To provide source connection details required to fetch the dataset.
- Checks: To select the columns on which duplicate rule needs to be applied
- Exception Report: To provide customized configuration for the exception report generated.
- Summary: To provide variables that are attached to rule execution instance event.
Overview
This component allows users to define rules and configure advanced options for parameterization. Below are the details for the fields included in the rule definition:
Field | Description | Purpose |
---|---|---|
Rule Name (Mandatory) | The unique name of the rule within a folder. It should be descriptive enough to help users quickly identify and understand the purpose of the rule. | Serves as the primary identifier for the rule. Ensures clarity and uniqueness within the folder structure. |
Rule Code (Optional) | An alphanumeric shorthand or technical identifier for the rule. This field is optional but highly useful for efficient management and organization of rules. | Helps users search, filter, or differentiate between similar rules. Provides a compact reference for integrations, automation, or workflows. |
Description (Optional) | A concise explanation of the rule. This field offers additional technical or business context to enhance understanding. | Clarifies the rule's intent or scope for users. Assists in onboarding new team members or providing documentation for audits. |
Purpose (Optional) | A short summary describing the purpose or intended outcome of the rule. This field provides context for why the rule was created. | Highlights the business or technical objective of the rule. Helps users understand its significance and expected impact. |
Criticality (Mandatory) | The Criticality field indicates the severity level of an issue. This is a mandatory field that helps categorize and prioritize errors based on their impact. The field provides the following options: Warning (Default): The issue should be reviewed but does not require immediate action. Critical: The issue needs to be resolved promptly; however, testing can continue despite its presence. Blocker: The issue must be resolved before testing can proceed, as it prevents further progress. | Ensures a clear understanding of the issue's severity, enabling teams to prioritize effectively and maintain workflow efficiency. |
Advance Settings
Field | Description | Purpose |
---|---|---|
Parameter (Optional) | This field specifies the default parameter file associated with the rule. | Enables parameterization of the rule by referencing keys in the parameter file in SQL, checks, etc. |
Source Dataset
This section allows users to configure the source settings to extract data for processing. The following fields and options are available:
- Connection Type: This field allows users to select the type of data source (application, database, file ...) they want to connect to. Selecting the appropriate connection type ensures that only relevant connections are displayed for further selection.
- Connection: This field displays a list of connections available for the selected connection type. Once selected a user can:
- Test the connection using the ✔️ Checkmark Icon.
- Edit the connection details using the ✏️ Edit Icon.
- Refresh the list of connections using the 🔄 Refresh Icon.
- Methods for reading data from the data source
- Default: This option allows users to browse and select the required Schema and Table/View from the chosen connection. It also allows user to add a where caluse in the query. It is an optional field. It is Ideal for users who prefer navigating schemas and tables visually without writing SQL.
- SQL: This option allows users to manually input custom SQL queries to retrieve specific data from the selected connection.It provides flexibility for advanced users to query data directly, offering precise control over the data extraction process.
- Preview Data: Clicking the Preview Data button allows users to view a sample of the data retrieved based on the selected table (Default) or SQL query.
Checks
This check is primarily used to check for duplicate values in a dataset by selecting a column or a group of columns. It allows only one check in which user selects one or more columns to be tested.
Exception Report
This component is for configuration of exception report display, storage and export formats. Below are the different properties available in this component.
Property | Description | Note |
---|---|---|
Export File Format | User can specify Excel or CSV to be the default export format | |
Store Failure & Errors Only | Enable this property to store only failures in the report | This property is enabled by default. |
Store Exported File Only | Enable this property to delete raw exception report and store only exported file | |
Show Download Exception URL | Enable this option to recieve exception report download link in email subscription | |
Store Only # Rows | Use this if you would like to store only specified number of records in the report |
Summary
Users can configure result override as well as attach custom variables to the Rule execution event.
Exit Code
The number of rows returned by the source dataset is the exit code. If it returns 0 rows exit code is 0 and if it returns 10 rows then exit code is 10.
Exit Code | Status | Note |
---|---|---|
== 0 | Success | No data returned by the source dataset, Rule execution is successful. |
> 0 | Failure (Warning/ Critical/ Blocker) | More than 0 rows returned by the dataset which is captured as a failure. |
< 0 | Error | A system error was encountered when executing the Rule. |
Scenarios
Following scenarios will help you understand how the Exit Code of a Duplicate Rule is calculated.
Scenario #1
In this example, there were no duplicates in the source dataset, hence ExitCode = 0. Since Exit Code = 0 the Duplicate rule was a success.
Scenario #2
In this example, since there are duplicates in the source dataset for the column select, ExitCode = 2993. Since Exit Code = 2993 the Duplicate rule has warnings.
How To: Create a Duplicate Rule
This video shows you how to create a duplicate rule against customer data.
Other Actions
Field Name | Description |
---|---|
Discard Rule | This option is available throughout the process until the Publish option is hit. During the initial creation, if discard is pressed, the entire rule will be discarded. Once the rule is published and then any changes are made, discard will roll back only the changes made during the last edit. |
Publish | Once the rule is configuration is complete, the publish option commits the rule created. Any changes made to the rule will come to effect only when the rule is published. |
Run | Once the Rule is published, the Run option will activate. This option will actually execute the rule and show the result. |
Recent Runs | Will highlight the box that displays the recent runs. |
Delete | Deletes the rule. This option is activated once the rule is published. |
More >Duplicate | Duplicates the rule |
More>Share | You can share the link to other users |
More>Deactivate/Activate | This option toggles between activation and deactivation of rules |
Subscriptions | This option is used to send the email notification to selects users on test Success, test Failures and Rule execution Errors |
Consideration
- User cannot add any out-of-box or custom checks in a Duplicate Rule.