Data quality rules
Documentation of data quality rules with explanation and technical constraints to validate business partner data records.
Description
Transformation of human-documented data requirements into executable data quality rules is mostly a manual IT effort. Changing requirements cause IT efforts again and again. Some checks, e.g. tax number validity (not just format!), require external services. Other checks, e.g. validity of legal forms, require managed reference data (e.g. legal forms by country, plus abbreviations). Continuous data quality assurance (i.e. batch analyses) and real-time checks in workflows often use different rule sets.
Data requirements and related reference data are collected and updated collaboratively by the Data Sharing Community. Data quality rules are derived from these requirements automatically. All data quality rules are executed behind 1 interface, in real-time. Batch jobs and single-record checks use the same rule set and can be integrated by APIs.
For proving that a data quality rule is content-wise correct we maintain supporting document(s) per data quality rule which share the rule's source. This could be:
- a public authority source
- any other trustful webpage
- a data standard of a specific community member
We manage the URL (if any), a screenshot of the relevant parts (if any) and the source's name (e.g. Community member data standard, European Commission, National ....) See Identifier format invalid (SIREN (France)) as an exemplary rule that was specified and implemented based on information provided by the OECD.
Types of Data Quality Rules
Identifier Checks
Identifier Qualification
Business Partner Checks
Address Checks
Compliance and Risk Checks
Bank Account Checks
Data Quality Rules Browser
The list below gives an overview on all data quality rules that are currently available.
<tabs> <tab name="RELEASED"> {{#sparql: SELECT ?Business_Rule ?Definition ?Criticality ?Status WHERE { ?x a <http://meta.cdq.com/Category-3AData_quality_rule> . ?x <http://meta.cdq.com/Property-3AHas_status> ?Status . ?x <http://meta.cdq.com/Property-3AHas_criticality> ?Criticality . ?x <http://meta.cdq.com/Property-3AHas_description> ?Description . ?x <http://www.w3.org/2000/01/rdf-schema#label> ?Label . BIND (URI(replace(replace(replace(replace(replace(str(?x), "-28", "("), "-29", ")"), "-2D", "-"), "-2C", ","), "-3A", ":")) AS ?url) . BIND (CONCAT("<a href=\"", str(?url), "\" title=\"", ?Label, "\">", ?Label ,"</a>") as ?Business_Rule) . BIND (str(replace(replace(replace(str(?Description), "\\u005B\\u005B[A-Za-z\\s\\/\\.]{1,}\\u007C", ""), "\\]\\]", ""), "\\[\\[", "")) as ?Definition). FILTER (?Status = "RELEASED") } ORDER BY ?url
|config=http://cdq-web |chart=bordercloud.visualization.DataTable |options=width=100%!height=500px |log=2 }} </tab> <tab name="HYPERCARE"> {{#sparql: SELECT ?Business_Rule ?Definition ?Criticality ?Status WHERE { ?x a <http://meta.cdq.com/Category-3AData_quality_rule> . ?x <http://meta.cdq.com/Property-3AHas_status> ?Status . ?x <http://meta.cdq.com/Property-3AHas_criticality> ?Criticality . ?x <http://meta.cdq.com/Property-3AHas_description> ?Description . ?x <http://www.w3.org/2000/01/rdf-schema#label> ?Label . BIND (URI(replace(replace(replace(replace(replace(str(?x), "-28", "("), "-29", ")"), "-2D", "-"), "-2C", ","), "-3A", ":")) AS ?url) . BIND (CONCAT("<a href=\"", str(?url), "\" title=\"", ?Label, "\">", ?Label ,"</a>") as ?Business_Rule) . BIND (str(replace(replace(replace(str(?Description), "\\u005B\\u005B[A-Za-z\\s\\/\\.]{1,}\\u007C", ""), "\\]\\]", ""), "\\[\\[", "")) as ?Definition). FILTER (?Status = "HYPERCARE") } ORDER BY ?url
|config=http://cdq-web |chart=bordercloud.visualization.DataTable |log=2 }} </tab> <tab name="Rule categories"> For detailed information, please see Rule categories
</tab> <tab name="Supporting documents"> no supporting document available </tab> </tabs>
Metadata change management: Data quality rules management process
Data quality rules are part of a strict testing process. Each rule is tested daily. The rules are tested against test cases (at least one passing and one failing test case) in the knowledge graph. It is possible to specify an arbitrary number of additional test cases that are considered. This ensures reliable and correct data quality rule implementations. For each testing run the test results are written back to the rule documentation, so that it is always visible how the rule behaves. The testing is automatically performed daily and can be triggered manually if required. Data quality rules follow a lifecycle that is specified in different status.
Activity | Rule status | Description |
---|---|---|
Step 1: Initial rule creation | IDEA | A new data quality rule is created using the data quality rule maintenance form. The rule's page name should follow the defined naming scheme. The initial status of the rule is "IDEA". The creator selects whether the rule documentation is to be reviewed by the data sharing community or by a dedicated pprover. The minimal required information in this step are:
|
Step 2: Rule implementation | PLANNED | The person responsible for making the rule available in the data validation service is the product owner. The product owner coordinates the activities or takes over the implementation by its own. If the rule is implementable using standard rule templates, the data analyst or product owner will generate the rule by maintaining the required parameters. In case there is no standard rule template available, the rule can be directly implemented by providing the SPARQL query. If the rule is generalizable a new standard rule template may be created and the rule is then generated. Usually, specific functions are required that are to be implemented by a software developer. In this case the functionality is requested following the internal software development process. The requirements on the function are documented in the function library beforehand. Similarly, the product owner may request the development of the SPARQL query following the software development process. The implementation of functions and/or the development of SPARQL queries is to be done within 2 weeks.
Part of the rule implementation is the provision of a violating and passing test case in the rule documentation. When the rule was implemented and test cases are available the status is changed to PLANNED. |
Step 3: Rule testing and release | RELEASED | Rules that are in status PLANNED are considered in the next rule release. Rule releases are provided daily. All rules that have the status PLANNED are tested automatically using the provided test cases in the rule documentation. If the test cases pass, the test status of the rule is set to VERIFIED and the status of the rule changes to RELEASED. The status RELEASED ensures that the rule is used by the data validation service. The release date is automatically set when the rule status changes to released. If the rule does not pass the automated tests, the status is changed to DEACTIVATED. Additionally to the test status, a proper test message is always provided which e.g. comprises information about the reasons of a failing test. |
Step 4: Revising deactivated rules | DEACTIVATED | The dashboard is daily checked for deactived rules. This is in the responsibility of the product owner. The reasons for the failing test(s) are analyzed and the issues are resolved. After revision the rule status is set to PLANNED so that it is considered for the next daily rules release. |
Step 5: Inform the community | RELEASED | The community is informed in the weekly reports (if data validation reports are activated) and additionally, in the regular community Inside newsletters about the new rules. |
Minor changes
Changes on data quality rules that do not affect any subsequent or dependent functionality do not follow a specific change process. Non-critical changes are:
- Any changes on fields other than criticality (see special case below), primary validation source, managed concepts, country scope. Changes on these fields are considered major changes.
- Changes on data quality rules that just have the criticality INFO
- Changes on the rule description, violation message and example are usually minor in case of corrections that do not change the described logic (e.g. more intuitive example, typos, more comprehensive description, more user-friendly violation message). If the logic of the rule changes these fields are adapted in accordance with the major change process.
Major changes
Major changes are changes that (potentially) have an impact on subsequent or dependent functionality. Besides changes on the fields:
- criticality,
- primary validation source,
- managed concepts,
- country scope
major changes are triggered by required changes on the logic of the data quality rule due to
- errors in the logic have been identified (e.g. there are special cases which were not considered)
- the outside world has changed and thus the logic requires a change (e.g. the format of an identifier has changed, or a new post code system has been introduced in a country)
When such a case has been identified or a change request has been triggered by a community member then the same process as for the creation of new rules is to be followed.
Testing approach
The daily testing job takes all data quality rules from this repository and tests them against the defined test cases. At least a positive and negative test case has to be defined for each rule. Additional test cases can be defined if required (e.g. by judgement of the data analyst for testing special cases). The test results are documented and readable on the documentation page of each data quality rule.
Mandatory test cases - Test case repository
A test case is a proper record that comprises all attributes that are required for executing the rule. The test cases are available on the data quality rule pages in the testing section. Please see below some examples:
Test result documentation
Template:Property label | Documents whether the test was successful or not. (VERIFIED. Test successfully executed; FAILED. Testing did not return the expected results; UNKNOWN) |
Template:Property label | Some message provided by the test, e.g. reasons for failing |
Template:Property label | The date when the last time an automated test was executed for this rule |
Template:Property label | How long did the rule execution last in average |
Contribute!
We are continuously defining and implementing additional rules. Please get in touch with us if you observe that a business rule is missing! Also if you are interested in the business rules management architecture and its implementation we would be happy to provide you with additional information or a showcase.
Some theoretical background
From a theoretical point of view the data model concepts and the relations between them define our domain (in other words the world as it is understood by CDQ). Within this world everything would be possible when there are no rules. Business rules constrain this world by reducing the space of possible instantiations of the modeled domain. An example for this is a business rule that constrains the possible values a country. It says that an allowed value for a country are only those countries that are defined in the ISO 3166-1 standard. These countries are documented as reference data in this portal. Without this rule a country could have any other value such as "Romulus". To take up again the wording from above: The documented countries are knowledge about the CDL world (domain), and this knowledge is used to constrain the possible space of options for the name of a country.