Data Professional Introspective: Accelerating Enterprise Data Quality
By Melanie Mecca, CMMI Institute's Director of Data Management Products & Services
My recent columns have focused on actionable initiatives that can both deliver business value, providing a tangible achievement, and raise the profile of the data management organization (DMO). [For more on the DMO, (a plug and play initial organization was proposed in an earlier column, ‘Coming in From the Cold – a Starter Data Management Organization.’]
In that light, let’s talk about data quality and how the DMO can help the organization get off the ground floor, AKA, out of the cycle of ‘Do what you’ve always done, get what you’ve always gotten.’
If you abstract the processes and practices of enterprise data management to the fewest number of high-level topics, you have:
- Data Architecture – what data stores and technologies are built or bought to store and distribute data
- Data Governance – managing and controlling data assets through enterprise participation and collective decision making
- Data Quality – the condition of the data as such, according to characteristics (known as dimensions), which affect its usability and fitness for purpose.
Among the ‘Big Three’ topics, the lines of business care most about data quality. From their point of view, where data may be captured and stored, and engaging in collective decisions about it, are primarily in the service of having the right data for their needs – accessible, accurate, and timely. It’s quite correct to state that overall, enhancing data quality is the POINT of enterprise data management.
If this is so obvious, why do so many organizations struggle and stumble in getting past isolated project efforts to improve data quality? Because of the overarching issue we face almost everywhere in every industry – organizations still frequently fail to recognize that their data assets are foundational and critical for their success, such as finance, human resources, and facilities management. For example, many organizations have purchased data quality tools, they may have one or two projects that regularly monitor certain data stores, and some projects that regularly cleanse data. Some large organizations spend millions to cleanse data for financial reporting, but the quality rules applied, and any ensuing data changes, are rarely propagated back to the source systems. This creates an endless cycle of repetitive spending, a treadmill guaranteeing a perennially leaking budget.
Another challenge derives from the funding model employed in most organizations, where almost everything is a project, with a beginning, middle, and end. The lines of business are encouraged to make their funding requests specific to a designated business need, and as a result, there is not enough attention given to enterprise initiatives. The ostrich syndrome applies here, sometimes in the form of ‘if we just wait long enough, a technology will come along that we can buy to solve the problem.’
Convincing an organization to smash through the ceiling of ignorance can be very challenging, to the point where it makes you contemplate why decision-makers are so reluctant to create a strategy, and so prone to creating alternative views of reality. The lines of business are constantly complaining about data quality in virtually every organization. Shouldn’t the organization make a top-down commitment to address data quality across the board? Shouldn’t they enlist governance to develop a high-level strategy for data quality improvements?
We can understand why the organization may be hesitant. There are many considerations requiring attention, such as determining fitness for purpose and quality rules for specific data sets, what events should trigger data profiling efforts, what data sets have priority, etc. In addition, useful approaches— such as engaging business data experts to learn about and apply data quality dimensions to develop quality rules and set targets and thresholds— are often not utilized at all.
When a project-based approach is the prevailing paradigm, organizations experience quality improvements as localized achievements. This tends to result in repetitive, excessive effort and costs from re-inventing the wheel. It also prevents the organization from reaping the benefits of consistent techniques and methods, defect reporting, impact analysis, business validation, and other best practices.
But enough soap-boxing – what can the DMO do about it?
Let’s start with some common concepts, borrowed from the Data Management Maturity (DMM) Model, which contains four process areas in the Data Quality category, addressing data quality from the enterprise perspective:
- Data Quality Strategy – an enterprise level plan to evolve an organization-wide data quality program, AKA ‘we’re taking it seriously and here’s our plan to do something about it’ – without a comprehensive and approved strategy, achievement of a ‘Quality Culture’ remains a pipe dream
- Data Profiling – planned discovery of potential defects and anomalies in data sets (in one or more data stores), AKA ‘you don’t know what you don’t know, so let’s find out’
- Data Quality Assessment – business-driven determination of fitness for purpose, involving data rules, a quality evaluation process, thresholds, targets, and quality dimensions, AKA ‘it’s our data, and we’re going to specify what would make it better’
- Data Cleansing – data correction in physical data stores to meet business criteria specified in quality rules, addressing all applicable quality dimensions, AKA ‘let’s fix this at the source, and prevent it from happening again.’
If the organization develops a strategic quality plan, milestones, documented policies, and sound processes and practices, the data assets will realize improvements across the entire organization. By taking an organization-wide perspective first, keeping the overall goals and objectives in mind, and then implementing practical approaches and processes, specific data sets can then be prioritized and addressed in small steps, eventually cementing continuous improvement in data quality organization-wide. However, the top-down approach ‘hey, let’s start a data quality program!’ can be a very hard sell, with the many competing priorities for data and IT spending.
The good news is that there is a fast-track path available, and it can be executed as a structured, controlled project (beginning, middle, and end) which also delivers reusable processes, practices, policy terms, and templates. And it is:
The Data Quality Pilot
For data quality in general, the axiom “Effort = Reward” is true. For this project, happily, the rewards exceed the effort expended. This is an excellent beginning initiative to convince the organization that data quality deserves its attention and focus.
A data quality pilot project is a powerful opportunity for rapid capability building that can create lasting value by trail-blazing work products corresponding to the activity steps involved. This initiative serves as a quick-start launch pad for developing data quality capabilities that may be lacking at the enterprise level. From the perspective of the DMM, the pilot enables rapid progress in implementing the best practices in three process areas, improving capabilities: Data Profiling, Data Quality Assessment, and Data Cleansing, and also adds significant input to the Data Quality Strategy.
The concept of a pilot for data quality supports achievement of near-term objectives:
- Explores a small but significant data set to determine potential defects and anomalies
- Engages data stewards and business data experts to determine any known issues – positive event for data governance
- Engages the DMO in its role as coordinator and formalizer of data management processes – establishing its usefulness to the organization
- Develop a report that presents the results in summary and detail, such that it can be reused
- Results in precise requirements for cleansing the data set which can be reused, resulting in better data for the business
It also furthers achievement of longer-term objectives:
- Selecting or acquiring a data quality toolset that will become a standard
- Establishing the role of a Data Quality Lead in the DMO
- Development of a reusable data profiling plan
- Creation of content for a data quality policy
- Input to the Enterprise Data Quality Strategy
- Development of standard processes, leading to greater efficiency
- Development of a standard reporting template
- Development of a prioritization method applying impact analysis
- Influences business lines to prioritize data quality.
Benefits for a volunteer business line, which offers an important data set as the content of the pilot effort, also includes the increased knowledge about the data that their staff will gain and fosters dedication to the concept and responsibilities of data ownership.
I’m using the term “profiling” in the outline below. However, the scope does extend that concept to maximize reusable work products through the functional steps of pilot, and is intended to be rigorous, albeit for a very small data set. So, one could also refer to the activities below as a ‘quality audit.’
How to Execute the Pilot
It is recommended that the DMO Director propose the project to the executive governance body, and designate a Data Quality Lead to coordinate the efforts of governance participants and IT. We don’t have enough space in this column time to address your internal sales pitch but feel free reach out to me if you would like to brainstorm.
The table below shows the activity steps for the project and the output for each step.
Step one is very important – defining scope. The ideal scenario is:
- A business sponsor interested in quality improvements
- Interested data stewards and/or business data experts
- Agreeing on a defined data set that is small – 25-50 attributes – but meaningful to the business (e.g., product master data, customer identification and addresses, etc.)
- A limited number of joins – that is, if data within the set needs to be profiled across many data stores, that increases complexity
- An available developer resource to conduct the profiling task
Recommended activity steps, and the resulting work products created that would serve as reusable artifacts, are listed below. It is important to note that none of the recommended work products below are lengthy, and that they may be segmented, as in the example below, or combined as applicable to the organization.
Post-pilot, the DMO can compile them into a Data Quality Workbook folder and publish the location for business lines and project teams.
Activity Step |
Reusable Work Products (Sample Titles) |
1. Define a preliminary scope for the profiling pilot |
1. Data Profiling Scope Definition Guidelines – defines a business rationale for the proposed scope |
2. Convene a small working group with key business representatives – producers and consumers of data within scope |
2. Data Quality Working Group Guidelines – what the DQWG does, who’s in charge, the tasks and milestones (benefits Governance) |
3. Finalize the pilot data set defined, then outline a template for scoping that identifies the candidate sources for each attribute included |
3. Data Profiling Scope Definition Guidelines – principles for defining and prioritizing manageable subsets suitable for profiling; identifying data store(s) where the data currently resides; summary of benefits |
4. Describe the business impacts – e.g., known issues, related issues, what benefits improved quality would facilitate. |
4. Data Profiling Plan Template – sections describe data set snapshot, issues, expected results, timeline, etc. |
5. Gain governance group approval of the Data Profiling Plan |
5. Data Profiling Approval Process – prioritizing data sets, defining issues to be addressed in profiling, level of effort, roles and signoffs for approval |
6. Define quality rules, provisional targets (aspirational quality levels) and thresholds (acceptable quality levels) for data set attributes, employing data quality dimensions – define initial metrics for the profiling effort |
6. Quality Rules Definition Guidelines – best practices in defining quality rules7. Data Quality Dimensions – set of quality factors to be employed8. Data Quality Assessment Guidelines – how to define data quality thresholds, targets, and metrics |
7. Select, or purchase, a data quality toolset |
9. (Optional) Data Quality Selection Criteria – the features and parameters that the organization wants in a data quality tool that will become a standard |
8. Develop data profiling plan with designated data sets (either for single candidate sources, or as appropriate, in an aggregated data set). Include quality checks for known issues, existing quality rules, list of attributes, join requirements for multiple data sources (e.g., match on Provider ID, fuzzy match terms, etc.), and specific business rules as needed |
10. Data Profiling Plan – plan documenting the data set, preparation activities, tests and checks that will be accomplished, resources, effort and schedule |
9. Conduct profiling effort, employing the selected tool set, and develop results reports |
11. Data Profiling Guidelines – recommended practices specific to the specified toolset, basic order of operations (e.g., Out-of-the-Box, business rules, known issues, etc.)12. Data Profiling Report Template–format for summary and drill-down reports, conclusions against tests performed, metrics, recommendations |
10. Review results estimate impacts (e.g., business impact of defects, technical impact to correct in the source versus correcting data at ingestion, etc.), and prioritize remediation activities with working group |
13. Data Quality Impact Template – for each issue surfaced by the profiling effort, contains a description, the quality rule involved, and a categorization of business and technical impacts for the issue/defect |
11. Where possible, identify root causes for high impact defects, and request data store owner to address them |
14. Action Items, to be submitted to the business sponsor and relevant data stewards and business data experts |
12. Establish a quality baseline for the data set which includes a set of practical metrics and supporting rationale |
15. Data Quality Metrics Template – consistent structure to capture and monitor metrics, providing a demonstration of the importance of data quality to the business |
13. Present results to senior governance body for improvement execution approval, and document quality rules for potential use for this data set, wherever it is stored |
16. Quality Rules Repository – these rules can be contained within the metadata repository, and linked to terms in the business glossary |
The outline above illustrates the thinking and activities that are recommended to apply to the pilot, as well as the work products developed for the pilot which can be reused for any data set.
The input from the data quality pilot will provide the opportunity to fully exercise best practices against a manageable data set. The knowledge gained from this effort will inform the organization’s Data Quality Strategy, as well as serve as a substantive learning project for data stewards and their IT partners.
Next column, we’ll address the Data Quality Strategy – what it should include, how it should be developed, and who should be involved.