Zapping Data Anomalies
29 May 2007
Improving Quality of Data Content by Preventing, Identifying and Monitoring Data Anomalies
The Zapping Process
Data warehouse professionals can protect data warehouses from anomalies through a methodical approach to detecting, handling and protecting the data warehouse from anomalies. This paper, Volume 1 in a series, describes the framework for improving warehouse data quality by “zapping” – preventing or at least monitoring and handling – data anomalies.

With the high rate of data warehouse failures – between 20 percent and 75 percent – and the associated downtime and lost productivity, organizations need to take a proactive approach to address data quality issues. You can avoid the pitfalls that cause data warehouse failures through a “zapping” process that can help ensure the success, effectiveness and efficiency of your data warehouse.
Data can be very volatile. An application can function effectively even if the data behind it is bad. It is only when you try to reuse that data for another purpose that you discover how inaccurate, outdated or irrelevant it is – and by then, it's usually too late. By definition, data warehouses “reuse” data from source systems, and therein lies the vast potential for data anomalies.
What is a Data Anomaly?
A data anomaly is any data that is unsuitable for the intended use because it doesn't conform to data warehouse documentation or requirements. The value of the data warehouse begins to diminish as anomalies accumulate. If enough anomalies are allowed to build up, they begin to erode user confidence in the data. Without confidence in the data, its value in supporting business decisions diminishes.
Part of the anomaly identification process is knowing what to look for. Here are some typical situations you my discover:
- Non-numeric data going into a numeric field
- Data size mismatches between source and data warehouse
- Unexpected code values
- Out-of-range values
- Erroneous defaulting of fields
- Unmatched data
- Missing data
- Incomplete data
- Erroneous data aggregations
How Do Data Anomalies Arise?
Data anomalies can exist when the data warehouse is first built. They can also enter the data warehouse system over time, as changes are made to upstream source systems, as changes occur in the functional business or as changes are made to the data warehouse. The anomaly can occur within a particular column of data, such as an incorrect “customer name” or within a table, such as “customer master,” which, by its definition, should not contain duplicate rows.
If a proactive plan to actively zap anomalies is not implemented, it can be several months before the level of anomalies rises to the point they become noticeable. By then, the costs of identifying and eradicating the anomalies can be huge – in terms of actually fixing the problem and in terms of poor decisions made as a result of bad information.
As an example, a major health care company was unpleasantly surprised when one of its vendors changed its patient-gender coding scheme, causing all data for a particular warehouse feed to default to “female.” It took more than six months before the volume of data became so large it stood out as a problem. This resulted in hundreds of hours spent to identify the extent of the problem and clean up the data – not to mention the impact on decisions over the six-month period. A methodical approach to anomaly handling, such as the zapping process, can minimize these impacts and costs.
Why is Zapping Anomalies So Important?
The value of the data warehouse begins to diminish as anomalies accumulate. According to an April 2007 press release from Gartner, more than 25 percent of critical data in the world's top companies is flawed. The Data Warehousing Institute (TDWI) 2002 "Data Quality and the Bottom Line…" report states that data quality problems cost US businesses more than $600 billion a year.
The unmonitored flow of data into the warehouse makes it easier for anomalies to accumulate undetected. Fixing incorrectly loaded data can be hugely expensive, both in terms of the impact to data analysis and in terms of the potential cost to clean up the data. The sooner anomalies are detected and handled, the better.
Overview of the Zapping Process
The zapping process contains three iterative steps to systematically handle anomalies. These steps are used in conjunction with four key strategies to make the process practical. The steps include:
- Identifying and recognizing anomalies
- Monitoring and handling anomalies
- Repeating the process
Having a methodology such as zapping is a start. But practical strategies for identifying, monitoring or handling potential anomalies are needed to successfully implement the anomaly zapping processes.
The Key Strategies
The strategies trace the documentation and flow of data from source systems to the target warehouse. This provides a logical approach to identifying opportunities for handling anomalies. These strategies include:
- Use specifications and related documentation
- Monitor data sources
- Handle data extraction/transformation/load (ETL) loopholes
- Monitor the target data
The three iterative steps to the zapping process, combined with the key strategies, provide a comprehensive framework for limiting anomalies in the data warehouse.
How EDS Can Help You
Whether you are designing a new data warehouse or supporting an existing data warehouse, the zapping process can help you avoid the pitfalls that cause the majority of data warehouse failures. EDS' proactive attention to data quality in the zapping process will help ensure the success and effectiveness of your data warehouse.
Learn more in the Zapping Data Anomalies White Paper PDF, 246K.