Does Big Data Need Bigger Data Quality and Data Management?

By Virginia Prevosto and Peter Marotta

"Well, in our country," said Alice, still panting a little, "you'd generally get to somewhere else — if you ran very fast for a long time, as we've been doing."

"A slow sort of country!" said the Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!"

— Lewis Carroll, Through the Looking Glass, 1871

Running faster won't get you to the right place if you don't know where you're going. Even if you do know your destination, you need the right road markers to help you on the way. Big Data and, more important, the analytics that Big Data fuels are the technology du jour. But creating better, faster, more robust means of accessing and analyzing large data sets can lead to disaster if your data management and data quality processes don't keep pace.

Traditionally, data management and data quality principles and processes have included:

  • clearly defining the problem and questions to be answered before identifying the data needed
  • defining data quality benchmarks to ensure the data will be fit for its intended use
  • identifying key data quality attributes to be measured (for example, validity, accuracy, timeliness, reasonableness, completeness)
  • managing data quality as close to the source as possible
  • capturing data that flows from the underlying business processes
  • creating and retaining data artifacts and documentation

Will those methods continue to work with Big Data, or are new principles and processes needed?

A 2011 McKinsey Global Institute study outlines a number of Big Data techniques and technologies. Those focusing on data and data management (and not the analytical items) include metadata, data element classification, data acquisition, and data fusion and integration. What are the data management and data quality implications of these techniques and technologies?

Metadata
Metadata is important to any data management activity. Metadata and metadata management become even more important when dealing with large, complex, and often multisourced data sets. Metadata to be used across the enterprise must be clear and easily interpreted and must apply at a very basic level.

Data Element Classification
For Big Data quality and management (Big DQ and DM), minimum metadata requirements need to be established and, ultimately, metadata standards too. To foster cross-enterprise use of data, taxonomies — classification or categorical structures — need to be defined, such as demographic data, financial data, geographic/geospatial data, property characteristics, and personal identifiable information.

Data Acquisition
In acquiring data, it is critical for data to be organized to be more readily assessable. Data and data exchange standards for Big DQ and DM are key tools in the acquisition process. Use of the common vocabulary and grammar that standards support facilitates the mapping of data across sources. For Big DQ and DM, movement toward data and data exchange standards will result in less friction in the data acquisition process.

Data Fusion and Integration
Integrating data across multiple sources may be a large part of a Big Data effort. Data standards greatly facilitate the mapping of data.

Other data management tools that aide integration are master data management (MDM), entity resolution, and identity management (idM). MDM creates a single, consistent view of data that is shared across an organization. Entity resolution is a process or tool that resolves who is who and who knows who (the "who" being a person, a business, an address, and so forth). And idM focuses on identifying individuals within a data source or across data sources.

For Big DQ and DM, one must make more use of data management and integration tools such as MDM, entity resolution, idM, and data standards.

The greatest impact of Big Data is on data quality. While data quality has traditionally been measured in relation to its intended use, for Big Data projects, data quality may have to be assessed beyond its intended use and address how data can be repurposed. To do so, data quality attributes — validity, accuracy, timeliness, reasonableness, completeness, and so forth — must be clearly defined, measured, recorded, and made available to end users.

Artifacts relating to each data element, including business rules and value mappings, must also be recorded. If data is mapped or cleansed, care must be taken not to lose the original values. Data element profiles must be created. The profiles should record the completeness of every record.

Because data may migrate across systems, controls and reconciliation criteria need to be created and recorded to ensure that data sets accurately reflect the data at the point of acquisition and that no data was lost or duplicated in the process.

Special care must be given to unstructured and semi-structured data because data quality attributes and artifacts may not be easily or readily defined. If structured data is created from unstructured and semi-structured data, the creation process too must be documented and any of the previously noted data quality processes applied.

For Big DQ and DM, you must create data quality metadata that includes data quality attributes, measures, business rules, mappings, cleansing routines, data element profiles, and controls.

In conclusion, data management and data quality principles for Big Data are the same as they have been in the past for traditional data. But priorities may change, and certain data management and data quality processes, such as metadata, data integration, data standardization, and data quality, must be given increased emphasis.

One major exception involves the time-tested practice of first clearly defining the problem. In the world of Big Data, where data may be used in ways not originally intended, data elements need to be defined, organized, and created in a way that maximizes potential use and does not hinder future utility.

Virginia R. Prevosto, FCAS, MAAA, is vice president of ISO's Information Services Department. Peter A. Marotta, AIDM, FIDM, is enterprise data administrator of Enterprise Data Management at Verisk Analytics.