KID Press 9 Dec 2013

Committing data warehouse suicide

By Mervyn Mooi, Director at Knowledge Integration Dynamics.
Johannesburg, 9 Dec 2013

While it has been suggested that loading data first and modelling it later when implementing data warehouses can be a viable alternative to modelling first and loading later, specifically to reduce costs and encourage political will to implement at all, I don't believe it is truly viable in most instances.

The idea stems from the wildfire spread of big data through the corporate environment and consumer conscious as a means to solve issues – from figuring out what customers want from products and services, to where consumers can find the best deal, get a personal response to an issue with a supplier, or make their voices count towards an issue.

In the big data world, the approach is to load all the data immediately and model it later. It is very basically throw into a data container or harness (eg, a database or reference library), then users begin reading it and modelling it as it shows its value, keeping what's pertinent and discarding the rest.

Fast and furious At first glance, this makes sense. So, if load first and model later works for big data, then why not in the data warehousing environment? The approach to load then model is not new. It has been around for decades. In the big data world, it also means business people can gain access to the information produced rapidly. The traditional data warehousing approach of modelling then loading is obviously slower. It may also limit flexibility.

So why should users not load then model as in the big data world? One of the misnomers of big data is that it is characterised by large volumes of data, but that is not necessarily true. To keep it simple, big data is actually defined more accurately as data of many different categories.

Data warehouses, on the other hand, typically deal in large sets of data or big volumes of the stuff in an integrated manner. One of its core strengths is the ability to sift through reams of data and deliver reports tailored to business demands.

Discombobulating data To enable consistent, accurate and legible reporting, data needs to be qualified and organised (ie, integrated, validated, verified, quality checked, mastered). The big data approach of loading first and modelling later lends itself to disorganisation of the data being processed. That may be all good and well for small data sets in smaller organisations where it can be easily managed, but it is not a good situation to foster in larger organisations handling larger data sets.

One of the misnomers of big data is that it is characterised by large volumes of data.
Additionally, data warehousing includes data integration and quality as standard services or components, which are not really possible without a predefined model, design or specification based on user and client requirements. While big data's load first, model later approach is rapid and seems to promote flexibility, it discounts some pertinent data requirements – such as solid integration and quality. Good decisions cannot be based on bad data. Traditional data warehousing, in fact, promotes another crucial aspect behind the need to manage data: agility or adaptability. Predefined data requires that all data entities defined and processed must subscribe to an almost generic design or model, which is precisely what promotes that agility.

Companies do not need more chaos when it comes to their data environments, they need more order, and they certainly need better quality data, one of the most common gripes in the local industry. So modelling first and loading later is an absolute essential.