Friday, 14 December 2007

Principles of Database, Datawarehouse and Repository Development

As the main repository of an organisation’s historical data, the data warehouse is evolving into the memory of a corporation[1]. Through storage of a wide variety of data sources in an integrated format, the data warehouse is becoming both the storeroom of past events, and the central predictive engine.

The data warehouse contains the unrefined substance that when fed through a decision support system can provide management with up-to-date analytics and corporate predictors. The data analyst can use this technology to perform complex queries and analysis of information without adversely impacting operational systems.

Through both the rise in computational speed and power and the growth of data storage, data warehouses have pressed other technologies into greater levels of development[2]. Technologies such as data mining have grown symbiotically with data warehouses due to the benefits that they can provide. To understand data warehousing it is necessary to have knowledge of both data warehouse technologies and the associated analytical analysis methods used to access, report and present on the data.

The data warehouse architecture illustrates the entire organisational process from a variety of points of view[3]. These include the data, processes and infrastructure of the organisation and can mirror the structure, function and interrelationships of each constituent element of the organisation.

Data warehousing and Data analysis
The infrastructure or technology viewpoint reflects the choice of hardware and software products as they are implemented by the distinct components which derive the overall system. The data perspective characteristically epitomizes the source and target data formation and can assist the members of the organisation to comprehend the data assets and functional relationships which make up the organisation’s operations. The process viewpoint is principally focused on the communication of the progression of data from the originating source database through to the procedure to load the data into data warehouse and finally to analyse and extract data from the warehouse.

To be able to explore the effect of the rise of data warehouses on business and society, one first needs to characterise what the concept encompasses. William H. Inmon[4], known as one of the fathers of data warehousing, has stated that data warehouses are required to be subject orientated. In this, the data and thus database is organised in a manner such that all data relations relating to the same event or object are associated correspondingly in a manner that is concurrently time variant, non-volatile and integrated.

Whereas operational systems are necessarily optimised for ease of use and the rapidity of response, the data warehouse is optimised for reporting and analysis. Online transaction processing (OLTP), crucial to the operational system, is of less importance to the data warehouse. Rather, on line analytical processing (OLAP) and the necessity to access unusual data patterns results in heavily denormalised or dimension based models that may not be required to achieve acceptable query response times[5].

By time variant, is meant that changes to the objects and tables within the database are tracked and evidenced. This process allows for the statistical analysis of the data over time to produce reports on time variant trends. Heteroscadastic (including ARIMA, GARCH and ARCH) time series analysis of data is one of the newer avenues of research.

A data warehouse requires that once data is committed to the database, it is stored as read only. In this it is used for future reporting and any point in time data becomes an individual snapshot of the database over time. This process allows both for historical analysis and also future predictions.

To be effective, and data warehouse needs to contain data from as many if not all of an organisations operational applications and individual databases[6]. Further, to be of any effective use, the data in the warehouse must be contained in a consistent manner. A failure to either constrain data consistently or to provide an adequate sample of the organisation’s data leads to the GIGO issue[7]. That is, garbage in, garbage out.

Research into data warehousing has expanded into what has been termed the Corporate Information Factory (or "CIF"). A CIF[8] is an organisational data structure which encompasses ERP, eCommerce, customer relationship management (CRM) and many other formerly separate reporting structures. In some cases, a CIF has been known to encompass data marts, exploration warehouses, ODS, both nearline and secondary storage, and project warehouses.

However, the volumes of data, expense and lack of enterprise support leave CIF implementations is an idea for the future. There are still a number of difficulties associated with data warehousing which make their implementation less widespread then maybe expected in the future. Of particular concern, the process of extracting, cleaning and processing data is both time-consuming and difficult. The failure to implement and adhere to corporate wide naming standards amongst many organisations as only exasperated this problem.

Widespread incompatibility between many database products has slowed down the broadening of data warehousing across organisations. Technologies such as OLAP have aided in the development of Cross-application data warehouses[9], but issues of table structure, normalisation and the type of data stored within individual databases remains an issue.
Another issue that has hampered the implementation of data warehousing is security. In a world that is increasingly becoming reliant on the Internet and the Web, security could develop into a serious issue. With links into the data warehouse from the Internet, an organisations key informational assets are at risk from both discovery and compromise.

There are a number of clear advantages to the adoption of data warehouse technologies[10]. It is easy to see that organisations which adopt these new technologies successfully will gain a clear competitive advantage. These technologies enhance end user access to data and reports in a manner that allows for greater creativity and more informed evaluations.

The ability to create trend reports and use statistical methods to accurately forecast probabilistic events based on the past occurrences and experience provides for more focused corporate activity[11]. For instance, marketing information from a particular sales push can be compared across different regions to evaluate the impact of differing advertising initiatives.
Data warehousing technology can also significantly increase the effectiveness of several commercial business applications. In particular, customer relationship management (CRM) benefits greatly from this technology. The ability to both gain an overall view and to be able to drill down to specific areas and individuals provides a significant advantage to many organisations. CRM has been one of the principal applications to make early use of data warehousing.

Data Mining
Data mining is a relatively new development. The use of statistical methods to routinely investigate large capacities of data for patterns by using techniques including classification, decision trees[12], association rule mining, and clustering, has resulted in a new field of computational mathematics[13]. Data mining is a complex subject in itself and has associations with numerous core disciplines together with computer science and appends significance to influential computational techniques from statistics, information retrieval, machine learning and pattern recognition[14].

Too many people, the key issue introduced through the use of data mining is not commercial or technological in nature. It is a social issue. The protection of individual privacy is a concern that has increased exponentially with the rise in data mining. Data mining increases the possibility of an individual’s privacy being violated in some manner.

The processes used to analyse ordinary commercial transactions may be used to compile significant quantities of information about individuals from their purchasing behaviour and lifestyle preferences[15]. In particular, where data is compiled across multiple organisations that need to protect the privacy of individuals identity is compounded.

There are five primary stages to data mining. Initially it is essential that the data is loaded into the data warehouse system. This phase includes the extraction of data from the source databases and any necessary transformations required to reformat the data or cleanse the data such that it is maintained consistently.

The next phase involves storing and managing the data. In this, the data needs to be normalised and formatted such that it fulfils the requirements necessary to maintaining a multidimensional database system. This data must be accessible to the organisation’s business analysts. The process of providing access to the data is the third phase of data mining. No data warehouse project may be considered successful if the business is unable to access the data.
The final two phases of data mining call for analysis and presentation. Using methods such as OLAP, ROLAP and MOLAP[16] to access the data, the business analyst will load the required information from the data base into specialised application software[17]. These products are then able to present the data contained in the data warehouse in a more usable format such as the graph or table.

The Types of Analysis
The data maintained in a data warehouse is of little use if it cannot be accessed and analysed. As a consequence, a number of analysis techniques have developed. Although not exclusive, the following provides a brief summary of the analysis techniques available for use against data warehouses.

Decision tree methodology has been around for a long time as a probabilistic tool. This technique uses tree shaped structures to represent a variety of decisions and possible outcomes. This process generates rules for the classification of data sets. There are a number of specific decision tree methods including Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) which have been derived from probability theory[18]. These methods provide a set of rules which may be applied to the data set in order to predict or forecast a given outcome from the data.

The CART methodology segments a dataset by creating 2-way splits[19], whereas the CHAID method segments using chi square tests to create multi-directional splits. CART methodologies are more common than CHAID methods as they typically require less data preparation. However, CHAID methods can provide a greater level of statistical precision[20].

Another method is the nearest neighbour method. This technique classifies each record in the dataset using an arrangement of the classes of the “k”[21] records which are the most comparable to it in a historical dataset. This technique is also known as the k-nearest neighbour technique[22].

Rule induction is common in organisations with a strong programming all logic background or focus. This technique uses a variety of predetermined statistical tests to extract data using “if-then-else” rules[23]. Analysis programs such as those provided by SAS and the open source product “R” make extensive use of this method.

A further common method is “data visualisation”. In this involves the generation of crafts and reports which allow for the visual interpretation of compound associations in multidimensional data by the analyst. Graphics tools are used to illustrate relationships between the data in a manner that provides more straightforward reporting than many of the other methods.
With the advances in both computational power and the development of new mathematical techniques, a couple of advanced analysis methodologies have developed including artificial neural networks and genetic algorithms. Artificial neural networks used non linear predictive models which “learn” using training algorithms to provide intricate statistical reports on the data. These techniques are so named as they resemble biological neural networks in their structure.

Genetic algorithms which use optimisation techniques and algorithmic evolution are also developing. These processes use a combination of “genetic combination”, mutation and probabilistic methods to simulate natural selection. These methods commonly integrate stochastic techniques, such as Monte Carlo simulations to provide estimates and forecasts from the data contained in the data warehouse[24].

Due to the exponential growth in the volume of data and the continuing development of new techniques for data analysis, organisations are continuing to bring about changes in the methods they use to both store data and analyse it. Data warehouse architectures which are now primarily used to express the overall configuration of a Business Intelligence system have integrated decision support systems (DSS), management information systems (MIS), into their fold giving businesses access to more information and predictive capability than ever before.
This rapid increase in data presents many problems for organisations, but at the same time provides opportunities for those who know how to use these new technologies and techniques. Those businesses and organisations which most effectively make use of this new technology are likely to gain significant competitive advantages. As a result, the implementation of data warehousing techniques and technologies is only likely to continue.

Coupled with the advances in analytical technologies such as neural network analysis and genetic algorithms, organisations now have greater access to the data than ever before. This increasing level of access has become the new organisational paradigm.


  1. Agresti, A. (1990), “Categorical Data Analysis”, New York: John Wiley & Sons.
  2. Amado, Carlos Armando (Miami, FL, US) (1997) “Method and apparatus for applying if-then-else rules to data sets in a relational data base and generating from the results of application of said rules a database of diagnostics linked to said data sets to aid executive analysis of financial data” US Patent Office – Application No. 400355 - G06F 015/18
  3. Berson, A & Smith, A (1997) “Data Warehousing, Data Mining and OLAP” McGraw Hill USA
  4. Beynon-Davis, P., 2004. Database systems, 3rd edn, Palgrave McMillan.
  5. Codd, E & Codd, S (1993) “Providing OLAP to User-Analysts: An IT Mandate”, Comshare
  6. Collett, D. (1991), “Modelling Binary Data”, London: Chapman & Hall.
  7. Date, C.J., 2004. An introduction to database systems, 8th edn, Addison Wesley.\
  8. Fair Issac (2003) “A Discussion of Data Analysis, Prediction and Decision Techniques”
  9. Fair Isaac White Paper, May 2003; http://
  10. Frank M. (1994) “A Drill Down Analysis of Multi-dimensional Databases”, DBMS, July.
  11. Hoffer, J., Prescott, M., McFadden, F., 2007. Modern database management, 8th edn, Prentice Hall.
  12. Inmon, W. H. (1995) “What is a Data Warehouse” Prisim Solutions Inc,
  13. Inmon, W. H. (1996) “What is a Data Mart”, Informatiques Magazine. Avril
  14. Inmon, W.H.(1996-2) “User Reaction to the Data Warehouse.” DMR (December 1996).
  15. Inmon, W. H. (2003) “Building the Data Warehouse” Wiley Computer Publishing, USA
  16. Keith, Steven; Kaser, Owen & Lemire, Daniel (2005) “Analyzing Large Collections of Electronic Text Using OLAP”, UNBSJ CSAS, TR-05-001, 2005.
  17. Kimball, Ralph & Ross, Margy (2002) “The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling” 2nd Ed. Wiley Computer Publishing, USA
  18. Kimball R. (1997) “A Dimensional Modelling Manifesto” DBMS Online,
  19. Kroenke, D., (2003) “Database processing: Fundamentals, design and implementation”, 10th edn, Prentice Hall.
  20. Lehn, R., Lambert, V. & Nachouki, M.-P. (1997) "Data warehousing tool's architecture: from multidimensional analysis to data mining," dexa, p. 636, 8th International Workshop on Database and Expert Systems Applications (DEXA '97).
  21. Ma, Yao (1998) “Data Warehousing, OLAP and Data Mining; An Integrated Strategy for use at FAA”, M.Eng Thesis, MIT 2nd edn, McGraw-Hill.
  22. Hegland, M (2001) “Data mining techniques” Acta Numerica (2001), Volume 10: Pp 313-355 Cambridge University Press
  23. Mento, B & Rapple, B (2003) “Data Mining and Data Warehousing” Association of Research Libraries, US
  24. Nguyen, Tho Manh & Tjoa, A Min (2006) “Zero-Latency Data Warehousing (ZLDWH): the State-of-the-art and experimental implementation approaches” Institute of Software Technology and Interactive Systems, Vienna University of Technology Favoritenstra├če 9-11/188-3 (2. Stock), 1040 Vienna, Austria
  25. Pokorny J. (1998) “Conceptual Modelling in OLAP”, Proceeding of ECIS’98, Aix-en-Provence, Pp 273-288.
  26. Pratt, P. & Adamski, J., (2005). “The concepts of database management”, 5th edn, ITP.
  27. Pulleyblank, W. R. (2002) “Mathematical sciences in the nineties” Systems Journal, IBM Journal of Research and Development, “Mathematical Sciences at 40” IBM, Vol. 47, No. 1, 2003
  28. Service, R & Maddux, H, (1999) “Building competitive advantage through IS: the organizational Information Quotient”, Journal of Information Science25(1) Pp 51-65.
  29. Sullivan, D. (2001) “Document Warehousing and Text Mining: Techniques for Improving Business Operations”, Marketing, and Sales, USA
  30. Thalhammer, T; Schrefl, M. & Mohania, M (2001) “Active Data Warehouses: Complementing OLAP with Analysis Rules”, Data & Knowledge Engineering, Elsevier Science Ltd., Vol. 39(3), Pp. 241–269.
  31. Vassiliadis, P; Quix, C; Vassiliou, Y & Jarke, M (2001) “DATA WAREHOUSE PROCESS MANAGEMENT” Informatik, De
  32. Whitehouse, Peter R. (2006) “Case Studies in Database Design and Implementation” UQ Australia

[1] Ma, 1998
[2] Vassiliadis et al; 2001
[3] Immon, 1996-2
[4] Immon, 1995 (The founder of Prism Solutions in 1991)
[5] Pokorny, 1998
[6] Nguyen & Tjoa, 2006
[7] Immon, 1996
[8] Immon, 2003
[9] Service & Maddux; 1999
[10] Immon, 1995
[11] Lehn, 1997
[12] Tree Modeling Methods, P81 in Fair Issac, 2003
[13] Pulleyblank, 2002; Hegland 2001
[14] Lehn et al, 1997
[15] Kroenke, 2003; Mento, 2003; ,Keith et al 2005.
[16] Lehn et al 1997
[17] There are a number of specialist statistical packages designed to provide data mining services. Vendors such as SAS and SPSS, as well as open source products such as R provide this type of application software product.
[18] Berson & Smith, 1997
[19] Thalhammer, 2001
[20] Thalhammer, 2001
[21] Where “k” is an arbitrary number of records chosen by the analyst.
[22] Agresti 1990
[23] Fair Issac, 2003; Collett, 1991 & Amado 1997
[24] Hegland, 2001

No comments: