Thursday, 1 August 2013

Principles of Database Development

This is a little something based on a paper of mine from last decade.

Introduction

As the main repository of an organisation’s historical data, the data warehouse is evolving into the memory of a corporation[1]. Through storage of a wide variety of data sources in an integrated format, the data warehouse is becoming both the storeroom of past events, and the central predictive engine.

The data warehouse contains the unrefined substance that when fed through a decision support system can provide management with up-to-date analytics and corporate predictors. The data analyst can use this technology to perform complex queries and analysis of information without adversely impacting operational systems.

Through both the rise in computational speed and power and the growth of data storage, data warehouses have pressed other technologies into greater levels of development. Technologies such as data mining have grown symbiotically.


Section 1 - Review of Research as at 2007 (when I did this)

To be able to explore the effect of the rise of data warehouses on business and society, one first needs to characterize what the concept encompasses. William H. Inmon [2], known as one of the fathers of data warehousing, has stated that data warehouses are required to be subject orientated. In this, the data and thus database is organised in a manner such that all data relations relating to the same event or object are associated correspondingly in a manner that is concurrently time variant, nonvolatile and integrated.

Whereas operational systems are necessarily optimised for ease of use and the rapidity of response, the data warehouse is optimised for reporting and analysis. Online transaction processing (OLTP), crucial to the operational system, is of less importance to the data warehouse. Rather, on line analytical processing (OLAP) and the necessity to access unusual data patterns results in heavily denormalised or dimension based models that may not be required to achieve acceptable query response times.

By time variant, is meant that changes to the objects and tables within the database are tracked and evidenced. This process allows for the statistical analysis of the data over time to produce reports on time variant trends. Heteroscadastic (including ARIMA, GARCH and ARCH) time series analysis of data is one of the newer avenues of research.

A data warehouse requires that once data is committed to the database, it is stored as read only. In this it is used for future reporting and any point in time data becomes a individual snapshot of the database over time. This process allows both for historical analysis and also future predictions.

To be effective, and data warehouse needs to contain data from as many if not all of an organisations operational applications and individual databases. Further, to be of any effective use, the data in the warehouse must be contained in a consistent manner. A failure to either constrained are consistently all to provide an adequate sample of the organisation’s data leads to the GIGO issue. That is, garbage in, garbage out.

Research into data warehousing has expanded into what has been termed the Corporate Information Factory (or "CIF"). A CIF is an organisational data structure which encompasses ERP, eCommerce, customer relationship management (CRM) and many other formerly separate reporting structures. In some cases, a CIF has been known to encompass data marts, exploration warehouses, ODS, both nearline and secondary storage, and project warehouses.

However, the volumes of data, expense and lack of enterprise support leave CIF implementations is an idea for the future. There are still a number of difficulties associated with data warehousing which make their implementation less widespread then maybe expected in the future. Of particular concern, the process of extracting, cleaning and processing data is both time-consuming and difficult. The failure to implement and adhere to corporate wide naming standards amongst many organisations as only exasperated this problem.

Widespread incompatibility between many database products has slowed down the broadening of data warehousing across organisations. Technologies such as OLAP have aided in the development of Cross-application data warehouses, but issues of table structure, normalisation and the type of data stored within individual databases remains an issue.

Another issue that has hampered the implementation of data warehousing is security. In a world that is increasingly becoming reliant on the Internet and the Web, security could develop into a serious issue. With links into the data warehouse from the Internet, an organisations key informational assets are at risk from both discovery and compromise.


Section 2 Implications of Research to Current Practice

There are a number of clear advantages to the adoption of data warehouse technologies. It is easy to see that organisations which adopt these new technologies successfully will gain a clear competitive advantage. These technologies enhance end user access to data and reports in a manner that allows for greater creativity and more informed evaluations.

The ability to create trend reports and use statistical methods to accurately forecast probabilistic events based on the past occurrences and experience provides for more focused corporate activity. For instance, marketing information from a particular sales push can be compared across different regions to evaluate the impact of differing advertising initiatives.

Data warehousing technology can also significantly increase the effectiveness of several commercial business applications. In particular, customer relationship management (CRM) benefits greatly from this technology. The ability to both gain an overall view and to be able to drill down to specific areas and individuals provides a significant advantage to many organisations. CRM has been one of the principal applications to make early use of data warehousing.

Data Mining

Data mining is a relatively new development. The use of statistical methods to routinely investigate large capacities of data for patterns using techniques including classification, decision trees, association rule mining, and clustering, has resulted in a new field of computational mathematics. Data mining is a complex subject in itself and has associations with numerous core disciplines together with computer science and appends significance to influential computational techniques from statistics, information retrieval, machine learning and pattern recognition[3].

Too many people, the key issue introduced through the use of data mining is not commercial or technological in nature. It is a social issue. The protection of individual privacy is a concern that has increased exponentially with the rise in data mining. Data mining increases the possibility of an individual’s privacy being violated in some manner.

The processes used to analyse ordinary commercial transactions may be used to compile significant quantities of information about individuals from their purchasing behaviour and lifestyle preferences. In particular, where data is compiled across multiple organisations that need to protect the privacy of individuals identity is compounded.

Data mining consists of five major elements:

  • Extract, transform, and load transaction data onto the data warehouse system.
  • Store and manage the data in a multidimensional database system.
  • Provide data access to business analysts and information technology professionals.
  • Analyze the data by application software.
  • Present the data in a useful format, such as a graph or table.

Different levels of analysis are available:

  • Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
  • Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.
  • Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID.
  • Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique.
  • Rule induction: The extraction of useful if-then rules from data based on statistical significance.
  • Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.

Conclusion

We expect to see a massive growth in data not just now, but into the foreseeable future.

When I first wrote this as an internal document in 2007, it was the case that data was increasing as a source of growth. Now, it is becoming more and more an end.

The fact that more data improves even poor AI algorithms makes the collection and storage of data essential.


Bibliography

Beynon-Davis, P., 2004. Database systems, 3rd edn, Palgrave McMillan.

Date, C.J., 2004. An introduction to database systems, 8th edn, Addison Wesley.

Hoffer, J., Prescott, M., McFadden, F., 2007. Modern database management, 8th edn, Prentice Hall.

Inmon, W. H. (2003) “Building the Data Warehouse” Wiley Computer Publishing, USA

Keith, Steven; Kaser, Owen & Lemire, Daniel (2005) “Analyzing Large Collections of Electronic Text Using OLAP”, UNBSJ CSAS, TR-05-001, 2005.

Kimball, Ralph & Ross, Margy (2002) “The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling” 2nd Ed. Wiley Computer Publishing, USA

IBM 2005 “IBM Informix Database Design and Implementation Guide”, IBM Informix Dynamic Server Enterprise and Workgroup Edition, v10.0, Last updated: November 2 2005

Kroenke, D., (2003) “Database processing: Fundamentals, design and implementation”, 10th edn, Prentice Hall.

Mannino, M.V., (2004). “Database design, application development & administration”, 2nd edn, McGraw-Hill.

Mento, B & Rapple, B (2003) “Data Mining and Data Warehousing” Association of Research Libraries, USA

Pratt, P. & Adamski, J., (2005). “The concepts of database management”, 5th edn, ITP.

Sullivan, D. (2001) “Document Warehousing and Text Mining: Techniques for Improving Business Operations”, Marketing, and Sales, USA

Whitehouse, Peter R. (2006) “Case Studies in Database Design and Implementation” UQ Australia


[1] https://en.wikipedia.org/wiki/Data_warehouse

[2] The founder of Prism Solutions in 1991

[3] http://aima.cs.berkeley.edu/

No comments: