Text data mining or just Text mining, involves the discovery of novel and previously unknown information using computer systems to analyse and extract data from a variety of text sources. Text mining allows the researcher to link extract information in order to create or test hypothesises. Text mining differs from the broader field of data mining or knowledge discovery in databases (Fayyad & Uthurusamy,1999), and the data sources textual collections and documents. It is interested in the derivation of patterns that may be found and unstructured textual data rather then formalised database records.
There are many similarities between data mining and text mining (Hastie, Tibshirani & Friedman, 2001) and in fact text mining has developed through much of the seminal work on its counterpart. In particular it maintains a strong reliance on pre-processing routines, pattern discovery algorithms, and presentation layer elements. Visualisation tools and data mining algorithms are commonly used in text mining with many software programs integrating both data and text functions simultaneously (Berry & Linoff, 1997).
One of the primary differences between the generalised field of data mining and text mining comes from the presumption that data sets used in data mining exercises will be stored in a structured format. Pre-processing operations in text mining are generally focused on transforming unstructured textual data into a format that is more readily interrogated. Additionally, text mining relies heavily on the field of computational linguistics (Fayyad & Stolorz,1996).
There is a strong relation between information retrieval, text mining and Web data mining. The special important properties of text that drive through grammatical syntax and the growing repositories of textual data (such as the Internet/Web) have driven interest in this emerging field. In particular, advances in computational linguistics have further fuelled advances in text mining leading to the development of new techniques and algorithms (Hastie, Tibshirani & Friedman, 2001).
Text Data Mining vs. Information Retrieval / information, Access
Although only one of many factors, a driving force behind the growth of text mining has been the Web (Hastie, Tibshirani & Friedman, 2001). The growth of Internet commerce has created large repositories of documents, customer information, records and other information. On top of this advances in scientific research, academic publications and professional journals provide increasing amounts of unstructured content. With the millions of new abstracts being published every year, knowledge discovery is increasingly becoming reliant on text mining operations.
Relating Text Mining and Computational Linguistics
Text mining extrapolates as data mining text collections to a series of processes that are analogous two processes used in data mining numerical data. In particular, the field of corpus-based computational linguistics has numerous overlaps (Hearst, 1997). Computational linguistics uses empirical methods to compute a wide range of statistics from a large range of official documents often in disparity collections. This process was developed in order to discover data patterns that could provide new or novel results.
These patterns may be then further used in the creation of algorithms that are designed to provide solutions to ongoing problems within natural language processing (Armstrong, 1994). Some of the main issues in this field include part-of-speech tagging, word sense disambiguation, and bilingual dictionary creation.
Church & Liberman (1991) proposed that there is great interest in the field of computational linguistics to word patterns and distributions. In particular they note that word combinations resembling “prices, prescription, and patent”' may be expected to be grouped with the medicinal meaning of “drug”. It is further noted that “abuse, paraphernalia, and illicit”' correlate with the use of the word “drug” in the sense of an illicit substance.
Text categorization is the process used to condense the particular content of a document into a set of pre-defined labels. It has been asserted (Fayyad & Uthurusamy, 1999) that text categorization should be considered text data mining. Fayyad & Uthurusamy (1999) consign the classification of astronomical phenomena as data mining although this is predominantly the analysis of textual data.
Hearst (1997) however believes that this process “does not lead to discovery of new information,…” but “rather, it produces a compact summary of something that is already known”. This process is thus in his view not generally a component of text data mining.
Hearst (1997) does however note that “there are two recent areas of inquiry that make use of text categorization and do seem to fit within the conceptual framework of discovery of trends and patterns within textual data for more general purpose usage”.
He notes these to be:
- A body of work associated with Reuters newswire that utilises text category labels to find “unexpected patterns among text articles” where “the main approach is to compare distributions of category assignments within subsets of the document collection”.
- The DARPA Topic Detection and Tracking initiative which includes the task called On-line New Event Detection “where the input is a stream of news stories in chronological order, and whose output is a yes/no decision for each story, made at the time the story arrives, indicating whether the story is the first reference to a newly occurring event”.
Tukey (1977) suggested that a way to scrutinize text data mining is as a progression of exploratory data analysis that leads to the unearthing of previously unidentified information. It can also be used to provide solutions to problems when the solution is not at present available.
It may also be held that the typical exercise of reading textbooks, journal articles and other papers assists in the invention process by uncovering innovative information, being that an essential component of research is about this. The goal however with Text mining is to utilize text for discovery in a more substantial way.
Text is generally considered to be unstructured (Cherkassky, 1998). However, nearly all documents demonstrate a rich amount of semantic and syntactical structure that may be used to form a framework in structuring data. Typographical elements such as punctuation capitalisation white space carriage returns for instance can provide a rich source of information to the text miner (Berry & Linoff, 1997).
The use of these elements can aid the researcher in determining paragraphs titles, dates etc. These in turn may be used to formulate structure in the data. This of course returns to the field of computational linguistics in an attempt to give meaning to groups of words or phrases and layout.
Characters, Words, Terms and Concepts
At the most basic level text mining systems take input from raw documents in order to create output in the form of patterns, trends and other useful output formats. The result is that Text mining often becomes an iterative process through a loop of queries, searches and refinements that lead to further sets of queries, searches and refinements (Fieldman & Sanger, 2007). Each of these iterative phases, the output should move closer to the desired result.
In Text Mining, the general model of classic data mining is roughly followed (Fieldman & Sanger, 2007):
1. Pre-processing tasks,
a. Document Fetching/ Crawling Techniques,
c. Feature/Term Extraction
2. Core mining operations,
b. Frequent and Near Frequent Sets,
d. Isolating Interesting Patterns,
e. Analysing document collections over time.
3. Presentation and browsing functionality, and
a. Pattern Identification,
b. Trend Analysis,
c. Browsing Functionality
- Simple Filters,
- Query Interpreter,
- Search Interpreter,
- Visualization Tools,
Core mining operations include pattern discovery trend analysis and incremental knowledge discovery algorithms and form the backbone of the text mining process. Together, pre-processing and core mining are the most critical areas for any text mining system. If these stages are not correctly implemented, the data that is produced and visualised will have little value (Fieldman & Sanger, 2007). In fact, the production of incorrect data could even result in negative consequences.
When analysing data, common patterns include distributions concept sets and associations may include comparisons. The goals this process being too figuratively uncover any “nuggets” from undiscovered relationships.
Presentation layer components include GUI and pattern browsing functionality and may include access to character and language editors and optimisers. This stage includes the creation of concept clusters and also the formulation of annotated profiles for specific concepts of patterns.
Refinement (which is also called post-processing) techniques include methods that filter redundant information and cluster closely related data. This stage may include suppression ordering pruning generalisation and clustering approaches aimed at discovery optimisation.
Summary and Future
Although text is difficult to process, it can be extremely rewarding. Without even looking to the future, vast repositories of valuable information may be currently found. The difficulty is in finding the proverbial needle in a haystack.
Computational linguistic tools are currently available, but they have long way to go and sophisticated language analysis needs to be developed further. The accumulations of statistical techniques that compute meaning and apply this to sections of text offer prove promising. However, there is a great amount of research that needs to be completed before the true value of text mining will come to the forefront (Hastie, Tibshirani & Friedman, 2001).
This leaves us with a future are still some way off but which continues to entice us with its potential. The growing volumes of textual documents, research news and records are already beyond the capabilities of any individual to search. If we are to continue to move forward at the rate of technological advancement that we have been moving, accessing this information is crucial. Text mining may provide the solution.
See Comment No. 1