Escolar Documentos
Profissional Documentos
Cultura Documentos
The Contents of the Presentation are extracted from the following articles:
Pre-crime Data Mining Investigative Data Mining for Security and Criminal Detection by Jesus Mena, ButterworthHeinemann 2003 Web Page Extension of Data Warehouses by John Wang (ed) Idea Group Publishing 2006 Text Mining: Clustering Concepts Investigative Data Mining for Security and Criminal Detection by Jesus Mena Butterworth-Heinemann 2003 Effective Databases for Text & Document Management Effective Databases for Text & Document Management by Shirley A. Becker (ed) Idea Group Publishing 2003
Topics
1.
2. 3.
4. 5.
Information Mining, Text Mining and Knowledge Management Text Mining Technologies & Applications Text Mining Methods
Text Mining for Security and Criminal Detection Text Mining Tools
6. 7. 8.
9.
10. 11.
Web Mining- An Overview Web Mining and Information Retrieval Web Page Extension of Data Warehouses Future Trends Conclusion Key Terms
Information Mining
The shortest path from complex data to knowledge discovery is Information mining instead of data mining to reflect the rich variety of forms that information required for business intelligence can take. Information mining implies using powerful and sophisticated tools to
do the following:
Uncover associations, patterns, and trends Detect deviations Group and classify information Develop predictive models
Information Mining
From a technical perspective, the real keys to successful information mining are its algorithms: complex mathematical processes that compare and correlate data. Algorithms enable an information mining application to determine who the best customers for the business are or what they like to buy. They can also determine at what time of day, in what combinations, or how an organization can optimize inventory, pricing, and merchandising in order to retain these customers and cause them to buy more, at increased profit margins. A large volume of information is stored in a non-numeric forms: documents, images and video files.
1. Machine Learning and Data Mining (Hand, et al., 2001; Mitchell, 1997; Witten & Frank, 1999): Provides techniques for data analysis with varying knowledge representations and large amounts of data. 2. Statistics and Statistical Learning (Hastie, et al., 2001): Contributes data analysis in general in the context of text mining (Duda et al., 2000). 3. Information Retrieval (Rijsberg, 1979): Provides techniques for text manipulation and retrieval mechanisms.
4. Natural Language Processing (Manning & Schutze, 2001): Provides techniques for analyzing natural language. Some aspects of text mining involve the development of models for reasoning about new text documents, based on words, phrases, linguistics, and grammatical properties of the text, as well as extracting information and knowledge from large amounts of text documents.
Internet Searching (Web Mining) - It has been around for a quite few years. Yahoo, Alta Vista, and Excite are three of the earliest. Search engines (and discovery services) operate by indexing the context in a particular Web site and allows users to search the indexes. Although useful, first generations of these tools often were wrong because they did not correctly index the content they retrieved. Advances in text mining applied to the internet searching resulted in online text mining, representing the new generation of Internet search tools. With these products, users can gain more relevant information by processing smaller amount of links, pages and indexes.
Text Analysis (Text Mining) - It has been around longer than Internet searching. Indeed, scientists have been trying to make computers understand natural languages for decades; text analysis is an integral part of these efforts. The automatic analysis of text information can be used for several different general purposes:
1. To provide an overview of the contents of a large document collection, for ex., finding significant clusters of documents in a customer feedback collection could indicate where a companys products and services need improvement.
2. To identify hidden structures between groups of objects; this may help to organize an intranet site so that related documents are all connected by hyperlinks.
2.
3.
Automated help desk- Some companies use text mining to respond to customer inquiries. Customers letters and emails are processed by a text mining applications.
5.
Text Document Categorization is used when a set of predefined content categories, such as arts, business, computers, games, health, recreation, science, and sport, is provided, as well as a set of documents labeled with those categories. The task is to classify previously unseen text documents by assigning each document one or more of the predefined categories. This usually is performed by representing documents as word-vectors and using documents that already have been assigned the categories to generate a model for assigning content categories to new documents (Jackson & Moulinier, 2002; Sebastiani, 2002). The categories can be organized into an ontology (e.g., the MeSH ontology for medical subject headings or the DMoz hierarchy of Web documents).
Text-Mining Methods
Document Clustering (Steinbach et al., 2000) is based on an arbitrary data clustering algorithm adopted for text data by representing each document as a word vector. The similarity of two documents is commonly measured by the cosinesimilarity between the word vectors representing the documents. The same similarity also is used in document categorization for finding a set of the most similar documents.
Text-Mining Methods
Visualization of text data is a method used to obtain early measures of data quality, content, and distribution (Fayyad et al., 2001). For instance, by applying document visualization, it is possible to get an overview of the Web site content or document collection. One form of text visualization is based on document clustering (Grobelnik & Mladenic, 2002) by first representing the documents as word vectors and by performing K-means clustering algorithms on the set of word vectors. The obtained clusters then are represented as nodes in a graph, where each node in the graph is described by the set of most characteristic words in the cluster. Similar nodes, as measured by the cosine-similarity of their word vectors, are connected by an edge in the graph. When such a graph is drawn, it provides a visual representation of the document set.
Text-Mining Methods
Text Summarization often is applied as a second stage of document retrieval in order to help the user getting an idea about content of the retrieved documents. Research in information retrieval has a long tradition of addressing the problem of text summarization, with the first reported attempts in the 1950s and 1960s, that were exploiting properties such as frequency of words in the text. When dealing with text, especially in different natural languages, properties of the language can be a valuable source of information. This brings in text summarization of the late 1970s the methods from research in natural language processing. As humans are good at making summaries, we can consider using examples of human-generated summaries to find something about the underlying process by applying machine learning and data-mining methods, a popular problem in 1990s.
The explosion of the amount of data generated from government and corporate databases, e-mails, Internet survey forms, phone and cellular records, and other communications has led to the need for new pattern-recognition technologies, including the need to extract concepts and keywords from unstructured data via text mining tools using unique clustering techniques. Based on a field of AI known as natural language processing (NLP), text mining tools can capture critical features of a document's content based on the analysis of its linguistic characteristics. One of the obvious applications for text mining is monitoring multiple online and wireless communication channels for the use of selected keywords, such as anthrax or the names of individual or groups of suspects. Patterns in digital textual files provide clues to the identity and features of criminals, which investigators can uncover via the use of this evolving genre of special text mining tools.
Text mining has typically been used by corporations to organize and index internal documents, but the same technology can be used to organize criminal cases by police departments to institutionalize the knowledge of criminal activities by perpetrators and organized gangs and groups. More importantly, criminal investigators and counter-intelligence analysts can sort, organize, and analyze gigabytes of text during the course of their investigations and inquiries using the same technology and tools. Most of today's crimes are electronic in nature, requiring the coordination and communication of perpetrators via networks and databases, which leave textual trails that investigators can track and analyze. There is an assortment of tools and techniques for discovering key information concepts from narrative text residing in multiple databases in many formats and multiple languages.
Text mining tools and applications focus on discovering relationships in unstructured text and can be applied to the problem of searching and locating keywords, such as names or terms used in e-mails, wireless phone calls, faxes, instant messages, chat rooms, and other methods of human communication. Unlike traditional data mining, which deals with databases that follow a rigid structure of tables containing records representing specific instances of entities based on relationships between values in set columns.
Probably one of the most powerful tools for investigative data miners, in terms of detecting, identifying, and classifying patterns of digital and physical evidence is the neural network, a technology that has been around for 20 years. Although neural networks were proposed in the late 1950s, it wasn't until the mid-1980s that software became sufficiently sophisticated and computers became powerful enough for actual applications to be developed. During the 1990s, the development of commercial neural network tools and applications by such firms are Nestor, NeuralWare, and HNC became reliable enough, enabling their widespread use in financial, marketing, retailing, medical, and manufacturing market sectors. Ironically, one of the first and most successful applications was in the area of the detection of credit card fraud.
Today, however, neural networks are being applied to an increasing number of real-world problems of considerable complexity. Neural networks are good pattern-recognition engines and robust classifiers with the ability to generalize in making decisions about imprecise and incomplete data. Unlike other traditional statistical methods, like regression, they are able to work with a relatively small training sample in constructing predictive models; this makes them ideal in criminal detection situations because, for example, only a tiny percentage of most transactions are fraudulent.
A key concept about working with neural networks is that they must be trained, just as a child or a pet must, because this type of software is really about remembering observations. If provided an adequate sample of fraud or other criminal observations, it will eventually be able to spot new instances or situations of similar crimes. Training involves exposing a set of examples of the transaction patterns to a neural-network algorithm; often thousands of sessions are recycled until the neural network learns the pattern. As a neural network is trained, it gradually become skilled at recognizing the patterns of criminal behavior and features of perpetrators; this is actually done through an adjustment of mathematical formulas that are continuously changing, gradually converging into a formula of weights that can be used to detect new criminal behavior or other criminals.
Each processing unit (circle) in one layer is connected to each processing unit in the next layer by a weighted value, expressing the strength of the relationship. The network attempts to mirror the way the human brain works in recognizing patterns by arithmetically combining all the variables with a given data point.
In this way, it is possible to develop nonlinear predictive models that learn by studying combinations of variables and how different combinations of variables affect different data sets.
Probably the most important and pivotal technology for profiling terrorists and criminals via data mining is through the use of machine-learning algorithms. Machine-learning algorithms are commonly used to segment a databaseto automate the manual process of searching and discovering key features and intervals. For example, they can be used to answer such questions as when is fraud most likely to take place or what are the characteristics of a drug smuggler. Machine-learning software can segment a database into statistically significant clusters based on a desired output, such as the identifiable characteristics of suspected criminals or terrorists. Like neural networks, they can be used to find the needles in the digital haystacks. However, unlike nets, they can generate graphical decision trees or IF/THEN rules, which an analyst can understand and use to gain important insight into the attributes of crimes and criminals.
Machine-learning algorithms, such as CART, CHAID, and C5.0, operate somewhat differently, but the solution is basically the same: They segment and classify the data based on a desired output, such as identifying a potential perpetrator. They operate through a process similar to the game of 20 questions, interrogating a data set in order to discover what attributes are the most important for identifying a potential customer, perpetrator, or piece of fruit. Let's say we have a banana, an apple, and an orange. Which data attribute carries the most information in classifying that fruit? Is it weight, shape, or color? Weight is of little help since 7.8 ounces isn't going to discriminate very much. How about shape? Well, if it is round, we can rule out a banana. However, color is really the best attribute and carries the most information for identifying fruit. The same process takes place in the identification of perpetrators, except in this case an analysis might incorporate hundreds, if not thousands, of data attributes.
Their output can be either in the form of IF/THEN rules or a graphical decision tree with each branch representing a distinct cluster in a database. They can automate the process of stratification so that known clues can be used to "score" individuals as interactions occur in various databases over time and predictive rules can "fire" in real-time for detecting potential suspects. The rules or "signatures" could be hosted in centralized servers, so that as transactions occur in commercial and government databases, real-time alerts would be broadcast to law enforcement agencies and other point-ofcontact users; a scenario might be played as follows:
3. Under different conditions, the alert is elevated: RULE 3: IF social security number issued <= 89121 days ago, AND 2 overseas trips during last 3 months, AND license type = Truck, THEN target 63% probability, Recommended Action: Ask for additional information about destination, report on findings to this system based on extracting keywords or sentences or generating sentences.
In a networked environment such as ours, a new entity has evolved: intelligent agent software. Over the past few years, agents have emerged as a new software paradigm; they are in part distributed systems, autonomous programs, and artificial life. The concept of agents is an outgrowth of years of research in the fields of AI and robotics. They represent concepts of reasoning, knowledge representation, and autonomous learning. Agents are automated programs and provide tools for integration across multiple applications and databases, running across open and closed networks. They are a means of retrieving, filtering, managing, monitoring, analyzing, and disseminating information over the Internet, intranets, and other proprietary networks.
Agents represent a new generation of computing systems and are one of the more recent developments in the field of AI. Agents are specific applications with predefined goals, which can run autonomously; for example, an Internet-based agent can retrieve documents based on user-defined criteria. They can also monitor an environment and issue alerts or go into action based on how they are programmed. In the course of investigative data mining projects, for example, agents can serve the function of software detectives, monitoring, shadowing, recognizing, and retrieving information for analysis and case development or real-time alerts.
The market for text mining products can be divided into two groups: text mining tools and kits for constructing applications with embedded text mining facilities. They vary from inexpensive desktop packages to enterprise-wide systems costing thousands of dollars. The following is a partial list of text analysis software. Keep in mind, this is not an all-encompassing list; however, analysts need to know what these tools do and how they may want to apply them to solve profiling and forensic needs.
Clairvoyance http://www.claritech.com/ ClearForest http://www.clearforest.com/ Copernic http://www.copernic.com/company/index.html DolphinSearch http://www.dolphinsearch.com/ dtSearch http://www.dtsearch.com/ HNC Software http://www.hnc.com/
IBM http://www3.ibm.com/software/data/iminer/fortext/index.html iCrossReader http://www.insight.com.ru Klarity http://www.klarity.com.au Kwalitan http://www.gamma.rug.nl Leximancer http://www.leximancer.com
Lextek
http://www.languageidentifier.com Semio http://www.semio.com/ Temis http://www.temis-group.com Text Analyst http://www.megaputer.com TripleHop http://www.triplehop.com/ie Quenza http://www.xanalys.com/quenza.html
Products
SelectResponse Agentware Cambio RetrievalWare DOCSFulcrum SearchSearver Intelligent Minor for Text Cross-reader Dataset Text Analyst SemioMap KeyView, Intranet Spider
Overview
Data mining researchers have a fertile area to develop different systems, using Internet as a knowledge base or personalizing Web information. The combination of the Internet and data mining typically has been referred as Web mining, defined by Kosala and Blockeel (2000) as "a converging research area from several research communities, such as DataBase (DB), Information Retrieval (IR) and Artificial Intelligent (AI), especially from machine learning and Natural Language Processing (NLP)"
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services; traditionally focused in three distinct ways, based on which part of the Web to mine: Web content, Web structure and Web usage. Brief descriptions of these categories are summarized below.
1. Web Content Mining: Web content consists of several types of data, such as textual, image, audio, video, and metadata, as well as hyperlinks. Web content mining describes the process of information discovery from millions of sources across the World Wide Web. From an information retrieval point of view, Web sites consist of collections of hypertext documents for unstructured documents (Turney, 2002); from a DB point of view, Web sites consist of collections of semi-structured documents (Jeh & Widom, 2004).
2. Web Structure Mining: This approach is interested in the structure of the hyperlinks within the Web itselfthe interdocument structure. The Web structure is inspired by the study of social network and citation analysis (Chakrabarti, 2002). Some algorithms have been proposed to model the Web topology, such as PageRank (Brin & Page, 1998) from Google and other approaches that add content information to the link structure (Getoor, 2003).
3. Web Usage Mining: Web usage mining focuses on techniques that could predict user behavior while the user interacts with the Web. A first approach maps the usage data of the Web server into relational tables for a later analysis. A second approach uses the log data directly by using special preprocessing techniques (Borges & Levene, 2004).
Information Retrieval
On the Web, there are no standards or style rules; the contents are created by a set of very heterogeneous people in an autonomous way. In this sense, the Web can be seen as a huge amount of online unstructured information. Due to this inherent chaos, the necessity of developing systems that aid us in the processes of searching and efficient accessing of information has emerged.
When we want to find information on the Web, we usually access it by search services, such as Google (http://www.google.com) or AllTheWeb (http://www.alltheweb.com), which return a ranked list of Web pages in response to our request. A recent study (Gonzalo, 2004) showed that this method of finding information works well when we want to retrieve home pages, Websites related to corporations, institutions, or specific events, or to find quality portals. However, when we want to explore several pages, relating information from several sources, this way has some deficiencies: the ranked lists are not conceptually ordered, and information in different sources is not related. The Google model has the following features: crawling the Web, the application of a simple Boolean search, the PageRank algorithm, and an efficient implementation. This model directs us to a Web page, and then we are abandoned with the local server search tools, once the page is reached. Nowadays, these tools are very simple, and the search results are poor.
Other ways to find information is using Web directories organized by categories, such as Yahoo (http://www.yahoo.com) or Open Directory Project (http://www.dmoz.org). However, the manual nature of this categorization makes the directories' maintenance too arduous, if machine processes do not assist it.
Future and present research tends to the visualization and organization of results, the information extraction over the retrieved pages, or the development of efficient local servers search tools. Next, we summarize some of the technologies that can be explored in Web content mining and give a brief description of their main features.
The main goal of these methods is to find the nearest category, from a pre-classified categories hierarchy to a specific Webpage content. Some relevant works in this approach can be found in Chakrabarti (2003) and Kwon and Lee (2003).
Clustering involves dividing a set of n documents into a specific number of clusters k, so that some documents are similar to other documents in the same cluster and different from those in other clusters. Some examples in this context are Carey, et al. (2003) and Liu, et al. (2002).
In general, Web mining systems can be decomposed into different stages that can be grouped in four main phases: resource access, the task of capturing intended Web documents; information preprocessing, the automatic selection of specific information from the captured resources; generalization, where machine learning or data-mining processes discover general patterns in individual Web pages or across multiple sites; and finally, the analysis phase, or validation and interpretation of the mined patterns. We think that by improving each of the phases, the final system behavior also can be improved.
In this presentation, we focus our efforts on Web pages representation, which can be associated with the informationpreprocessing phase in a general Web-mining system. Several hypertext representations have been introduced in literature in different Web mining categories, and they will depend on the later use and application that will be given. Here, we restrict our analysis to Web-content mining, and, in addition, hyperlinks and multimedia data are not considered. The main reason to select only the tagged text is to look for the existence of special features emerging from the HTML tags with the aim to develop Web-content mining systems with greater scope and better performance as local server search tools. In this case, the representation of Web pages is similar to the representation of any text.
A model of text must build a machine representation of the world knowledge and, therefore, must involve a natural language grammar. Since we restrict our scope to statistical analyses for Web-page classification, we need to find suitable representations for hypertext that will suffice for our learning
applications.
We carry out a comparison between different representations using the vector space model (Salton et al., 1975), where documents are tokenized using simple rules, such as whitespace delimiters in English and tokens stemmed to canonical form (e.g., reading to read). Each canonical token represents an axis in the Euclidean space. This representation ignores the sequence in which words occur and is based on the statistical about single independent words. This independence principle between the words that co-appear in a text or appear as multiword terms is a certain error but reduces the complexity of our problem without loss of efficiency. The different representations are obtained using different functions to assign the value of each component in the vector representation. We used a subset of the BankSearch Dataset as the Web document collection (Sinka & Corne, 2002).
First, we obtained five representations using wellknown functions in the IR environment. All these are based only on the term frequency in the Web page that we want to represent, and on the term frequency in the pages of the collection. Below, we summarize the different evaluated representations and a brief explanation.
Information Representations
1 Binary: This is the most straightforward model, which is called set of words. The relevance or weight of a feature is a binary value {0,1}, depending on whether the feature appears in the document or not.
2. Term Frequency (TF): Each term is assumed to have an importance proportional to the number of times it occurs in the text (Luhn, 1957). The weight of a term t in a document d is given by W(d;t)=TF(d;t); where TF(d;t) is the term frequency of the term t in d.
Information Representations
3. Inverse Document Frequency (IDF): The importance of each term is assumed to be inversely proportional to the number of documents that contain the term. The IDF factor of a term t is given by IDF(t)=logNxdf(t), where N is the number of documents in the collection and df(t) is the number of documents that contain the term t. 4. TF-IDF: Salton (1988) proposed to combine TF and IDF to weight terms. Then, the TF-IDF weight of a term t in a document d is given by W(d;t)=TF(d;t) xIDF(t).
5. WIDF: It is an extension of IDF to incorporate the term frequency over the collection of documents. The WIDF weight is given by W(d, t)=TF(d, t)/ diTF(i, t).
Information Representations
In addition to the five representations, we obtained other two representations which combine several criteria extracted from some tagged text and that can be treated differently from other parts of the Web page document. Both representations consider more elements than the term frequency to obtain the term relevance in the Web page content. These two representations are: the Analytical Combination of Criteria (ACC) and Fuzzy Combination of Criteria (FCC).
The Analytical Combination of Criteria (ACC) Vs. Fuzzy Combination of Criteria (FCC)
The difference between them is the way they evaluate and combine the criteria. The first one (Fresno & Ribeiro, 2004) uses a lineal combination of those criteria, whereas the second one (Ribeiro et al., 2002) combines them by using a fuzzy system. A fuzzy reasoning system is a suitable framework to capture the qualitative human expert knowledge to solve the ambiguity inherent to the current reasoning process, embodying knowledge and expertise in a set of linguistic expressions that manage words instead of numerical values. The fundamental cue is that often a criterion evaluates the importance of a word only when it appears combined with another criterion. Some Web pages representation methods that use HTML tags in different ways can be found in (Molinari et al., 2003; Pierre, 2001; Yang et al., 2002).
Word Frequency in the Text: Luhn (1957) showed that a statistical analysis of the words in the document provides some clues of its contents. This is the most used heuristic in the text representation field. Word's Appearance in the Text: The word's appearance in the title of the Web page, considering that in many cases the document title can be a summary about the content.
The Positions All Along the Text: In automatic text summarization, a well-known heuristic to extract sentences that contain important information to the summary is selecting those that appear at the beginning and at the end of the document (Edmunson, 1969). Word's Appearance in Emphasis Tags: Whether or not the word appears in emphasis tags. For this criterion, several HTML tags were selected, because they capture the author's intention. The hypothesis is that if a word appears emphasized, it is because the author wants it to stand out.
The selected classes are very different from each other to display favorable conditions for learning and classification stages, and to show clearly the achievements of the different representation methods.
Representation Stage
The representation stage was achieved as follows. The corpus, a set of documents that generate the vocabulary, was created from 700 pages for each selected classes. All the different stemmed words found in these documents generated the vocabulary as axes in the Euclidean space. We fixed the maximum length of a stemmed word to 30 characters and the minimum length to three characters.
Representation Stage
In order to calculate the values of the vector components for each document, we followed the next steps: (a) we eliminated all the punctuation marks except some special marks that are used in URLs, e-mail addresses, and multi-word terms; (b) the words in a stoplist used in IR were eliminated from the Web pages; (c) we obtained the stem of each term by using the wellknown Porter's stemming algorithm; (d) we counted the number of times that each term appeared on each Web page and the number of pages where the term was present; and (e) in order to calculate the ACC and FCC representations, we memorized the position of each term all along the Web page and whether or not the feature appears in emphasis and title tags. In addition, another 300 pages for each class were represented in the same vocabulary to evaluate the system.
Learning Stage
In the learning stage, the class descriptors (i.e., information common to a particular class, but extrinsic to an instance of that class) were obtained from a supervised learning process. Considering the central limit theorem, the word relevance (i.e., the value of each component in the vector representation) in the text content for each class will be distributed as a Gaussian function with the parameters mean and variance . Then, the density function: gets the probability that a
word i, with relevance r, appears in a class (Fresno & Ribeiro, 2004). The mean and variance are obtained from the two selected sets of examples for each class by a maximum likelihood estimator method.
Classification Stage
Once the learning stage is achieved, a Bayesian classification process is carried out to test the performance of the obtained class descriptors. The optimal classification of a new page d is the class c j C, where the probability P (Cj/d) is maximum, where C is the set of considered classes.)P(cj|d reflects the confidence that cj holds given the page d. Then, the Bayes' theorem states: Considering the hypothesis of the independence principle, assuming that all pages and classes have the same prior probability, applying logarithms because it is a non-decreasing monotonic function and shifting the argument one unit to avoid the discontinuity in x=0, finally, the most likelihood class is given by:
Data warehouses are constructed to provide valuable and current information for decision-making. Typically this information is derived from the organization's functional databases. The data warehouse is then providing a consolidated, convenient source of data for the decision maker. However, the available organizational information may not be sufficient to come to a decision. Information external to the organization is also often necessary for management to arrive at strategic decisions. Such external information may be available on the World Wide Web; and when added to the data warehouse extends decision-making power.
Introduction
The Web can be considered as a large repository of data. This data is on the whole unstructured and must be gathered and extracted to be made into something valuable for the organizational decision maker. To gather this data and place it into the organization's data warehouse requires an understanding of the data warehouse metadata and the use of Web mining techniques.
Introduction
Typically when conducting a search on the Web, a user initiates the search by using a search engine to find documents that refer to the desired subject. This requires the user to define the domain of interest as a keyword or a collection of keywords that can be processed by the search engine. The searcher may not know how to break the domain down, thus limiting the search to the domain name. However, even given the ability to break down the domain and conduct a search, the search results have two significant problems. One, Web searches return information about a very large number of documents. Two, much of the returned information may be marginally relevant or completely irrelevant to the domain. The decision maker may not have time to sift through results to find the meaningful information.
Background
Web search queries can be related to each other by the results returned (Wu & Crestani, 2004; Glance, 2000). This knowledge of common results to different queries can assist a new searcher in finding desired information. However, it assumes domain knowledge sufficient to develop a query with keywords, and does not provide corresponding organizational knowledge.
Background
Some Web search engines find information by categorizing the pages in their indexes. One of the first to create a structure as part of their Web index was Yahoo! (http://www.yahoo.com). Yahoo! has developed a hierarchy of documents, which is designed to help users find information faster. This hierarchy acts as a taxonomy of the domain. Yahoo! helps by directing the searcher through the domain. Again, there is no organizational knowledge to put the Web pages into a local context, so the documents must be accessed and assimilated by the searcher.
Experts within a domain of knowledge are familiar with the facts and the organization of the domain. In the warehouse design process, the analyst extracts from the expert the domain organization. This organization is the foundation for the warehouse structure and specifically the dimensions that represent the characteristics of the domain.
6. Select Web Documents: Web pages selected are those that may add knowledge to the data warehouse. This may be new knowledge or extensional knowledge to the warehouse. Because the analyst knows that the city of interest to the data warehouse is Buffalo, New York, he only considers the appropriate pages.
7. Relevancy Review: The analyst reviews the selected pages to ensure they are relevant to the intent of the warehouse attribute. The meta-data of the relevant Web pages is extracted during this relevancy review. The meta-data includes the Web page URL, title, date retrieved, date created, and summary. This meta-data may come from the search engine results list. For the Buffalo home page this meta-data is found in Figure 2. 8. Add Meta-Data: The meta-data for the page is added as an extension to the data warehouse. This addition is added as an extension to the city dimension creating a snowflake-like schema for the data warehouse.
Nowadays, the main lack in systems that aids us in the search and access of the information process is revealed when we want to explore several pages, relating information from several sources. Future trends must find regularities in HTML vocabulary to improve the response of the local server search tools, combining with other aspects such as hyperlink regularities.
Future Trends
There is a number of researchers intensively working in the area of text data mining, mainly guided by the need of developing new methods capable of handling interesting realworld problems. One such problem recognized in the past few years is on reducing the amount of manual work needed for hand labeling the data. Namely, most of the approaches for automatic document filtering, categorization, user profiling, information extraction, and text tagging requires a set of labeled (pre-categorized) data describing the addressed concepts. Using unlabeled data and bootstrapping learning are two directions giving research results that enable important reduction in the needed amount of hand labeling.
Future Trends
In document categorization using unlabeled data, we need a small number of labeled documents and a large pool of unlabeled documents (e.g., classify an article in one of the news groups, classify Web pages). The approach proposed by Nigam, et al. (2001) can be described as follows. First, model the labeled documents and use the trained model to assign probabilistically weighted categories to all unlabeled documents. Then, train a new model using all the documents and iterate until the model remains unchanged. It can be seen that the final result depends heavily on the quality of the categories assigned to the small set of hand-labeled data, but it is much easier to hand label a small set of examples with good quality than a large set of examples with medium quality.
Future Trends
Bootstrap learning for Web page categorization is based on the fact that most of the Web pages have some hyperlinks pointing to them. Using that, we can describe each Web page either by its content or by the content of the hyperlinks that point to it. First, a small number of documents is labeled, and each is described, using the two descriptions. One model is constructed from each description independently and used to label a large set of unlabeled documents. A few of those documents for which the prediction was the most confident are added to the set of labeled documents, and the whole loop is repeated. In this way, we start with a small set of labeled documents, enlarging it through the iterations and hoping that the initial labels were a good coverage of the problem space.
Future Trends
Some work also was done in the direction of mining the extracted data (Ghani et al., 2000), where information extraction is used to automatically collect information about different companies from the Web. Data-mining methods then are used on the extracted data. As Web documents are naturally organized in a graph structure through hyperlinks, there are also research efforts on using that graph structure to improve document categorization (Craven & Slattery, 2001) to improve Web search and visualization of the Web.
Future Trends
In the search engine of the future the linear index of Web pages will be replaced by an ontology. This ontology will be a semantic representation of the Web. Within the ontology the pages may be represented by keywords and will also have connections to other pages. These connections will be the relationships between the pages and may also be weighted. Investigation of an individual page's content, the inter-page hypertext links, the position of the page on its server, and search engine discovered relationships would create the ontology (Guha, McCool & Miller, 2003).
Future Trends
Matches will no longer be query keyword to index keyword, but a match of the data warehouse ontology to the search engine ontology. Rather than point-to-point matching, the query is the best fit of one multi-dimensional space upon another (Doan, Madhavan, Dhamankar, Domingos, & Halevy, 2003). The returned page locations are then more specific to the information need of the data warehouse.
Mining of text data, as described in this article, is a fairly wide area, including different methods used to provide computer support for handling text data. Evolving at the intersection of different research areas and existing in parallel with them, we could say that text mining gets its methodological inspiration from different fields, while its applications are closely connected to the areas of Web mining (Chakrabarti, 2002) and digital libraries. As many of the already developed approaches provide reasonable quality solutions, text mining is gaining popularity in applications, and the researcher are addressing more demanding problems and approaches that, for instance, go beyond the word-vector representation of the text and combine with other areas, such as semantic Web and knowledge management.
ConclusionWeb Searching
The use of the Web to extend data warehouse knowledge about a domain provides the decision maker with more information than may otherwise be available from the organizational data sources used to populate the data warehouse. The Web pages referenced in the warehouse are derived from the currently available data and knowledge of the data warehouse structure.
The Web search process and the data warehouse analyst sifts through the external, distributed Web to find relevant pages. This Web generated knowledge is added to the data warehouse for decision maker consideration.
Active Learning: Learning modules that support active learning select the best examples for class labeling and training without depending on a teacher's decision or random sampling. Document Categorization: A process that assigns one or more of the predefined categories (labels) to a document. Document Clustering: An unsupervised learning technique that partitions a given set of documents into distinct groups of similar documents based on similarity or distance measures. EM Algorithm: An iterative method for estimating maximum likelihood in problems with incomplete (or unlabeled) data. EM algorithm can be used for semi-supervised learning (see below) since it is a form of clustering algorithm that clusters the unlabeled data around the labeled data.
Key Terms
Fuzzy Relation: In fuzzy relations, degrees of association between objects are represented not as crisp relations but membership grade in the same manner as degrees of set membership are represented in a fuzzy set. Information Extraction: A process of extracting data from the text, commonly used to fill in the data into fields of a database based on text documents. Semi-Supervised Clustering: A variant of unsupervised clustering techniques without requiring external knowledge. Semi-supervised clustering performs clustering process under various kinds of user constraints or domain knowledge.
Key Terms
Semi-Supervised Learning: A variant of supervised learning that uses both labeled data and unlabeled data for training. Semi-supervised learning attempts to provide more precisely learned classification model by augmenting labeled training examples with information exploited from unlabeled data. Supervised Learning: A machine learning technique for inductively building a classification model (or function) of a given set of classes from a set of training (prelabeled) examples.
Key Terms
Text Classification: The task of automatically assigning a set of text documents to a set of predefined classes. Recent text classification methods adopt supervised learning algorithms such as Nave Bayes and support vector machine.
Text Summarization: A process of generating summary from a longer text, usually based on extracting keywords or sentences or generating sentences.
Topic Hierarchy (Taxonomy): Topic hierarchy in this chapter is a formal hierarchical structure for orderly classification of textual information. It hierarchically categorizes incoming documents according to topic in the sense that documents at a lower category have increasing specificity.
Key Terms
Topic Identification and Tracking: A process of identifying appearance of new topics in a stream of data, such as news messages, and tracking reappearance of a single topic in the stream of text data. User Profiling: A process for automatic modeling of the user. In the context of Web data, it can be contentbased, using the content of the items that the user has accessed or collaborative, using the ways the other users access the same set of items. In the context of text mining, we talk about user profiling when using content of text documents. Visualization of Text Data: A process of visual representation of text data, where different methods for visualizing data can be used to place the data usually in two or three dimensions and draw a picture.