Você está na página 1de 98

Text Mining & Web Mining

Paul Chen 2006

The Contents of the Presentation are extracted from the following articles:

Pre-crime Data Mining Investigative Data Mining for Security and Criminal Detection by Jesus Mena, ButterworthHeinemann 2003 Web Page Extension of Data Warehouses by John Wang (ed) Idea Group Publishing 2006 Text Mining: Clustering Concepts Investigative Data Mining for Security and Criminal Detection by Jesus Mena Butterworth-Heinemann 2003 Effective Databases for Text & Document Management Effective Databases for Text & Document Management by Shirley A. Becker (ed) Idea Group Publishing 2003

Topics
1.
2. 3.
4. 5.

Information Mining, Text Mining and Knowledge Management Text Mining Technologies & Applications Text Mining Methods
Text Mining for Security and Criminal Detection Text Mining Tools

6. 7. 8.

9.
10. 11.

Web Mining- An Overview Web Mining and Information Retrieval Web Page Extension of Data Warehouses Future Trends Conclusion Key Terms

Topic 1: Information Mining, Text Mining and Knowledge Management


In the current and emerging competitive and highly dynamic business environment, only the most competitive companies will achieve sustained market success. In order to capitalize on business opportunities, these organization will distinguish themselves by the capacity to leverage information about their marketplace, customers, and operations. A central part of this strategy for long-term sustaining success will be an active information repository- an advanced data warehouse, in which information from various applications or parts of the business is coalesced and understood.

Information Mining
The shortest path from complex data to knowledge discovery is Information mining instead of data mining to reflect the rich variety of forms that information required for business intelligence can take. Information mining implies using powerful and sophisticated tools to
do the following:

Uncover associations, patterns, and trends Detect deviations Group and classify information Develop predictive models

Information Mining
From a technical perspective, the real keys to successful information mining are its algorithms: complex mathematical processes that compare and correlate data. Algorithms enable an information mining application to determine who the best customers for the business are or what they like to buy. They can also determine at what time of day, in what combinations, or how an organization can optimize inventory, pricing, and merchandising in order to retain these customers and cause them to buy more, at increased profit margins. A large volume of information is stored in a non-numeric forms: documents, images and video files.

Text Mining and Knowledge Management


Text Mining is a subset of information mining technology that, in turn, is a component of a more general category of Knowledge Management (KM). Knowledge, in this case, refers to the collective expertise, experiences, know-How, and wisdom of an organization. In a business world, knowledge is represented not only by the structured data found in traditional database, but in a wide variety of unstructured sources such as word documents, memos and letters, e-mail messages, news feeds, Web pages, and so forth.

Text Mining and Knowledge Management


Unlike data mining, text mining works with information stored in an unstructured collection of text documents. Specifically, online text mining refers to the process of searching through unstructured data on the internet and deriving some meaning from it. Text mining goes beyond applying statistical models to data files; in fact, text mining uncovers relationships in a text collection, and leverages the creativity of the knowledge work to explore these relationships and discover new knowledge.

Research Fields Related To Text Mining


Text mining is an interdisciplinary area that involves at least the following key research fields:

1. Machine Learning and Data Mining (Hand, et al., 2001; Mitchell, 1997; Witten & Frank, 1999): Provides techniques for data analysis with varying knowledge representations and large amounts of data. 2. Statistics and Statistical Learning (Hastie, et al., 2001): Contributes data analysis in general in the context of text mining (Duda et al., 2000). 3. Information Retrieval (Rijsberg, 1979): Provides techniques for text manipulation and retrieval mechanisms.

Research Fields Related To Text Mining

4. Natural Language Processing (Manning & Schutze, 2001): Provides techniques for analyzing natural language. Some aspects of text mining involve the development of models for reasoning about new text documents, based on words, phrases, linguistics, and grammatical properties of the text, as well as extracting information and knowledge from large amounts of text documents.

Topic 2: Text Mining Technologies & Applications


There are two key key technologies that make online text mining possible:

Internet Searching (Web Mining) - It has been around for a quite few years. Yahoo, Alta Vista, and Excite are three of the earliest. Search engines (and discovery services) operate by indexing the context in a particular Web site and allows users to search the indexes. Although useful, first generations of these tools often were wrong because they did not correctly index the content they retrieved. Advances in text mining applied to the internet searching resulted in online text mining, representing the new generation of Internet search tools. With these products, users can gain more relevant information by processing smaller amount of links, pages and indexes.

Text Mining Technologies Text


Analysis

Text Analysis (Text Mining) - It has been around longer than Internet searching. Indeed, scientists have been trying to make computers understand natural languages for decades; text analysis is an integral part of these efforts. The automatic analysis of text information can be used for several different general purposes:
1. To provide an overview of the contents of a large document collection, for ex., finding significant clusters of documents in a customer feedback collection could indicate where a companys products and services need improvement.

2. To identify hidden structures between groups of objects; this may help to organize an intranet site so that related documents are all connected by hyperlinks.

Text Mining Technologies Text


Analysis
3. To increase the efficiency and effectiveness of a search process to find similar or related information; for ex., to search articles from a news service and discover all unique documents that contain hints on possible trends or technologies that have so far not been mentioned in their articles.
4. To detect duplicate documents in an article.

Text Mining Applications


1. E-mail management- A popular use of text analysis is for message routing in which the computer reads the message to decide who should deal with it. (Spam control is another good example). Document Management- By mining the different documents for meaning as they are put into a document repository, a company can establish a detailed index that allows the location of relevant documents at any time.

2.

3.

Automated help desk- Some companies use text mining to respond to customer inquiries. Customers letters and emails are processed by a text mining applications.

Text Mining Applications


4. Market research - A market researcher can use online text mining to gather statistics on the occurrences of certain words, phases, concepts, or themes on the World Wide Web. This information can be useful for establishing market demographics and demand curves.
Business intelligence gathering- This is the most advanced use of text mining. (See next slide).

5.

Business Intelligence Gathering -Semantic Networks and Other Techniques


A key element of building an advanced system for textual information. analysis, summarization, and search is the development of a Semantic Network for the investigated text. A Semantic Network is a set of the most significant conceptswords and word combinations derived from the analytical texts, along with the semantic relationships between these concepts in the text. A semantic network provides a concise and very accurate summary of the analyzed text.
Other techniques, For ex., Cambio uses absolute positioning, pattern recognition, fixed and floating tags. The SemioMap software extracts all relevant phrases from the text collection. It builds a lexical network of co-occurrences by grouping related phases and enhancing the most salient features of these groupings.

Topic 3: Text-Mining Methods

Text Document Categorization is used when a set of predefined content categories, such as arts, business, computers, games, health, recreation, science, and sport, is provided, as well as a set of documents labeled with those categories. The task is to classify previously unseen text documents by assigning each document one or more of the predefined categories. This usually is performed by representing documents as word-vectors and using documents that already have been assigned the categories to generate a model for assigning content categories to new documents (Jackson & Moulinier, 2002; Sebastiani, 2002). The categories can be organized into an ontology (e.g., the MeSH ontology for medical subject headings or the DMoz hierarchy of Web documents).

Text-Mining Methods

Document Clustering (Steinbach et al., 2000) is based on an arbitrary data clustering algorithm adopted for text data by representing each document as a word vector. The similarity of two documents is commonly measured by the cosinesimilarity between the word vectors representing the documents. The same similarity also is used in document categorization for finding a set of the most similar documents.

Text-Mining Methods

Visualization of text data is a method used to obtain early measures of data quality, content, and distribution (Fayyad et al., 2001). For instance, by applying document visualization, it is possible to get an overview of the Web site content or document collection. One form of text visualization is based on document clustering (Grobelnik & Mladenic, 2002) by first representing the documents as word vectors and by performing K-means clustering algorithms on the set of word vectors. The obtained clusters then are represented as nodes in a graph, where each node in the graph is described by the set of most characteristic words in the cluster. Similar nodes, as measured by the cosine-similarity of their word vectors, are connected by an edge in the graph. When such a graph is drawn, it provides a visual representation of the document set.

Text-Mining Methods
Text Summarization often is applied as a second stage of document retrieval in order to help the user getting an idea about content of the retrieved documents. Research in information retrieval has a long tradition of addressing the problem of text summarization, with the first reported attempts in the 1950s and 1960s, that were exploiting properties such as frequency of words in the text. When dealing with text, especially in different natural languages, properties of the language can be a valuable source of information. This brings in text summarization of the late 1970s the methods from research in natural language processing. As humans are good at making summaries, we can consider using examples of human-generated summaries to find something about the underlying process by applying machine learning and data-mining methods, a popular problem in 1990s.

Text-Mining Methods-Text Summarization


There are several ways to provide text summary (Mani & Maybury, 1999). The simplest but also very effective way is providing keywords that help to capture the main topics of the text, either for human understanding or for further processing, such as indexing and grouping of documents, books, pictures, and so forth. As the text is usually composed of sentences, we can talk about summarization by highlighting or extracting the most important sentences, a way of summarization that is frequently found in humangenerated summaries. A more sophisticated way of summarization is by generating new sentences based on the whole text, as used, for instance, by humans in writing book reviews.

Topic 4: Text Mining for Security and Criminal Detection

The explosion of the amount of data generated from government and corporate databases, e-mails, Internet survey forms, phone and cellular records, and other communications has led to the need for new pattern-recognition technologies, including the need to extract concepts and keywords from unstructured data via text mining tools using unique clustering techniques. Based on a field of AI known as natural language processing (NLP), text mining tools can capture critical features of a document's content based on the analysis of its linguistic characteristics. One of the obvious applications for text mining is monitoring multiple online and wireless communication channels for the use of selected keywords, such as anthrax or the names of individual or groups of suspects. Patterns in digital textual files provide clues to the identity and features of criminals, which investigators can uncover via the use of this evolving genre of special text mining tools.

Text Mining for Security and Criminal Detection

Text mining has typically been used by corporations to organize and index internal documents, but the same technology can be used to organize criminal cases by police departments to institutionalize the knowledge of criminal activities by perpetrators and organized gangs and groups. More importantly, criminal investigators and counter-intelligence analysts can sort, organize, and analyze gigabytes of text during the course of their investigations and inquiries using the same technology and tools. Most of today's crimes are electronic in nature, requiring the coordination and communication of perpetrators via networks and databases, which leave textual trails that investigators can track and analyze. There is an assortment of tools and techniques for discovering key information concepts from narrative text residing in multiple databases in many formats and multiple languages.

Text Mining for Security and Criminal Detection

Text mining tools and applications focus on discovering relationships in unstructured text and can be applied to the problem of searching and locating keywords, such as names or terms used in e-mails, wireless phone calls, faxes, instant messages, chat rooms, and other methods of human communication. Unlike traditional data mining, which deals with databases that follow a rigid structure of tables containing records representing specific instances of entities based on relationships between values in set columns.

Text Mining --Neural Networks

Probably one of the most powerful tools for investigative data miners, in terms of detecting, identifying, and classifying patterns of digital and physical evidence is the neural network, a technology that has been around for 20 years. Although neural networks were proposed in the late 1950s, it wasn't until the mid-1980s that software became sufficiently sophisticated and computers became powerful enough for actual applications to be developed. During the 1990s, the development of commercial neural network tools and applications by such firms are Nestor, NeuralWare, and HNC became reliable enough, enabling their widespread use in financial, marketing, retailing, medical, and manufacturing market sectors. Ironically, one of the first and most successful applications was in the area of the detection of credit card fraud.

Text Mining -- Neural Networks

Today, however, neural networks are being applied to an increasing number of real-world problems of considerable complexity. Neural networks are good pattern-recognition engines and robust classifiers with the ability to generalize in making decisions about imprecise and incomplete data. Unlike other traditional statistical methods, like regression, they are able to work with a relatively small training sample in constructing predictive models; this makes them ideal in criminal detection situations because, for example, only a tiny percentage of most transactions are fraudulent.

Text Mining -- Neural Networks

A key concept about working with neural networks is that they must be trained, just as a child or a pet must, because this type of software is really about remembering observations. If provided an adequate sample of fraud or other criminal observations, it will eventually be able to spot new instances or situations of similar crimes. Training involves exposing a set of examples of the transaction patterns to a neural-network algorithm; often thousands of sessions are recycled until the neural network learns the pattern. As a neural network is trained, it gradually become skilled at recognizing the patterns of criminal behavior and features of perpetrators; this is actually done through an adjustment of mathematical formulas that are continuously changing, gradually converging into a formula of weights that can be used to detect new criminal behavior or other criminals.

A Neural Net Can Be Trained to Detect Criminal Behavior

Example of Classification Using Neural Induction

Example of Classification Using Neural Induction

Each processing unit (circle) in one layer is connected to each processing unit in the next layer by a weighted value, expressing the strength of the relationship. The network attempts to mirror the way the human brain works in recognizing patterns by arithmetically combining all the variables with a given data point.
In this way, it is possible to develop nonlinear predictive models that learn by studying combinations of variables and how different combinations of variables affect different data sets.

Text Mining -- Machine Learning

Probably the most important and pivotal technology for profiling terrorists and criminals via data mining is through the use of machine-learning algorithms. Machine-learning algorithms are commonly used to segment a databaseto automate the manual process of searching and discovering key features and intervals. For example, they can be used to answer such questions as when is fraud most likely to take place or what are the characteristics of a drug smuggler. Machine-learning software can segment a database into statistically significant clusters based on a desired output, such as the identifiable characteristics of suspected criminals or terrorists. Like neural networks, they can be used to find the needles in the digital haystacks. However, unlike nets, they can generate graphical decision trees or IF/THEN rules, which an analyst can understand and use to gain important insight into the attributes of crimes and criminals.

Text Mining -- Machine Learning

Machine-learning algorithms, such as CART, CHAID, and C5.0, operate somewhat differently, but the solution is basically the same: They segment and classify the data based on a desired output, such as identifying a potential perpetrator. They operate through a process similar to the game of 20 questions, interrogating a data set in order to discover what attributes are the most important for identifying a potential customer, perpetrator, or piece of fruit. Let's say we have a banana, an apple, and an orange. Which data attribute carries the most information in classifying that fruit? Is it weight, shape, or color? Weight is of little help since 7.8 ounces isn't going to discriminate very much. How about shape? Well, if it is round, we can rule out a banana. However, color is really the best attribute and carries the most information for identifying fruit. The same process takes place in the identification of perpetrators, except in this case an analysis might incorporate hundreds, if not thousands, of data attributes.

Text Mining -- Machine Learning

Their output can be either in the form of IF/THEN rules or a graphical decision tree with each branch representing a distinct cluster in a database. They can automate the process of stratification so that known clues can be used to "score" individuals as interactions occur in various databases over time and predictive rules can "fire" in real-time for detecting potential suspects. The rules or "signatures" could be hosted in centralized servers, so that as transactions occur in commercial and government databases, real-time alerts would be broadcast to law enforcement agencies and other point-ofcontact users; a scenario might be played as follows:

Machine Learning An Example


1. An event is observed (INS processes a passport), and a score is generated:
RULE 1: IF social security number issued <= 89121 days ago, THEN target 16% probability, Recommended Action: OK, process through. 2. However, if the conditions are different, a low alert is calibrated: RULE 2: IF social security number issued <= 89121 days ago, AND 2 overseas trips during last 3 months, THEN target 31% probability, Recommended Action: Ask for additional ID, report on findings to this system

Machine Learning An Example

3. Under different conditions, the alert is elevated: RULE 3: IF social security number issued <= 89121 days ago, AND 2 overseas trips during last 3 months, AND license type = Truck, THEN target 63% probability, Recommended Action: Ask for additional information about destination, report on findings to this system based on extracting keywords or sentences or generating sentences.

Machine Learning An Example

4. Finally, the conditions warrant an escalated alert and associated action:


RULE 4: IF social security number issued <= 89121 days ago, AND 2 overseas trips during last 3 months, AND license type = Truck, AND wire transfers <= 35, THEN target 71% probability, Recommended Action: Detain for further investigation, report on findings to this system.

Text Mining --Intelligent Agents: Software Detectives

In a networked environment such as ours, a new entity has evolved: intelligent agent software. Over the past few years, agents have emerged as a new software paradigm; they are in part distributed systems, autonomous programs, and artificial life. The concept of agents is an outgrowth of years of research in the fields of AI and robotics. They represent concepts of reasoning, knowledge representation, and autonomous learning. Agents are automated programs and provide tools for integration across multiple applications and databases, running across open and closed networks. They are a means of retrieving, filtering, managing, monitoring, analyzing, and disseminating information over the Internet, intranets, and other proprietary networks.

What Can Agents Do?

Agents represent a new generation of computing systems and are one of the more recent developments in the field of AI. Agents are specific applications with predefined goals, which can run autonomously; for example, an Internet-based agent can retrieve documents based on user-defined criteria. They can also monitor an environment and issue alerts or go into action based on how they are programmed. In the course of investigative data mining projects, for example, agents can serve the function of software detectives, monitoring, shadowing, recognizing, and retrieving information for analysis and case development or real-time alerts.

Topic 5: Text Mining Tools

The market for text mining products can be divided into two groups: text mining tools and kits for constructing applications with embedded text mining facilities. They vary from inexpensive desktop packages to enterprise-wide systems costing thousands of dollars. The following is a partial list of text analysis software. Keep in mind, this is not an all-encompassing list; however, analysts need to know what these tools do and how they may want to apply them to solve profiling and forensic needs.

Text Mining Tools

Clairvoyance http://www.claritech.com/ ClearForest http://www.clearforest.com/ Copernic http://www.copernic.com/company/index.html DolphinSearch http://www.dolphinsearch.com/ dtSearch http://www.dtsearch.com/ HNC Software http://www.hnc.com/

Text Mining Tools

IBM http://www3.ibm.com/software/data/iminer/fortext/index.html iCrossReader http://www.insight.com.ru Klarity http://www.klarity.com.au Kwalitan http://www.gamma.rug.nl Leximancer http://www.leximancer.com

Text Mining Tools

Lextek
http://www.languageidentifier.com Semio http://www.semio.com/ Temis http://www.temis-group.com Text Analyst http://www.megaputer.com TripleHop http://www.triplehop.com/ie Quenza http://www.xanalys.com/quenza.html

Text Mining Tools

Readware http://www.readware.com VantagePoint http://www.thevantagepoint.com VisualText http://www.textanalysis.com/ Wordstat http://www.simstat.com/wordstat.htm

Text Mining Products- A Summary


Company
Aptex Software, Inc. Autonomy Data Junction Excalibur Technologies Corp. Fulcrum Technologies Inc. IBM Corp. InsightSoft-M Intercon System, Ltd. Megaputer Inc. Semio Corp. Verity, Inc.

Products
SelectResponse Agentware Cambio RetrievalWare DOCSFulcrum SearchSearver Intelligent Minor for Text Cross-reader Dataset Text Analyst SemioMap KeyView, Intranet Spider

Text Mining Products-An Example http://www.autonomy.com


Autonomy (Agentware) offers three kinds of products relating to online text mining:
Knowledge Server Provides users with a fully automated and precise means of categorizing, cross-referencing, and presenting information. Knowledge Update Monitors specified Internet and intranet sites, news feeds, and internal repositories of documents, and creates a personalized report on their contents. Knowledge Builder A toolkit that allows companies to integrate Agentware capabilities into their own systems.

Topic 6: Web Mining - An

Overview
Data mining researchers have a fertile area to develop different systems, using Internet as a knowledge base or personalizing Web information. The combination of the Internet and data mining typically has been referred as Web mining, defined by Kosala and Blockeel (2000) as "a converging research area from several research communities, such as DataBase (DB), Information Retrieval (IR) and Artificial Intelligent (AI), especially from machine learning and Natural Language Processing (NLP)"

Web Mining Categories

Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services; traditionally focused in three distinct ways, based on which part of the Web to mine: Web content, Web structure and Web usage. Brief descriptions of these categories are summarized below.

Web Mining-Web Content

1. Web Content Mining: Web content consists of several types of data, such as textual, image, audio, video, and metadata, as well as hyperlinks. Web content mining describes the process of information discovery from millions of sources across the World Wide Web. From an information retrieval point of view, Web sites consist of collections of hypertext documents for unstructured documents (Turney, 2002); from a DB point of view, Web sites consist of collections of semi-structured documents (Jeh & Widom, 2004).

Web Mining -Web Structure

2. Web Structure Mining: This approach is interested in the structure of the hyperlinks within the Web itselfthe interdocument structure. The Web structure is inspired by the study of social network and citation analysis (Chakrabarti, 2002). Some algorithms have been proposed to model the Web topology, such as PageRank (Brin & Page, 1998) from Google and other approaches that add content information to the link structure (Getoor, 2003).

Web Mining - Web Usage

3. Web Usage Mining: Web usage mining focuses on techniques that could predict user behavior while the user interacts with the Web. A first approach maps the usage data of the Web server into relational tables for a later analysis. A second approach uses the log data directly by using special preprocessing techniques (Borges & Levene, 2004).

Topic 7: Web Mining and

Information Retrieval

On the Web, there are no standards or style rules; the contents are created by a set of very heterogeneous people in an autonomous way. In this sense, the Web can be seen as a huge amount of online unstructured information. Due to this inherent chaos, the necessity of developing systems that aid us in the processes of searching and efficient accessing of information has emerged.

Web Mining and Information Retrieval

When we want to find information on the Web, we usually access it by search services, such as Google (http://www.google.com) or AllTheWeb (http://www.alltheweb.com), which return a ranked list of Web pages in response to our request. A recent study (Gonzalo, 2004) showed that this method of finding information works well when we want to retrieve home pages, Websites related to corporations, institutions, or specific events, or to find quality portals. However, when we want to explore several pages, relating information from several sources, this way has some deficiencies: the ranked lists are not conceptually ordered, and information in different sources is not related. The Google model has the following features: crawling the Web, the application of a simple Boolean search, the PageRank algorithm, and an efficient implementation. This model directs us to a Web page, and then we are abandoned with the local server search tools, once the page is reached. Nowadays, these tools are very simple, and the search results are poor.

Web Mining and Information Retrieval

Other ways to find information is using Web directories organized by categories, such as Yahoo (http://www.yahoo.com) or Open Directory Project (http://www.dmoz.org). However, the manual nature of this categorization makes the directories' maintenance too arduous, if machine processes do not assist it.

Future and present research tends to the visualization and organization of results, the information extraction over the retrieved pages, or the development of efficient local servers search tools. Next, we summarize some of the technologies that can be explored in Web content mining and give a brief description of their main features.

Text Categorization on the Web

The main goal of these methods is to find the nearest category, from a pre-classified categories hierarchy to a specific Webpage content. Some relevant works in this approach can be found in Chakrabarti (2003) and Kwon and Lee (2003).

Web Document Clustering

Clustering involves dividing a set of n documents into a specific number of clusters k, so that some documents are similar to other documents in the same cluster and different from those in other clusters. Some examples in this context are Carey, et al. (2003) and Liu, et al. (2002).

Web Mining Stages

In general, Web mining systems can be decomposed into different stages that can be grouped in four main phases: resource access, the task of capturing intended Web documents; information preprocessing, the automatic selection of specific information from the captured resources; generalization, where machine learning or data-mining processes discover general patterns in individual Web pages or across multiple sites; and finally, the analysis phase, or validation and interpretation of the mined patterns. We think that by improving each of the phases, the final system behavior also can be improved.

Web Mining Information Retrieval & Representation

In this presentation, we focus our efforts on Web pages representation, which can be associated with the informationpreprocessing phase in a general Web-mining system. Several hypertext representations have been introduced in literature in different Web mining categories, and they will depend on the later use and application that will be given. Here, we restrict our analysis to Web-content mining, and, in addition, hyperlinks and multimedia data are not considered. The main reason to select only the tagged text is to look for the existence of special features emerging from the HTML tags with the aim to develop Web-content mining systems with greater scope and better performance as local server search tools. In this case, the representation of Web pages is similar to the representation of any text.

Web Mining - Information Retrieval & Representation

A model of text must build a machine representation of the world knowledge and, therefore, must involve a natural language grammar. Since we restrict our scope to statistical analyses for Web-page classification, we need to find suitable representations for hypertext that will suffice for our learning
applications.

Web Mining - Information Retrieval & Representation

We carry out a comparison between different representations using the vector space model (Salton et al., 1975), where documents are tokenized using simple rules, such as whitespace delimiters in English and tokens stemmed to canonical form (e.g., reading to read). Each canonical token represents an axis in the Euclidean space. This representation ignores the sequence in which words occur and is based on the statistical about single independent words. This independence principle between the words that co-appear in a text or appear as multiword terms is a certain error but reduces the complexity of our problem without loss of efficiency. The different representations are obtained using different functions to assign the value of each component in the vector representation. We used a subset of the BankSearch Dataset as the Web document collection (Sinka & Corne, 2002).

Web Mining - Information Retrieval & Representation

First, we obtained five representations using wellknown functions in the IR environment. All these are based only on the term frequency in the Web page that we want to represent, and on the term frequency in the pages of the collection. Below, we summarize the different evaluated representations and a brief explanation.

Information Representations

1 Binary: This is the most straightforward model, which is called set of words. The relevance or weight of a feature is a binary value {0,1}, depending on whether the feature appears in the document or not.
2. Term Frequency (TF): Each term is assumed to have an importance proportional to the number of times it occurs in the text (Luhn, 1957). The weight of a term t in a document d is given by W(d;t)=TF(d;t); where TF(d;t) is the term frequency of the term t in d.

Information Representations

3. Inverse Document Frequency (IDF): The importance of each term is assumed to be inversely proportional to the number of documents that contain the term. The IDF factor of a term t is given by IDF(t)=logNxdf(t), where N is the number of documents in the collection and df(t) is the number of documents that contain the term t. 4. TF-IDF: Salton (1988) proposed to combine TF and IDF to weight terms. Then, the TF-IDF weight of a term t in a document d is given by W(d;t)=TF(d;t) xIDF(t).

5. WIDF: It is an extension of IDF to incorporate the term frequency over the collection of documents. The WIDF weight is given by W(d, t)=TF(d, t)/ diTF(i, t).

Information Representations

In addition to the five representations, we obtained other two representations which combine several criteria extracted from some tagged text and that can be treated differently from other parts of the Web page document. Both representations consider more elements than the term frequency to obtain the term relevance in the Web page content. These two representations are: the Analytical Combination of Criteria (ACC) and Fuzzy Combination of Criteria (FCC).

The Analytical Combination of Criteria (ACC) Vs. Fuzzy Combination of Criteria (FCC)

The difference between them is the way they evaluate and combine the criteria. The first one (Fresno & Ribeiro, 2004) uses a lineal combination of those criteria, whereas the second one (Ribeiro et al., 2002) combines them by using a fuzzy system. A fuzzy reasoning system is a suitable framework to capture the qualitative human expert knowledge to solve the ambiguity inherent to the current reasoning process, embodying knowledge and expertise in a set of linguistic expressions that manage words instead of numerical values. The fundamental cue is that often a criterion evaluates the importance of a word only when it appears combined with another criterion. Some Web pages representation methods that use HTML tags in different ways can be found in (Molinari et al., 2003; Pierre, 2001; Yang et al., 2002).

The Combined Criteria in ACC and FCC


The combined criteria in ACC and FCC are summarized below:

Word Frequency in the Text: Luhn (1957) showed that a statistical analysis of the words in the document provides some clues of its contents. This is the most used heuristic in the text representation field. Word's Appearance in the Text: The word's appearance in the title of the Web page, considering that in many cases the document title can be a summary about the content.

The Combined Criteria in ACC and FCC (Contd)

The Positions All Along the Text: In automatic text summarization, a well-known heuristic to extract sentences that contain important information to the summary is selecting those that appear at the beginning and at the end of the document (Edmunson, 1969). Word's Appearance in Emphasis Tags: Whether or not the word appears in emphasis tags. For this criterion, several HTML tags were selected, because they capture the author's intention. The hypothesis is that if a word appears emphasized, it is because the author wants it to stand out.

Information RepresentationsQuality Comparison


To compare the quality of the representations, a Web-page binary classification system was implemented in three stages:

Representation Learning Classification algorithm.

The selected classes are very different from each other to display favorable conditions for learning and classification stages, and to show clearly the achievements of the different representation methods.

Representation Stage

The representation stage was achieved as follows. The corpus, a set of documents that generate the vocabulary, was created from 700 pages for each selected classes. All the different stemmed words found in these documents generated the vocabulary as axes in the Euclidean space. We fixed the maximum length of a stemmed word to 30 characters and the minimum length to three characters.

Representation Stage

In order to calculate the values of the vector components for each document, we followed the next steps: (a) we eliminated all the punctuation marks except some special marks that are used in URLs, e-mail addresses, and multi-word terms; (b) the words in a stoplist used in IR were eliminated from the Web pages; (c) we obtained the stem of each term by using the wellknown Porter's stemming algorithm; (d) we counted the number of times that each term appeared on each Web page and the number of pages where the term was present; and (e) in order to calculate the ACC and FCC representations, we memorized the position of each term all along the Web page and whether or not the feature appears in emphasis and title tags. In addition, another 300 pages for each class were represented in the same vocabulary to evaluate the system.

Learning Stage
In the learning stage, the class descriptors (i.e., information common to a particular class, but extrinsic to an instance of that class) were obtained from a supervised learning process. Considering the central limit theorem, the word relevance (i.e., the value of each component in the vector representation) in the text content for each class will be distributed as a Gaussian function with the parameters mean and variance . Then, the density function: gets the probability that a

word i, with relevance r, appears in a class (Fresno & Ribeiro, 2004). The mean and variance are obtained from the two selected sets of examples for each class by a maximum likelihood estimator method.

Classification Stage

Once the learning stage is achieved, a Bayesian classification process is carried out to test the performance of the obtained class descriptors. The optimal classification of a new page d is the class c j C, where the probability P (Cj/d) is maximum, where C is the set of considered classes.)P(cj|d reflects the confidence that cj holds given the page d. Then, the Bayes' theorem states: Considering the hypothesis of the independence principle, assuming that all pages and classes have the same prior probability, applying logarithms because it is a non-decreasing monotonic function and shifting the argument one unit to avoid the discontinuity in x=0, finally, the most likelihood class is given by:

Topic 8: Web Page Extension of Data Warehouses

Data warehouses are constructed to provide valuable and current information for decision-making. Typically this information is derived from the organization's functional databases. The data warehouse is then providing a consolidated, convenient source of data for the decision maker. However, the available organizational information may not be sufficient to come to a decision. Information external to the organization is also often necessary for management to arrive at strategic decisions. Such external information may be available on the World Wide Web; and when added to the data warehouse extends decision-making power.

Introduction

The Web can be considered as a large repository of data. This data is on the whole unstructured and must be gathered and extracted to be made into something valuable for the organizational decision maker. To gather this data and place it into the organization's data warehouse requires an understanding of the data warehouse metadata and the use of Web mining techniques.

Introduction

Typically when conducting a search on the Web, a user initiates the search by using a search engine to find documents that refer to the desired subject. This requires the user to define the domain of interest as a keyword or a collection of keywords that can be processed by the search engine. The searcher may not know how to break the domain down, thus limiting the search to the domain name. However, even given the ability to break down the domain and conduct a search, the search results have two significant problems. One, Web searches return information about a very large number of documents. Two, much of the returned information may be marginally relevant or completely irrelevant to the domain. The decision maker may not have time to sift through results to find the meaningful information.

Background

Web search queries can be related to each other by the results returned (Wu & Crestani, 2004; Glance, 2000). This knowledge of common results to different queries can assist a new searcher in finding desired information. However, it assumes domain knowledge sufficient to develop a query with keywords, and does not provide corresponding organizational knowledge.

Background

Some Web search engines find information by categorizing the pages in their indexes. One of the first to create a structure as part of their Web index was Yahoo! (http://www.yahoo.com). Yahoo! has developed a hierarchy of documents, which is designed to help users find information faster. This hierarchy acts as a taxonomy of the domain. Yahoo! helps by directing the searcher through the domain. Again, there is no organizational knowledge to put the Web pages into a local context, so the documents must be accessed and assimilated by the searcher.

Web Search For Warehouse

Experts within a domain of knowledge are familiar with the facts and the organization of the domain. In the warehouse design process, the analyst extracts from the expert the domain organization. This organization is the foundation for the warehouse structure and specifically the dimensions that represent the characteristics of the domain.

Web Search For Warehouse


In the Web search process; the data warehouse analyst can use the warehouse dimensions as a starting point for finding more information on the World Wide Web. These dimensions are based on the needs of decision makers and the purpose of the warehouse. They represent the domain organization. The values that populate the dimensions are pieces of the knowledge about the warehouse's domain. These organizational and knowledge faucets can be combined to create a dimension-value pair, which is a special case of a taxonomy tree (Kerschberg, Kim & Scime, 2003; Scime & Kerschberg, 2003). This pair is then used as keywords to search the Web for additional information about the domain and this particular dimension value.

Web Search For Warehouse


The pages retrieved as a result of dimension-value pair based Web searches are analyzed to determine relevancy. The metadata of the relevant pages is added to the data warehouse as an extension of the dimension. Keeping the warehouse current with frequent Web searches keeps the knowledge fresh and allows decision makers access to the warehouse and Web knowledge in the domain.

Web Page Collection And Warehouse Extension


1. Select Dimensions: The data warehouse analyst selects the dimension attributes that are likely to have relevant data about their values on the Web. For example, the dimension city would be chosen; as most cities have Web sites.
2. Extract Dimension Value Pair: As values are added to the selected dimensions the dimension label and value are extracted as a dimension-value pair and converted into a keyword string. The value Buffalo for the dimension city becomes the keyword string city Buffalo. 3. Keyword String Query: The keyword string is sent to a search engine (for example, Google).

Web Page Collection And Warehouse Extension


1. Select Dimensions: The data warehouse analyst selects the dimension attributes that are likely to have relevant data about their values on the Web. For example, the dimension city would be chosen; as most cities have Web sites.
2. Extract Dimension Value Pair: As values are added to the selected dimensions the dimension label and value are extracted as a dimension-value pair and converted into a keyword string. The value Buffalo for the dimension city becomes the keyword string city Buffalo. 3. Keyword String Query: The keyword string is sent to a search engine (for example, Google).

Web Page Collection And Warehouse Extension


4. Search the World Wide Web: The keyword string is used as a search engine query and the resulting hit lists containing Web page meta-data are returned. This meta-data typically includes page URL, title, and some summary information. In our example, the first result is the Home Page for the City of Buffalo in New York State. On the second page of results is the City of Buffalo, Minnesota.
5. Review Results Lists: The data warehouse analyst reviews the resulting hit list for possible relevant pages. Given that large number of hits (over 5 million for city Buffalo) the analyst must limit consideration of pages to a reasonable amount.

Web Page Collection And Warehouse Extension

6. Select Web Documents: Web pages selected are those that may add knowledge to the data warehouse. This may be new knowledge or extensional knowledge to the warehouse. Because the analyst knows that the city of interest to the data warehouse is Buffalo, New York, he only considers the appropriate pages.
7. Relevancy Review: The analyst reviews the selected pages to ensure they are relevant to the intent of the warehouse attribute. The meta-data of the relevant Web pages is extracted during this relevancy review. The meta-data includes the Web page URL, title, date retrieved, date created, and summary. This meta-data may come from the search engine results list. For the Buffalo home page this meta-data is found in Figure 2. 8. Add Meta-Data: The meta-data for the page is added as an extension to the data warehouse. This addition is added as an extension to the city dimension creating a snowflake-like schema for the data warehouse.

Topic 9: Future Trends

Nowadays, the main lack in systems that aids us in the search and access of the information process is revealed when we want to explore several pages, relating information from several sources. Future trends must find regularities in HTML vocabulary to improve the response of the local server search tools, combining with other aspects such as hyperlink regularities.

Future Trends

There is a number of researchers intensively working in the area of text data mining, mainly guided by the need of developing new methods capable of handling interesting realworld problems. One such problem recognized in the past few years is on reducing the amount of manual work needed for hand labeling the data. Namely, most of the approaches for automatic document filtering, categorization, user profiling, information extraction, and text tagging requires a set of labeled (pre-categorized) data describing the addressed concepts. Using unlabeled data and bootstrapping learning are two directions giving research results that enable important reduction in the needed amount of hand labeling.

Future Trends

In document categorization using unlabeled data, we need a small number of labeled documents and a large pool of unlabeled documents (e.g., classify an article in one of the news groups, classify Web pages). The approach proposed by Nigam, et al. (2001) can be described as follows. First, model the labeled documents and use the trained model to assign probabilistically weighted categories to all unlabeled documents. Then, train a new model using all the documents and iterate until the model remains unchanged. It can be seen that the final result depends heavily on the quality of the categories assigned to the small set of hand-labeled data, but it is much easier to hand label a small set of examples with good quality than a large set of examples with medium quality.

Future Trends

Bootstrap learning for Web page categorization is based on the fact that most of the Web pages have some hyperlinks pointing to them. Using that, we can describe each Web page either by its content or by the content of the hyperlinks that point to it. First, a small number of documents is labeled, and each is described, using the two descriptions. One model is constructed from each description independently and used to label a large set of unlabeled documents. A few of those documents for which the prediction was the most confident are added to the set of labeled documents, and the whole loop is repeated. In this way, we start with a small set of labeled documents, enlarging it through the iterations and hoping that the initial labels were a good coverage of the problem space.

Future Trends

Some work also was done in the direction of mining the extracted data (Ghani et al., 2000), where information extraction is used to automatically collect information about different companies from the Web. Data-mining methods then are used on the extracted data. As Web documents are naturally organized in a graph structure through hyperlinks, there are also research efforts on using that graph structure to improve document categorization (Craven & Slattery, 2001) to improve Web search and visualization of the Web.

Future Trends

In the search engine of the future the linear index of Web pages will be replaced by an ontology. This ontology will be a semantic representation of the Web. Within the ontology the pages may be represented by keywords and will also have connections to other pages. These connections will be the relationships between the pages and may also be weighted. Investigation of an individual page's content, the inter-page hypertext links, the position of the page on its server, and search engine discovered relationships would create the ontology (Guha, McCool & Miller, 2003).

Future Trends

Matches will no longer be query keyword to index keyword, but a match of the data warehouse ontology to the search engine ontology. Rather than point-to-point matching, the query is the best fit of one multi-dimensional space upon another (Doan, Madhavan, Dhamankar, Domingos, & Halevy, 2003). The returned page locations are then more specific to the information need of the data warehouse.

Topic 10: Conclusion

Mining of text data, as described in this article, is a fairly wide area, including different methods used to provide computer support for handling text data. Evolving at the intersection of different research areas and existing in parallel with them, we could say that text mining gets its methodological inspiration from different fields, while its applications are closely connected to the areas of Web mining (Chakrabarti, 2002) and digital libraries. As many of the already developed approaches provide reasonable quality solutions, text mining is gaining popularity in applications, and the researcher are addressing more demanding problems and approaches that, for instance, go beyond the word-vector representation of the text and combine with other areas, such as semantic Web and knowledge management.

ConclusionWeb Searching

The use of the Web to extend data warehouse knowledge about a domain provides the decision maker with more information than may otherwise be available from the organizational data sources used to populate the data warehouse. The Web pages referenced in the warehouse are derived from the currently available data and knowledge of the data warehouse structure.
The Web search process and the data warehouse analyst sifts through the external, distributed Web to find relevant pages. This Web generated knowledge is added to the data warehouse for decision maker consideration.

Conclusion Web Mining


The results on a Web page representation comparison are very dependent on the selected collection and the classification process. In a binary classification with the proposed learning and classification algorithms, the best representation was the binary, because it obtained the best F-measures in all the cases. We can expect that when the class' number will be increased, this F-measure value will decrease, and the rest of representations will increase their global results. Finally, apart from binary function, the ACC and FCC representations have better F-measure values than the rest, inherited from the IR field, when the sizes are the smallest. This fact can be the result of considering the tagged text in a different way, depending on the tag semantic, and capturing more information than when only the frequency is considered. A most deep exploration must be accomplished to find hidden information behind this Hypertext Markup Language vocabulary.

Topic 11: Key Terms

Active Learning: Learning modules that support active learning select the best examples for class labeling and training without depending on a teacher's decision or random sampling. Document Categorization: A process that assigns one or more of the predefined categories (labels) to a document. Document Clustering: An unsupervised learning technique that partitions a given set of documents into distinct groups of similar documents based on similarity or distance measures. EM Algorithm: An iterative method for estimating maximum likelihood in problems with incomplete (or unlabeled) data. EM algorithm can be used for semi-supervised learning (see below) since it is a form of clustering algorithm that clusters the unlabeled data around the labeled data.

Key Terms

Fuzzy Relation: In fuzzy relations, degrees of association between objects are represented not as crisp relations but membership grade in the same manner as degrees of set membership are represented in a fuzzy set. Information Extraction: A process of extracting data from the text, commonly used to fill in the data into fields of a database based on text documents. Semi-Supervised Clustering: A variant of unsupervised clustering techniques without requiring external knowledge. Semi-supervised clustering performs clustering process under various kinds of user constraints or domain knowledge.

Key Terms

Semi-Supervised Learning: A variant of supervised learning that uses both labeled data and unlabeled data for training. Semi-supervised learning attempts to provide more precisely learned classification model by augmenting labeled training examples with information exploited from unlabeled data. Supervised Learning: A machine learning technique for inductively building a classification model (or function) of a given set of classes from a set of training (prelabeled) examples.

Key Terms

Text Classification: The task of automatically assigning a set of text documents to a set of predefined classes. Recent text classification methods adopt supervised learning algorithms such as Nave Bayes and support vector machine.

Text Summarization: A process of generating summary from a longer text, usually based on extracting keywords or sentences or generating sentences.
Topic Hierarchy (Taxonomy): Topic hierarchy in this chapter is a formal hierarchical structure for orderly classification of textual information. It hierarchically categorizes incoming documents according to topic in the sense that documents at a lower category have increasing specificity.

Key Terms

Topic Identification and Tracking: A process of identifying appearance of new topics in a stream of data, such as news messages, and tracking reappearance of a single topic in the stream of text data. User Profiling: A process for automatic modeling of the user. In the context of Web data, it can be contentbased, using the content of the items that the user has accessed or collaborative, using the ways the other users access the same set of items. In the context of text mining, we talk about user profiling when using content of text documents. Visualization of Text Data: A process of visual representation of text data, where different methods for visualizing data can be used to place the data usually in two or three dimensions and draw a picture.

Você também pode gostar