Você está na página 1de 11

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No.

A Novel Approach for Automatic Text Categorization with Classic Summarization Technique
Suneetha manne and Dr. S. Sameen Fatima Assistant Professor, Department of IT, VR Siddhartha Engineering College, Vijayawada HOD & Professor, Department of CSE, Osmania University, Hyderabad suneethamanne74@gmail.com Abstract
It is undoubtedly a matter of fact to affirm that the emergence of the Internet has made a profound change in the lives of many enthusiastic, innovators and researchers. The information available on the Internet has knocked the doors of Knowledge and Discovery leading to a new information era. The World Wide Web has succeeded in that extent so as to forward information one could need and various data storage centers came into their inception. Unfortunately, the hypertext data provided by storage centers like Google, Yahoo, Bing, Ask etc., often provide the content irrelevant to the topic provided by the browser. It would be a time consuming task to the researchers to recognize which web content is relevant to their interested area. So far, a number of Text Categorization techniques have been developed to throw the given documents under which class does they fall under. This research paper focuses an implementation of Information Extraction and Categorization with Summarization techniques. The derived model is tested for the accuracy by applying various categorization techniques. Keywords: Text Categorization, Text Preprocessing, Text summarization, Term Frequency.

1. Introduction
It is observed that, there is an enormous growth in the addition of new documents in the web. These documents come from various domains and form a huge heterogeneous collection of documents. In such enormous collection of information, identification of the relevant document related to a specific topic is really a challenging task in real time applications. The time taken to search for relevant document is too expensive and results in huge computational as well as time complexity. To address this sort of issue, the analysis may be augmented using experiment-based approaches. Text Categorization is one of such approaches of identifying the class to which a web document belongs to by analyzing the dataset and observing the behavior. This generally involves learning of each class and its representation from a set of documents in the class. To know in a better way we need to focus on the term Information Retrieval.

2.1 Information Retrieval


Information Retrivel(IR) is the process of finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections usually stored on computers or data warehouse.

Figure 1: Information availability on Web in various forms

125

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 Due to abundance of text information, information retrieval has found many applications. There exist many information retrieval systems, such as on-line library catalog systems, online document management systems, the more recently developed web search engines.A typical information retrieval problem is to locate relevant documents in a document collection based on the users query, which is often some keyword describing an information needed, although it could also be an example relevant document. Text Categorization is the one of the fields of Data mining and Information Retrieval which uses predefined categories to label new documents.

2.2 Text Categorization


Text Categorization is the process of identifying the class to which a text document belongs. This categorization process has many applications such as document routing methods, document management, or document dissemination [1]. Traditionally, each incoming document is analyzed and categorized manually by domain experts based on the content of the document. A large amount of human resources have to be spent on carrying out such a task. To facilitate the process of the text categorization, automatic categorization schemes are required.

Data=

Text Categorization

Figure 2: Text Categorization operation

2. Background
Over the past two decades, the automatic management of electronic documents has been a major research field in computer science. Text documents have become the most common type of information repositories especially with the increased popularity of the internet and the World Wide Web (WWW). Internet and web documents like web pages, emails, newsgroup messages, internet news feed etc., contain million or even billion of text documents. In the last decades content-based document management tasks have gained a prominent status in the information systems field, due to the increased availability of documents in digital form [2] [3]. From several decades, automatic document management tasks have gained a prominent status in the field of information retrieval. Until late 80s, text classification task was based on Knowledge Engineering, where a set of rules were defined manually to encode the expert knowledge on how to classify the documents under the given categories. Since there is a requirement of human intervention in knowledge engineering, researchers in 90s have proposed many machine learning techniques to automatically manage the text documents [4]. The advantages of a machine learning based approach are that the accuracy is comparable to that achieved by human experts and no intervention from either knowledge engineers or domain experts needed for the construction of a document management tool [5]. Many text mining methods like document retrieval, clustering, classification, routing, filtering are often used for effective management of text documents. However, in this project we concentrate only on classification of text documents. A classifier can be built by training the document systematically using a set of training documents D, where all of the documents belonging to D are labelled [15]. Text classification presents many challenges and difficulties. First, it is difficult to capture high-level semantics and abstract concepts of natural languages just from a few key words. Furthermore, semantic analysis, a major step in designing an information retrieval system, is not well understood, although there are some techniques that have been successfully applied tolimited number domains. Second, high dimensionality (thousands of features) and variable length, content and quality are the characteristics of a huge number of documents on the web. These place both efficiency and accuracy demands on classification systems [5]. 126

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 Since any classifier is unable to understand a document in its raw format, a document has to be converted into a standard representation. Extensive work is carried out to propose various text representation techniques and text classification methods in the literature [9]. But, it is essential for a software developer to have a complete knowledge on all existing representation schemes and classifiers in order to select an appropriate representation scheme and classifier which best suits his purpose or application. In this context, we reiterate that our focus is only on the widely accepted representation schemes and classifiers. Automatic text summarization involves reducing a text document or a larger corpus of multiple documents into a short set of words or paragraph that conveys the main meaning of the text. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate. Such a summary might contain words not explicitly present in the original. The state-of-the-art abstractive methods are still quite weak, so most research has focused on extractive methods.

3. Text Representation
In automatic text classification, it has been proved that the term is the best unit for text representation and classification [6]. Though a text document expresses vast range of information, unfortunately, it lacks the imposed structure of traditional database. Therefore, unstructured data, particularly free running text data has to be transformed into a structured data. To do this, many pre-processing techniques are proposed in literature [7, 8].After converting an unstructured data into a structured data, we need to have an effective document representation model to build an efficient classification system. Bag of Word (BoW) is one of the basic methods of representing a document. The BoW is used to form a vector representing a document using the frequency count of each term in the document. This method of document representation is called as a Vector Space Model (VSM) [9]. Unfortunately, BoW representation scheme has its own limitations. Some of them are: high dimensionality of the representation, loss of correlation with adjacent words and loss of semantic relationship that exist among the terms in a document [10]. To overcome these problems, term weighting methods are used to assign appropriate weights to the term to improve the performance of text classification, which we used as binary representation for given document in our project.The following formats are available as web content and needed to organize. HTML format: Hypertext Markup Language is the predominant markup language for web pages. HTML tags normally come in pairs like <h1> and </h1>. The first tag in a pair is the start tag, the second tag is the end tag. The purpose of a web browser is to read HTML documents and compose them into visual or audible web pages. The browser does not display the HTML tags, but uses the tags to interpret the content of the page. It provides a means to create structured documents by denoting structural semantics for text such as headings, paragraphs, lists, links, quotes and other items. It can embed scripts in languages such as JavaScript which affect the behavior of HTML WebPages. PDF format: Portable Document Formt (PDF) is used for representing documents in a manner independent of the application software, hardware, and operating system. Each PDF file encapsulates a complete description of a fixed-layout 2D document that includes the text, fonts, images, and 2D vector graphics which compose the documents. Doc format: A document file format is a text or binary file format for storing documents on a storage media, especially for use by computers. Binary DOC files often contain more text formatting information (as well as scripts and undo information) than files using other document file formats like Rich Text Format and Hypertext Markup Language, but are usually less widely compatible.

3.1 Overview of Text Classifiers


3.1.1 Nave Bayes Classifiers
Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model". In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. Abstractly, the probability model for a classifier is a conditional model p (C|F1.Fn) over a dependent 127

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 class variable C with a small number of outcomes or classes, conditional on several feature variables F1 through Fn. The problem is that if the number of features n is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, we write

In plain English the above equation can be written as

In practice we are only interested in the numerator of that fraction, since the denominator does not depend on C and the values of the features Fi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model [17][20].

This can be rewritten as follows, using repeated applications of the definition of conditional probability:

Now the "naive" conditional independence assumptions come into play: assume that each feature Fi is conditionally independent of every other feature Fj for j i. This means that

For j i, and so the joint model can be expressed as

This means that under the above independence assumptions, the conditional distribution over the class variable C can be expressed like this:

Where Z is a scaling factor dependent only on F1.Fn, i.e., it is constant if the values of the feature variables are known. Models of this form are much more manageable, since they factor into a so-called class prior p(C) and independent probability distributions p(F1 |C). If there are k classes and if a model for each p(F1 |C=c) can be expressed in terms of r parameters, then the corresponding naive Bayes model has (k 1) + n r k parameters. In practice, often k = 2 (binary classification) and r = 1 (Bernoulli variables as features) are 128

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 common, and so the total number of parameters of the naive Bayes model is 2n + 1, where n is the number of binary features used for classification and prediction.Thus in this context of text classification, using bag of words representation for document d ranging from 1...n, we can compute P (di |Ck) as P( di | Ck )=P(Bow( di ) |Ck)=P(w1,w2,w3,w | v |,j | Ck) .But the assumption of a Nave Bayes classifier is that j th word in a ith text document wji is not correlated with other ones. P( di | Ck )=P(w1,i,w2,i,w3,i,w | v |,i | Ck) = P(w j , i |Ck) Here the problem is reduced to estimate the probability of the single word w j , I in the class Ck.

3.1.2 K-Nearest Neighbor Classifier


The Nearest Neighbor classifier is used for text classification. The Nearest Neighbor classification is a nonparametric method and it can be shown that for large datasets the error rate of the 1-Nearest Neighbour classifier is never larger than twice the optimal error rate. In this classifier to decide whether the document di belongs to the class Ck, the similarity Sim(di , dj) or Dissim(di , dj) to all documents dj in the set is determined. The k most similar training documents are selected. The proportion of neighbors having the same class may be taken as an estimator for the probability of that class and the class with the largest proportion is assigned to the document dj. The algorithm has two parameters (k and similarity/dissimilar value) which decide the performance of the classifier and are empirically determined. However, the optimal number k of neighbors may be estimated from additional training data by cross validation [8, 11, and 12]. The major drawback of the classifier is the computational effort during classification, as basically the similarity of a document with respect to all other documents of a training set has to be determined [14].

3.1.3 Centroid based classifier


Centroid based classifier is the most popular supervised approach used to classify texts into a set of predefined classes with relatively low computation [18]. Based on the vector space model, the performance of the classifier depends on the way to weigh the terms in documents in order to construct a representative class vector for each class and degree of spherical shape in class. Based on the documents in each class the Centroid based classifier selects a single representative called Centroid and then it works like K-NN classifier with K=1. Given a set S of documents and their representation, we need to compute the summed centroid and normalized centroid of class Ci by Cis and Cin. Then we have to calculate the similarities between documents d with each class Ci. Based on these similarities, assign d the class label corresponding to the most similar centroid. This is the brief procedure what Centroid based classifier is concerned with.

3.1.3 Decision Trees


Decision trees (DT) are the most widely used inductive learning methods. Their robustness to noisy data and their capability to learn disjunctive expressions seem suitable for document classification. One of the most well known decision tree algorithms is ID3 and its successor C4.5 and C5. It is a top-down method which recursively constructs a decision tree classifier. DT text classifier is a tree in which internal nodes are labeled by terms, branches departing from them are labeled by the weight that the term has in the test document, and leafs are labeled by categories. Such a classifier categorize a text document d by recursively test for the weights that the terms labeling the internal nodes have in vector dj, until a leaf node is reached; the label of this node is then assigned to dj. A possible method for learning a DT for category Cj consists in a divide and conquer strategy of checking whether all the training examples have the same label. If not, selecting a term tk, partitioning from the pooled classes of the documents that have the same value for tk, and placing each such class in the same class Cj, which is then chosen as the label for the leaf. The key step is the choice of the term tk on which to operate the partition. Generally, a choice is made according to an information gain or entropy criterion. However, such a fully grown tree may be prone to overfitting, as some branches may be too specific to the training data. Most DT learning methods thus include a method for growing the tree and one for pruning it, for removing the overly specific branches.

4. Proposed Text Classifier


We need to overcome the problems faced by each classification algorithm and have to develop a new approach to classify the document into a reserved class. Here we adopted a latest challenging approach of converting the unorganized content into organized content applying summarization technique and then classify based on feature term identification.

4.1 Text Analysis


As a part of categorization, we need to identify the feature terms which represent the document. This 129

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 involves considerable amount of text analysis. We assume that the input document can be of any document format (e.g. pdf, html, doc, txt, rtf .....), hence the system first applies document converters to extract the text from the input document. In our system we have used document converters that could convert PDF, MS Word, text, rtf and HTML documents into text.

4.2 Text Normalization


The text normalization is a rule based component which removes the unimportant objects like figures, tables, identifies the headings and subheadings and handling of non-standard words like web URLs and emails and so on. The text is then divided into sentences for further processing.

4.3 Sentence Marker


This module divides the document into different sentences. At first glance, it may appear that using end-of-sentence punctuation marks, such as periods, question marks, and exclamation points, is sufficient for marking the sentence boundaries. Exclamation point and question mark are somewhat less ambiguous. However, dot '(.') in real text could be highly ambiguous and need not mean a sentence boundary always. The sentence marker considers the following ambiguities in marking the boundary of sentences. a. Non standard word like web urls, emails, acronyms, and so on, will contain '.' b. Every sentence starts with an uppercase letter c. Document titles and subtitles can be written either in upper case or title case

4.4 Syntactic Parsing


This module analyzes the sentence structure with the help of available Natural Language Processing tools such as Brills tagger, named entity extractor, etc. A named entity extractor can identify named entities (persons, locations and organizations), temporal expressions (dates and times) and certain types of numerical expressions from text. This named entity extractor uses both syntactic and contextual information. The context information is identified in the form of POS tags of the words and used in the named entity rules, some of these rules are general and while the rest are domain specific.

4.5 Feature Extraction


The system extracts both the sentence level and the word level features which are used in calculating the importance or relevance of the sentence towards the document. The sentence level features include Position of the sentence in input document: The sentence number is normalized to the scale of 0 to 1. The weights corresponding to the sentence position is calculated. Presence of the verb in the sentence: Based on the assumption that a complete sentence contains verb, this feature will help in deciding the candidate sentence to generate the summary. Referring pronouns: The score for a sentence is attributed by the words that are present in the sentence. During this process most of the IR/IE systems neglect the stop words that occur in the document. The referring pronouns are also neglected as stop words. But to get the actual sentence score one should also consider the proper nouns to which these pronouns are referring to. Length of the sentence: Since long sentences contain more number of words, they get usually more score. This factor needs to be considered while calculating the score of the sentence. In our system we normalize the sentence score by the number of words in that sentence, which is the score of the sentence per word. The word level features include the following: Term frequency tf(w) : Term frequency is calculated using both the unigram and bigram frequency. We considered only nouns while computing the bigram frequencies. A sequence of two nouns occurring together denotes a bigram. The unigram/bigram frequency denotes the number of times the unigram/bigram occurred in the document. Typically the bigrams occur less number of times than the unigrams, so we used a factor that convert the bigram frequency to unigram frequency as a word level feature. All the bigrams in which the word occurs are taken, and normalized to unigram scale. Finally the maximum of the unigram and normalized bigram frequency is taken as the term frequency of the word. Length of the word l(w): Smaller words occur more frequently than the larger words, In order to negate this effect we considered the word length as a feature. Parts of speech tag p(w): In general, Brills tagger is used to find the POS tag of the word. We ranked the tags and assigned weights, based on the information that they contribute to the sentence. All the above features are normalized on a 0-1scale. A weighted combination of all these features is used in 130

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 calculating the score of sentence.

4.5 Summary Generation


Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user or task [13].The task of summarization is to identify informative evidence from a given document, which are most relevant to its content and create a shorter version of summary of the document from this information. The informative evidence associated with techniques used in summarization may also provide clues for text categorization to determine the appropriate category of the document.Summary generation include tasks such as calculating the score for each sentence, selecting the sentences with high score, and refinement of the selected pool of sentences.

4.5.1 Sentence Ranking


Some of the word features are dependent on the context in which it occurs, i.e. they depend on the sentence number also. So the score of the word is also dependent on the sentence number. Once the feature vector for each sentence is extracted, the score of a sentence is the sum of the score of individual words influenced by sentence level features. Score (l, w) = fi (w) Score (l) = Score (l, wi) where l, denotes the sentence number and w denotes the word that occurred in the sentence, and f1(w) denotes the value of ith feature value.In order to compute effect of referring pronouns on sentence score, we assumed that pronouns in a given sentence are referring to nouns in immediate preceding sentence. This is only if the pronoun occurred in the first half of the sentence, otherwise it is assumed that the pronoun is referring to noun within the sentence. Based on this assumption the actual score of the sentence (if those nouns were existing in the same sentence) can be calculated as Score (l) Score (l)+(No. of coreferrents SPW(l-1)) SPW (l) Score(l)/length(l) Where SPW (l) denote the score per word, of the sentence l. The SPW is multiplied by the positional value of the sentence, to get the final score of the sentence.

4.5.2 Sentence Selection


After the sentences are scored, we need to select the sentences that make good summary. One strategy is to pick the top N sentence towards the summary, but this creates the problem of coherence. The selection of sentence is dependent upon the type of the summary requested [21]. The process of selecting the sentences for final summary can be viewed as a Markov process. This is to say, the selection of the next sentence for summary is dependent on already selected sentences for summary. This approach is important to get a meaningful and coherent summary.

4.5.3 Coherence Score (CS)


This is a measure of the amount of information that is common to the set of sentences that are already selected and the new sentence that is going to be selected. We have used a bag of words technique to calculate the coherence of the information flow CS. Let sw represent the set of words present in the sentences that are already selected, and l w be the set of word in the new sentence, then coherence score is the sum of the scores of the words that are in common to both sw and lw Now the score of the new sentence is given by CF X CS (l)+(1-CF) XSPW(l) Where CF is the Coherence Factor, which is user defined parameter.

4.5.4 Summary Refinement


Sentence selection module will give a set of sentences which satisfies the user criteria. Further to increase readability of the summary the following transformations were used, in the specified order: 1. Add sentences to the pool so as to avoid dangling discourse relations. We have a list of discourse connectors that are commonly used to connect different sentences in a document. For example if a sentence starts with afterwards or but, then this sentence is marked as dependent on the previous sentence at the discourse level. At this stage the system can optionally 2. Mark preceding sentence as important and as well add to the pool of selected sentences. Or follow step 3. 3. Remove this sentence from the selected list. 4. Some sentences are removed depending on the length of the desired summary. If a short length summary 131

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 is requested, then it is good to select many short sentences and remove very long sentences. If the length of summary is comparable with the length of the document then sentences which are less than some threshold are removed from the pool. Remove questions, title and subtitles from the set of sentences. Rewrite sentences by deleting marked parenthetical units. If a co referent is found in the given sentence, then the previous sentence is also included in the set of selected sentences. In the final step, we order the sentences based on their occurrence in the document and generate the summary by concatenating the ordered sentences.

5. 6. 7.

4.6 Text Categorization Operations


For effective and accurate classification of data, Information Extraction is needed and often it is a complex task to achieve when dealing with html files, pdfs, docs etc., as they involve various media and graphical items. First we need to parse the web content into useful text and then apply various procedures. The general purpose of Knowledge Discovery is to extract implicit, previously unknown, and potentially useful information from data [19]. Information Extraction mainly deals with identifying words or feature terms within a textual file. Feature terms can be defined as those which are directly related to the domain. In order to identify featured terms and process the data the web document has to undergo several operations which will be presented in the following section.The operations involved in the text categorization are as follows.

4.6.1 Preprocessing
In the step of preprocessing, the given document which is any of the prescribed form is parsed. The parsing of the file may vary for different formats.If the given document ends with an extension specifying a html file then we need to check the file whether it is truly a html file. This can be possible by using conformation checkers. The purpose of the conformation checkers is to check whether the resource is determined to be in html syntax. If it is html document, then the following common rules are used to remove html tags [22]. 1. If the sequence of bytes starting with <!, </,<?, then advance the pointer so that it points to the text after it is preceded by !>and > by simply skipping the bytes in between them. 2. If the sequence of bytes starting with any of the attribute <meta/http-equiv/content/charset followed by space or slash, advance the position pointer so that it points at the next of >. 3. If the byte at position is one of ASCII TAB, ASCII LF, ASCII FF, ASCII CR, ASCII space then advance position to the next byte, then, repeat the rules. 4. All the html tags are treated in case insensitive manner. If the document is of PDF format we can use any the available pdf parsing APIs pdfbox, pdfone, Asprise etc., which are open source programs. In our model we used PDF box api in which there is a class PDF Parser which we instantiate in our program feeding our file to the method getdocument().The TextStripper() method in this class remove the html tags are return the text which in turn stored in a temporary file for further processing. In case of the given document is doc format, then can use doc Reader class in poi, poi scratchpad api. This will remove all the media terms and return only text.If document is plain text, rtf format, then no need of any parsing we simply process the same file in further steps.

4.6.2 Stemming
Stemming refers to identifying the root of a certain word in the document. Any text document, in general contain repetition of same word but with variations in the grammar such as sometime a word appear to be in past tense, some times in present and sometimes containing gerund (ing suffixed in the end) Stemming is of two types: Derivational Stemming and Inflectional Stemming. Derivational stemming aims at creating a new word from an existing word, most often by changing the grammatical category. e.g.: Rationalize- Rational, UsefulUse, Musical Music, Finalize-Final. Example for Derivational Stemming is Affix Removal Technique where we use simplest stemming algorithm is truncating words at Nth symbol (keeping words shorter than N letters unaffected). Though not used for stemming in real systems, the algorithm provides a good baseline for other algorithms evaluation techniques [3] another simple approach is so-called "S"-stemmer, where we conflate the singular and plural forms of English nouns.

4.6.3 Stop words removal


Some words are extremely common and occur in a large majority of documents.For example, articles such as a, an, the, by appear almost in every text but do not include much semantic information. Categorization is based on the featured terms not on commas, full stops, colons, semicolons etc., To reduce search space and processing time, these stop words are dealt separately by recognizing them in the stemmed file 132

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 and remove it from document to list in tokens so that these words will not be stored in the signature file.Some of the stop words listed below in Table 2 are found to be non considerable. These are the top stop words list prescribed by Cambridge University. We save these stop words in a text file and compare with each word in the temporary text file.

4.6.4 String Tokenization


The processing of text often consists of parsing a formatted input string. Parsing here in this context refers to division of text into asset of discrete parts, or tokens, which in a certain sequence can convey a semantic meaning.The StringTokenizer class provides the first step in the parsing process, often called lexer or scanner. StringTokenizer implements the Enumeration interface. Therefore, given an input string, we can enumerate the individual tokens contained in it using StringTokenizer. To use StringTokenizer, we specify an input text file that is nothing but the temporary file we created in the previous section and a string containing the delimiters. Delimiters are characters that separate tokens. Each character in the delimiters string is considered a valid delimiter-for example [?!\"(),.;:] sets the delimiters to the respective symbols. The default set of delimiters consists of the whitespace characters: space, tab and carriage return.

4.6.5 Term Frequency and Weighting


The text file words are counted after tokenization process that is the number of terms in the wordlist. The next step is to find the number of times the string repeated through out the file. For this, we will pass the string in an array into some temporary string pass. Then compare each element in array with this pass. If same, the word count is incremented by 1. After word count of each string is calculated, calculate the weight of each string over the entire text and divided the frequency of each term with total count. On multiplying with hundred, we can easily get the percentage.

4.6.5 Comma Separated Variable files generation


In order to analyze the strings, we need to convert the attributes 1) Terms, 2) Term Frequency and 3) Term weight into an organized form. Here for simplicity we have organized the entire data into Comma Separated Variable format (CSV). CSV is a special format supported by Microsoft Excel. We achieved to bring this data into such format by simply printing each attribute separated by comma.The same procedure for each term. The format can be viewed in word pad and excel as follows:

Figure 4: A CSV file in Excel sheet and word pad

5. Experimental Results and Outcome


The method provides a successful organizing of data with java implementation. we can define which is the important term in the text file and compare its category from the predefined listed category terms bysimply evaluating the word.MaxFreq(). The Result we achieved is the best fit category to which the file falls under. As the feature terms in the trained file increases the accuracy also increases appreciably. The method we developed 133

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 is the best for all times as we tested on various other tools like WEKA. The User Interface, file submission, Categorization and sequence of results are shown in figure 5.

Figure 5: Resulting Best fit Category displayed on the output area

The CSV file generated is fed to the tool WEKA of version 3.6.The results are compared on various classifiers inbuilt in it. We applied various attribute selections and tested the instances which are correct in results. The count is found to be the best criteria on which we build the classifier. The following figure 6 shows the results.

Figure 6: Best Result on the basis of count accuracy is 96.55%

6. Conclusion and Future Research


It is true that different algorithms perform differently depending on data collections. None of the classification algorithms appear globally superior over the others. However, to the certain extent Summarization based on term weighted Vector Space Mode representation scheme perform well in many text classification tasks. Thus, with our proposed method we have justified the statement that summarization technique not only be the best for extracting useful patterns from files but also for testing the file to our needed domain. The method has a vast scope for extended study. This can be extended in following areas: 1. Extend to various formats like asp, xml, php etc., 2. Lemmatization can also be included while processing the text. 134

International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 3. Summarization can also be included to the same software. 4. The techniques can be useful for Extracted Summarization of data. 5. Apply to various categories of trained data.

References
[1] [2] Lam, W. Ho, N.Y.(1998), Using a Generalized Instance set for Automatic Text Categorization SIGIR98,pages 81-89. Dinesh, R., Harish, B. S., Guru, D.S., and Manjunath, S. (2009), Concept of Status Matrix in Text Classification. In the Proceedings of Indian International Conference on Artificial Intelligence, Tumkur, India, pp. 2071 2079. Guru, D. S., Harish B. S., and Manjunath, S. (2009), Clustering of Textual Data: A Brief Survey, In the Proceedings of International Conference on Signal and Image Processing, pp. 409 413. Mitra, V., Wang, C.J., and Banerjee, S. (2007), Text Classification: A least square support vector machine approach. Journal of Applied Soft Computing. vol. 7, pp. 908 914. Fung, G.P.C., Yu, J.X., Lu. H., and Yu, P.S. (2006), Text classification without negative example revisit IEEE Transactions on Knowledge and Data Engineering. Vol. 18, pp. 23 47. Song, F., Liu, S., and Yang, J. (2005), A comparative study on text representation schemes in text categorization, Journal of Pattern Analysis Application, Vol 8, 2005, pp. 199 209. Porter, M.F. (1980), An algorithm for suffix stripping. Program, Vol. 14 (3), pp. 130 137. Hotho, A., Nrnberger, A., and Paa, G. (2005), A Brief Survey of Text Mining. Journal for Computational Linguistics and Language Technology. Vol. 20, pp. 19 62. Salton, G., Wang, A., and Yang, C.S.(1975),A Vector Space Model for Automatic Indexing Communications of the ACM, Vol. 18, pp. 613 620. Bernotas, M., Karklius, K., Laurutis, R., and Slotkiene, A. ,(2007), The peculiarities of the text document representation, using ontology and tagging-based clustering technique. Journal of Information Technology and Control. Vol. 36, pp.217 220. Yang, Y., Slattery, S., and Ghani, R. (2002). A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, Vol 18(2), pp. 219 241. Sandip, K. (2003). An experimental study of some algorithms for text categorization. M.Tech Thesis, IIT Kanpur, India. Lawrence Wong, Automatic News Summarization and Extraction System Meng Computing Imperial College Dept. of Computing http://www.doc.ic.ac.uk/ Gongde Guo (2004), A kNN Model based Approach and its application in Text Categorization. Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, Consiglio Nazionale delle Ricerche, Italy. Atika Mustafa, Ali Akbar, Knowledge Discovery using Text Mining: A Programmable Implementation on Information on Information Extraction and Categorization, International Journal of Mutimedia and Ubiquitous Engineering, Vol.4, No.2. Ioan Pop (2006), An approach of the Naive Bayes classifier for the document classification General Mathematics Vol. 14, No. 4 , 135138 Lifei Chen, Yanfang Ye, Qingshan Jiang, (2008),"A New Centroid-Based Classifier for Text Categorization," ainaw, pp.1217-1222, 22nd International Conference on Advanced Information Networking and Applications - Workshops W. Frawley, G. Piatetsky-Shapiro, C. Matheus (1991),Knowledge Discovery in Databases - An Overview", in Knowledge Discovery in Databases , pp. 1-30. Padmavati Shrivastava, Uzma Ansari (2010), Text Categorization based on Associative Classification. Special Issue of IJCCT Vol.1 Issue 2, 3, 4[ACCTA-2010]. H. Loftsson, E. Rgnvaldsson, S. Helgadttir (Eds.) (2010), Summarization as Feature Selection for Document Categorization on Small Datasets, LNAI 6233, pp. 3944, Springer-Verlag Berlin Heidelberg. Suneetha Manne and Dr. S.Sameen Fatima (2011) A Novel Approach for Text Categorization of Unorganized data based with Information Extraction International Journal on Computer Science and Engineering (IJCSE) ISSN : 0975-3397, Vol. 3 No. 7 pg 2846 2854.

[3] [4] [5] [6] [7] [8] [9] [10]

[11] [12] [13] [14] [15] [16]

[17] [18]

[19] [20] [21]

[22]

135

Você também pode gostar