Escolar Documentos
Profissional Documentos
Cultura Documentos
A Novel Approach for Automatic Text Categorization with Classic Summarization Technique
Suneetha manne and Dr. S. Sameen Fatima Assistant Professor, Department of IT, VR Siddhartha Engineering College, Vijayawada HOD & Professor, Department of CSE, Osmania University, Hyderabad suneethamanne74@gmail.com Abstract
It is undoubtedly a matter of fact to affirm that the emergence of the Internet has made a profound change in the lives of many enthusiastic, innovators and researchers. The information available on the Internet has knocked the doors of Knowledge and Discovery leading to a new information era. The World Wide Web has succeeded in that extent so as to forward information one could need and various data storage centers came into their inception. Unfortunately, the hypertext data provided by storage centers like Google, Yahoo, Bing, Ask etc., often provide the content irrelevant to the topic provided by the browser. It would be a time consuming task to the researchers to recognize which web content is relevant to their interested area. So far, a number of Text Categorization techniques have been developed to throw the given documents under which class does they fall under. This research paper focuses an implementation of Information Extraction and Categorization with Summarization techniques. The derived model is tested for the accuracy by applying various categorization techniques. Keywords: Text Categorization, Text Preprocessing, Text summarization, Term Frequency.
1. Introduction
It is observed that, there is an enormous growth in the addition of new documents in the web. These documents come from various domains and form a huge heterogeneous collection of documents. In such enormous collection of information, identification of the relevant document related to a specific topic is really a challenging task in real time applications. The time taken to search for relevant document is too expensive and results in huge computational as well as time complexity. To address this sort of issue, the analysis may be augmented using experiment-based approaches. Text Categorization is one of such approaches of identifying the class to which a web document belongs to by analyzing the dataset and observing the behavior. This generally involves learning of each class and its representation from a set of documents in the class. To know in a better way we need to focus on the term Information Retrieval.
125
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 Due to abundance of text information, information retrieval has found many applications. There exist many information retrieval systems, such as on-line library catalog systems, online document management systems, the more recently developed web search engines.A typical information retrieval problem is to locate relevant documents in a document collection based on the users query, which is often some keyword describing an information needed, although it could also be an example relevant document. Text Categorization is the one of the fields of Data mining and Information Retrieval which uses predefined categories to label new documents.
Data=
Text Categorization
2. Background
Over the past two decades, the automatic management of electronic documents has been a major research field in computer science. Text documents have become the most common type of information repositories especially with the increased popularity of the internet and the World Wide Web (WWW). Internet and web documents like web pages, emails, newsgroup messages, internet news feed etc., contain million or even billion of text documents. In the last decades content-based document management tasks have gained a prominent status in the information systems field, due to the increased availability of documents in digital form [2] [3]. From several decades, automatic document management tasks have gained a prominent status in the field of information retrieval. Until late 80s, text classification task was based on Knowledge Engineering, where a set of rules were defined manually to encode the expert knowledge on how to classify the documents under the given categories. Since there is a requirement of human intervention in knowledge engineering, researchers in 90s have proposed many machine learning techniques to automatically manage the text documents [4]. The advantages of a machine learning based approach are that the accuracy is comparable to that achieved by human experts and no intervention from either knowledge engineers or domain experts needed for the construction of a document management tool [5]. Many text mining methods like document retrieval, clustering, classification, routing, filtering are often used for effective management of text documents. However, in this project we concentrate only on classification of text documents. A classifier can be built by training the document systematically using a set of training documents D, where all of the documents belonging to D are labelled [15]. Text classification presents many challenges and difficulties. First, it is difficult to capture high-level semantics and abstract concepts of natural languages just from a few key words. Furthermore, semantic analysis, a major step in designing an information retrieval system, is not well understood, although there are some techniques that have been successfully applied tolimited number domains. Second, high dimensionality (thousands of features) and variable length, content and quality are the characteristics of a huge number of documents on the web. These place both efficiency and accuracy demands on classification systems [5]. 126
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 Since any classifier is unable to understand a document in its raw format, a document has to be converted into a standard representation. Extensive work is carried out to propose various text representation techniques and text classification methods in the literature [9]. But, it is essential for a software developer to have a complete knowledge on all existing representation schemes and classifiers in order to select an appropriate representation scheme and classifier which best suits his purpose or application. In this context, we reiterate that our focus is only on the widely accepted representation schemes and classifiers. Automatic text summarization involves reducing a text document or a larger corpus of multiple documents into a short set of words or paragraph that conveys the main meaning of the text. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate. Such a summary might contain words not explicitly present in the original. The state-of-the-art abstractive methods are still quite weak, so most research has focused on extractive methods.
3. Text Representation
In automatic text classification, it has been proved that the term is the best unit for text representation and classification [6]. Though a text document expresses vast range of information, unfortunately, it lacks the imposed structure of traditional database. Therefore, unstructured data, particularly free running text data has to be transformed into a structured data. To do this, many pre-processing techniques are proposed in literature [7, 8].After converting an unstructured data into a structured data, we need to have an effective document representation model to build an efficient classification system. Bag of Word (BoW) is one of the basic methods of representing a document. The BoW is used to form a vector representing a document using the frequency count of each term in the document. This method of document representation is called as a Vector Space Model (VSM) [9]. Unfortunately, BoW representation scheme has its own limitations. Some of them are: high dimensionality of the representation, loss of correlation with adjacent words and loss of semantic relationship that exist among the terms in a document [10]. To overcome these problems, term weighting methods are used to assign appropriate weights to the term to improve the performance of text classification, which we used as binary representation for given document in our project.The following formats are available as web content and needed to organize. HTML format: Hypertext Markup Language is the predominant markup language for web pages. HTML tags normally come in pairs like <h1> and </h1>. The first tag in a pair is the start tag, the second tag is the end tag. The purpose of a web browser is to read HTML documents and compose them into visual or audible web pages. The browser does not display the HTML tags, but uses the tags to interpret the content of the page. It provides a means to create structured documents by denoting structural semantics for text such as headings, paragraphs, lists, links, quotes and other items. It can embed scripts in languages such as JavaScript which affect the behavior of HTML WebPages. PDF format: Portable Document Formt (PDF) is used for representing documents in a manner independent of the application software, hardware, and operating system. Each PDF file encapsulates a complete description of a fixed-layout 2D document that includes the text, fonts, images, and 2D vector graphics which compose the documents. Doc format: A document file format is a text or binary file format for storing documents on a storage media, especially for use by computers. Binary DOC files often contain more text formatting information (as well as scripts and undo information) than files using other document file formats like Rich Text Format and Hypertext Markup Language, but are usually less widely compatible.
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 class variable C with a small number of outcomes or classes, conditional on several feature variables F1 through Fn. The problem is that if the number of features n is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, we write
In practice we are only interested in the numerator of that fraction, since the denominator does not depend on C and the values of the features Fi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model [17][20].
This can be rewritten as follows, using repeated applications of the definition of conditional probability:
Now the "naive" conditional independence assumptions come into play: assume that each feature Fi is conditionally independent of every other feature Fj for j i. This means that
This means that under the above independence assumptions, the conditional distribution over the class variable C can be expressed like this:
Where Z is a scaling factor dependent only on F1.Fn, i.e., it is constant if the values of the feature variables are known. Models of this form are much more manageable, since they factor into a so-called class prior p(C) and independent probability distributions p(F1 |C). If there are k classes and if a model for each p(F1 |C=c) can be expressed in terms of r parameters, then the corresponding naive Bayes model has (k 1) + n r k parameters. In practice, often k = 2 (binary classification) and r = 1 (Bernoulli variables as features) are 128
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 common, and so the total number of parameters of the naive Bayes model is 2n + 1, where n is the number of binary features used for classification and prediction.Thus in this context of text classification, using bag of words representation for document d ranging from 1...n, we can compute P (di |Ck) as P( di | Ck )=P(Bow( di ) |Ck)=P(w1,w2,w3,w | v |,j | Ck) .But the assumption of a Nave Bayes classifier is that j th word in a ith text document wji is not correlated with other ones. P( di | Ck )=P(w1,i,w2,i,w3,i,w | v |,i | Ck) = P(w j , i |Ck) Here the problem is reduced to estimate the probability of the single word w j , I in the class Ck.
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 involves considerable amount of text analysis. We assume that the input document can be of any document format (e.g. pdf, html, doc, txt, rtf .....), hence the system first applies document converters to extract the text from the input document. In our system we have used document converters that could convert PDF, MS Word, text, rtf and HTML documents into text.
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 calculating the score of sentence.
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 is requested, then it is good to select many short sentences and remove very long sentences. If the length of summary is comparable with the length of the document then sentences which are less than some threshold are removed from the pool. Remove questions, title and subtitles from the set of sentences. Rewrite sentences by deleting marked parenthetical units. If a co referent is found in the given sentence, then the previous sentence is also included in the set of selected sentences. In the final step, we order the sentences based on their occurrence in the document and generate the summary by concatenating the ordered sentences.
5. 6. 7.
4.6.1 Preprocessing
In the step of preprocessing, the given document which is any of the prescribed form is parsed. The parsing of the file may vary for different formats.If the given document ends with an extension specifying a html file then we need to check the file whether it is truly a html file. This can be possible by using conformation checkers. The purpose of the conformation checkers is to check whether the resource is determined to be in html syntax. If it is html document, then the following common rules are used to remove html tags [22]. 1. If the sequence of bytes starting with <!, </,<?, then advance the pointer so that it points to the text after it is preceded by !>and > by simply skipping the bytes in between them. 2. If the sequence of bytes starting with any of the attribute <meta/http-equiv/content/charset followed by space or slash, advance the position pointer so that it points at the next of >. 3. If the byte at position is one of ASCII TAB, ASCII LF, ASCII FF, ASCII CR, ASCII space then advance position to the next byte, then, repeat the rules. 4. All the html tags are treated in case insensitive manner. If the document is of PDF format we can use any the available pdf parsing APIs pdfbox, pdfone, Asprise etc., which are open source programs. In our model we used PDF box api in which there is a class PDF Parser which we instantiate in our program feeding our file to the method getdocument().The TextStripper() method in this class remove the html tags are return the text which in turn stored in a temporary file for further processing. In case of the given document is doc format, then can use doc Reader class in poi, poi scratchpad api. This will remove all the media terms and return only text.If document is plain text, rtf format, then no need of any parsing we simply process the same file in further steps.
4.6.2 Stemming
Stemming refers to identifying the root of a certain word in the document. Any text document, in general contain repetition of same word but with variations in the grammar such as sometime a word appear to be in past tense, some times in present and sometimes containing gerund (ing suffixed in the end) Stemming is of two types: Derivational Stemming and Inflectional Stemming. Derivational stemming aims at creating a new word from an existing word, most often by changing the grammatical category. e.g.: Rationalize- Rational, UsefulUse, Musical Music, Finalize-Final. Example for Derivational Stemming is Affix Removal Technique where we use simplest stemming algorithm is truncating words at Nth symbol (keeping words shorter than N letters unaffected). Though not used for stemming in real systems, the algorithm provides a good baseline for other algorithms evaluation techniques [3] another simple approach is so-called "S"-stemmer, where we conflate the singular and plural forms of English nouns.
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 and remove it from document to list in tokens so that these words will not be stored in the signature file.Some of the stop words listed below in Table 2 are found to be non considerable. These are the top stop words list prescribed by Cambridge University. We save these stop words in a text file and compare with each word in the temporary text file.
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 is the best for all times as we tested on various other tools like WEKA. The User Interface, file submission, Categorization and sequence of results are shown in figure 5.
The CSV file generated is fed to the tool WEKA of version 3.6.The results are compared on various classifiers inbuilt in it. We applied various attribute selections and tested the instances which are correct in results. The count is found to be the best criteria on which we build the classifier. The following figure 6 shows the results.
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8 3. Summarization can also be included to the same software. 4. The techniques can be useful for Extracted Summarization of data. 5. Apply to various categories of trained data.
References
[1] [2] Lam, W. Ho, N.Y.(1998), Using a Generalized Instance set for Automatic Text Categorization SIGIR98,pages 81-89. Dinesh, R., Harish, B. S., Guru, D.S., and Manjunath, S. (2009), Concept of Status Matrix in Text Classification. In the Proceedings of Indian International Conference on Artificial Intelligence, Tumkur, India, pp. 2071 2079. Guru, D. S., Harish B. S., and Manjunath, S. (2009), Clustering of Textual Data: A Brief Survey, In the Proceedings of International Conference on Signal and Image Processing, pp. 409 413. Mitra, V., Wang, C.J., and Banerjee, S. (2007), Text Classification: A least square support vector machine approach. Journal of Applied Soft Computing. vol. 7, pp. 908 914. Fung, G.P.C., Yu, J.X., Lu. H., and Yu, P.S. (2006), Text classification without negative example revisit IEEE Transactions on Knowledge and Data Engineering. Vol. 18, pp. 23 47. Song, F., Liu, S., and Yang, J. (2005), A comparative study on text representation schemes in text categorization, Journal of Pattern Analysis Application, Vol 8, 2005, pp. 199 209. Porter, M.F. (1980), An algorithm for suffix stripping. Program, Vol. 14 (3), pp. 130 137. Hotho, A., Nrnberger, A., and Paa, G. (2005), A Brief Survey of Text Mining. Journal for Computational Linguistics and Language Technology. Vol. 20, pp. 19 62. Salton, G., Wang, A., and Yang, C.S.(1975),A Vector Space Model for Automatic Indexing Communications of the ACM, Vol. 18, pp. 613 620. Bernotas, M., Karklius, K., Laurutis, R., and Slotkiene, A. ,(2007), The peculiarities of the text document representation, using ontology and tagging-based clustering technique. Journal of Information Technology and Control. Vol. 36, pp.217 220. Yang, Y., Slattery, S., and Ghani, R. (2002). A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, Vol 18(2), pp. 219 241. Sandip, K. (2003). An experimental study of some algorithms for text categorization. M.Tech Thesis, IIT Kanpur, India. Lawrence Wong, Automatic News Summarization and Extraction System Meng Computing Imperial College Dept. of Computing http://www.doc.ic.ac.uk/ Gongde Guo (2004), A kNN Model based Approach and its application in Text Categorization. Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, Consiglio Nazionale delle Ricerche, Italy. Atika Mustafa, Ali Akbar, Knowledge Discovery using Text Mining: A Programmable Implementation on Information on Information Extraction and Categorization, International Journal of Mutimedia and Ubiquitous Engineering, Vol.4, No.2. Ioan Pop (2006), An approach of the Naive Bayes classifier for the document classification General Mathematics Vol. 14, No. 4 , 135138 Lifei Chen, Yanfang Ye, Qingshan Jiang, (2008),"A New Centroid-Based Classifier for Text Categorization," ainaw, pp.1217-1222, 22nd International Conference on Advanced Information Networking and Applications - Workshops W. Frawley, G. Piatetsky-Shapiro, C. Matheus (1991),Knowledge Discovery in Databases - An Overview", in Knowledge Discovery in Databases , pp. 1-30. Padmavati Shrivastava, Uzma Ansari (2010), Text Categorization based on Associative Classification. Special Issue of IJCCT Vol.1 Issue 2, 3, 4[ACCTA-2010]. H. Loftsson, E. Rgnvaldsson, S. Helgadttir (Eds.) (2010), Summarization as Feature Selection for Document Categorization on Small Datasets, LNAI 6233, pp. 3944, Springer-Verlag Berlin Heidelberg. Suneetha Manne and Dr. S.Sameen Fatima (2011) A Novel Approach for Text Categorization of Unorganized data based with Information Extraction International Journal on Computer Science and Engineering (IJCSE) ISSN : 0975-3397, Vol. 3 No. 7 pg 2846 2854.
[17] [18]
[22]
135