Você está na página 1de 8

A Summary Sentence Extraction Method for Web-based Mailing List Review Application and Its Effectiveness Study

Satoru FUJITANI, Kanji AKAHORI Graduate School of Decision Science and Technology, Dept. of Human System Science, Tokyo Institute of Technology. 2-12-1, O-okayama, Meguro, Tokyo 152-8552, Japan, mike@cradle.titech.ac.jp, akahori@cradle.titech.ac.jp
E-mail based communication is gradually making its way into the distant collaborative learning environment. But, compared with traditional lecture cum discussion learning environment in e-mail-based collaborative discussion, it is difficult to know the latest statuses of the learners for providing immediate feedback effectively due to limited information resources. The authors propose an information retrieval method for mailing list review. Relevant nouns, as keywords in a message, were pulled out from e-mails, and summary was extracted by these keywords. Japanese natural language processing technology is employed in the proposed method. These extraction procedures are used to form the basis for the Web-based mailing list reference application. The authors claim that these features in the proposed method help to identify an outline of discussion topics in a mailing list, and improve the readability of e-mails. Statistical evaluation of this method has been proceeded and it has been found as an effective retrieval method for e -mail-based communication. Also, The comparative experiment between the extraction result of 46 undergraduate students and the algorithm result of extraction method suggests that the proposed method can detect major sentences in e-mail articles properly. Keywords: e-mail discussion, mailing list, web-based application, natural language processing

1 Introduction
Within the Internet, e-mail has the most widespread usage. E-mail is increasingly used as a means for storing the latest information and for interactive learning among the learners. In particular, mailing list help learners send messages to a whole group of people and receive back messages from them. Messages sent through mailing list can be followed by all members in the group [1]. In the future, the use of mailing list is expected to increase even more as a part of collaborative learning. In e-mail-based collaborative learning, it is difficult for teachers to provide spontaneous analysis and feedback. Therefore, teachers must try to facilitate the process of learning effectively. Even learners themselves have the same difficulties in e -mail communication, e.g. a deluge of e-mail [2]. Learners need to exert more efforts to choose what is more important among the numerous e-mail messages that they process without teachers. Our earlier research on a network-based collaborative learning project facilitated by the mailing list of overseas Japanese schools [3] suggests that sorting and indexing support tools is crucial in order to apply appropriate ideas for educational improvement. We further suggest that extracting according to learners interests from stored information can be helpful in keeping their focus on the discussion topic. Many improvements have been suggested for an easy-to-use network collaborative learning environment. But most of them are for graphical user interface [4-5]. So, in this paper the authors will suggest a trial keyword and summary extracting method from the series of e -mail (called thread) of mailing list. Japanese natural language processing technology will be employed for appropriate keyword extraction. From a thread, the keywords will clarify the topics that the subscribers deal with.

The purpose of this research is to verify where indicators, important words for summary extraction, are included in a series of e-mail text, as thread empirically, and to develop summary extraction system implemented by the result of empirical fact-findings. Many researchers have discussed methods for content-based text extraction [6-12]. Previous researchers equate the summary sentence with some important sentences selected from the document. This method assumes that there are certain indicators for importance of words, phrases, or complete sentences. We propose an e-mail discussion extraction model in order to acquire new information. We assume that there is some cohesion between the original e-mail and the reply e-mail. The sentence extraction algorithm is based on empirical experiments. The keyword extraction algorithm assumes a continuity of keyword appearance in the e-mail thread that enables parsing the thread to extract typical terms with a high frequency of occurrence in each message. This keyword extraction method is good for domain-specific knowledge. With the help of the e-mails original title and given keywords, learners and newcomers can learn about the message topics. A computer learning environment can easily archive the result of e-mail exchanges. With this method, learners activities can be inspected and reviewed. This keyword extraction method has formed the basis for a WWW-based mailing list reference environment [13-14].

2 Extracting keywords from e-mail thread


2.1 Some characteristics of Japanese Sentences for the extraction
Spoken Japanese is characterized as follows: (1) Pronouns do not usually used replace proper nouns, original nouns are instead repeated. (2) Repeated nouns are not usually replaced with the pronouns, and are used with affixing the directives. For instance, in the context of e-mails, the other partys name tends to be repeated without replacing it with a pronoun. Japanese people are used to greeting and referring by surname. That is why the listener/receiver may not be able to distinguish the senders gender. Common nouns are also used in this way. Because of this practice, we can expect to find the same noun with same symbol in various coreferenced documents.

2.2 Characteristics of paragraph structure in E-mail threads


The language style in electronic exchanges is informal, and it is located somewhere between the spoken language and the written language. The usage and the structure in electronic exchanges are very different [15]. It is difficult to decide the stylistics of e-mail document. Also, the heuristics for the location of the selected sentences in the document are seldom used. For instance, we can expect to see a reference to a preceding document anywhere in the reply e-mail.

2.3 Why Keywords Extraction?


Take the case of a text-based discussion on network. When a person wants to participate in an ongoing discussion on a certain topic, at first (s)he reads the series of e-mail or thread, from the beginning one by one. Moreover, if the topics diverge variously and someone wants to pursue one of these topics, it also needs to read all the relevant threads in detail. Although e-mails documents cannot find locational rules or structure, e-mails have a coreference between a preceding and a reply e-mail. Also, quite commonly, e-mail documents are employed for references. In this paper, retrievals by keywords are based on these features of e-mails, and a summary sentence extraction method is proposed later. In this paper, we assume that mailing list communication takes place as follows: (1) In a mailing list, participants refer to the content of the preceding messages, and cite them occasionally. (2) Also in a mailing list, important words in the discussion appear many times, especially nouns that correspond to the topic; (3) Therefore, important words in the discussion accumulate at a high frequency in a discussion thread.

We propose the following keyword extraction method: first, keywords that are related to the content of the messages in the discussion are extracted and displayed automatically. Based on the keywords, sentences thought to be necessary for the discussion are pulled out and are listed together. Using this method, the topics in the discussion are expected to emerge more clearly. Furthermore, the flow of the discussion could be summarized to some extent. Appending a subject is another practice among users of e-mail. Those who use e-mail can append a subject by writing a topic in the header of the e-mail, and the subject is considered the best way to represent the topic in a message. But, in practice, users seldom change the subject in reply e-mail. The reply e-mails subject is then automatically changed by the computer by prefixing Re: before the original subject; e.g. Re: (original subject). With subsequent exchange of reply e-mails, even the topic in the thread changes gradually. As such, we cannot refer to the subject as a keyword of the message.

2.4

Characteristics of Extracted Keywords in This Research

Several methods of information retrieval have been used for extraction of a part of Japanese text. For example, Kurohashi et al. [17] assume that the most important description of a word in a text is the part where the word occurs with the highest frequency. They have tried their methods on 20 books. This method uses the density of the individual word in a book. An earlier research by Fujitani & Akahori [13] shows that mailing lists have less continuity of topics than Netnews or WWW-based bulletin board systems, even if participants set their messages as the response of the preceding message. In contrast, i this paper we identify the n keyword(s) in mailing list. Participants of the mailing list, who are members of a special interest group, have a tendency to refer to these keywords repeatedly in their e-mails.

2.5 Algorithm for Keyword Extraction

In this paper, the authors apply the morpheme analysis system, JUMAN [18], for the extraction. Figure 1 shows a diagram of this keyword extraction method. The morpheme analysis method is as follows: (1) The morpheme is analyzed for the contributing message, and by judging the part of speech for every morpheme, the nouns are pulled out. These are assumed to be keyword candidates in the message. Authors set Japanese language heuristics for this procedure (as established by the preceding researches). First, nouns written only in Hiragana are excluded because many of the nouns are used in a different way from the original literal meaning. Second, nouns of one Kanji character are excluded [19] Figure 1:diagram of the keyword extraction method because most of the nouns are so common that they are not suitable as keywords. (2) In same way, selection of keyword candidates from the preceding message and the response message are carried out. For the response message, if there are two or more messages sent consecutively, nouns that are pulled out from at least one message are all included in the list of keyword candidates. (3) In consecutive messages, the intersection of the keyword candidates is obtained. The authors define this noun set as Pre noun sets of the target (response) message, which is concerned to the preceding message. On the other hand, another set of nouns can be obtained using the response message and target message similarly. That set of nouns will be referred to as the Post noun set of the target (preceding) message. In this paper, following noun sets in Figure

response message

2 are defined. In a thread with more than two messages, intersection of Pre and Post noun set can be obtained for each in both preceding and response messages. These noun sets take more specific characteristics into account than simple noun sets. For example, noun set has only nouns that newly appeared. These noun sets are used to look for a more appropriate keyword extraction method. An evaluation experiment of the keyword extraction method is described in next section.
Figure 2: List of defined noun sets in this research

3 Important Sentence Extraction Method


Generally, a summary sentence is a very simplified expression taken from the contents with a certain length, e.g., a research paper. In individual cases of learning too, summary sentences are used to understand meanings of the contents [16]. Instead of making new sentences for the summary, we generate a list of important sentences. In order to identify the sentence that provides the context for the keywords, we used the important sentences extraction method. By important sentences of the target message, we mean sentences that contain the keywords that were already extracted from each target message. Also, we assumed that the summary sentences are generated by compilation of the important sentences. Usually several sentences were extracted as the important sentences from each message.

Evaluation Experiment of Keywords and Summary Sentences Extraction

Given the above theoretical framework, we conducted an evaluation experiment of these keywords and summary sentences extraction method.

4.1 Purpose of the Experiment


The purpose of this experiment is to verify an appropriate keywords extraction method for identifying e -mail topics or for making summary sentences, by comparing summary sentence extraction results where the keyword is extracted using the methods as defined in this research, and where the keyword is extracted through human cognition. In this research, the summary sentences are generated utilizing the result of the keywords extraction method. Specifying what summary extraction method is most suitable to human cognition can therefore be replaced with specifying what keywords extraction method is most suitable.

4.2 Subjects
Subjects were 46 female university undergraduate students who were taking a lecture on teacher preparation course. All participants have their own e-mail accounts and they have received instruction on e-mail usage and basic computer operations at the university.

4.3 Procedure
In this experiment, the authors used e-mail messages from mailing list concerning applied educational use of the Internet in overseas Japanese schools. The mailing list was maintained by us. It consisted of about 300 people, out of which, about 90% were elementary or junior-high school teachers. During the experiment, the subjects were first given a brief explanation of mailing lists in general. After that, they each received handouts consisting of a series of eleven e-mail messages.

The subjects were instructed to read each message and perform extraction work. They were instructed to pick out the important sentences in each message as described below. (1) First, select the one sentence considered to be the most important, and enclose it with a square. (2) Then underline the sentence(s) considered to be important and necessary to understand the content. We collected t e handouts one week later, and counted the adoption frequency of each h important sentence.

4.4 Evaluation method and results


As mentioned before, summary sentences are defined as the sentences that the subjects identified with either a square or with an underline. For the purpose of evaluation and comparison between the results of the experiment and the important sentences as selected by the extraction methods, we defined two index values. The first index value, the summary degree of each message to evaluate the summary as the result of the experiment which applies w hen the subjects adopt and concentrate on a very small number of sentences. Thus, the value of the summary degree grows. The second index value, Selection degree for each important sentence extraction method is
(Summary Degree) = (the total number of marks on sentences by all the subjects) / (the number of marked sentences)

defined as follows.
(Selection Degree) = (the total number of marks by all the subjects on extracted sentences using the method) / (the number of extracted sentences)

If the sentence extracted by each method is the same as the important sentences, both indexes become Table 1: The equal. Furthermore, comparison between summary degree by when each method experiment and detects the selection degree by sentence which the important sentence subject does not extraction methods. values are adopt in the Bold-faced summary larger than experiment, only degree. the denominator of 35 the selection degree grows and the 30 Selection Degree selection degree PrePost not becomes small. On 25 Pre the other hand, if PrePost not only the sentences 20 Post that a considerable 15 PrePost number of subjects Summary Degree not() adopted are 10 Experimental result extracted by each method, the 5 selection degree 0 grows conversely. 809 810 812 843 845 847 850 851 852 853 854 Therefore, Article No. when a more pointed summary Figure 3 : The comparison between the experiment result and 3: comparison between the experiment result and important sentence extraction methods methods was produced by important sentence extraction each method and the selection degree by each method becomes larger than the summary degree, it is interpreted to mean that the important sentences extraction method can make a more refined summary. Table 1 and Figure 3 shows where the selection degree of each method and the summary degree and the experiment were calculated for 11 messages used in the experiment. In Table 1 and Figure 3, the selection degree becomes higher than the summary degree when we compare
Summary Degree / Selection Degree
________ ________

the summary in the experiment and the extracted important sentences made by the set of nouns using the response message. In particular, the result from the set of nouns is remarkably high. It shows that the subjects adopted summary sentences as the sentences which contain the nouns that n ewly appeared in the message and consecutive thereafter in a thread. Although we should note that this result is dependent on the topic in the e-mail, we may also explain that the subjects have judged that it is suitable for a summary of mailing list messages to adopt newly appeared topics in e-mail messages.

5 Statistical Evaluation of the Retrieval Method


Summary sentence extraction methods have already been discussed in Section 2. Therefore, statistical evaluation of our method with other keyword retrieval method is needed. The evaluation of the retrieval method has been proceeded as follows. It has been found as an effective retrieval method for e-mail-based communication.

5.1 Purpose of the Evaluation


The purpose of this evaluation is to verify appropriateness of keywords extraction method for identifying e -mail topics or for making summary sentences, by comparing keyword extraction results using tf*idf keyword segmentation rule. The authors proposed frequency-based keyword extraction method, but this method uses coreference between preceding and reply e-mail. Authors examine whether the proposed method based on coreference is better than the keyword extraction method.

5.2 Procedure
Tf*idf value for keyword segmentation calculates individual word weights [6]. For each word in a document, following weighted measure [16] is used:
w = ( term frequency)* (document frequency)

where
term frequency = ( occuranceof term in document) document frequency = log number of documentsin corpus occurance documentswhich includeterm of
Summary Degree / Selection Degree

This identifies the words in a document that 35 are relatively unique (hence, signature) to that 30 document. 25 The authors applied this weighted measure for all the appeared independent words in e-mail 20 thread which are used in evaluation 15 experiment. The authors define important sentences of the target message as sentences 10 that contain the above-mentioned keywords. 5 Important sentence definition on tf*idf value 0 rule is same also in the proposed method. 809 810 812 843 845 847 850 851 852 853 854 Also, There is no limit on the number of Article No. signature words used for a given document. Significance threshold are determined not tf.idf>1 Proposed Method tf*idf Method Experiment empirically which marks best selection Figure 4: The relationship of selection degree between methods degree in e-mail thread. Figure 4 shows the results of selection degree between the important sentence extraction methods. Tf*idf value based rule didnt mark high score during the thread. However, by the present extraction method, high scores are marked in the medium part of e-mail thread. We suggest that this method doesnt follow up the desired keyword towards the end, because these documents of e-mail thread tend to disregard connectivity which makes this method weak.

6 Generation of Summary Sentence and Digest


The authors developed summary sentence and digest generation functions as a compilation of extracted important sentences of the messages. We use the important sentences extraction method which uses a noun set according to the result of the experiment. In this procedure, the authors regard these important sentences as summary sentences. These functions are an addition to the WWW-based mailing list reference interface. Figure 5 shows the system diagram for summary and digest generation from e-mail messages. Members in the mailing list use their MUA (Mail User Agent) and post e-mail to mailing list. Simultaneously the e-mail is distributed by MTI (Mail Transfer Interface), summary and digest generation server extracts keywords and reconfigures an e-mail summary database. After that, 4 5 the summary and digest generation server produces the following outputs. (1) A WWW page with tree hierarchy view of e-mail messages. The hierarchy of messages can be pursued corresponding with each e-mail headers. This WWW page contains the message number, posted date and time, the senders E-mail address, title, and summary sentences. (2) A mailing lists digest using the summary sentences. First, the digest generation server puts e-mail messages in order. According to the hierarchy of every group of messages, called the thread, the server joins the summary of each message in the order of the message number. The authors call this the digest of a thread. This digest is distributed by the mailing list at periodic intervals. However, the message must be sent consecutively back and forth in the above-mentioned extraction method. Thus, the authors defined the following rules. In the message of the head, the result of an important sentence which uses the method using noun set Post is expediently assumed to be summary sentences. In the message of the bottom, noun set Pre is expediently used to generate summary sentences. Moreover, when a new consecutive message arrives and consecutive messages increase, the summary sentences are generated again.

7 Conclusions
We have described a summary sentence extraction method of e-mail discussion and its web-based application to mailing list review. We have proceeded an empirical trial to identify the location of important sentences, and developed summary sentence and digest generation functions as a compilation of extracted important sentences of the messages. The main conclusion is that the extracted summary sentences which include new keywords correspond better to human ones than summaries based on extracting sentences which emphasize continuing thread keywords. The authors have felt the necessity of a more empirical verification of the appropriateness of this keyword and summary extracting method in practical use.

Acknowledgement
Our special thanks are due to Madhumita Bhattacharya, Ph.D., the National Institute of Multimedia Education, Chiba Japan, and Dr. Tanmoy Bhattacharya, Department of Phonetics

Linguistics, University College of London, U.K., for reading the manuscript and making a number of helpful suggestions.

References
[1] Roerden, L. P., Net Lessons: Web-based projects for your classroom, CA: OReilly & Assoc. (1997) [2] Chandra, H., Electric Mail: An Examination of High-End Users, Proceeding of Selected Research and Development Presentations at the 1996 National Convention of the Association for Educational Communications and Technology (1996) [3] Akahori, K. & Fujitani, S., A Survey of Educational Use of Internet in Overseas Japanese Schools (in Japanese), Proceedings of the 12th annual conference of Japan Society for Educational Technology, pp.63-64 (1996) [4] Ahern, T. C., The Effect of Interface on the Structure of Interaction in Computer-Mediated Small-Group Discussion, Journal of Educational Computing Research, vol.11, no.3, pp.235-250 (1994) [5] Shintani, T. & Uchimura, T., Adventure of Media-kids -Record of education practice on Internet- (in Japanese), Tokyo: NTT publishing (1996) [6] Salton,G. & McGill, M.J., Introduction to Modern Information Retrieval, N.Y.: McGraw-Hill (1983) [7] Brandow, R., Mitze, K., & Rau, L. F., Automatic Condensation of Electronic Publications by Sentence Selection, Information Processing & Management, vol.31, no.5, pp.675-685 (1995) [8] Edmundson, H. P., New methods in automatic abstracting, Journal of the Association for Computing Machinery, vol.16, no.2, pp.264-285 (1969) [9] Kupiec J., Pedersen J., & Chen F., A Trainable Document Summarizer, Proceedings of the 18th Annual International ACM/SIGIR Conference, pp.68-73 (1995) [10] Sato M., Sato S., & Shinoda Y., Automatic Digesting of the NetNews (in Japanese), Journal of IPSJ, vol.36, no.10, pp.2371-2378 (1995) [11] Ishihara, M. & Akahori, K., Development of a System to Generate Digests of Internet Articles for Supporting Discussions (in Japanese), Japan Journal of Educational Technology, vol.22, no.1, pp.1-13 (1998) [12] Hearst, M. A. & Plaunt, C., Subtopic Structuring for Full-Length Document Access, Proceedings of the 16th Annual International ACM/SIGIR Conference (1993) [13] Fujitani, S. & Akahori, K., Keyword Sampling and Interface Improvement for the Review of e-mail (in Japanese), Proceedings of the 21st annual conference of Japan Association of Science Education, pp.57-58 (1997) [14] Fujitani, S. & Akahori, K., Protocol Analysis of Decision Making Strategies under Computer Network Environment, Proceedings of First International Conference on Cognitive Science (ICCS97), Seoul, South Korea, pp.140-145 (1997) [15] Robinson, B., Electronic Communication and Collaborative Writing (in Japanese), Global Commons Inc., Japan, Trans., An Environment for Collaborative Learning: Proceedings of 4th Computer Pals International Conference [On-line], Available WWW: http://kids-commons.net/cpaw/jirei02/jirei018. html (1991) [16] Sato, S., Information Retrieval, In Nagao, M., Kurohashi, S., Sato, S., Ikehara, S., & Nakano, H. (Eds.), Linguistic Sciences: vol. 9. Language Information Processing (in Japanese), Tokyo: Iwanami Book Publishing (1998) [17] Kurohashi, S., Shiraki, N., & Nagao. M., A Method for Detecting Important Descriptions of a Word based on its Density Distribution in Text (in Japanese), IPSJ SIG Notes, Tokyo: Information Processing Society of Japan (IPSJ), 96-NL-115, pp.43-50 (1996) [18] Matsumoto, Y Japanese morpheme analysis system JUMAN ver.3.0 (in Japanese), Kyoto, Japan: ., Kyoto University (1996) [19] Kokogawa, T., Keywords Extraction Methods for Group Information Sharing (in Japanese), Proceedings of IEICE Technical Report, Tokyo: the Institute of Electronics, Information, and Communication Engineers (IEICE), AI96-15, pp.51-55 (1996)

Você também pode gostar