A Method of Extracting Malicious Expressions in Bulletin Board Systems

Information Processing and Management 47 (2011) 323335
Contents lists available at ScienceDirect
Information Processing and Management

journal homepage: www.elsevier.com/locate/infoproman
A method of extracting malicious expressions in bulletin board systems by using context analysis
Hiroshi Hanafusa, Kazuhiro Morita, Masao Fuketa, Jun-ichi Aoe
Department of Information Science and Intelligent Systems, University of Tokushima, Tokushima 770-8506, Japan
a r t i c l e
i n f o
a b s t r a c t
Bulletin board systems are well-known basic services on the Internet for information frequent exchange. The convenience of bulletin boards enables us to communicate with other persons and to read the communication contents at any time. However, malicious postings about crimes are serious problems for serving companies and users. The extracting scheme of the traditional methods depends on words or a sequence of words without considering contexts of articles and, therefore, it takes a lot of human efforts to alert malicious articles. In order to reduce the human efforts, this paper presents a new ltering algorithm that can recover the error rate of false positive for non-malicious articles by using context analysis. The presented scheme builds detecting knowledge by introducing multi-attribute rules. By the experimental results for 11,019 test data, it turns out that sensitivity and specicity of the presented method become 38.7 and 24.1 (%) points higher than traditional method, respectively. 2010 Elsevier Ltd. All rights reserved.
Article history: Received 29 September 2009 Received in revised form 17 August 2010 Accepted 17 August 2010
Keywords: Malicious expressions Bulletin board systems Filtering systems Context analysis Multi-attribute rules Separate co-occurrence expressions
1. Introduction Bulletin board systems (BBS) have been used as basic services on the Internet for frequent information exchange. Representative examples of BBS are 2 channel h2 channeli in Japan, and Yahoo! Bulletin board hYahoo! BBSi in the world. Social networking services (SNS) and blog services can be considered as the applications of BBS because they provide the communicating place in the Internet (Claypool, Brown, LE, & Waseda, 2001). Moreover, Mixi hMixii and Yokoku-In hYokoku.ini are well-known services in Japan. Moreover, myspace hmyspacei and Livejournal hLivejournali are representative services in the world. Therefore, BBS to be discussed in this paper includes the above SNS applications. The convenience of bulletin boards is to casually communicate with other persons due to the anonymity (Security) of the services and also enables us to read the communication contents any time. However, malicious postings in bulletin boards are serious problems for users, which are postings about mental abuse and warnings of crimes. Each country takes action against the mentioned problems. In America, the Childrens Internet Protection Act (CIPA) hChildren Internet Protection Acti was established in 2000, which requires public schools and libraries to use ltering software. In Germany, the Youth Protection Act hYouth Protection Acti was established in 2002, which requires providers to lter harmful content. In France, the Digital Economic Act hDigital Economic Acti was established in 2004, which requires explanation of ltering for accessing online communication service to the public. In Japan, the Provider Liability Act hProvider Liability Acti was established in May 2002: the action plan for the dissemination of and enlightenment of ltering was started in March 2006, and the Internet Environmental Improvement Act was established in June 2008. Moreover, there were cabinet decisions for the comprehensive measurements for suicide established in June 2007 and the Prohibition of harm to the third
Corresponding author.
E-mail address: aoe@is.tokushima-u.ac.jp (J.-i. Aoe). 0306-4573/$ - see front matter 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2010.08.003
324
H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335
persons with a revised version was appended in December 2008. Considering those laws and measurements, legislative preparations in Japan is more advanced than other countries and there are many court cases and case examples. In order to solve these serious problems for serving companies and users, typical ltering schemes are using the URL ltering system hhAnichivaii which is a knowledge base representing malicious sites. This scheme can completely lter the sites including malicious expressions, however the problem is to build URL knowledge by human for extremely increasing WEB information. Therefore, it can not apply to lter the part of malicious articles in the same site. The useful schemes (Goldberg, Nichols, Oki, & Terry, 1992; Goldberg, Roeder, Guptra, & Perkins, 2001; Good et al., 1996; Heckerman, Chickering, Meek, Rounthwite, & Kadie, 2000; Herlocker, Konstan, & Riedl, 2002; Kim, Min, Jeon, Man Ro, & Han, 2009; Lee, Lee, Chung, & An, 2007; Pennock, Horvitz, Lawrence, & Giled, 2000; Reddy, Kitsuregawa, Sreekanth, & Rao, 2002; Wang, Arjen, & Marcel, 2006) are automatic detection ltering systems. In these researches, content analysis (Kim et al., 2009; Lee et al., 2007) was introduced for malicious domains such as pornography, drug, violence, crime and so on, but the basic scheme is used two steps of classication to detect harmful word ltering based on SVM and does not use context analysis, etc. In the methodologies, there are two types of detecting knowledge bases. One is rule-based (Francis, Frantz, & Mathieu 2000; Landau, Sillion, & Vichot, 1993) and another is statistic-based schemes (Gharieb, 2000; Yoohwan, Wing, Mooi, & Chao, 2006). Rule-based techniques can detect precise locations for expected expressions and it is easy to improve the part of rules, but it needs to develop practical matching engines and to keep expert persons to build knowledge. The typical statistic-based scheme is Support Vector Machine (SVM) (Larry & Malik, 2001) which is easy to build detection knowledge by its strong learning technique and also to control decision engines, but it is weak to locate the expected expressions precisely, and to improve detecting method with knowledge partly because of its automatic learning. These automatic ltering techniques are supporting only systems in the current applications. Although the current techniques can detect a sequence of words, there are no practical techniques that can consider contexts of articles which have many disadvantages as the biggest limitation. Therefore, in the current automatic ltering systems, the rate of false positive (Shiraki et al., 2004; Xu, Chong, Lu, & Zhou, 2004) for non-malicious articles is low. Consequently, human must check a lot of articles in the application systems. For example, DeNA BBS (3.14 million articles per day) services need 300 persons and 2000 million Yen for checking, but there are very critical problems because all articles cannot check as it has a lot inappropriate Web article even they are using automatic ltering techniques. In order to solve these problems, this paper presents a new context ltering algorithm to reduce the effort of human and to improve the rate of false positive without degrading the rate of false negative. First of all, the presented method denes separate co-occurrence (SC) expressions that cannot be detected by word sequence of the traditional methods. Moreover, the context analyses for SC expressions are proposed by introducing multi-attribute rules which are proper to extract expressions in changing Web, especially for inappropriate Web contents and a common hierarchy method. The presented method is estimated for 11,019 test data including malicious and non-malicious articles. It is veried that the presented method can improve the rate of false positive of the traditional method without degrading the rate of false negative. Section 2 describes the outline of the context analysis and introduces the outline of the presented system by classifying malicious expressions into inadequate and crime expressions. Section 3 proposes multi-attribute rules for these expressions. In Section 4, a context analysis algorithm is presented. Section 5 evaluates the accuracy and time performance of the presented method from the experimental data. Section 6 presents conclusion and possible further works. 2. The presented system 2.1. The outline of context analysis Fig. 1 shows examples for malicious and non-malicious expressions, where the abbreviation RPG in (b1) means a role playing game. Texts (a1), (a2) and (a3) in Fig. 1a are malicious because underlined expressions are crime and inadequate. However, texts (b1), (b2), and (b3) in Fig. 1b are not malicious because bold expressions means negative expressions that deny malicious expressions. In (b1), malicious expressions, I have a strong sword and I kill them, can be denied by a game led indentied by SC expressions, RPG. In (b2), there are negative expressions by a computer eld identied by SC expressions, machine and processes. In (b3), there are negative expressions by attention of SC expressions Do not write malicious articles that can deny malicious expressions You are ugly and this woman is BBW. The difculty of the traditional method is that it
(a1) I get a strong sword. Bring your company to Tokyo station tomorrow. I will kill them. (a2) This machine is very heavy because there are many muzzles. I will try to kill them soon. (a3) You are ugly in any jacket, always lying in the meeting at work and you speak about BBW. (b1) I have a strong sword. Bring your company in the next scene of RPG tomorrow. I kill them. (b2) This machine is very heavy because there are many processes. I will try to kill them soon. (b3) Do not write malicious articles in BBS. For example, You are ugly in any jacket or this woman is BBW.
(a) Malicious expression texta
(b) Non-malicious expression text
Fig. 1. Examples of malicious and non-malicious expressions.
325
has no scheme of detecting separate concurrency (SC) expressions for combinations by (I have a strong sword, RPG) and (RPG, I kill them) in (b1), (machine and processes, I will try to kill them) in (b2), and (Do not write malicious articles, this woman is BBW) in (b3). In order to achieve the above solution, this paper presents a new ltering algorithm to detect SC expressions by introducing context analysis and multi-attribute rules. Note that expressions to be detected in the presented method combines a sequence of expressions based on the traditional method and SC expressions corresponding to context analysis. 2.2. Inadequate and crime expressions In this paper, expressions to be detected are classied into two categories of inadequate and crime expressions. 2.2.1. Inadequate expressions Inadequate expressions which people feel irritated have four main categories as follows: (a) hhABUSEii expressions involve violent or insulting comments towards someone or causes the psychological state of being annoyed by someone as follows: (1) You are ugly in any jacket. (2) This Woman is BBW. (3) You are always lying in the meeting at work. (4) Everybody in the company says you are stupid. (5) Are you crazy?. (b) hhDISCRIMINATIONii means treating people differently through prejudice: unfair treatment of one person or group, usually because of prejudice about race and ethnicity as follows: (1) I think she is deaf because she cant understand what I say all the time. (2) He is a bad man as all his talk about BBW. (3) Yellow monkeys cant use this room because this is for white people. (c) hhDATING SERVICE WEBSITEii is a dating system which allows individuals, couples and groups to make contact and to communicate with each other over the Internet as follows: (1) Im a 16-year-old girl. I can go out with guys at 3.1 (d) hhOBSCENTITYii means the trait of behaving in an obscene manner as follows: (1) I want to see you naked. (2) I want to buy kid porn. (3) He will visit that building to buy some kid porn. Although there are overlap expressions which consider successive postings with the same contents as troll and ungrammatical expressions, both are not included in this paper discussion because they have no SC expressions. 2.3. Crime expressions Bulletin boards include expressions which warn about crimes. They are very important expressions to detect because those terribly malicious postings have the possibility to affect people and organizations seriously. As some cases actually happened from postings which warn crimes, those postings should not be permitted even if they are fake. There are four categories of expressions with warnings of crimes as follows: (a) hhMURDER&VIOLENCEii, as dened in common law countries, is the unlawful killing of another human being with intent as follows: (1) Your friends are immediately killed, (2) killed some people at Tokyo station last week. (b) hhEXPLOSION&ARSONii means the crime of deliberately and maliciously destroying or setting re to structures or wildland as follows: (1) A strange boy set re to his grandparents house last night. (2) A Female terrorist destroyed a big shopping mall with dynamites in Wakayama last Monday. (c) hhCRIME MATERIALii means the tools which are used in the crime processing as follows: (1) I get a strong sword, I will kill them by it. (2) This machine is very heavy because there are many muzzles. (d) hhDRUGii means a chemical substance that affects the processes of the mind or body as follows: (1) S crystal, high quality, 0.0002 g. (2) White and clear SS, high quality, ice ice ice.
1
Where 3 means the amount of money.
326
2.4. The construction of the presented system Fig. 2 shows the construction of the presented system. In Fig. 2, Article number/Posting title/Acquisition is extracted from the bulletin board (Text Data), and then the input sentence is segmented by morpheme analysis and named entity processing. In the next step, two context phases of inadequate and crime expressions are carried out by each extraction rule in parallel. Finally, risk judgment is conducted according to the above results.
3. Rule-based extracting knowledge 3.1. Denition of multi-attribute rules For extraction of the expected expressions in natural language processing, it is important to introduce an efcient algorithm that can match multi-attribute rules by formation (morphological, syntactic and semantic). In order to build efcient detection rules or knowledge, the fundamental concept has proposed by Ando, Mizobuchi, Shishibori, and Aoe (1998) and it has been utilized for the target-based approach of sentence classication (Kadoya et al., 2005). Moreover, this approach has been applied to classication of medical reports (Kiyoi, Atlam, Fuketa, Yoshinari, & Aoe, 2008) and emotion analysis (Yoshinari, Atlam, Morita, Kiyoi, & Aoe, 2008). Generally, these attributes include strings (words), part of speeches (categories) and concepts (semantic, or meanings). Suppose that A_NAME represents the attribute name and let A_VALUE represent the attribute value. Then, let R be a nite set of pairs (A_NAME, A_VALUE), then R is called a rule structure and attributes as follows: STR: string, or, word spelling. CAT: category by general concepts, or a part of speeches. SEM: semantic information to be dened in this paper. The formal denition depends on the description by Kadoya et al. (2005), but all rules correspond to inadequate and crime expressions. For example, by using these attributes, the input structures of the sentence He kills someone are dened as follows: N1 = {(STR, He), (CAT, HUMAN)}. N2 = {(STR, kills), (CAT, VERB), (SEM, MURDER&VIOLENCE)}. N3 = {(STR, someone), (CAT, HUMAN)}.
Input
Bulletin Board (Text Data)
Acquisition of Posting Identification Information Article number/Posting title/Acquisition
Morpheme and Named Entity Analysis
Morpheme Dictionary
Inadequate Expressions Test

<<ABUSE>> <<DISCRIMINATION>> <<DATING SERVICE WEBSITE >> <<OBSCENTITY >>
Crime Expression Test

<<MURDER&VIOLENCE>> <<EXPLOSION&ARSON>> <<CRIME MATERIAL>> <<DRUG>>
Extraction rules for inadequate expressions
Risk judgment
Extraction rules for crime expressions
Fig. 2. The construction of the presented systems.
327
In the above examples, (SEM, MURDER&VIOLENCE) denotes semantics for crime expressions, (CAT, HUMAN) means categories by general concepts, and (CAT, HUMAN) and (CAT, VERB) mean part of speeches. A huge number of expressions about crime expressions can be represented by multi-attribute rules by using these semantics. The p-th multi-attribute matching rule Rule (p) is dened as follows:
Rule p Rp1 Rp2 . . . Rpm ; m np ; 0 < np :

Consider a rule to match the above input expression He kills someone can be dened by the Rule (1) as follows:
Rule1 R11 R12 R13 R11 fCAT; HUMANg; R12 fSEM; MURDER&VIOLENCEg; R13 fCAT; HUMANg:
3.2. Multi-attribute descriptions 3.2.1. Semantic information Semantic information (SEM) depends on Section 2, so the typical semantics are explained. Inadequate and crime expressions are described by combining a variety of words, phrases, categories and semantics as follows: (1) Abuse expressions: (SEM, ABUSE) use for violence or insulting comments towards someone. For example, abuse expressions are ugly, liar, stupid and crazy. (2) Discrimination expressions: (SEM, DISCRIMINATION) use for treating people differently through prejudice. For example, a discrimination expression is deaf. (3) Obscenity expressions: (SEM, OBSCENITY) use for the trait of behaving in an obscene manner. For example, obscenity expressions are naked and kid porn. (4) Murder & Violence expressions: (SEM, MURDER&VIOLENCE) use for the unlawful killing of another human being with intent. For example, murder and violence expressions are kill, and shoot. (5) Explosion & Arson expressions: (SEM, EXPLOSION&ARSON) use for destroying or setting re to structures or wild land areas. For example, explosion and arson expressions are terrorists, destroy and set re. (6) Crime material expressions: (SEM, CRIME MATERIAL) use for the tools which are used in the crime processing. For example, crime material expressions are sward and muzzles. 3.2.2. Multi-attribute rules Context analysis of multi-attribute expressions (MULTI) is carried out in two stages. The rst stage determines candidates of inadequate expressions in the text and produces results (CON, x), where CON and x represent features for context analysis of the second stage. The second stage determines the nal results or risk judgement, by using (CON, x) and produces (FIX, y) if the result is xed, otherwise (NON, y). The detailed method will be discussed in the next section. Tables 1 and 2 show rule-based knowledge for the rst and the second stages, respectively. Table 1 uses general concepts such as CLOTHES, JOB, DOCUMENT. (CON, NEGATION)) neglects inadequate and crime decisions. NEGATION) can be also performed by the concepts denoting special elds (CON, GAME) and (CON, COMPUTER). In Table 1, Rule (8) can match the input crime expression He kills someone and produces (CON, MURDER&VIOLENCE) for context analysis by the second stage. In Table 2, Rule (18) = {{(CON, CRIME MATERIAL)}{(CON, PLACE)} {(CON, TIME)} {(CON, MURDER&VIOLENCE)}} is the decision rule for hhCRIME MATERIALii and hhCRIME MATERIALii, where hhii corresponds to types of inadequate and crime classes in Section 2. This rule takes the features for context analysis of the rst stage of Table 1 and produces the nal judgment of (FIX, hhCRIME MATERIALii, hhMURDER&VIOLENCEii), where this notation means (FIX, hhCRIME MATERIALii) and (FIX, hhMURDER&VIOLENCEii). If the nal judgment is not FIX, then the output becomes (NON, hCRIME MATERIALii), (NON, hhMURDER&VIOLENCEii) and (NON, hhABUSEii) as in Table 2. 4. Multi-attribute matching 4.1. Construction of machines For multi-attribute matching, Ando et al. (1998) has proposed a set matching algorithm and the implementation is developed in C programming language. Kadoya et al. (2005) used this approach for sentence classication and Kiyoi et al. (2008) used this approach for medical reports and Yoshinari et al. (2008) used it for emotion analysis.
328
Table 1 Examples of extracting knowledge of the rst stage. output (CON, ABUSE) Multi-attribute rules Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule Rule (1) = {{(SEM, ABUSE)}{(CAT, CLOTHES)}} (2) = {{(SEM, ABUSE)}{(CAT, JOB)}} (3) = {{{(SEM, ABUSE)} {(CAT, DOCUMENT)}} (4) = {{(CAT, HUMAN)} {(CAT, VERB)}{(SEM, DISCRIMINATION)}} (5) = {{(CAT, VERB)} {(CAT, HUMAN)} {(SEM, OBSCENITY)}} (6) = {{(CAT, HUMAN)} {(CAT, VERB)} {(SEM, OBSCENITY)}} (7) = {{(SEM, MURDER&VIOLENCE)} {(CAT, HUMAN)}} (8) = {{(CAT, HUMAN)} {(SEM, MURDER&VIOLENCE)} {(CAT, HUMAN)}} (9) = {{(CAT, HUMAN)} {(SEM, EXPLOSION&ARSON)}} (10) = {{(SEM, EXPLOSION&ARSON)}{(CAT, ORGANIZATION)}} (11) = {{(CAT, HUMAN)} {(CAT, VERB)} {(SEM, CRIME MATERIAL)}} (12) = {{(CAT, MACHINE)}{(SEM, CRIME MATERIAL)}} (13) = {{(CAT, VERB)} {(CAT, GAME)}} (14) = {{(CAT, MACHINE)} {(CAT, VERB)} {(CAT, PROCESS)}} (15) = {{(SEM, NEGATIVE)}} (16) = {{(CAT, NAME)} {(CAT, STATION)}} (17) = {{(CAT, TIME)} Examples (1) ugly in any Jacket (2) liar at work (3)malicious articles (4)she is deaf (5)see you naked (6) I buy kid porn (7) kill people (8) He kills someone (9) strange boy set re (10) destroy a high school (11) I get a strong sword (12) machine has many muzzles (13) play RPG (14) machine has many processes (15) Do not write malicious articles (16) Tokyo station (17) tomorrow
(CON, DISCRIMINATION) (CON, OBSCENITY) (CON, MURDER&VIOLENCE) (CON, EXPLOSION&ARSON) (CON, CRIME MATERIAL) (CON, (CON, (CON, (CON, (CON, GAME) COMPUTER) NEGATION) PLACE) TIME)
Table 2 Examples of decision knowledge of the second stage (context analysis). output (FIX, hhCRIME MATERIALii, hhMURDER&VIOLENCEii) (NON, hhCRIME MATERIALii) (FIX, hhCRIME MATERIALii, hhMURDER&VIOLENCEii) (NON, hhMURDER&VIOLENCEii) (FIX, hhABUSEii) (NON, hhABUSEii) Multi-attribute rules Rule (18) = {{(CON, CRIME MATERIAL)}{(CON, PLACE)} {(CON, TIME)} {(CON, MURDER&VIOLENCE)}} Rule (19) = {{(CON, CRIME MATERIAL)}{{(CON, GAME)}} Rule (20) = {{(CON, CRIME MATERIAL)}{{(CON, MURDER&VIOLENCE)}} Rule (21) = {(CON, COMPUTER)} {(CON, MURDER&VIOLENCE)}} Rule (22) = {{(CON, ABUSE)}{(CON, ABUSE)}} Rule (23) = {{(CON, NEGATION)}{(CON, ABUSE)}} Examples Fig. 1 (a1) I get a strong sword. Bring your company to Tokyo station tomorrow. I will kill them Fig. 1 (b1) I get a strong sword. Bring your company to play RPG tomorrow. I will kill them. Fig. 1 (a2) This machine has many muzzles. I will try to kill them soon. Fig. 1 (b2) This machine has many processes. I will try to kill them soon. Fig. 1 (a3) You are ugly in any jacket, always lying in the meeting at work and you speak about BBW. Fig. 1 (b3) Do not write malicious articles in BBS. For example, You are ugly in any jacket or this woman is BBW.
Suppose that R be a sequence of the input structures. The machine multi-attribute pattern-matching (MAPM) machine in this method takes R as the input and produces matching results as the output corresponding to the rules. Formally, the machine MAPM consists of a set of states and each state is represented by a number. The matching operation of the machine MAPM is similar to the multi-keyword string pattern-matching method of AhoCorasick (Aho & Corasick, 1975; Ando et al., 1998), but it has the following distinctive features:
4.1.1. goto and output functions Let T be a set of states and let L be a set of the rule structure R, then the behaviour of the machine MAPM is dened by next two functions: (a) goto function goto: T L ? T [ {fail} where the function goto maps a set of consisting of a state and a rule structure into a state or the message fail. A transition label of the goto function is extended to a set of notation. Therefore, in the machine MAPM, a conrming transition is decided by the inclusion relationship whether the input structure N includes the rule structure R or not. (b) output function: T ? A where A is a set of pair, (p,(x, y)), for rule number p and for matching results (x, y). For Rule (1) in Table 1, the matching result becomes (CON, ABUSE) and then {(1,(CON, ABUSE))} is the proper representation. The input structures to be matched by the matching rule are also dened by the same set of representation. N is used as the notation for input structures to distinguish them from R. In order to consider the abstraction of the rule structure, matching of the rule structure R and the input structure N are decided by the inclusion relationship such that N includes R, denoted by N R. Therefore, the machine MAPM is also called a set matching machine. Let R_SET be a set of Rule (p) for inadequate and crime expressions. Consider the following Rule (7) and Rule (8) in Table 1. R71 = {(SEM, MURDER&VIOLENCE)}}, R72 = {(CAT, HUMAN)}}. R81 = {(CAT, HUMAN)}, R82 = {(SEM, MURDER&VIOLENCE)}, R83 = {(CAT, HUMAN)}}.
329
Suppose that the input he kills someone has the following structures. N1 = {(STR, he), (CAT, HUMAN)}. N2 = {(STR, kills), (CAT, VERB}, {(SEM, MURDER&VIOLENCE)}. N3 = {(STR, someone)}, (CAT, NOUN), (CAT, HUMAN)}. Each input structure can include the corresponding rule structure as follows: N 1 R71 , N 2 R72 . N 1 R81 , N 2 R82 , N 3 R83 . The machine MAPM becomes non-deterministic if there are two more rules that can match the input structure. The ambiguity can be solved by selecting the longest applicable rules with high priority. Figs. 3 and 4 show goto and output functions for Tables 1 and 2, respectively. In these gures Rules 10, 15, 16, 17 and 19 are neglected as we used some samples only. 4.2. Multi-stage matching The context analysis of MULTI is carried out by two stages, where the rules of Table 1 are used for the rst stage matching and those of Table 2 are used by the second stage. The following procedure summarizes the behaviour of the machine MAPM as the procedure MAPM(a, M) that can carry out context analysis of the proposed method MULTI by calling this procedure MAPM(a, M) twice (Kiyoi et al., 2008). 4.2.1. Procedure MAPM(a, M) A sequence a of input structures is N1, N2, . . ., Nn where each Ni (0 < i < n + 1) is an input structure. M is a machine MAPM dened by goto and output functions. Note that the input of the rst state becomes results of named entity processing and the second stage is the sequence of outputs with the notation (CON, x) of the rst stage matching. The function NEXT(a) returns the rst structure N1 and modies a = N2. . ..Nn, where the rst structure N1 is removed.
{(SEM, MURDER&VIOLENCE)}
{(CAT, HUMAN)}
3
{(CAT, HUMAN)} {(CAT, VERB)} {(SEM, DISCRIMINATION)}
4 6 7 8
output(4) = {(8,(CON,(MURDER&VIOLENCE)))} output(6) = {(4, (CON, DISCRIMINATION))}
5
{(SEM, OBSCENITY)}
1
{(SEM, CRIME MATERIAL)} {(SEM, EXPLOSION)}
output(7) = {(6, (CON, OBSCENITY))} output(8) ={(11, (CON, CRIME MATERIAL))}
10 output(10)= {(9, (CON, EXPLOSION&ARSON))}

{(CAT, HUMAN)} {(SEM, OBSCENITY)}
12
{(CAT, VERB)} {(CAT, GAME)}
14 13
output(14) ={(5, (CON, OBSCENITY))}
11
{(SEM, ABUSE)}
output(13) = {(13, (CON, GAME))}
{(CAT, CLOTHES, JOB, DOCUMENT)}*
15
{(SEM, CRIM MATERIAL)}
16 18
output(16)={(1, 2,3, (CON, ABUSE))}
output(18)= {(12 , (CON, CRIME MATERIAL))} {(CAT, PROCESS)}
{(CAT, MACHINE)}
{(CAT, VERB)}
17
{(SEM, MURDER&VIOLENCE)}
19
{(CAT, HUMAN)}
20
output(20)= {(14, (CON, COMPUTER))}
21
22 output(22) = {(7, (CON,MURDER&VIOLENCE))}
* (CAT,CLOTHES), (CAT,JOB) and (CAT,DOCUMENT) are merged into the same transition and output(16) merges rules 1,2 and 3.
Fig. 3. The goto and output functions for some sample rules in Table 1.
330
output (3) = {(22, (FIX,<<ABUSE>>))} {(CON, ABUSE)} {(CON, ABUSE)}
3
output(7)={(18,(<<FIX,CRIME MATERIAL>>,<< MURDER &VIOLENCE>>))}
{(CON, CRIME MATERIAL)}
{(CON, PLACE)}
{(CON, TIME)}
4 1
{(CON, NEGATION)} {(CON, ABUSE)}
5
{(CON, MURDER&VIOLENCE)}
7
8
output(8)={(20,(FIX,<<CRIME MATERIAL >>,<< MURDER&VIOLENCE >>))}
9
{(CON, COMPUTER)}
10
output (10) = {(23, (NON,<<ABUSE>>))}
11
12
output (12) = {(21, (NON, << MURDER&VIOLENCE >>))}
Fig. 4. The goto and output functions for rules in Table 2.
begin STATE:= 0; while a NULL do begin N = NEXT (a); while STATE fail do STATE:= goto (STATE, R) such that N R; MAPM (a, M); Output:= output (p) for matched Rule (p); N:= NEXT (a); end end Consider the input sentence I get a strong sward in (a1) of Fig. 1 with the following sequence of structures. N1 = {(STR, I), (CAT, HUMAN)}. N2 = {(STR, get), (CAT, VERB)}. N3 = {(STR, a strong sward), (CAT, NOUN), (SEM, CRIME MATERIAL)}. Table 3 shows the matching ow of the rst stage for the above input structures. State transitions are 1, 2, 5, 8, and then state 8 produces output (8) = 7 identifying {(CON, CRIME MATERIAL)} which becomes the input of the second stage matching. Table 4 shows the matching ow of the second stage for (a1) in Fig. 1. Suppose that the following results are obtained from the rst stage. I get a strong sword.is N1 = {(CON, CRIME MATERIAL)}. Bring your company to Tokyo station tomorrow. is N2 = {(CON, PLACE)} and N3 = {(CON, TIME)}, I will kill them is N4 = {(CON, MURDER&VIOLENCE)} State transitions are 1, 4, 5, 6, 7, and, then, output (7) produces that the nal decision of (a1) is hhCRIME MATERIALii and hhMURDER&VIOLENCEii. In the same manner, Table 5 shows the matching ow of the second stage for the part of (b1) in Fig. 1. Suppose that the following results are obtained from the rst stage.
Table 3 Examples of matching process in the rst stage. STATE 1 2 5 N N1 N2 N3 R {(CAT, HUMAN)} {(CAT, VERB)} {(SEM, CRIME MATERIAL)} goto/output 2 5 output (8) = {(11, (CON, CRIME MATERIAL))}
H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335 Table 4 Examples of matching process in second stage. STATE 1 4 5 6 N N1 N2 N3 N4 R {(CON, {(CON, {(CON, {(CON, CRIME MATERIAL)} PLACE)} TIME)} MURDER&VIOLENCE)} goto/output
331
4 5 6 output (7) = {(18, (FIX, hhCRIME MATERIALii, hhMURDER&VIOLENCEii))}
Table 5 Examples of matching process in second stage. STATE 1 11 N N1 N2 R {(CON, GAME)} {(CON, MURDER&VIOLENCE)} goto/output 11 output (12) = {(21, (NON, hhMURDER&VIOLENCEii))}
Bring your company in the next scene of RPG tomorrow is N1 = {(CON, GAME)}. I kill them is N2 = {(CON, MURDER&VIOLENCE)}. State transitions are 1, 11 and 12. Then, output (12) produces that the nal decision is not hhMURDER&VIOLENCEii. 5. Experimental results 5.1. Basic detection knowledge and experimental data Basic detection knowledge has been built to detect expressions for abuse, obscenity, drug, and crime. Table 6 shows the contents of detection knowledge for each expression, where the following abbreviations are utilized. NUM-WORD: The number of word expressions. NUM-RULE: The number of multi-attribute rules. NUM-PAT: The number of surface patterns. Experimental data have been collected from 22 bulletin boards and the number of articles with possibility of inappropriate expressions is 8450. For main sites of 2 channel h2 channeli, Yahoo! hYahoo! BBSi, Gakkou-Ura hGakkou-Urai and Yokoku-In hYokoku-Ini, elds of the above articles include criticism, requests, hospitals, educational problems, cartoons, betrayal, dirt, comics, rumors, arrest, notices and crimes. 1525 inadequate (I) and 388 crime (C) expressions to be estimated has been obtained from 8450 malicious articles. For non-inadequate (NI) and non-crime (NC) test data, 2569 non-malicious expressions (1277 non-inadequate (NI) and 1382 crime (NC) expressions) have been prepared from Web pages, like Fig. 1 (b), including basic single words such as kill, sword, sex, adults and so on. That is to say, test data I and C are malicious data, but NI and NC are non-malicious data. 5.2. Experimental results The presented method based on multi-attribute expressions in context is called as MULTI and the traditional method based on sequential of morphemes or words is called as SINGLE against to MULTI. To estimate MULTI and SINGLE, specicity and sensitivity (Altman & Bland, 1994) are used as follows: True positive (TP): Malicious expressions correctly determined as malicious. False positive (FP): Non-malicious expressions incorrectly determined as malicious. True negative (TN): Non-malicious expressions correctly determined as non-malicious. False negative (FN): Malicious expressions incorrectly determined as non-malicious. Let NUM_TP, NUM_FP, NUM_TN and NUM_FN be the numbers of TP, FP, TN and FN, respectively. SPECIFICITY: The rate (%) of specicity by calculating NUM_TN/(NUM_TN + NUM_FP). SENSITIVITY: The rate (%) of sensitivity by calculating NUM_TP/(NUM_TP + NUM_FN).
Table 6 The contents of basic detection knowledge. NUM-WORD Inadequate Crime Total 16,239 12,681 28,920 NUM-RULE 1281 1378 2659 NUM-PAT 12,875,138 9486,523 22,361,661
332
A specicity of 100% means that the all non-malicious expressions will be detected as non-malicious expressions. A sensitivity of 100% means that the all malicious expressions will be detected as malicious expressions. We can say that the presented method has low error rate and can reduce human efforts for a large number of malicious candidates. Other preparations for malicious test data (I and C) are explained as follows: ALL_DATA: The number of all data to be estimated. ALL_CORR: The number of all correct expressions to be extracted. NUM_EXTR: The number of the extracted expressions. Table 7 shows experimental results for malicious expressions, where SINGLE(I) and SINGLE(C) represent SINGLE for I and C, respectively. MULTI(I) and MULTI(C) are the same meaning. Table 8 shows experimental results for non-malicious test data (NI and NC), where SINGLE(NI) and SINGLE(NC) represent SINGLE for NI and NC, respectively. MULTI(NI) and MULTI(NC) are the same meaning. Table 9 shows evaluation results by SPECIFICITY and SENSITIVITY obtained from TP, FN, TN and FP in Tables 7 and 8. From simulation results in Table 9, it turns out that SPECIFICITY and SENSITIVITY of MULTI can be improved by 38.7 and 24.1 (%) points for those of SINGLE, respectively. The main reason is that the error rate by false positive (FP) of MULTI is very lower than that of SINGLE in Table 8. High TP and TN of MULTI are also related to the improvement. In general, the number of malicious expressions is extremely larger than that of non-malicious expressions. Therefore, the presented method contributes reduction of human efforts. Moreover, the presented rule-based method has two efcient advantages as follows: (a) Unknown words: The presented rule-based method is proper to extract expressions in changing Web, especially for inappropriate Web contents. The reason is based on the set matching ability such that N R. Suppose that N has (CAT, UNKNOWN) when the input has unknown expressions. Then, R is replaced by {(CAT, UNKNOWN)} and N R is conrmed.
Table 7 Simulation results for inadequate (I) and crime (C) expressions. ALL_CORR SINGLE(I) MULTI(I) MULTI(I)-SINGLE(I) SINGLE(C) MULTI(C) MULTI(C)-SINGLE(C) SINGLE MULTI MULTI-SINGLE NON = represents the empty data. Table 8 Simulation results for non-inadequate (NI) and non-crime (NC) expressions. ALL_NON SINGLE(NI) MULTI(NI) MULTI(NI)-SINGLE(NI) SINGLE(NC) MULTI(NC) MULTI(NC)-INGLE(NC) SINGLE MULTI MULTI-SINGLE NON = represents the empty data. Table 9 Evaluation results by SPECIFICITY and SENSITIVITY. SPECIFICITY (%) SINGLE(I + NI) MULTI(I + NI) SINGLE(C + NC) MULTI(C + NC) SINGLE MULTI 41.2 84.4 56.7 91.2 49.2 88.0 SENSITIVITY (%) 58.4 83.2 58.8 80.4 58.5 82.6 1277 1277 NON 1382 1382 NON 2659 2659 NON NUM_TN 526 1078 552 228 312 84 754 1390 636 NUM_FP 751 199 552 160 76 84 911 275 636 TN (%) 41.1 84.4 43.3 56.7 91.2 34.5 49.2 87.9 37.8 FP (%) 58.8 15.6 43.2 43.3 8.8 34.5 50.8 12.1 38.7 1525 1525 NON 388 388 NON 1913 1913 NON NUM_EXTR 920 1453 533 229 324 95 1149 1777 628 NUM_TP 891 1247 356 228 312 84 1119 1559 440 NUM_FN 634 252 382 160 76 84 794 328 466 TP (%) 96.8 85.8 11.0 99.6 96.3 3.3 97.4 87.7 9.7 FN (%) 58.4 81.8 23.3 58.8 80.4 21.6 58.5 81.5 23.0
333
Consider the following input and rule structures. N1 = {(STR, Killed), (SEM, MURDER&VIOLENCE)}. N2 = {(STR, ABC), (CAT, UNKNOWN)}. N3 = {(STR, Tokyo Station), (CAT,PLACE)}. Rule (7) = R71, R72, R73. R71 = {(SEM, MURDER&VIOLENCE)}. R72 = {(CAT, HUMAN)}. R73= {(CAT, PLACE)}. In this case, N 1 R71 , N 2 R72 and N 3 R73 are satised because R72 = {(CAT, HUMAN)} is replaced by R72 = {(CAT, UNKNOWN)}. Therefore, transitions are always success by this error recovery. Although this error recovery produces many accessible transitions, it is a very practical scheme with robustness because it is easy to restrict the upper bound of possible transitions in the practical system. It is difcult task to register new words and expressions into dictionaries together with their categories and semantics. For this problem, it is clear that the above method does not need to be extended together with the registration. The important point of robustness issue is how to extract possible candidates from malicious expressions with many syntax errors and argotic words. (b) Hierarchical concept matching. Consider delete people at Tokyo station with the following structures. N1 = {(STR, delete), (SEM, DELETE)}. N2 = {(STR, people), (CAT, NOUN) (CAT, HUMAN)}. N3 = {(STR, Tokyo Station), (CAT, PLACE)}. Suppose that the semantic meaning of the verb delete is the super-category DELETE of category MURDER&VIOLENCE and suppose R71 of the above Rule (7) is {(SEM, DELETEnMURDER&VIOLENCE)}, where n means a hierarchical notation. Hierarchical concept matching is succeeded if DELETE of N1 is equal to the super-category DELETE of DELETEnMURDER&VIOLENCE of R71. This matching is weak because it is not perfect, but the extended matching is practical in the error recovery that the similar expressions can be extracted. That is to say, it enables us to support rule-based knowledge using concepts. The rules bases of the presented method MULTI is building for frequent expressions step by step, but there are difcult problems as shown in the following examples: RQJmcf2O kill Aaaaqqqbbb, where RQjmcf2O and Aaaaqqqbbb are user ID. Context analysis for a sequence of articles including past information should be proposed, but the current system has no ability to describe applicable rules. This technique depends on discourse analysis and remains in the future research. Moreover, there are some hUngrammatical sentencei as follows:
D R U G S
To solve this problem, special frozen analysis must be introduced case by case and remains in the future research. Support Vector machine (SVM) is a well-known approach. SVMs depend on words or a sequence of words without considering the context of articles which has many disadvantage (Burgess, 1998) as follows: (1) (2) (3) (4) The biggest limitation of the support vector approach lies in choice of the kernel. The second limitation is speed and size, both in training and testing. Discrete data presents another problem. The most serious problem with SVMs is the high algorithmic complexity and extensive memory requirements in largescale tasks (Horvth, 2003).
However, the detected results by the presented method can use SVM schemes as the learning futures because SVMs require a lot of correct training data. That is to say, SVMs and the presented method can work in a coordinated manner.
334
Fig. 5. Time evaluation of the presented method and the traditional method.
5.3. Time evaluation and error analysis The detecting ability of the presented method is excellent as described above, but it is very important to evaluate the time performance together with essential error analysis for the whole system consisting of the following modules. The rst module FOCUS determines the essential text by removing redundant texts (advertising parts) from Web pages. This module is carries out by HTML tag processing. High frequent pages with the same tag format as recommending products and news are removed in this module to reduce the error rate of false positive. In the second module, morphological analysis MOPH determines part of speeches and fundamental concepts. For error analysis, unknown expressions are detected in this module. The module KW determines keywords consisting of sequential expressions from the results of morphological analysis. The module FIELD determines document elds (Atlam, Elmarhomy, Fuketa, Morita, & Aoe, 2006; Atlam, Fuketa, Morita, & Aoe, 2003; Fuketa, Lee, Tsuji, Okada, & Aoe, 2000; Fuketa et al., 2005). This module is carried out by matching eld association words to the results of morphological analysis. The results of this module can be used to reduce the error rate by false positive. Examples are game and computer elds for (b1) and (b2) in Fig. 1, and for Rule (13) and Rule (14) in Table 2, respectively. The next module NE determines named entities such as names, organizations, places and so on. This module is carried out for the results of keywords analysis (Asahara and Matsumoto, 2003; Wright & Budin, 1997). For example, ABC Station is a station name and Nagoya company is a company name. In the error analysis of this modules, unknown word ABC and ambiguous name Nagoya in the module MOPH can be solved. The module ATTR determines SC expressions by using the presented multi-attribute method. SINGLE uses FOCUS, MOPH, KW and NE. MULTI uses FOCUS, MOPH, NE and ATTR, where NE is included ATTR. The presented system has been developed Windows 2003 server and two CPU of Intel Xeon E5440 (2.83 GHz) with 2 GB main memory. Fig. 5 shows the time expenses for the above modules, where the analysis time is estimated for 100 articles of HTML and its text (TEXT). The sizes of HTML and TEXT are 1 MB and 60 KB, respectively. For HTML documents in Fig. 5, it turns out that the time of the presented method is practical although MULTI is about 1.28 times slower than SINGLE. In fact, MORH and FOCUS can be performed by the preprocess servers, so the analysis time of the main module ATTR of MULTI becomes 20 ms for a text article. 6. Conclusion The extracting scheme of traditional methods depends on words or a sequence of words without considering the context of articles. Therefore, many irrelevant candidates of possible malicious expressions are extracted. Although the current ltering scheme can precisely alert malicious articles, many non-malicious articles are not detected well. In order to solve these problems, this paper has presented a new ltering algorithm to detect SC expressions by introducing multi-attribute rules (MULTI). For 11,019 articles, it has been veried that the presented method could improve the rate of false positive of the traditional method without degrading the rate of false negative. Therefore, we can say that the presented method MULTI is a very useful approach for ltering services for inadequate expressions. In future work, it needs to build rule-based knowledge for many types of malicious postings together with the error recovery. References
2 channel. <http://www.2ch.net/>. Aho, A. V., & Corasick, M. J. (1975). Efcient string matching: An aid to bibliographic search. Communications of the ACM, 18(6), 333340. Altman, D. G., & Bland, J. M. (1994). Diagnostic tests: Sensitivity and specicity. BMJ, 308, 1552. Ando, K., Mizobuchi, S., Shishibori, M., & Aoe, J. (1998). Efcient multi-attribute pattern matching. An International Journal of Computer Mathematics, 66(1+2), 2138.
335
Anichiva. <http://cn.anchiva.com/download/Commtouch%20URL%20Filtering%20White%20Paper_Anichiva_En.pdf>. Asahara, M., & Matsumoto Y. (2003). Japanese named entity extraction with redundant morphological analysis. In Proc. of HLTNAACL 03 (pp. 815). Atlam, E.-S., Elmarhomy, G., Fuketa, M., Morita, K., & Aoe, J. (2006). Automatic building of new eld association word candidates using search engine. Information Processing & Management Journal, 42(4), 951962. Atlam, E.-S., Fuketa, M., Morita, K., & Aoe, J. (2003). Documents similarity measurement using eld association terms. An International Journal of Information Processing and Management, 39(6), 809824. Burgess 1998. <http://www.svms.org/disadvantages.html>. Children Internet Protection Act. <http://en.wikipedia.org/wiki/Childrens_Internet_Protection_Act>. Claypool, M., Brown, D., LE, P., & Waseda, M. (2001). Inferring user interest. IEEE Internet Computing, 5, 3239. Digital Economic Act. <http://www.legifrance.gouv.fr/afchTexte.do?cidTexte=JORFTEXT000000801164&dateTexte>. Francis, W., Frantz, V. & Mathieu, S., (2000). Using learning-based lters to detect rule-based ltering obsolescence. Article in proceeding of research information assister of ordinate, RIAO 2000, Paris. Fuketa, M., Kadoya, Y., Atlam, E.-S., Kunikata, T., Morita, K., Kashiji, S., et al (2005). A method of extracting and evaluating good and bad reputations for natural language expressions. Information Technology & Decision Making, 4(2), 77196. Fuketa, M., Lee, S., Tsuji, T., Okada, M., & Aoe, J. (2000). A document classication method by using eld association words. An International Journal of Information Sciences, 126(1), 5770. Gakkou-Ura. <http://schecker.jp/> (in Japanese). Gharieb, R. R. (2000). Higher order statistics based IIR notch ltering scheme for enhancing sinusoids in colored noise. IEE Proceedings Vision Image and Signal Processing, 147(2), 115121. Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative ltering to weave an information tapestry. Communications of the ACM, 35, 6170. Goldberg, K., Roeder, T., Guptra, D., & Perkins, C. (2001). Eigentaste: A constant-time collaborative ltering algorithm. Information Retrieval, 4, 133151. Good, N., Schafer, J. B., Konstan, J. A., Borchers, A., Sarwar, B. M., & Harter, S. P. (1996). Variations in relevance assessments and the measurement of retrieval effectiveness. Journal of the American Society for Information Science, 47, 3749. Heckerman, D., Chickering, D. M., Meek, C., Rounthwite, R., & Kadie, C. (2000). Dependency networks for inference, collaborative ltering, and data visualization. Journal of Machine Learning Research, 1, 4975. Herlocker, J. L., Konstan, J. A., & Riedl, J. (2002). An empirical analysis of design choices in neighborhood-based collaborative ltering algorithms. Information Retrieval, 5, 287310. Horvath, 2003. Horvth (2003) in Suykens et al. p. 392. Kadoya, Y., Morita, K., Fuketa, M., Ohono, M., Atlam, E.-S., Sumitomo, T., et al (2005). A sentence classication technique by using intention association expressions. Computer Mathematics, 82(7), 777792. Kim, S., Min, H., Jeon, J., Man Ro, Y., & Han, S. (2009), Malicious content ltering based on semantic features. In Proceedings of the ACM international conference on interaction sciences: Information technology, culture and human (Vol. 403, pp. 802806), Seoul, Korea. Kiyoi, K., Atlam, E.-S., Fuketa, M., Yoshinari, T., & Aoe, J. (2008). A method for extracting knowledge from medical texts including numerical representation. International Journal of Computer Applications in Technology, 33(2/3), 226236. Landau, M.C., Sillion, F., & Vichot, F. (1993), Exoseme: A thematic document ltering system. In Intelligence Articial, Avignon, France. Larry, M., & Malik, Y. (2001). One-class SVMs for document classication. Journal of Machine Learning Research, 139, 154. Lee, W., Lee, S., Chung, S., & An, D. (2007), Harmful contents classication using the harmful word ltering and SVM. In Proceedings of the 7th international conference on computational science, Part III: ICCS 2007 (pp. 1825), May 2730, 2007, Beijing, China. Livejournal. <http://www.livejournal.com/>. Mixi. <http://mixi.jp/> (in Japanese). myspace. <http://us.myspace.com/>. Pennock, D. M., Horvitz, E., Lawrence, S., & Giled, C. L. (2000). Collaborative ltering by personality diagnosis: A hybrid memory and, model-based approach. In Proceedings of the sixteenth annual conference on uncertainty in articial intelligence (UAI-2000) (pp. 473480), Morgan Kaufmann, San Francisco. Provider Liability Act. <http://law.e-gov.go.jp/htmldata/H13/H13HO137.html>. Reddy, P. K., Kitsuregawa, P., Sreekanth, P., Rao, S. S. (2002), A graph based approach to extract a neighborhood customer community for collaborative ltering. In Lecture notes in computer science databases in networked information systems, second international workshop (pp. 188200), Springer. Shiraki, N., Hara, M., Ogino, H., Shibamoto, Y., Iida, A., Tamaki, T., et al (2004). False-positive and true-negative hilar and mediastinal lymph nodes on FDGPET Radiologicalpathological correlation. Annals of Nuclear Medicine, 18(1), 2328. Wang, J., Arjen, P., Marcel, J.T. (2006). Unifying user-based and item-based collaborative ltering approaches by similarity fusion. In Proceeding of SIGIR 2006. August 611, 2006, Seattle, Washington, USA. Wright, S. E., & Budin, G. (1997). Handbook of terminology management. Basic aspects of terminology management (Vol. 1). Amsterdam, Philadelphia: John Benjamins. Xu, J., Chong, Z., Lu, H., Zhou, A. (2004). False positive or false negative: Mining frequent itemsets from high speed transactional Data streams. In Proceeding of the 30th VLBD (pp. 204215), Toronto, Canada. Yahoo! BBS. <http://messages.yahoo.co.jp/index.html>. Yokoku.in. <http://yokoku.in/> (in Japanese). Yoohwan, K., Wing, C., Mooi, C., & Chao, H. Jonathan (2006). PacketScore: A statistics-based packet ltering scheme against distributed denial-of-service attacks. IEEE Transactions on Dependable and Secure Computing, 3(2), 141155. Yoshinari, T., Atlam, E.-S., Morita, K., Kiyoi, K., & Aoe, J. (2008). Automatic acquisition for sensibility knowledge using co-occurrence relation. International Journal of Computer Applications in Technology, 33(2/3), 218225. Youth Protection Act. <http://www.wien.gv.at/recht/landesrecht-wien/landesgesetzblatt/jahrgang/2002/html/lg2002017.htm>.

A Method of Extracting Malicious Expressions in Bulletin Board Systems

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Method of Extracting Malicious Expressions in Bulletin Board Systems

Enviado por

Direitos autorais:

Formatos disponíveis

Information Processing and Management 47 (2011) 323335

Contents lists available at ScienceDirect

Information Processing and Management

H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335

(a) Malicious expression texta

(b) Non-malicious expression text

Fig. 1. Examples of malicious and non-malicious expressions.

H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335

Where 3 means the amount of money.

H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335

Acquisition of Posting Identification Information Article number/Posting title/Acquisition

Morpheme and Named Entity Analysis

Inadequate Expressions Test

Crime Expression Test

Extraction rules for inadequate expressions

Extraction rules for crime expressions

Fig. 2. The construction of the presented systems.

H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335

Rule p Rp1 Rp2 . . . Rpm ; m np ; 0 < np :

H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335

H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335

output(4) = {(8,(CON,(MURDER&VIOLENCE)))} output(6) = {(4, (CON, DISCRIMINATION))}

output(7) = {(6, (CON, OBSCENITY))} output(8) ={(11, (CON, CRIME MATERIAL))}

10 output(10)= {(9, (CON, EXPLOSION&ARSON))}

output(14) ={(5, (CON, OBSCENITY))}

output(13) = {(13, (CON, GAME))}

{(CAT, CLOTHES, JOB, DOCUMENT)}*

output(16)={(1, 2,3, (CON, ABUSE))}

output(18)= {(12 , (CON, CRIME MATERIAL))} {(CAT, PROCESS)}

output(20)= {(14, (CON, COMPUTER))}

22 output(22) = {(7, (CON,MURDER&VIOLENCE))}

H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335

output (3) = {(22, (FIX,<<ABUSE>>))} {(CON, ABUSE)} {(CON, ABUSE)}

{(CON, CRIME MATERIAL)}

output (12) = {(21, (NON, << MURDER&VIOLENCE >>))}

Fig. 4. The goto and output functions for rules in Table 2.

4 5 6 output (7) = {(18, (FIX, hhCRIME MATERIALii, hhMURDER&VIOLENCEii))}

H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335

H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335

H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335

H. Hanafusa et al. / Information Processing and Management 47 (2011) 323335

Você também pode gostar