Almost Unsupervised Content Filtering Using Topic Models: Swapnil Hingmire Sandeep Chougule Girish K. Palshikar

Almost Unsupervised Content Filtering Using Topic Models
Swapnil Hingmire
Sandeep Chougule
Tata Research Development and Design Centre Pune, INDIA
Girish K. Palshikar
swapnil.hingmire@tcs.com sandeep.chougule@tcs.com gk.palshikar@tcs.com Sutanu Chakraborti

Indian Institute of Technology (IIT), Madras Chennai, INDIA
sutanuc@iitm.ac.in
ABSTRACT
One important way to help users deal eectively with document repositories is to automatically lter out those parts of the documents which are irrelevant to his/her goal at hand. While such a goal-driven content-ltering system is an ambitious target, we propose an almost unsupervised approach to separate useless (irrelevant) sentences in a document from useful (relevant) sentences. We automatically construct a topic model using a standard algorithm [2], manually divide the topics into useful and useless, then aggregate all useful topics into a single useful topic and all useless topics into a single useless topic and then automatically assign a label, useful or useless to each sentence, depending on whether it is closer to the aggregated useful or useless topic. We illustrate the approach on a real-life dataset of IT infrastructure support (ITIS) tickets containing highly noisy problem descriptions and solutions to remove all sentences not relevant as solution steps. We experimentally demonstrate that our results, apart from requiring minimal user intervention, improve over other content-ltering approaches. The approach can be generalized to remove useless sentences (from any perspective) from other types of documents.
General Terms
Theory, Performance, Experimentation, Verication
Keywords
Information Filtering, Text Mining, Noisy Text, Topic Modelling
1.
INTRODUCTION
Categories and Subject Descriptors

H.3.3 [Informational Storage and Retrieval]: Information Search and RetrievalInformation ltering ; I.2.7 [Articial Intelligence]: Natural Language Processing Text analysis ; I.5.4 [Pattern Recognition]: ApplicationsText processing Research scholar at Indian Institute of Technology (IIT), Madras, Chennai, India
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. AND 12 Mumbai, India, Copyright 2012 ACM 978-1-4503-1919-5/12/12 ...$10.00.
With the advent of cheap and fast storage, there is an explosive growth in the size and number of documents available in enterprise document repositories. An important part of text-mining is helping users make eective use of the knowledge and insights hidden in a large collection of complex documents. Techniques such as corpus visualization, topic modelling, document clustering, classication, summarization, key-phrase identication are dierent approaches towards this end. One important way to help the user deal eectively with a given document is to automatically lter out those parts of the given document(s) which are irrelevant to his/her goal or task at hand. These irrelevant parts in the document constitute noise, as far as the readers specic goal is considered. Thus, identifying irrelevant text for a given document analysis goal can be considered as a problem of handling a new category of noise. As an example, when analyzing a news item about an accident, the reader may want to focus only on the dierent descriptions of the accident (e.g., as stated by the driver(s), passengers, witnesses or police). Or else, the reader may just want to read about the rescue actions taken (e.g., bystanders, police, re-brigade etc.). In a help-desk application, the user may just want to focus on either the symptoms, root-causes or solution actions taken for a particular problem (e.g., in a previously solved ticket for the problem unable to print a WORD document). Table 1 shows an example. For such applications, ideally the document repository management system should provide a content ltering facility, which allows the user to specify his/her goal and then automatically lters out all those parts (e.g., sentences) in the given document which are irrelevant to the stated goal.
PROBLEM: Need new id le and password for Lotus Notes SOLUTION: [User was getting the message -You are not authorised to access the database- upon clicking on the MAIL Icon on the welcome page.]symptoms [Reason incorrect mail le location in the connection document. user had the connection document of a different user selected.]diagnosis [created a new connection document.]solution [user was able to access the mails.]outcome [Resolving the ticket with the users permission. Please provide your valuable feedback before closing the ticket.]noise Table 1: An example help-desk ticket containing the users problem and the resolvers solution.
While such a goal-driven content-ltering system is an ambitious target, we propose a simple approach to separate useless (irrelevant) sentences in a document from useful (relevant) sentences. For clarity, we focus on documents that contain highly noisy problem descriptions and solutions for IT infrastructure support (ITIS) tickets. The useful sentences are those that contain actions that solve the users problem; we consider all other types of sentences as useless. The approach can be easily generalized to remove useless sentences (from any perspective) from other types of documents. The main motivation for this work is the automatic construction of runbooks (also called troubleshooting or problem solving manuals) from large datasets of such ITIS tickets. This is a well-known problem [7]. Such a runbook contains a sequence of well-organized and unambiguous steps that can be carried out by the support executive. Such runbooks codify extensive problem-solving knowledge obtained from experience (e.g., past tickets handled) and help in improving the productivity of the support executives and allow them to eciently solve many tickets requiring dierent skills and backgrounds. We are doing content ltering using almost unsupervised text classication. The central idea in our content ltering approach is to construct automatically a topic model using a standard algorithm [2], then manually divide the topics into useful and useless, then aggregate all useful topics into a single useful topic and all useless topics into a single useless topic and then automatically assign a label, useful or useless to each sentence, depending on whether it is closer to the aggregated useful or useless topic. We illustrate the approach on a real-life dataset of ITIS tickets. Our experiments demonstrate that our results, apart from requiring minimal supervision, are better than some other approaches. The paper is organized as follows. Section 2 reviews the related work. In section 3, we review dimensionality reduction and topic modelling. Section 4 contains our sentence ltering algorithm. In section 5, we demonstrate the eectiveness of our approach with experiments on real world data. We end our paper with conclusions and future prospects for our work in section 6.
2.
RELATED WORK
In this section we review related work in Case Based Rea-
soning (CBR), noisy text analysis and text classication using unsupervised and semi-supervised techniques. Case Based Reasoning (CBR): CBR is a widely used technique for knowledge management in help desk [7]. Most of these systems have manually created case bases. A successful CBR system requires a high quality case base, which provides rich and ecient solutions for solving real world problems [22]. Creating a case base of CBR system can be challenging if the problem solving experiences are captured as unstructured or semi-structured text [21, 23]. Our ltering approach can be used to create a high quality case base from noisy and unstructured ITIS tickets. Noisy text analysis: Subramaniam et al. [19] survey dierent types of text noise and techniques to handle the noise. They have dened noise as any kind of dierence in the surface form of an electronic text from the intended, correct or original text. Agarwal et al. [1] study the eect of dierent kinds of noise on automatic text classication. Usually a noisy text contains spelling errors, abbreviations, non standard words, false starts, repetitions, missing punctuations, missing letter case information, pause-lling words etc. [1]. Apart from the noise as dened above, we assume that text which is meaningless, useless, unwanted, o-topic and irrelevant with respect to a specic goal is also noise. Noisy text analysis in help desk domain: Mishne et al. [15] introduced a method of estimating the domainspecic importance of conversation fragments, based on divergence of corpus statistics. They computed the signicance level of a text fragment by combining the importance of each word in the fragment. Importance of a word is computed by comparing the frequency of the word in the domain-specic corpus to its frequency in an open-domain corpus using a statistical test for distribution dierence such as 2 or log likelihood measures. In this approach, importance of a word depends on the open domain corpus used for analysis. Takeuchi et al.[20] proposed a 2 statistic based technique to nd eective expressions from text segments. This approach needs labeled dataset of eective and non-eective expressions. Unsupervised text classication using labeled and unlabeled documents: Blum et al. [3] proposed a co-training method to build a text classier from labeled and unlabeled documents. Nigam et al.[16] dened an algorithm for text classication from labeled and unlabeled documents based on the combination of ExpectationMaximization (EM) and a naive Bayes classier. However, our approach does not require any labeled dataset. Unsupervised text classication using keywords: Ko et al. [12] proposed a text classication method based on unlabeled documents and the title word of each category. In the rst step of this method, documents are labeled using a bootstrapping technique, and in the second step, a feature projection technique is used to build a robust text classier. McCallum et al. [14] described an approach to build a classier by labeling keywords on unlabeled dataset. Initially a small set of keywords from each class are used to assign approximate labels to unlabeled documents by term matching. These preliminary labels become the starting point for a bootstrapping process that learns a naive Bayes classier using EM and hierarchical shrinkage. Qiu et al. [17] proposed a three step approach for unsupervised text classication. In the rst step, a set of keywords
is used to retrieve a set of positive documents from the unlabeled documents. In the second step, these positive documents are used to extract more positive documents from the unlabeled documents. In the third step, these positive and unlabeled documents are used to build a text classier. This approach requires WordNet and a search engine. As WordNet has limited coverage, it might not be suitable for domains like help desk. The approach dened in Qiu et al.[18] is similar to [17] and it diers in the rst step where, a set of keywords is used to retrieve a set of related pages from Wikipedia. These Wikipedia pages are used in the second step. This approach requires access to Wikipedia and Wikipedia may not be suitable to certain domains like help desk. In the keyword based approaches discussed above, the key problem is to nd the right sets of keywords for each class. One solution to this problem is to ask a user to provide a list of keywords. But it is dicult for the user to analyze the dataset and provide a complete list of keywords. Liu et al. [13] proposed an approach which uses hard clustering and feature selection techniques to assist the user to select keywords, and then it builds a text classier using EM and naive Bayes algorithm. We observed that the dataset contains words with polysemy (same word with dierent meanings). Also, importance of a word depends on its context. So while creating the list of keywords it is necessary to consider the context of each word. In our approach, the topics learnt after the construction of a LDA based topic model, perform soft clustering on the documents and provide their compact and interpretable representation. The topics capture contextual information of words, so it is easy for the user to nd the most probable words (keywords) along with their disambiguation. The results of LDA are presented in the form of probabilities, and thus can be directly incorporated into other probabilistic models and analyzed with standard statistical techniques [4].
cepts are unable to handle polysemy. LSI assumes that term count follows a Gaussian distribution. This assumption is hard to justify. So, LSI is ad-hoc [11].
3.2
Probabilistic topic models
There are two important probabilistic topic models: 1. Probabilistic latent semantic indexing (pLSI)[10] 2. Latent Dirichlet Allocation (LDA) [2]. Probabilistic latent semantic indexing (pLSI)[10] is a probabilistic formulation of latent semantic indexing. Training a pLSI model on a document collection requires a large number of parameters and it does not allow modelling of unseen documents. Latent Dirichlet Allocation (LDA)[2] is an unsupervised generative probabilistic model for collections of discrete data such as text documents. LDA has better modelling exibility than pLSI. As compared to pLSI, LDA requires fewer parameters to be learned. Also, it can be used to infer topics for unseen documents. Crain et al. [4] provide a good survey of dimensionality reduction and topic modelling techniques.
3.3
Latent Dirichlet Allocation (LDA)
Figure 1: Diagram of the LDA graphical model Figure 1 shows the graphical model of LDA. Generative process of LDA in gure 1 for document d can be interpreted as follows: (Refer table 2 for description of notations used in this paper.) 1. for t = 1...T (a) Select the word probabilities for topic t: t Dirichlet( ) 2. for each document d D (a) Select the topic distribution for document d: d Dirichlet() (b) for each word w at position n in d
n i. Select the topic: zd M ultinomial(d ) n n ii. Select the word: wd M ultinomial(zd )
3.
DIMENSIONALITY REDUCTION AND TOPIC MODELLING
In this section we review dimensionality reduction and topic modelling. The Bag-of-words (BOW) representation of documents is widely used in text analysis. It has a very high dimensionality. Also it does not capture polysemy and synonymy. A dimensionality reduction technique identies lower dimensional representation of original documents and it also preserves important properties of the documents like similarity between documents. There are two important techniques of dimensionality reduction for documents: 1. Latent semantic indexing (LSI)[5] 2. Probabilistic topic models ([2], [10]).
3.1
Latent semantic indexing (LSI)
Training an LDA model involves estimation of the wordtopic distributions and the topic distributions for each document d in the training corpus. The empirical likelihood of the training documents under a given LDA model is computed as:
ND Nd
LSI is based on the Vector Space Model (VSM). VSM is an algebraic model based on the term-document matrix. LSI uses singular value decomposition (SVD) of the termdocument matrix to project both the documents and terms into a lower dimensional space of concepts. The concepts derived from LSI are dicult to interpret; they are expressed in the form of numbers so that they can not be mapped into natural language concepts. Also, these con-
=
d=1 n=1
n n n p (w d |zd , )p(zd |d )p(d |)p(| )
(1)
Direct and exact estimation of parameters that maximizes the likelihood is intractable. In this paper we are using Collapsed Gibbs sampling for approximate parameter estimation of LDA. This parameter estimation technique is well explained in [8] and [9].
D = {d} ND Nd W = {w} V T Z = {z } n wd n zd n zd w,t t,d t w,t d t,d = {w,t } = {t,d }
The The The The The The The The The The The The The The The The The The The The
set of all documents in the corpus number of documents in the corpus number of words in document d set of all unique words in the corpus number of all unique words in the corpus number of topics specied as a parameter set of all the topics parameters of topic Dirichlet prior parameters of word Dirichlet prior word at position n in the document d topic assigned to the word at position n in the document d topics assigned to all words in the document d except the word at position n count of word w assigned to topic t count of the topic t assigned to some word w in document d word probabilities for topic t probability of the word w assigned to the topic t topic probability distribution for document d probability of the topic t assigned to the document d word-topic distributions for all words in W over all topics Z topic-document distributions for all topics in Z over all documents in D Table 2: Key notations
Gibbs sampling is a form of Markov Chain Monte Carlo (MCMC) technique. In Gibbs sampling, initially random values are assigned to each variable and then while sampling, each variable is conditioned on values of other variables. The sampling process is executed until the sampled values approximate the target distribution. In Collapsed Gibbs sampling, certain variables are marginalized out of the model. Griths and Steyvers [8] propose this sampling method for LDA in which both and are marginalized. n = t) i.e. the topic t In this method, the probability p(zd being assigned to the word w W at the position n in the document d is conditioned on all other topic assignments to other words in W and , .
n n p ( zd = t | zd , w, W ) = w,t + w 1
The Dirichlet distribution is dened as:

T
( p ( | ) =
t=1 T
t ) (t )
T t 1 t t=1
(6)
t=1
where, = {1 , ..., t , ..., T } is a point on the (T 1) simplex (i.e. 0 < t < 1 and T t=1 t = 1) and = (1 , ..., t , ..., T ) is a set of parameters with t > 0. So, = (1 , ..., t , ..., T ) Dirichlet(1 , ..., t , ..., T ) (7)
t,d + t 1
T
(2)
v,t + v 1
v W i=1
i,d + i 1
Aggregation property of the Dirichlet distribution: The Dirichlet distribution has a useful fractal like property called aggregation property of the Dirichlet distribution. It is dened as follows: The aggregation of any subset of Dirichlet distribution variables yields a Dirichlet distribution, with corresponding aggregation of the parameters. If {A1 , A2 , ..., Ar } is a partition of {1, 2, ..., T } then, =(
tA1
t ,
tA2
t , ...,
tAr
t ) t , ...,
tA2 tAr
where,
Nd n zd
Dirichlet(
i zd tA1
t ,
t )
(8)
=
i=1&i=n
(3)
After performing Collapsed Gibbs sampling for LDA, wordtopic distributions and topic-document distributions are computed as: w,t = w,t + w v,t + v
v W
(4)
Let Dirichlet(1 , 2 , ..., 6 ) be the Dirichlet distribution over six-sided dice with 6 + . To compute the probability of rolling an odd number versus the probability of rolling an even number, aggregate (1 + 3 + 5 ) into odd and (2 + 4 + 6 ) into even using the aggregation property of the Dirichlet distribution. This aggregation yields the Dirichlet(odd , even ) over two event sample space of odd versus even. The aggregation property of the Dirichlet distribution is explained and proved in detail in [6].
t,d =
t,d + t
T
4.
(5)
SENTENCE FILTERING
i,d + i
i=1
3.4
The Dirichlet distribution
In this section, we describe our sentence ltering algorithm. Initially, a corpus D of all the sentences from all the solutions of a specic problem is created. Consider each sentence as a separate document. Using Collapsed Gibbs
sampling for LDA, T number of topics are learnt such that Z = {z1 , z2 , ..., zT }. Now an expert will assign a label to each topic zi Z as either useless or useful. Create Z = {Z 1 , Z 2 } as a partition of Z such that Z 1 = {zi |zi Z and zi is useful} and Z 2 = {zi |zi Z and zi is useless}. If for a document d in the corpus D, d = (1,d , 2,d , ..., T,d ) Dirichlet(1 , 2 , ..., T ) then using the aggregation property of the Dirichlet distribution, dene d as: d = (
zt Z 1
In this section, we evaluate our proposed approach. We compare our approach with the following approaches: 1. Sentence ltering using labeled and unlabeled sentences with EM (EM LU) [16], 2. Sentence ltering using keywords and EM (EM keywords) [14] and 3. Clustering based sentence ltering.
5.1
Dataset
t,d ,
zt Z 2
t,d ) Dirichlet(
zt Z 1
t ,
zt Z 2
t ) (9)
Initialize and using following equations: w,Z 1 =

tZ 1
[w,t + w ] w,Z 2 =
tZ 2
[w,t + w ] (10)
The dataset used for our experiments was collected from TCS Global Help Desk (GHD). Initially, we collected all solutions of Lotus Notes related problems from GHD. These solutions are then segmented into sentences. Using these segmented sentences, we created a corpus of 996 randomly selected sentences. We then manually labeled these sentences as either useful or useless. We represent our dataset as C = {c1 , c2 } where, c1 is the set of all useful sentences and c2 is the set of all useless sentences.
w,Z 1 =
w,Z 1 v W
w,Z 2 =
w,Z 2 v W
5.2
(11)
Evaluation measure
v,Z 1 [t,d + t ] Z 2 ,d =
v,Z 2 [t,d + t ]
Z 1 ,d =
tZ 1
tZ 2 Z 1 ,d Z 2 ,d Z 2 ,d +Z 1 ,d
(12) (13)
Z 1 ,d =
Z 2 ,d +Z 1 ,d
Z 2 ,d =
We evaluated eectiveness of our approach by computing the Macro F-measure (F1 ). Though accuracy is a widely used measure to evaluate a classier, it is not suitable whenever there is class imbalance. F1 of class ci C is harmonic mean of the precision and recall values of class ci . Macro F1 measure is the average of F1 of each class. Macro F1 gives overall eectiveness of a classier. It is dened as: Macro F1 = where, Precision (ci ) = number of correct classications into class ci number of classications into class ci (16) 1 |C| 2 P recision(ci ) Recall(ci ) (15) P recision(ci ) + Recall(ci )
Using Collapsed Gibbs sampling for LDA, update and . A sentence d D is useful if Z 1 ,d > Z 2 ,d otherwise it is useless. Remove all useless sentences from D. Now D will contain only useful sentences. Algorithm for sentence ltering is described in algorithm 1. Algorithm 1: Filtering useless sentences input : D = {d} : A set of sentences output: Useful sentences in D 1 begin 2 Learn T number of topics on D such that: Z = {z1 , z2 , ..., zT }; 3 Compute and using equations 4 and 5 ; 4 Create a partition Z = {Z 1 , Z 2 } such that: Z 1 = {zi |zi Z and zi is useful} and Z 2 = {zi |zi Z and zi is useless}; 5 Initialize and using equations in 11 and 13 respectively; 6 Update and using Collapsed Gibbs sampling; 7 for d D do 8 Infer Z 1 ,d and Z 2 ,d using equation 13; 9 dlabel = useful useless if Z 1 ,d > Z 2 ,d otherwise (14)
ci C
Recall (ci ) =
number of correct classications into class ci number of sentences in class ci (17)
5.3
Experiment Settings
We did preprocessing on the corpus. We removed stopwords and non ASCII characters from the corpus. We replaced each URL by the word URL and each email-id by the word EMAIL ID from the corpus using regular expressions.
5.3.1
LDA based sentence ltering (LDA)
10 end 11 Remove d D such that dlabel = useless ; 12 end
We used Mallet1 to run LDA on the corpus. Mallet uses Gibbs sampling to estimate the parameters of LDA. We experimented with dierent values of number of topics (T), ranging from 20 to 40. The Dirichlet parameter was chosen to be 0.01 and was 50/T . We performed 1000 steps for Gibbs sampling. We also decided to nd inter-expert agreement on labeling of topics and its relation with Macro-F1. For each T, we asked two experts to label each topic as either useful or useless. Then we used Kappa coecient to nd inter-expert agreement.
5.3.2
Sentence ltering using labeled and unlabeled sentences with EM (EM LU)
In this experiment, we followed the approach proposed in [16]. We experimented with various sizes of the labeled
5.
EXPERIMENTAL EVALUATION
http://mallet.cs.umass.edu/
Algorithm LDA (T = 32) Clustering (N = 28) EM LU [16] EM keywords [14]
Class c1 Precision Recall 0.90 0.87 0.88 0.86 0.85 0.86 0.85 0.88
F1 0.892 0.87 0.855 0.865
Class c2 Precision Recall 0.74 0.80 0.72 0.75 0.70 0.68 0.72 0.66
Macro F1 F1 0.77 0.74 0.69 0.69 0.831 0.805 0.7725 0.7775
Table 3: Experimental results of sentence ltering dataset. We used 10% to 50% of the corpus as labeled dataset and rest of the corpus as unlabeled dataset.
5.3.3
Sentence ltering using keywords and EM (EM keywords)
In this experiment, we followed the approach proposed in [14]. We manually created a set of keywords for both, useful and useless classes and performed the experiment.
5.3.4
Clustering based sentence ltering (Clustering)
In this approach, we want to measure the performance of LDA based approach with text clustering. We will cluster sentences into N clusters, then nd representative words of each cluster, and using these representative words, an expert will label each cluster as useful or useless. We label a sentence useful if it belongs to the useful cluster otherwise we label it useless. For this experiment we are using CLUTO 2 . We are doing this experiment for dierent values of N ranging from 20 to 40.
Figure 2: Macro-F1 and Kappa coecient variation against the number of topics/clusters
5.4
Experimental Results
Figure 2 shows the graph of Kappa coecient values variation for LDA and Macro-F1 of LDA and clustering based approaches against the number of topics/clusters. In LDA based approach, we observed highest Macro-F1 and highest Kappa coecient at T = 32. We are trying to analyze Kappa coecient variation against the number of topics. In clustering based approach, we observed highest Macro-F1 at N = 28. Table 4 shows topics obtained on our dataset with T = 32 and Table 5 shows examples of few correctly classied sentences. Table 3 gives class wise performance and Macro-F1 of the four approaches described above. We can see that LDA based sentence ltering performs better than the other three approaches. We can observe from the graph, that LDA based sentence ltering always performs better than clustering based sentence ltering for dierent values of number of topics or clusters. We observed that performance of sentence ltering using labeled and unlabeled sentences with EM depends on the labeled dataset and its size. In keyword based approach, we have a set of keywords for each class. We observed that, there are some sentences where context of a keyword of one class contains keywords from other class. We can say that such type of sentences are at the boundary of two classes. Due to mixture of keywords from dierent classes these sentences are likely to get misclassied. We observed that in the context of Lotus Notes related problems, the word notes is important. In the keyword based
2
approach we can say that this word belongs to the class of useful words. In table 4, we can observe that the word notes is present in useless topics 14 and 19, as well as in many useful topics like 9, 11, 18 . So in unsupervised settings, it is incorrect to assign a single class to the word notes. But while providing list of keywords, context of a word is not considered. So there are chances of misclassication. In LDA based approach, most prominent words in a topic frequently co-occur with each other in the dataset. Also LDA is a generative model, so that with the help of topics, it is easy for the user to understand the complete dataset, which is not feasible in the other approaches described above.
6.
CONCLUSION
http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
In this paper, we suggested modications to the denition of text noise and presented a novel, inexpensive and almost unsupervised approach to lter out text noise. We summarize our approach as follows: Filtering text noise is a text classication problem. As the size of the dataset is huge, it is very expensive to label the dataset and to use a supervised classier. We use LDA and the aggregation property of the Dirichlet distribution for ltering text noise. We have also demonstrated the eectiveness of our approach by experiments on a real-life dataset of ITIS tickets containing highly noisy problem descriptions and solutions. In the future, we would like to focus on automatic labeling of topics as useless or useful with respect to user specied goal. We would like to experiment with hierarchical LDA to capture hierarchical information present in the dataset. We would like to use ltered sentences in 1. automatic construction of solution manuals 2. creating high quality case
Topic ID 0 1 2 3* 4 5 6 7* 8* 9 10 11 12 13 14* 15* 16 17 18 19* 20 21* 22* 23 24 25 26 27 28 29 30 31
Most prominent words in the topic email, ultimatix, helpdesk, selection, global, address, hr, tcs, employee steps, mentioned, follow, error, showing, unable, user, attachment, nd les, id, hidden, search, folders, pc, inside, kindly, discussed tcs, call, free, toll, indy, voip, usa, ghd, queries password, temporary, enter, sametime, webmail, login, url, set, accessing le, id, password, hours, hour, note, work, log, wait mail, server, web, tcs, copy, open, mails, network, forwarding ticket, provide, close, request, feedback, valuable, service, solution, improve contact, local, idm, call, issue, case, project, incase, person id, le, password, login, notes, rename, replace, copy, folder access, hrs, database, emp, user, replication, status, mails, message lotus, data, settings, notes, number, employee, local, application, folder id, le, ticket, attached, mention, problem, earlier, requested, certied id, email, days, request, le, business, wait, kindly, raise lotus, notes, conguration, document, congure, url, follow, attached, step ticket, raise, request, user, note, unable, pl, reopen, behalf location, hr, server, update, updated, mail, ultimatix, check, current le, id, attached, password, pls, nd, congure, recertied, hr ticket, raise, conguration, lotus, notes, idle, expired, case, certicate issues, call, nd, notes, conguration, incase, lotus, raise, helpdesk dear, problem, kindly, changed, group, support, services, chat, mailbox ticket, raise, issue, resolved, webmail, user, tickets, single, log user, ticket, called, resolving, informed, phone, issue, reach, gave le, id, attached, download, user, ticket, switch, save, desktop password, change, follow, step, instructions, mentioned, im, carefully, link password, reset, id, note, time, le, ist, pm, normal mail, id, le, emp, nsf, server, access, change, password notes, lotus, congured, user, connected, users, permission, system, archive login, lotus, details, notes, open, client, sametime, exit, logon password, lotus, notes, ne, default, working, user, account, update password, user, change, le, reset, security, policy, enter, click user, email, send, username, switched, format, checked, chn, ramesh
Table 4: Most prominent words of topics (* indicates useless topic)
Sentence Incase you nd any issues or queries with respect to above information; please call us at USA TOLL FREE : 1-877-TCS-INDY: PSTN: 080-6060 5555 OR VOIP: 500-5555 Also requesting you to provide your valuable feedback to your request :.so that we can improve our service and please close the ticket after satisfactory solution Contacted user over the phone:As informed by user issue is already xed by IDM and hence resolving the ticket Please access your mail box on server InMumM03//mail//mail3//EMPID.nsf and change the password immd: once Lotus Notes client is congured . Click on HR Management - Employee Self Service - Email selection and select your email address . b) Enter the temporary password:7NxUwX1Bcu6T(Please ensure to copy password correctly) .
Class useless useless useless useful useful useful
Closest topic 3 7 22 26 0 4
Probability 0.90 0.87 0.78 0.70 0.86 0.66
Table 5: Examples of a few correctly classied sentences by LDA based sentence ltering algorithm
base for CBR based help desk system 3. skill proling of help desk resolvers for eective workforce management in the help desk. We would like to generalize algorithm 1 to nd the number of skills and ticket to skill mapping in help desk. We would also like to use our content ltering approach to lter various aspects of nancial reports and news articles.
7.
ACKNOWLEDGMENTS
The authors would like to thank K.V.S. Dileep of AIDB Lab at IIT Madras, Darshan Shirodkar and Sangameshwar Patil of SRL at TRDDC for their help and support.
8.
REFERENCES
[1] S. Agarwal, S. Godbole, D. Punjani, and S. Roy. How much noise is too much: A study in automatic text classication. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, pages 312, October 2007. [2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:9931022, March 2003. [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92100, 1998. [4] S. P. Crain, K. Zhou, S.-H. Yang, and H. Zha. Dimensionality reduction and topic modeling: From latent semantic indexing to latent dirichlet allocation and beyond. In C. C. Aggarwal and C. Zhai, editors, Mining Text Data, pages 129161. Springer US, 2012. [5] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391407, 1990. [6] B. A. Frigyik, A. Kapila, and M. R. Gupta. Introduction to the dirichlet distribution and related processes. Technical report. University of Washington, Seattle, 2012. https://www.ee.washington.edu/ techsite/papers/documents/UWEETR-2010-0006.pdf [7] M. H. G oker and T. Roth-Berghofer. Development and utilization of a case-based help-desk support system in a corporate environment. In Proceedings of the Third International Conference on Case-Based Reasoning and Development, pages 132146, 1999. [8] T. L. Griths and M. Steyvers. Finding scientic topics. PNAS, 101(suppl. 1):52285235, April 2004. [9] G. Heinrich. Parameter estimation for text analysis. Technical report. University of Leipzig, Germany, 2004. http: //www.arbylon.net/publications/text-est.pdf [10] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 5057, 1999. [11] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1-2):177196, January 2001. [12] Y. Ko and J. Seo. Text classication from unlabeled documents with bootstrapping and feature projection techniques. Information Processing and Management, 45(1):7083, January 2009.
[13] B. Liu, X. Li, W. S. Lee, and P. S. Yu. Text classication by labeling words. In Proceedings of the 19th national conference on Artical intelligence, pages 425430, 2004. [14] A. Mccallum and K. Nigam. Text classication by bootstrapping with keywords, EM and shrinkage. In ACL-99 Workshop for Unsupervised Learning in Natural Language Processing, pages 5258, 1999. [15] G. Mishne, D. Carmel, R. Hoory, A. Roytman, and A. Soer. Automatic analysis of call-center conversations. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 453459, October 2005. [16] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classication from labeled and unlabeled documents using EM. Machine Learning Special issue on information retrieval, 39(2-3), May-June 2000. [17] Q. Qiu, Y. Zhang, and J. Zhu. Building a text classier by a keyword and unlabeled documents. In Proceedings of the 13th Pacic-Asia Conference on Advances in Knowledge Discovery and Data Mining, pages 564571, 2009. [18] Q. Qiu, Y. Zhang, J. Zhu, and W. Qu. Building a text classier by a keyword and Wikipedia knowledge. In Proceedings of the 5th International Conference on Advanced Data Mining and Applications, pages 277287, 2009. [19] L. V. Subramaniam, S. Roy, T. A. Faruquie, and S. Negi. A survey of types of text noise and techniques to handle noisy text. In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pages 115122, 2009. [20] H. Takeuchi, L. V. Subramaniam, T. Nasukawa, and S. Roy. Automatic identication of important segments and expressions for mining of business-oriented conversations at contact centers. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 458467, 2007. [21] R. O. Weber, K. D. Ashley, and S. Br uninghaus. Textual case-based reasoning. Knowl. Eng. Rev., 20:255260, September 2005. [22] C. Yang, B. Farley, and B. Orchard. Automated case creation and management for diagnostic CBR systems. Applied Intelligence, 28:1728, February 2008. [23] M. Zaluski, N. Japkowicz, and S. Matwin. Case authoring from text and historical experiences. In Proceedings of the 16th Canadian society for computational studies of intelligence conference on Advances in articial intelligence, pages 222236, 2003.

Almost Unsupervised Content Filtering Using Topic Models: Swapnil Hingmire Sandeep Chougule Girish K. Palshikar

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Almost Unsupervised Content Filtering Using Topic Models: Swapnil Hingmire Sandeep Chougule Girish K. Palshikar

Enviado por

Direitos autorais:

Formatos disponíveis

Almost Unsupervised Content Filtering Using Topic Models

swapnil.hingmire@tcs.com sandeep.chougule@tcs.com gk.palshikar@tcs.com Sutanu Chakraborti

Tata Research Development and Design Centre Pune, INDIA

Categories and Subject Descriptors

Probabilistic topic models

Latent Dirichlet Allocation (LDA)

DIMENSIONALITY REDUCTION AND TOPIC MODELLING

Latent semantic indexing (LSI)

n n n p (w d |zd , )p(zd |d )p(d |)p(| )

D = {d} ND Nd W = {w} V T Z = {z } n wd n zd n zd w,t t,d t w,t d t,d = {w,t } = {t,d }

The Dirichlet distribution is dened as:

The Dirichlet distribution

Initialize and using following equations: w,Z 1 =

number of correct classications into class ci number of sentences in class ci (17)

LDA based sentence ltering (LDA)

10 end 11 Remove d D such that dlabel = useless ; 12 end

Algorithm LDA (T = 32) Clustering (N = 28) EM LU [16] EM keywords [14]

F1 0.892 0.87 0.855 0.865

Macro F1 F1 0.77 0.74 0.69 0.69 0.831 0.805 0.7725 0.7775

Sentence ltering using keywords and EM (EM keywords)

Clustering based sentence ltering (Clustering)

Topic ID 0 1 2 3* 4 5 6 7* 8* 9 10 11 12 13 14* 15* 16 17 18 19* 20 21* 22* 23 24 25 26 27 28 29 30 31

Table 4: Most prominent words of topics (* indicates useless topic)

Class useless useless useless useful useful useful

Probability 0.90 0.87 0.78 0.70 0.86 0.66

Você também pode gostar