Personalizing The Web Directories

Personalizing Web Directories with the Aid of Web Usage Data Literature Survey:
Computational intelligence models for Personalization CI has been defined as the study of adaptive mechanisms to enable or facilitate intelligent behavior in complex and changing environments. This is an ongoing and evolving area of research since its term was coined by John McCarthy in 1956. Different CI models related to personalization are given in figure 1.
Fuzzy Systems (FS) and Fuzzy Logic (FL) mimic the concept the way people think, that is, with reasoning rather than precise. Fuzzy methods were found to be instrumental in web-based personalization when used with WUM data. User profiles are processed using fuzzy approximate reasoning to recommend personalized URLs. Handling of user profiles with fuzzy concepts has been used by IR systems to provide users with personalized search engine results. Based on users web usage history data, fuzzy methods have been used to categorize or cluster web objects for web personalization. Fuzzy logic was used with collective or collaborate data mining techniques to improve the quality of intelligent agents to provide personalized services to users . Evolutionary Algorithms (EA) use mechanisms inspired by biological evolution such as reproduction, mutation, recombination and selection. One of the most popular EA is Genetic Algorithms (GA). They mimic the gene structure in humans based on evolutionary theory. GA has been used to address some of the flaws of WUM and to tackle different problems such as
personalized search, IR, query optimization and document representation. GA was applied with user log mining techniques to get a better understanding of user preferences and discover associations between different URL addresses. By GA was included randomness in content filtering rather than strict adherence to predefined user profiles. This is known as the element of serendipity in IR. This modified GA was introduced for optimal design of a website based on a multiple optimization criteria taking download time, visualization and product association level into consideration. Artificial Neural Networks (ANN) or simply Neural Networks (NN) mimic the biological process of the human brain. A NN can be trained to group users into specified categories or into clusters. This is useful in personalization as each user group may possess similar preferences and hence the content of a web interface can be adapted to each group. NNs can also be trained to learn the behavior of website users. Inputs for this learning can be derived from WUM data and CF techniques. The learning ability of neural networks can also be used for real time adaptive interaction instead of only common content and static based personalization. A NN was used to construct user profiles. A NN was implemented to categorize e-mail folder. Swarm Intelligence (SI) is based on the collective behavior of animals in nature such as birds, ants, bees and wasps. Particle Swarm Optimization (PSO) models the convergence behavior of a flock of birds. PSO was used for analyzing unique behavior of web user for manipulation of web access log data and user profile data. Personalized recommendation based on individual user preferences or CF data has also been explored using PSO. This was done by building up profiles of users and then using an algorithm to find profiles similar to the current user by supervised learning. Personalized and automatic content sequencing of learning objects was implemented using PSO. Research has also been done using PSO as a clustering algorithm but no use of this approach to clustering was found in relation to website personalization. Another SI technique is Ant Colony Optimization (ACO) which models the behavior of ants that leave the nest to wander randomly in search of food and when it is found they leave a trail of pheromone when returning to the colony. ACO resulted in the development of the shortest path optimization algorithms and has applications in routing optimization. ACO has been used to classify web users in WUM (cAnt-WUM algorithm) allowing personalization of the web system to each user class. Bees Colony Optimization (BCO) is built on basic principles of
collective bee intelligence. It has been applied to web-based systems to improve the IR systems of search engines incorporating WUM data, however the issue of personalization has not yet known to be directly addressed. Wasp Colony optimization (WCO) or Wasp Swarm Optimization (WSO) has not yet been exploited in comparison to the other SI methods. It models the behavior of insect wasps in nature. WCO has also been applied to the NP-hard optimization problem known as the Multiple Recommendations Problem (MRP). It occurs when several personalized recommendations are running simultaneously and results in churning where a user is presented with uninteresting recommendations. Further research has to be done however, using WCO on real, scalable and dynamic data sets. Artificial Immune Systems (AIS) mimic the functioning of the human immune system as the body learns to handle antigens by producing antibodies based in previous experience. Applications of AIS have been solving pattern recognition problems, classification tasks, cluster data and anomaly detection. Already AIS has been applied to personalization of web-based systems. The human body is represented by a website, incoming web requests are antigens and learning is paralleled to the learning of the immune systems to produce the right antibodies to combat each antigen. Using this analogy and AIS based on WUM was used as a learning system for a website. It is common practice to combine CI techniques to create a hybrid which seeks to overcome the weakness of one technique with the strength of another. Several hybrids were applied to personalization of web based systems. NN was combined with FL to give a hybrid Neuro -Fuzzy strategy for Web personalization. The topology and parameters of NN were used to obtain the structure and parameters of fuzzy rules. The learning ability of NN was then applied to this set of rules. The ability of evolutionary techniques such as GA, to extract implicit information from user logs was combined with fuzzy techniques to include vagueness in decision making. This FL-GA hybrid allows more accurate and flexible modeling of user preferences. User data obtained from web usage data is the input for a NN. The weights and fitness functions derived from NN training is optimized using GA to derive classification rules to govern personalized decision making in eBusiness. A fuzzy-PSO approach was introduced to personalize Content Based Image Retrieval (CBIR). User logs were analyzed and used as the PSO input. Fuzzy principles were applied to the PSO velocity, position and weight parameters.
Personalization of web-based systems using CI models Based on the eight major CI methods described above, it is noticed that WUM is the common input for all models. Data mining in a sense provides the fuel for personalization using CI methods. CI methods are comparable to taxonomy of intelligent agents for personalization. Building on ideas from this approach taxonomy for personalization of web-based systems was proposed (cf. Fig. 2). Two main uses are identified for CI methods when applied to personalization: profile generation and profile exploitation. User profiles can further be used to personalize either the navigation or content of web based systems. Profile generation Profile generation is the creation of user profiles based on both implicit WUM data and explicit user preferences. User profiles can be generated either per individual or group users which appear to have similar previous web usage habits using CF techniques. Five CI methods found in previous work which were applied to user profile generation of web based systems are: FL, NN, PSO, ACO and AIS. FL models are constructed to identify ambiguity in user preferences however there are many ways of interpreting fuzzy rules and translating human knowledge into formal controls can be challenging. NN was trained to identify similarities in user behavior however for proper training the sample size must be large and the NN can be complex due to over fitting. Both PSO and GA were used to link users behavior by profile-matching but PSO was found to outperform GA in terms of speed, execution and accuracy. ACO was used to model users with relative accuracy and simplicity; however its computational complexity causes long computing time. PSO approach was found to be faster when compared to ACO. AIS was used to dynamically adapt profiles to changing and new behaviors. The theoretical concept of AIS is not fully sound however, since in reality other human systems support the functioning of the immune system and these are not modeled. The artificial cells in AIS do not work autonomously therefore the success or fail of one part of the system may determine the performance of the following step.
A hybrid method uses GA to optimize the input values of a NN, to maximize the output. In this way the slow learning process of NN is helped with the optimization ability of GA. Profile exploitation Profile exploitation personalizes various aspects of a web-based system by predefined user profiles. Two main approaches to personalize web based systems were identified as personalization of navigation and personalization of content (cf. fig.2). Personalized navigation Personalized navigation includes WUM for personalized IR, such as search engine results, and URL recommendations. FL, BCO and GA were three main CI methods found for navigation personalization (cf. fig.2). FL was used for offline processing to recommend URLs to users. It is relatively fast, deal with natural overlap in user interests and suitable for real time recommendations. Various FL testing however showed slightly lower precision and harder to program for the fuzzy part. GA was applied for search and retrieval but is it known to be more general and abstract than other optimization methods and does not always provide the optimal solution. BCO was used for IR but it is not a widely covered area of research and currently there is a better theoretical than experimental understanding. ACO is similar to BCO and has seen more successful applications. A hybrid between GA and FL was applied to this area. Fuzzy set techniques were used for better document modeling and genetic algorithms for query optimization to give personalized search engine results. A Neuro-Fuzzy method combined the
learning ability of NN with the representation of vagueness in Fuzzy Systems to overcome the NN black-box behavior and present more meaningful results than FL alone. Personalized content Personalized content refers to WUM for personalized web objects on each web page and sequence of content. FL, NN, GA, PSO and WCO were the main CI techniques found with applications in this area (cf. fig.2). FL was used for a web search algorithm and to automate recommendations to ecommerce customers. It was found to be flexible and able to support ecommerce application. NN was used to group users into clusters for content recommendations however over fitting problem still exists today GA was applied to devise the best arrangement of web objects. It was found to be scalable; however it is suggest to be used in collaboration with other data mining tools. PSO was used to sequence Learning Objects and was chosen because of relative small number of parameters compared with other techniques such as GA. PSO parameter selection is also a well researched area. Using a modified PSO for data clustering was found to give accurate results. WCO was applied on the churning problem of uninteresting content recommendations to users. This is mostly a theoretical concept, not well tested on real data and other biological inspired algorithms have found more success such as ACO. Fuzzy-PSO was created to help improve the effectiveness of standard PSO particle movement in a content based system. PROBABILISTIC LATENT SEMANTIC MODELS OF WEB USER NAVIGATIONS The overall process of Web usage mining consists of three phrases: data preparation and transformation, pattern discovery, and pattern analysis. The data preparation phase transforms raw Web log data into transaction data that can be processed by various data mining tasks. In the pattern discovery phase, a variety of data mining techniques, such as clustering, association rule mining, and sequential pattern discovery can be applied to the transaction data. The discovered patterns may then be analyzed and interpreted for use in such applications as Web personalization. The usage data preprocessing phase [8, 32] results in a set of n page views, P = {p1, p2, . . . , pn} and a set of m user sessions, U = {u1, u2, . . . , um}. A page view is an aggregate representation of a collection of Web objects (e.g. pages) contributing to the display on a users browser resulting from a single user action (such as a click through, product purchase, or database query). The Web session data can be conceptually viewed as an m n session-page
view binary matrix UP = [w(ui, pj )]mn, where w(ui, pj) represents the weight of page view pj in a user session ui. The weights can be binary, representing the existence or non-existence of the page view in the session, or they may be a function of the occurrence or duration of the page view in that session. PLSA is a latent variable model which associates hidden (unobserved) factor variable Z = {z1, z2, ..., zl} with observations in the co-occurrences data. In our context, each observation corresponds to an access by a user to a Web resource in a particular session which is represented as an entry of the m n co-occurrence matrix UP. The probabilistic latent factor model can be described as the following generative model: 1. select a user session ui from U with probability Pr(ui), 2. pick a latent factor zk with probability Pr(zk|ui), 3. Generate a page view pj from P with probability Pr(pj|zk). As a result we obtain an observed pair (ui, pj ), while the latent factor variable zk is discarded. Translating this process into a joint probability model results in the following:
Summing over all possible choices of zk from which the observation could have been generated. Using Bayes rule, it is straightforward to transform the joint probability into:
Now, in order to explain a set of observations (U, P), we need to estimate the parameters Pr(zk), Pr(ui|zk), Pr(pj |zk), while maximizing the following likelihood L(U, P) of the observations,
Expectation-Maximization (EM) algorithm is a well known approach to performing maximum likelihood parameter estimation in latent variable models. It alternates two steps: (1) an expectation (E) step where posterior probabilities are computed for latent variables, based on the current estimates of the parameters, (2) a maximization (M) step, re-estimate the parameters in order to maximize the expectation of the complete data likelihood. The EM algorithm begins with some initial values of Pr(zk), Pr(ui|zk), and Pr(pj |zk). In the expectation step we compute:
In the maximization step, we aim at maximizing the expectation of the complete data likelihood E(LC),
While taking into account the constraints, l k=1 Pr(zk) = 1, on the factor probabilities, as well as the following constraints on the two conditional probabilities:
Through the use of Lagrange multipliers (see for details), we can solve the constraint maximization problem to get the following equations for re-estimated parameters:
Iterating the above computation of expectation and maximization steps monotonically increases the total likelihood of the observed data L(U, P) until a local optimal solution is reached. The computational complexity of this algorithm is O(mnl), where m is the number of user sessions, n is the number of page views, and l is the number of factors. Since the usage observation matrix is, in general, very sparse, the memory requirements can be dramatically reduced using efficient sparse matrix representation of the data. DISCOVERY AND ANALYSIS OF USAGE PATTERN WITH PLSA One of the main advantages of PLSA model in Web usage mining is that it generates probabilities which quantify relationships between Web users and tasks, as well as Web pages and tasks. From these basic probabilities, using probabilistic inference, we can derive relationships among users, among pages, and between users and pages. Thus this framework provides a flexible approach to model a variety of types of usage patterns. In this section, we will describe various usage patterns that can be derived using the PLSA model. As noted before, the PLSA model generates probabilities Pr(zk), which measures the probability of a certain task is chosen; Pr(ui|zk), the probability of observing a user session given a certain task; and Pr(pj |zk), the probability of a page being visited given a certain task. Applying Bayes rule to these probabilities, we can generate the probability that a certain task is chosen given an observed user session:
and the probability that a certain task is chosen given an observed page view:
In the following, we discuss how these models can be used to derive different kinds of usage patterns. We will provide several illustrative examples of such patterns, from real Web usage data, in Section 4. Characterizing Tasks by Page views or by User Sessions Capturing the tasks or objectives of Web users can help the analyst to better understand these users preferences and interests. Our goal is to characterize each task, represented by a latent factor, in a way that is easy to interpret. One possible approach is to find the prototypical pages that are strongly associated with a given task, but that are not commonly identified as part of other tasks. We call each such page a characteristic page for the task, denoted by pch. This definition of prototypical has two consequences; first, given a task, a page which is seldom visited cannot be a good characteristic page for that task. Secondly, if a page is frequently visited as part of a certain task, but is also commonly visited in other tasks, the page is not a good characteristic page. So we define characteristic pages for a task zk as the set of all pages, pch, which satisfy:
Where
is a predefined threshold. By examining the characteristic pages of each task, we can
obtain a better understanding of the nature of these tasks. Characterizing tasks in this way can lead to several applications. For example, most Web sites allow users to search for relevant pages using keywords. If we also allow users to explicitly express their intended task(s) (via inputting task descriptions or choosing from a task list), we can return the characteristic pages for the specified task(s), which are likely to lead users directly to their objectives. A similar
approach can be used to identify prototypical user sessions for each task. We believe that a user session involving only one task can be considered as the characteristic session for the task. So, we define the characteristic user sessions, uch, for a task, zk, as sessions which satisfy
where
is a predefined threshold. When a user selects a task, returning such exemplar sessions
can provide a guide to the user for accomplishing the task more efficiently. This approach can also be used in the context of collaborative filtering to identify the closest neighbors to a user based on the tasks performed by that user during an active session. User Segments Identification Identifying Web user groups or segments is an important problem in Web usage mining. It helps Web site owners to understand and capture users common interests and preferences. We can identify user segments in which users perform common or similar task, by making inferences based on the estimated conditional probabilities obtained in the learning phase. For each task zk, we choose all user sessions with probability Pr(ui|zk) exceeding a certain threshold session set C. Since each user sessions, to get a
can also be represented as a page view vector, we can
further aggregate these users sessions into a single page views vector to facilitate interpretation. The algorithm of generating user segments is as follows: 1. Input: Pr(ui|zk), user session-page matrix UP and threshold . 2. For each zk, choose all the sessions with Pr(ui|zk) to get a candidate session set C.
3. For each zk, compute the weighed average of all the chosen sessions in set Cto get a page vector defined as:
4. For each factor zk, output page vector
This page vector consists of a set of weights, for
each page view in P, which represents the relative visit frequency of each page view for this user segment. We can sort the weights so that the top items in the list correspond to the most frequently visited pages for the user segment. These user segments provide an aggregate representation of all individual users navigational activities in the a particular group. In addition to their usefulness in Web analytics, user segments also provide the basis for automatically generating item recommendations. Given an active user, we compare her activity to all user segments and find the most similar one. Then, we can recommend items (e.g., pages) with relatively high weights in the aggregate representation of the segment. In Section 4, we conduct experimental evaluation of the user segments generated from two real Web sites. Identifying the Underlying Tasks of a User Session To better understand the preferences and interests of a single user, it is necessary to identify the underlying tasks performed by the user. The PLSA model provides a straightforward way to identify the underlying tasks in a given user session. This is done by examining Pr(task|session), which is the probability of a task being performed, given the observation of a certain user session. For a user session u, we select the top tasks zk with the highest Pr(zk|u) values, as the primary task(s) performed by this user. For a new user session, unew, not appearing in the historical navigational data, we can adopt a folding-in method as introduced in to generate Pr(task|session) via the EM algorithm. In the E-step, we compute
Here, w(unew, p) represents the new users visit frequency on the specified page p. After we generate these probabilities, we can use the same method to identify the primary tasks for the new user session. The identification of the primary tasks contained in user sessions can lead to further analysis. For example, after identifying the tasks in all user sessions, each session u can be transformed into a higher-level representation,
where zi denotes task i and wi denotes Pr(zi|u). This, in turn, would allow the discovery and analysis of task-level usage patterns, such as determining which tasks are likely to be visited together, or which tasks are most (least) popular, etc. Such higher-level patterns can help site owners better evaluate the Web site organization. Integration of Usage Patterns with Web Content Information Recent studies have emphasized the benefits of integrating semantic knowledge about the domain (e.g., from page content features, relational structure, or domain ontologies) in the Web usage mining process. The integration of content information about Web objects with usage patterns involving those objects provides two primary advantages. First, the semantic information provides additional clues about the underlying reasons for which a user may or may not be interested in particular items. Secondly, in cases where little or no rating or usage information is available (such as in the case of newly added items, or in very sparse data sets), the system can still use the semantic information to draw reasonable conclusions about user interests. The PLSA model described here also provides an ideal and uniform framework for integrating content and usage information. Each page view contains certain semantic knowledge represented by the content information associated with that page view. By applying text mining and information retrieval techniques, we can represent each page view as an attribute vector. Attributes may be the keywords extracted from the page views, or structured semantic attributes of the Web objects contained in the page views. As before, we assume there exists a set of hidden factors z Z = {z1, z2, ..., zl}, each of which represents a
semantic group of pages. They can be a group of pages which have similar functionalities for users performing a certain task, or a group of pages which contain similar content information or semantic attributes. However, now, in addition to the set of page views, P, and the set of user sessions, U, we also specify a set of t semantic attributes, A = {a1, a2, . . . , at}. To model the user-page observations, we use
These models can then be combined based on the common component Pr(pj |zk). This can be achieved by maximizing the following log-likelihood function with a predefined weight .
where
is used to adjust the relative weights of two observations. The EM algorithm can again
be used to generate estimates for Pr(zk), Pr(ui|zk), Pr(pj |zk), and Pr(aq|zk). By applying probabilistic inferences, we can measure the relationships among users, pages, and attributes, thus we are able to answer questions such as, What are the most important attributes for a group of users?, or Given an Web page with a specified set of attributes, will it be of interest to a given user?, and so on.
EXPERIMENTS WITH PLSA MODEL In this section, we use two real data sets to perform experiments with our PLSA-based Web usage mining framework. We first provide several illustrative examples of characterizing users tasks, as introduced in the previous section, and of identifying the primary tasks in an individual user session. We then perform two types of evaluations based on the generated user segments. First we evaluate individual user segments to determine the degree to which they represent activities of similar user. Secondly, we evaluate the effectiveness of these user segments in the context of generating automatic recommendations. In each case, we compare our approach with the standard clustering approach for the discovery of Web user segments. In order to compare the clustering approach to the PLSAbased model, we adopt the algorithm presented in for creating aggregate profiles based on session clusters. In the latter approach, first, we apply a
multivariate clustering technique such as k-means to user-session data in order to obtain a set of user clusters TC = {c1, c2, ..., ck}; then, an aggregate representation, prc, is generated for each cluster c as a set of page view-weight pairs:
where the significance weight, weight
is given by weight(p, prc) = (1/|c|)u c w(p, u) and c. Thus, each segment is represented
w(p, u) is the weight of page view p of the user session u
as a vector in the page view space. In the following discussion, by a user segment, we mean its aggregate representation as a page view vector. Data Sets In our experiments, we use Web server log data from two Web sites. The first data set is based on the server log data from the host Computer Science department. This Web site provide various functionalities to different types of Web users. For example, prospective students can obtain program and admissions information or submit online applications. Current students can browse course information, register for courses, make appointments with faculty advisors, and log into the Intranet to do degree audits. Faculty can perform student advising functions online or interact with the faculty Intranet. After data preprocessing, we identified 21,299 user sessions (U) and 692 pageviews (P), with each user session consisting of at least 6 pageviews. This data set is referred to as the CTI data. The second data set is from the server logs of a local affiliate of a national real estate company. The primary function of the Web site is to allow prospective buyers to visit various pages and information related to some 300 residential properties. The portion of the Web usage data during the period of analysis contained approximately 24,000 user sessions from 3,800 unique users. During preprocessing, we recorded each user-property pair and the corresponding visit frequency. Finally, the data was filtered to limit the final data set to those users that had visited at least 3 properties. In our final data matrix, each row represented a user vector with properties as dimensions and visit frequencies as the corresponding dimension values. We refer to this data set as the Realty data. Each data set was randomly divided into multiple training and test sets to use with 10-fold cross-validation. By conducting sensitivity analysis, we chose 30 factors in the case of CTI data and 15 factors for the Realty data. To avoid overtraining, we implemented the Tempered EM algorithm to train the PLSA model.
Examples Usage Patterns Based on the PLSA Models Figure 1 depicts an example of the characteristic pages for a specific discovered task in the CTI data. The first 6 pages have the highest Pr(page|task) Pr(task|page) values, thus are considered as the characteristic pages of this task. Observing these characteristic pages, we may infer that this
task corresponds to prospective students who are completing an online admissions application. Here characteristic has two implications. First, if a user wants to perform this task, he/she must visit these pages to accomplish his/her goal. Secondly, if we find a user session contains these pages, we can claim the user must have performed online application. Some page may not be characteristic pages for the task, but may still be useful for the purpose of analysis. An example of such a page is the /news/ page which has a relatively high Pr(page|task) value, and a low
Pr(task|page) value. Indeed, by examining the at the site structure, we found that this page serves as a navigational page, and it can lead users to different sections of the site to perform different tasks (including the online application). This kind of discovery can help Web site designer to identify the functionalities of pages and reorganize Web pages to facilitate users navigation. Figure 2 identifies three tasks in the Realty data. In contrast to the CTI data, in this data set the tasks represent common real estate properties visited by users, thus reflecting user interests in similar properties. The similarities are clearly observed when property attributes are shown for each characteristic page. From the characteristic pages of each task, we infer that Task 4 represents users interest in newer and more expensive properties, while Task 0 reflects interest in older and very low priced properties. Task 5 represents interest in properties midrange prices. We can also identify prototypical users corresponding to specific tasks. An example of such a user session is depicted in Figure 3 corresponding to yet another task in the realty data which reflects interest in very high priced and large properties (task not shown here).
Our final example is this section shows how the prominent tasks contained in a given user session can be identified. Figure 4 depicts a random user session from CTI data. Here we only show the tasks IDs which have the highest probabilities Pr(task|session). As indicated, the dominant tasks for this user session are Tasks 3 and 25. The former is, in fact, the online application task discussed earlier, and the latter is a task that represents international students who are considering applying for admissions. It can be easily observed that, indeed, this session seems to identify an international student who, after checking admission and visa requirements, has applied for admissions online.
Evaluation of User Segments and Recommendations We used two metrics to evaluate the discovered user segments. The first is called the Weighted Average Visit Percentage (WAVP). WAVP allows us to evaluate each segment individually according to the likelihood that a user who visits any page in the segment will visit the rest of the pages in that segment during the same session. Specifically, let T be the set of transactions in the evaluation set, and for a segment s, let Ts denote a subset of T whose elements contain at least one page from s. The weighted average similarity to the segment s over all transactions is then computed (taking both the transactions and the segments as vectors
Note that a higher WAVP value implies better quality of a segment in the sense that the segment represents the actual behavior of users based on their similar activities. For evaluating the recommendation effectiveness, we use a metric called Hit Ratio in the context of top-N recommendation. For each user session in the test set, we took the first K pages as a representation of an active session to generate a top-N recommendation set. We then compared the recommendations with the pageview (K +1) in the test session, with a match being considered a hit. We define the Hit Ratio as the total number of hits divided by the total number of user sessions in the test set. Note that the Hit Ratio increases as the value of N (number of recommendations) increases. Thus, in our experiments, we pay special attention to smaller number recommendations (between 1 and 20) that result in good hit ratios. Note that a higher WAVP value implies better quality of a segment in the sense that the segment represents the actual behavior of users based on their similar activities. For evaluating the recommendation effectiveness, we use a metric called Hit Ratio in the context of top-N recommendation. For each user session in the test set, we took the first K pages as a representation of an active session to generate a top-N recommendation set. We then compared the recommendations with the page view (K +1) in the test session, with a match being considered a hit. We define the Hit Ratio as the total number of hits divided by the total number of user sessions in the test set. Note that the Hit Ratio increases as the value of N (number of recommendations) increases. Thus, in our experiments, we pay special attention to smaller number recommendations (between 1 and 20) that result in good hit ratios.
In the first set of experiments we compare the WAVP values for the generated segments using the PLSA model and those generated by the clustering approach. Figures 5 and 6 depict these results for the CTI and Realty data sets, respectively. In each case, the segments are ranked in the decreasing order of WAVP. The results show clearly that the probabilistic segments based on the latent factor factors provides a significant advantage over the clustering approach. In the second set of experiments we compared the recommendation accuracy of the PLSA model with that of kmeans clustering segments. In each case, the recommendations are generated according to the
recommendation algorithm presented in Section 3.2. The recommendation accuracy is measured based on hit ratio for different number of generated recommendations. These results are depicted in Figures 7 and 8 for the CTI and Realty data sets, respectively. Again, the results show a clear advantage for the PLSA model. In most realistic situations, we are interested in a small, but accurate, set of recommendations. Generally, a reasonable recommendation set might contain 5 to 10 recommendations. Indeed, this range of values seem to represent the largest improvements of the PLSA model over the clustering approach. ODP: The Open Directory Project Description. The DMOZ Open Directory Project (ODP) [20] is the largest, most comprehensive human-edited web page catalog currently available. It covers 4 million sites filed into more than 590,000 categories (16 wide-spread top-categories, such as Arts, Computers, News, Sports, etc.) Currently, there are more than 65,000 volunteering editors maintaining it. ODPs data structure is organized as a tree, where the categories are internal nodes and pages are leaf nodes. By using symbolic links, nodes can appear to have several parent nodes. Since ODP truly is free and open, everybody can contribute or re-use the dataset, which is available in RDF (structure and content are available separately). Google for example uses ODP as basis for its Google Directory service. Applications Besides its re-use in other directory services, the ODP taxonomy is used as a basis for various other research projects. In Persona, ODP is applied to enhance HITS with dynamic user profiles using a tree coloring technique (by keeping track of the number of times a user has visited pages of a specific category). Users can rate a page as being good or unrelated regarding their interest. This data is then used to rank and omit interesting/unwanted results. While asks users for feedback, we only rely on user profiles, i.e., a one-time user interaction. More, we do not develop our search algorithm on top of HITS, but on top of any search algorithm, as a refinement. In, a similar approach using the ODP taxonomy is applied onto a recommender system of research papers. The Open Directory can also be used as a reference source containing good pages, to fight web spam containing uninteresting URLs through white listing, as a web corpus for comparisons of rank algorithms, as well as for focused crawling towards special-
interest pages. Unfortunately, the free availability of ODP also has its downside. A clone of the directory modified to contain some spam pages could trap people to link to this fake directory, which results in an increased ranking not only for this directory clone, but also for the injected spam pages. Page Rank and Personalized Page Rank Page Rank computes Web page scores based on the graph inferred from the link structure of the Web. It is based on the idea that a page has high rank if the sum of the ranks of its back links is high. Given a page p, its input I(p) and output O(p) sets of links, the Page Rank formula is:
The dampening factor c < 1 (usually 0.15) is necessary to guarantee convergence and to limit the effect of rank sinks [2]. Intuitively, a random surfer will follow an outgoing link from the current page with probability (1 c) and will get bored and select a random page with probability c (i.e., the E vector has all entries equal to 1/N, where N is the number of pages in the Web graph). Initial steps towards personalized page ranking are already described by who proposed a slight modification of the above presented algorithm to redirect the random surfer towards preferred pages using the E vector. Several distributions for this vector have been proposed since.
Topic-sensitive Page Rank Haveliwala builds a topic oriented Page Rank, starting by computing off-line a set of 16 PageRank vectors biased on each of the 16 main topics of the Open Directory Project. Then, the similarity between a user query and each of these topics is computed, and the 16 vectors are combined using appropriate weights. Personalized Page Rank. A more recent investigation, uses a different approach: it focuses on user profiles. One Personalized Page Rank Vector (PPV) is computed for each user. The personalization aspect of this algorithm stems from a set of hubs (H)1, each user having to select her preferred pages from it. PPVs can be expressed as a linear combination of PPVs for preference vectors with a single non-zero entry corresponding to each
of the pages from the preference set (called basis vectors). The advantage of this approach is that for a hub set of N pages, one can compute 2N Personalized Page Rank vectors without having to run the algorithm again, unlike, where the whole computation must be performed for each biasing set. The disadvantages are forcing the users to select their preference set only from within a given group of pages (common to all users), as well as the relatively high computation time for large scale graphs. USING ODP METADATA FOR PERSONALIZED SEARCH Motivation. We presented in Section 2.2 the most popular approaches to personalizing Web search. Even though they are the best so far, they all have some important drawbacks. In, we need to run the entire algorithm for each preference set (or biasing set), which is practically impossible in a large-scale system. At the other end, computes biased PageRank vectors limited only to the broad 16 top-level categories of the ODP, because of the same problem. Improves this somewhat, allowing the algorithm to bias on any subset of a given set of pages (H). Although work has been done in the direction of improving the quality of this latter set [4], one limitation is still that the preference set is restricted to a subset of this given set H (if H = {CNN, FOX News} we cannot bias on MSNBC for example). More importantly, the bigger H is, the more time is needed to run the algorithm. Thus finding 1Note that hubs were defined here as pages with high Page Rank, differently from the more popular definition from.
a simpler and faster algorithm with at least similar personalization granularity is still a worthy goal to pursue. In the following we make another step towards this goal. Introduction. Our first step was to evaluate how ODP search compares with Google search, specifically exploiting the fact that all ODP entries are categorized into the ODP topic hierarchy. We started with the following two observations: 1. given the fact that ODP just includes 4 million entries, and the Google database includes 8 billion, does ODP-based search stand a chance of being comparable to Google? 2. ODP advanced search offers a rudimentary personalized search feature by restricting the search to the entries of just one of the 16 main categories. Google directory offers a related feature, by offering to restrict search to a specific category or subcategory. Can we improve this personalized search feature, taking the user profile into account in a more
sophisticated way, and how does such an enhanced personalized search on the ODP or Google entries compare to ordinary Google results? Most people would probably answer (1) No, not yet, and (2) Yes. In the following Section we will prove the correctness of the second answer by introducing a new personalized search algorithm, and then we will concentrate on the first answer in the experiments Section. Algorithm Our algorithm is exploiting the annotations accumulated in generic large-scale annotations such as the Open Directory. Even though we concentrate our forthcoming discussion on ODP, practically any similar taxonomy can be used. These annotations can be easily used to achieve personalization, and can also be combined with the initial Page Rank algorithm. We define user profiles using a simple approach: each user has to select several topics from the ODP, which best fit her interests. For example, a user profile could look like this:
Then, at run-time, the output given by a search service (from Google, ODP Search, etc.) is resorted using a calculated distance from the user profile to each output URL. The execution is also depicted in Algorithm 3.1.
Distance Metrics When performing search on Open Directory, each resulting URL comes with an associated ODP topic. Similarly, a good amount of the URLs output by Google is connected to one or more topics within the Google Directory (almost 50%, as discussed in Section 3.2). Therefore, in both cases, for each output URL we are dealing with two sets of nodes from the topic tree: (1) Those representing the user profile (set A), and (2) those associated with the URL (set B). The distance between these sets can then be defined as the minimum distance between all pairs of nodes given by the Cartesian product A B. Finally, there are quite a few possibilities to define the distance between two nodes. Even though, as we will see from the experiments, the simplest approaches already provide very good results, we are now performing an optimality study2 to determine which metric best fits this kind of search. In the following, we will present our best solutions so far. Na ve Distances. The simplest solution is the minimum tree distance, which, given two nodes a and b, returns the sum of the minimum number of tree edges between a and the sub sumer (the deepest node common to both a and b) plus the minimum number of tree edges between b and the subsumer (i.e., the shortest path between a and b). On the example from Figure 1, the distance between /Arts/Architecture and /Arts/Design/Interior Design/Events/Competitions is 5, and the subsumer is /Arts. If we also consider the inter-topic links from the Open Directory, the simplest distance becomes the graph
shortest path between a and b. For example, if there is a link between Interior Design and Architecture in Figure 1, then the distance between Competitions and Architecture is 3. This solution implies to load either the entire topic graph or all the inter-topic links into memory. Furthermore, its utility is subjective from user to user: the existence of a link between Architecture and Interior Design does not always imply that a famous architect (one level below in the tree) is very close to the area of interior design. We can consider these links in our metric in three ways: 1. Consider the graph containing all intra-topic links and output the shortest path between a and b. 2. Consider graph containing only the intra-topic links directly connected to a and b and output the shortest path. 2We refer the reader to for an in-depth view of the approach we took in this study. 3. If there is an intra-topic link between a and b, output 1. Otherwise, ignore all intra-topic links and output the tree distance between a and b. Complex Distances. The main drawback of the above metrics comes from the fact that they ignore the depth of the subsumer. The bigger this depth is, the more related are the nodes (i.e., the concepts represented by them). This problem is solved by, which investigates ten intuitive strategies for measuring semantic similarity between words using hierarchical semantic knowledge bases such as Word Net [18]. Each of them was evaluated experimentally on a group of testers, the best one having a 0.9015 correlation between the human judgment and the following formula:
The
parameters
are
as
follows:
and were defined as 0.2 and 0.6 respectively, h is the tree-depth of the sub sumer, and l is the semantic path length between the two words. Considering we have several words attached to each concept and sub-concept, then l is 0 if the two words are in the same concept, 1 if they are in different concepts, but the two concepts have at least one common word, or the tree shortest path if the words are in different concepts which do not contain common words. Although this measure is very good for words, it is not perfect when we apply it to the Open Directory topical tree because it does not make a difference between the distance from a (the profile node) to the subsumer, and the distance from b (the output URL) to the subsumer. Consider node a to be /Top/Games and b to be /Top/Computers/Hardware/Components/Processors/x86. A teenager interested in computer games (level 2 in the ODP tree) could be very satisfied receiving a page
about new processors (level 6 in the tree) which might increase his gaming quality. On the other hand, the opposite scenario (profile on level 6 and output URL on level 2) does not hold any more, at least not to the same extent: a processor manufacturer will generally be less interested in the games existing on the market. This leads to our following extension of the above formula:
with l1 being the shortest path from the profile to the subsumer, l2 the shortest path from the URL to the subsumer, and a parameter in [0, 1]. Combining the Distance Function with Google Page Rank. And yet something is still missing. If we use Google to do the search and then sort the URLs according to the Google Directory taxonomy, some high quality pages might be missed (i.e., those which are top ranked, but which are not in the directory). In order to integrate that, the above formula could be combined with the Google Page Rank. We propose the following approach:
Conclusion. Human judgment is a non-linear process over information sources, and therefore it is very difficult (if not impossible) to propose a metric which is in perfect correlation to it. A thorough experimental analysis of all these metrics (which we are currently performing, but which is outside the scope of this paper) could give us a good enough approximation. In the next Section we will present some experiments using the simple metric presented first, and show that it already yields quite reasonable improvements.
Experimental Results To evaluate the benefits of our personalization algorithm, we interviewed 17 of our colleagues (researchers in different computer science areas, psychologists, pedagogues and designers), asking each of them to define a user profile according to the Open Directory topics (see Section 3.1 for an example profile), as well as to choose three queries of the following types: One clear query, which they knew to have one or maximum two meanings3 One relatively ambiguous query, which they knew to have two or three meanings One ambiguous query, which they knew to have at least three meanings, preferably more We then compared test results using the following four types of Web search: 1. Plain Open Directory Search 2. Personalized Open Directory Search, using our algorithm from Section 3.1 to reorder the top 1000 results returned by the ODP Search 3. Google Search, as returned by the Google API [8] Personalized Google Search, using our algorithm from Section 3.1 to reorder the top 100 URLs returned by the Google API4, and having as input the Google Directory topics returned by the API for each resulting URL. For each algorithm, each tester received the top 5 URLs with respect to each type of query, 15 URLs in total. All test data was shuffled, such that testers were neither aware of the algorithm, nor of the ranking of each assessed URL. We then asked the subjects to rate each URL from 1 to 5, 1 defining a very poor result with respect to their profile and expectations (e.g., topic of the result, content, etc.) and 5 a very good one5. Finally, for each sub-set of 5 URLs we took the average grade as a measure of importance attributed to that < algorithm, query type > pair. The average values for all users and for each of these pairs can be found in table 1, together with the averages over all types of queries for each algorithm. We of course expected the plain ODP search to be significantly worse than the Google search, and that was the case: an average of 2.41 points for ODP versus the 2.76 average received by Google. Also predictable was the dependence of the grading on the query type. If we average the values on the three columns representing each query type, we get 2.54 points for ambiguous queries, 2.91 for semiambiguous ones and 3.25 for clear ones - thus, the clearer was the query, the better rated were the URLs returned. Personalized Search using ODP. But the same table 1 also provides us with a more surprising result: The personalized search algorithm is clearly better than Google search, regardless whether we use Open Directory or Google Directory as taxonomy. Therefore, a
personalized search on a well-selected set of 4 million pages often provides better results than a non-personalized one over a 8 billion set. This a clear indicator that taxonomy-based result sorting is indeed very useful. For the ODP experiments, only our clear queries did not receive a big improvement, mainly because for some of
these queries ODP contains less than 5 URLs matching both the query and the topics expressed in the user profile. Personalized Search using Google. Similarly, personalized search using Google Directory was far better than the usual Google search. We would have expected it to be even better than the ODP based personalized search, but results were probably negatively influenced by the fact that the ODP experiments were run on 1000 results, whereas the Google Directory ones only on 100, due to the limited number of Google API licenses we had. The grading results are summarized in Figure 2. Generally, we can conclude that personalization significantly increases output quality for ambiguous and semi-ambiguous queries. For clear queries, one should prefer Google to Open Directory search, but also Google Directory search to
the plain Google search. Also, the answers we sketched in the beginning of this Section proved to be true: Google search is still better than Open Directory search, but we provided a personalized search algorithm which outperforms the existing Google and Open Directory search capabilities. Another interesting result is that 40.98% of the top 100 Google pages were also contained in the Google Directory. More specifically, for the ambiguous queries 48.35% of the top pages were in the directory, for the semi-ambiguous ones 41.35%, and for the clear ones 33.23%6. Finally, let us add that we performed statistical significance tests7 on our experiments, obtaining the following results: Statistical significance with an error rate below 1% for the algorithm criterion, i.e., there is significant difference between each algorithm grading. An error rate below 25% for the query type criterion, i.e., the difference between the average grades with respect to query types is less statistically significant. Statistical significance with an error rate below 5% for the inter-relation between query type and algorithm, i.e., the re-
EXTENDING ODP ANNOTATIONS TO THE WEB In the last Section we have shown that using ODP entries and their categorization directly for personalized search turns out to be amazingly good. Can this huge annotation effort invested in the ODP project (with 65,000 volunteers participating in building and maintaining the ODP database) be extended to the rest of the Web? This would be useful if we want to find less highly rated pages not contained in the directory. Just extending the ODP effort does not scale, because first, significantly increasing the number of volunteers seems improbable, and second, extending the selection of ODP entries to a larger percentage obviously becomes harder and less rewarding once we try to include more than just the most important pages for a specific topic. We start
with the following questions: Given that Page Rank for a large collection of Web pages can be biased towards a smaller subset, can this be done with sets of ODP entries corresponding to given categories / subcategories as well? Specifically, ODP entries consist of many of the most important entries in a given category. Do we have enough entries for each topic such that biasing on these entries makes a difference? When does biasing make a difference? One of the most important work investigating Page Rank biasing is. It first uses the 16 top levels of the ODP to bias Page- Rank on and then provides a method to combine these 16 resulting vectors into a more query-dependant ranking. But what if we would like to use one or several ODP (sub-)topics to compute a Personalized Page Rank vector? More general, what if we would like to achieve such a personalization by biasing Page Rank towards some generic subset of pages from the current Web crawl we have? Many authors have used such biasing in their algorithms. Yet none have studied the boundaries of this personalization, the characteristics the biasing set has to exhibit in order to obtain relevant results (i.e., rankings which are different enough from the non-biased Page Rank). We will investigate this in the current Section. Once these boundaries are defined, we will use them to evaluate (some of) the biasing sets available from ODP in Section 4.2. First, let us establish a characteristic function for biasing sets, which we will use as parameter determining the effectiveness of biasing. Pages in the World Wide Web can be characterized in quite a few ways. The simplest of them is the out-degree (i.e., total number of out-going links), based on the observation that if biasing is targeted to such a page, the newly achieved increase in Page Rank score will be passed forward to all its out-neighbors (pages to which it points). A more sophisticated version of this measure is the hub value of pages. Hubs were initially defined in and are pages pointing to many other high quality pages. Reciprocally, high quality pages pointed to by many hubs are called authorities. There are several algorithms for calculating this measure, the most common ones being HITS and its more stable improvements SALSA and Randomized HITS. Yet biasing on better hub pages will have less influence on the rankings because the vote a page gives is propagated to its out-neighbors divided by its out-degree. Moreover, there is also an intuitive reason against this measure: Page Rank biasing is usually performed to achieve some degree of personalization and people tend to prefer highly valued
authorities to highly valued hubs. Therefore, a more natural measure is an authority-based one, such as the non-biased Page Rank score of a page. Even though most of the biasing sets consist of high Page Rank pages, in order to make this analysis complete we have run our experiments on different choices for these sets, each of which must be tested with different sizes. For comparison to Page Rank, we used two degrees of similarity between the non-biased Page Rank and each resulting biased vector of ranks. They are defined in as follows: 1. O Sim indicates the degree of overlap between the top n elements of two ranked lists and It is defined as
KSim is a variant of Kendalls T distance measure. Unlike OSim, it measures the degree of agreement between the two ranked lists. If U is the union of items in then let be the extension of containing and and is U \
appearing after all items in
. Similarly,
is defined as an extension of
. Using these notations, KSim is defined as follows:
Even though used n = 20, we chose n to be 100, after experimenting with both values and obtaining more stable results with the latter value. A general study of different similarity measures for ranked lists can be found in. Let us start by analyzing the biasing on high quality pages (i.e., with a high Page Rank). We consider the most common set to contain pages in the range [0 10]% of the sorted list of Page- Rank scores. We varied the sum of scores within this set between 0.00005% and 10% of the total sum over all pages (for simplicity, we will call this value TOT hereafter). For very small sets, the biasing produced an output only somewhat different: about 38% Kendall similarity (see Figure 3). The same happened for large sets, especially those above 1% of TOT. Finally, the graph makes also clear where we would get the most different rankings from the non-biased ones
Someone could wish to bias only on the best pages (the top [0 2]%, as in Figure 4). In this case, the above results would only be shifted a little bit to the right on the x-axis of the graph,
i.e., the highest differences would be achieved for a set size from 0.02% to 0.75%. This was expectable, as all the pages in the biasing set were already top ranked, and it would therefore take a little bit more effort to produce a different output with such a set. Another possible input set consists of randomly selected pages (Figure 5). Such a set most probably contains many low Page Rank pages. This is why, although the biased ranks are very different for low TOT values, they start to become extremely similar (up to almost the same) after TOT exceeds 0.01% (because it would take a lot of low Page Rank pages to accumulate a TOT value of 1% of the overall sum of scores, for example). The extreme case is to bias only on low Page Rank pages (Figure 6). In this case, the biasing set will contain too many pages even sooner, around TOT = 0.001%. The last experiment is mostly theoretical. One would expect to obtain the smallest similarities to the non-biased rankings when using a biasing set from [2 5]% (because these pages are already close to the top, and biasing on them would have best chances to overturn the list). Experimental results support this intuition (Fig
The graphs above were initially generated based on a crawl of 3 million pages. Once all of them had been finalized, we selectively ran similar experiments on the Stanford Web Base crawl, obtaining similar results. For example, a biasing set of size TOT = 1% containing randomly selected pages produced rankings with a 0.622% Kendall similarity to the non-biased ones, whereas a set of TOT = 0.0005% produced a similarity of only 0.137%. This was necessary in order to prove that the above discussed graphs are not influenced by the crawl size. Even so, the limits they establish are not totally accurate, because of the random or targeted random selection (e.g., towards top [0 2]% pages) of our experimental biasing sets. Is biasing possible in the ODP context? The URLs collected in the Open Directory are manually added Web pages supposed to (1) cover the specific topic of the ODP tree leaf they belong to and
(2) be of high quality. Both requirements are not fully satisfied. Sometimes (rarely though) the pages are not really representing the topic in which they were added. More
important for PageRank biasing, they usually cover a large interval of page ranks, which made us decide for the random biasing model. However, we are aware that in this case, the human editors chose much more high quality pages than low quality ones, and thus the decisions of the analysis are susceptible to errors. Generally, according to the random model of biasing, every set with TOT below 0.015% is good for biasing. According to this, all possible biasing sets analyzed in tables 4, 5 and 3 would generate a different enough Page Rank vector9. We can therefore conclude that biasing is (most probably) possible on all subsets of the Stanford Open Directory crawl.
Web usage mining has been extensively used in order to analyze web log data. There exist various methods based on data mining algorithms and probabilistic models. The related literature is very extensive and many of these approaches fall out of the scope of this paper. For more information, the reader may refer to. There exist many approaches for discovering sequences of visits in a web site. Some of them are based on data mining techniques, whereas others use probabilistic models, such as Markov models in order to model the users visits. Such approaches aim at identifying representative trends and browsing patterns describing the activity in a web site and can assist the web site administrators to redesign or customize the web site, or improve the performance of their systems. They do not, however, propose any methods for personalizing the web sites. There exist some approaches that use the aforementioned techniques in order to personalize a web site. Contrary to our approach, these approaches do not distinguish between different users or user groups in order to perform the personalization. Thus, the methods that seem to be more relevant to ours, in terms of identifying different interest groups and personalize the web site based on these profiles, are those that are based on collaborative filtering. Collaborative filtering systems are used for generating recommendations and have been broadly used in e-commerce. Such systems are based on the assumption that users with common interests and behavior present similar searching/browsing behavior. Thus, the identification of similar user profiles enables the filtering of relevant information and the generation of recommendations. Similar to such approaches, we also identify users with common interests and use this information to personalize the topic directory. In our work, however, we do not model the user profiles as vectors in order to find similar users. Instead, we use clustering to group users into interest groups. Moreover, we propose the use of sequential pattern mining in order to generate recommendations. Thus, we also capture the sequential dependencies within users visits, whereas this is not the case with collaborative filtering systems. All of the aforementioned approaches aim at personalizing generic web sites. Our approach focuses on the personalization of a specific type of web sites, that of topic directories. Since topic directories organize web content into meaningful categories, we can regard them as a form of digital library or portal. In this context, we also overview here some approaches for personalizing digital libraries and web
portals. Some early approaches were based on explicit user input and the personalization services provided are limited to simplified search functionalities or alerting services. propose the semiautomatic generation of user recommendations based on implicit user input. In those approaches, information is extracted from user accesses in the DL resources, and then is used for further retrieval or filtering. As already mentioned, our approach does not limit its personalization services on identifying the preferences of each individual user alone. Rather, we identify user groups with common interests and behavior expressed by visits to certain categories and information resources. This is enabled by approaches that are based on collaborative filtering. Those approaches, however, fail to capture the sequential dependencies between the users visits, as discussed previously.
MODELLING TOPIC DIRECTORIES A topic directory is a hierarchical organization of thematic categories. Each category contains resources (i.e., links to web pages). A category may have subcategories and/or related categories. Subcategories narrow the content of broad categories. Related categories contain similar resources, but they may exist in different places of the directory. Note that the related relationship is bidirectional, that is, if category N is related to M, then M is also related to N. A resource cannot belong to more than one category. We consider a graph representation of topic directories. Definition 3.1. A topic directory D is a labelled graph G(V,E), where V is the set of nodes and E the set of edges, such that: (a) each node in V corresponds to a category of D, and is labelled by the category name, (b) for each pair of nodes (n,m) that corresponds to categories (N,M), where N is subcategory of M in D, there is a directed edge from m to n, and (c) for each pair of nodes (n,m) that corresponds to categories (N,M), where N and M are related categories in D, there is a bi directed edge between n and m. The graph G(V,E) may also have shortcuts, which are directed edges connecting nodes in V . Examples of such graphs are illustrated in Figure 4. The role of shortcuts as a means for personalizing the directory will be further discussed in Section 5. The case study of Open Directory Project. In our work, we use the Open Directory Project (ODP) as a case study. Figure 1 illustrates a part of the ODP directory. In
ODP, there are three types of categories: (a) subcategories (to narrow the content of broad categories), (b) relevant categories (i.e., the ones appearing inside the see also section, and (c) symbolic categories (i.e., denoted by the @ character after categorys name). Symbolic categories are subcategories that exist in different places of the directory. We consider relevant categories as related categories, according to the Definition 3.1 Navigation patterns. To represent the navigation behaviour of users when browsing the directory, we use the notion of navigation patterns. A navigation pattern is the sequence of categories visited by a user during a session. We note that such patterns may include multiple occurrences of the same categories. This might be the result of users going back and forth within a path in the directory. Finally, we also underline that during a session, a user may pursue more than one topic interests.

Personalizing The Web Directories

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Personalizing The Web Directories

Enviado por

Direitos autorais:

Formatos disponíveis

Personalizing Web Directories with the Aid of Web Usage Data Literature Survey:

is a predefined threshold. By examining the characteristic pages of each task, we can

can also be represented as a page view vector, we can

4. For each factor zk, output page vector

This page vector consists of a set of weights, for

where the significance weight, weight

w(p, u) is the weight of page view p of the user session u

appearing after all items in

. Using these notations, KSim is defined as follows:

Você também pode gostar