Escolar Documentos
Profissional Documentos
Cultura Documentos
School of Information Science and Technology, Dalian Maritime University, Dalian, 116026, PR China
Institute for Photogrammetry and Remote Sensing, Chinese Academy of Surveying and Mapping, Beijing, 100039, PR China
Wuhan National Lab for Optoelectronics (WNLO), Huazhong University of Science and Technology, Wuhan, 430074, PR China
article
info
Article history:
Received 30 November 2009
Received in revised form
6 April 2010
Accepted 16 April 2010
Available online 2 May 2010
Keywords:
Collaborative filtering
Recommendation
Sparse matrix
Eigenvalue matrix
Peer-to-peer (P2P) networks
abstract
With the increasing number of commerce facilities using peer-to-peer (P2P) networks, challenges exist
in recommending interesting or useful products and services to a particular customer. Collaborative
Filtering (CF) is one of the most successful techniques that attempts to recommend items (such as music,
movies, web sites) which are likely to be of interest to the people. However, conventional collaborative
filtering encounters a number of challenges on its recommendation accuracy. One of the most important
challenges may be due to the sparse attributes inherent to the rating data. Another important challenge
is that existing CF methods consider mainly user-based or item-based ratings respectively. In this paper a
P2P-based hybrid collaborative filtering mechanism for the support of combining user-based and item
attribute-based ratings is considered. We take advantage of the inherent item attributes to construct
a Boolean matrix to predict the blank elements for a sparse useritem matrix. Furthermore, a Hybrid
collaborative filtering (HCF) algorithm is presented to improve the predictive accuracy. Case studies and
experiment results illustrate that our approaches not only contribute to predicting the unrated blank data
for a sparse matrix but also improve the prediction accuracy as expected.
2010 Elsevier B.V. All rights reserved.
1. Introduction
In recent years, peer-to-peer (P2P) file-sharing networks have
become a popular new way to exchange resources, information
and services across a large number of autonomous peers [1,2].
Examples of P2P file sharing systems are: Gnutella, BitTorrent
and P2P Music streaming systems like iTunes [3]. These systems
enable users to form communities for sharing different types of
files. However, due to the explosive growth of the volume of
information, such as in the web, users should be able to make
choices without knowing all of the alternatives [4,5]. Moreover,
both the users and the data are distributed and dynamically
changing which make it difficult to filter (and search) and localize
the available content within the P2P network [6].
These significant promotions and the associated requirement
challenges have motivated the development of recommendation
systems. Collaborative filtering (CF) [7] is such a personalized
recommendation technique that has been very promising both
in research and industry. CF leverages the usage history of
groups of similar users in order to make recommendations to a
Corresponding author.
E-mail addresses: zhbliu@gmail.com (Z. Liu), eunice.qu@gmail.com,
buyhorse@gmail.com (W. Qu), lhtao@casm.ac.cn (H. Li), cs_xie@mail.hust.edu.cn
(C. Xie).
0167-739X/$ see front matter 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.future.2010.04.002
1410
centers. (3) Assign each node to the nearest cluster center. (4)
Recompute the new cluster centers. (5) Repeat the two previous
steps until some convergence criterion is met (usually that the
assignment hasnt changed). Because the resulting clusters depend
on the initial random assignments, its disadvantage is that it
may not yield the same result with each run [25]. Another
disadvantage is the search for neighbors among the whole P2P
network may decrease the performance quality. The traditional
k-means algorithm for collaborative filtering mainly refers to
item-based or user-based separately. For this reason, in [26] the
authors propose a hybrid predictive algorithm with smoothing to
consider both users aspects and items aspects. In [27] eSciGrid
was presented to take into account the physical distance between
peers and the amount of traffic carried by each node. Wang
et al. present a unified probabilistic model for collaborative
filtering using Parzen-window density estimation for acquiring
the probabilities of the proposed unified relevance model [28,29].
However, these approaches may dont suit well for P2P network
application scenarios or ignore real-time performance quality
while finding closer neighbors.
3. Sparsity limitation solution
In this section we present a method for alleviating the sparsity
challenge in collaborative filtering based on the item attributes
Boolean matrix.
3.1. Collaborative filtering based on item attributes
The first approach is to compose the eigenvalue matrix of items.
Its methodology is as follows: Each item could be divided into
several dimensions. Each dimension is the items attribute. At the
same time, each attribute has its initial eigenvalue. To simplify
the problem, we use the Boolean variable (uniformly 0 or 1) to
construct the eigenvalue matrix. We assume that a 1 indicates
the item is of that attribute, a 0 indicates it is not. In order to
determine which items are similar, we need to define a similarity
function. We take advantage of Euclidean distance to calculate the
similar degrees among items,
v
u n
uX
d(itemi , itemj ) = t
(pitemi pitemj )2
(1)
m=1
1
1 + d(itemi , itemj )
(2)
Ri =
j =1
n
P
(3)
sims (itemi , itemj )
j =1
user 1
user 2
user 3
user m
item1
item2
item3
itemn
5
3
4
3
4
are m users and n items (movies), and then we can get the
useritem matrix (as shown in Table 1). Each movie can be
regarded as an item. Each item often has detailed information
on primitive concept levels. If such a rating exists, the element
of the matrix means the users rating on item, otherwise, if the
element is blank, which means that there has been no such
rating. In this paper, each movie item contains four preferences:
genre, language, release year and country. Each preference has
several primitive values. For instance, genre contains {action,
adventure, mystery, drama, documentary, romance and comedy
etc}. Language contains {Chinese, Cantonese, English, Japanese and
Korean etc}. Release year contains {less than two years, less than
five years and more than five years}. Country contains {Chinese
Mainland, HK &Taiwan, Occident, Japan and Korea etc}.
Each user of the system expresses his opinions about movies
which he loves or dislikes by rating the score. The opinion of a
customer can be divided into five ratings of the preference (from
1 to 5). All of these ratings are captured in the useritem matrix.
In this matrix, users are in the rows and items (movies) are in the
columns. Each space contains the users rating of that item shown
in Table 1. For example, 1 star expresses that the user feels awful
and 5 stars expresses that the user feels excellent. From the Table 1,
we can see that user 1 could rate the movie item1 5 stars, and
user 2 could rate the same movie 3 stars.
Obviously, most of the elements in useritem matrix are unrated blank data. To improve the accuracy of filling-in, we first construct the Boolean matrix in terms of item attributes. In this specific
instance, we give that item1 is Crouching Tiger, Hidden Dragon,
item2 is Tomb Raider, item3 is The Lord of the Rings and item4 is
Mr and Mrs Smith. As a result, we can get the eigenvalue matrix
of items respectively (as shown in Table 2(ad)).
To compute the prediction rating of user 1 to item2 , according to
Eq. (1), we first calculate the Euclidean distance between item2 and
item1 , item3 , item4 respectively.
d(item2 , item1 ) =
1411
Table 2
Eigenvalue matrix of items.
(a) Eigenvalue matrix of item1
Eigenvalue matrix of item1
Genre
Language
Year
Country
1
1
0
1
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
0
1
0
0
0
0
0
0
1
0
1
7 = 2.646.
d(item2 , item3 ) = 1.
d(item2 , item4 ) = 2.
Hence, from the Fig. 1 we can get that the closest distance occurring
from item2 to item3 .
In order to compute the similarity between item2 and item1 ,
item3 , item4 respectively, according to Eq. (2) we can get:
sim(item2 , item1 ) = 0.27.
sim(item2 , item3 ) = 0.5.
sim(item2 , item4 ) = 0.33.
As a result, in terms of Eq. (3), the prediction rating of user1 to item2
is:
5 0.27 + 4 0.5 + 5 0.33
= 4.54
0.27 + 0.5 + 0.33
= 5 (after near-integer rounded down).
Similarly to the above, we can obtain the rest of the unrated data in
the sparse useritem matrix. To compare the prediction accuracy
of our methodology, we also compute the rated data. After filling
in the entire element, the new useritem matrix (m n) can be
depicted in Table 3.
Table 3
The useritem matrix (m n) after data prediction.
user 1
user 2
user 3
user m
item1
item2
item3
itemn
4.48
4
3.5
3.64
4.54
3.64
4
3.35
5
3.64
3.41
4
4.22
3.58
3.45
3.47
1412
6
initial
R(user,item)
prediction
4
3
2
1
0
6
initial
R(user,item)
prediction
4
3
2
1
0
cause the additional time during the period of finding the closer
neighbors. It could also result in the space complexity of clustering
algorithm. On the other hand, on a large-scaled P2P network,
the number of active users may impact the network congestion.
Therefore, determining the right network scale is very important
for P2P collaborative filtering clustering algorithm operations.
Considering P2P scalability and clustering efficiency, in this
paper, the P2P users may be classified into different groups
(clusters) with respect to the user personality features. In other
words, a collection of users are similar within clusters and are
dissimilar to the users belonging to other clusters. Our hybrid
collaborative filtering algorithm based on k-means will search
neighbors within the similar use cluster instead of searching the
whole user space. As a result, it can not only reduce the algorithm
complexity but also improve the prediction accuracy because of
considering the preference of users.
4.1. A quantitative approach for P2P user attributes
Before proposing our new hybrid collaborative filtering algorithm, the personality features of P2P users should be expressed
quantificationally. Generally, when a new customer registers into
a P2P network, we can get user profiles, such as age, gender, career, character and preference etc. and usually they are stored into
a database. Obviously, age is a numerical value and gender is a dualistic value (for example, 0 means male, 1 means female). Educational background can be divided into elementary school, middle
school, bachelor, master and Ph.D., which can be described from 1
to 5 respectively. As for the quantitative profession and character,
to describe this we can adopt a hierarchical tree.
Following is the quantifying stage of profession and character
for P2P users. In his theory of career choice, psychologist John L.
Holland created Holland Codes to measure an individuals type
and match it with a list of career choices [30]. In this section,
1413
1414
E means all the ways between the nodes. The initial state is an
unconnected graph T = (V , {}), where there are no edges and
the number of nodes is n. Every node self-composes a connected
vector. If the nodes associated with the minimum cost edge belong
to the sub-vector of T then put this edge into T , otherwise remove
this edge and select the next minimum cost edge. Followed by
analogy, while some connecting nodes form a loop, all nodes
associated with this loop will be added into the same cluster M, at
the same time, removed from T set. Repeat the above procedures
until all nodes are allocated into k clusters. After recalculating
the centroid of each cluster as new centroid, we then apply the
conventional k-means to finish the operations. The algorithm
stops when all the distances become less than the initialized
threshold.
A motivating example is illustrated in Fig. 7. Suppose there are
6 users in a P2P network. As shown in Fig. 7(a), two nodes (user
v1 and v3) with the minimum distance are connected by an edge.
The rest can be done in the same manner until the state Fig. 7(f).
We can see all the nodes compose two loops. They are separated
in two clusters as shown in Fig. 7(g) and (h) respectively. In other
words, elements in the same cluster are similar in some sense. The
average distance of each cluster will be treated as the new centroid,
then the classical k-means algorithm is executed.
Through the similar user clustering method, similar customers
with similar attributes or behaviors will be gathered into the same
cluster. This will be more effective and precise through performing
the clustering algorithm directly only with each cluster vector
to determine the relative nearest neighbors. It also conforms
the real-time requirements in the recommendation system.
Considering the new users problem appearing in collaborative
filtering algorithm, theoretically speaking, users with similar
information dont have large differences about their interests.
Therefore, we can recommend the average-scored item from the
same cluster to the new user. Incidentally, this resolves cold start
problems effectively.
5. Experiments evaluation
5.1. Test dataset
To test the efficiency of our methodology, in this section, we
experimented with classical movie-rating datasets: the MovieLens [31] dataset. The MovieLens dataset was collected by the
GroupLens group through the MovieLens Web site during the pe-
Table 4
The basic characteristics of the test dataset.
MovieLens
Number of users
Number of items
Sparsity
Rating scales
Training set
Test set
943
1682
93.7%
15
80%
20%
riod between September 1997 and April 1998. The basic characteristics of MovieLens datasets with different sizes are summarized in
Table 4. The dataset contains three sets: Movies.dat, Rating.dat and
User.dat. Movies.dat contains 1682 movies (items), including the
detail in formations: movie code, name, type (for example Action,
Adventure, Animation, Comedy, Crime, Documentary, Drama, Fantasy, Horror, Romance etc.). User.dat has 943 users features, such
as user ID, gender, age and profession. Rating.dat contains ratings
by 943 users for 1682 movies (items). Each user had rated at least
20 movies. The rating scale takes values from 1 (lowest rating) to 5
(highest rating). As a result, the sparsity of the MovieLens dataset
100000
is 1 1682
= 0.936993 = 93.7%.
943
5.2. Item attributes impact
In order to evaluate how close forecasts or predictions with
experiments are, we report our results using the mean absolute
error (MAE) evaluation metric. Just as its name implies, the mean
absolute error is an average of the absolute errors pi qi , where pi
is the prediction set and qi is the true value set. For all test datasets,
we have,
N
P
MAE =
i =1
|pi qi |
(4)
N
where N denotes the number of tested ratings. MAE gives
expression to the average absolute deviation of predictions to
the actual data. Note that a smaller value indicates a better
performance.
The recommendation prediction influencing in CF has been
mainly attributed to two factors: one is the sparsity level of
datasets, and the other is the number of neighbors. Based on these
1415
Table 5
The number of neighbors found by conventional CF algorithm.
Num. of clusters
User ID
2
3
4
5
16
11
11
10
10
317
11
10
9
8
608
10
9
8
8
912
11
10
9
8
608
12
11
9
8
912
11
11
10
8
90
83.3
75
71.7
Table 6
The number of neighbors found by HCF algorithm.
Num. of clusters
2
3
4
5
User ID
16
12
12
11
10
Avg. (%)
121
12
11
10
9
317
11
10
9
8
96.7
91.7
81.7
71.7
Avg. (%)
121
11
10
9
9
1416
0.95
0.9
MAE
0.85
_
Item based CF
Traditional CF
0.8
0.75
0.7
0.65
0.6
0.9
0.84
0.8
0.75
0.72
Sparsity level
Fig. 8. MAE impact of different CF for different sparsity levels.
0.95
0.9
Traditional CF
_
Item based CF
MAE
0.85
0.8
0.75
0.7
0.65
0.6
10
15
20
25
Num of neighbors
30
35
40
0.9
0.85
Traditional CF
K-means based CF
MAE
0.8
Hybrid CF
0.75
0.7
0.65
0.6
5
10
15
20
25
Num of neighbors
30
35
40
users with similar purchasing motives. Case studies and experimental results illustrate that our approach is a feasible technique
for recommendation in a P2P network. Our hybrid mechanism
prediction mainly depends on user-similarity of P2P networks for
prediction. In the future work, we intend to deal with fraudulent behavior, anonymity and privacy problems under P2P network
conditions.
Acknowledgements
This work has been partially supported by the National
Natural Science Foundation of China (Grant No. 90818002,
60973115 and 60933002), Ph.D. Programs Foundation of Ministry
of Education of China (Grant No. 20070151020), National Basic
Research Program of China (973 Program) under the Grant No.
2006CB701303 and Hi-Tech Research and Development Program
of China (863 Program) under the Grant No. 2007AA12Z151 and
2009AA01A402.
References
[1] Amund Tveit, Peer-to-peer based recommendations for mobile commerce, in:
Proceedings of the 1st International Workshop on Mobile Commerce, 2001,
pp. 2629.
[2] Fuyong Yuan, Jian Liu, Chunxia Yin, Yulian Zhang, Nan Shen, A novel collaborative filtering mechanism for product recommendation in P2P networks,
in: Third International IEEE Conference on Signal-Image Technologies and
Internet-Based System, 2007, pp. 254261.
[3] S. Eyheramendy, D. Lewis, D. Madigan, On the naive bayes model for text
categorization, in: Proc. of Artificial Intelligence and Statistics, 2003.
[4] Giancarlo Ruffo, Rossano Schifanella, A peer-to-peer recommender system
based on spontaneous affinities, ACM Transactions on Internet Technology 9
(1) (2009) Article 4.
[5] Keqiu Li, Hong Shen, Francis Y.L. Chin, Si-Qing Zheng, Optimal methods
for coordinated enroute web caching for tree networks, ACM Transactions
Internet Technology 5 (3) (2005) 480507.
[6] Jun Wang, Johan Pouwelse, Reginald L. Lagendijk, Marcel J.T. Reinders,
Distributed collaborative filtering for peer-to-peer file sharing systems, in:
Proceedings of the 2006 ACM Symposium on Applied Computing, 2006,
pp. 10261030.
[7] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl, GroupLens: an open
architecture for collaborative filtering of netnews, in: Proceedings of ACM
Conference on Computer Supported Cooperative Work, 1994.
[8] J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, J. Riedl, Grouplens:
applying collaborative filtering to usenet news, Communications of the ACM
40 (3) (1997) 7787.
[9] B. Smyth, P. Cotter, Personalized electronic programme guides, Artificial
Intelligence Magazine 21 (2) (2001).
[10] Greg Linden, Brent Smith, Jeremy York, Amazon.com recommendations: itemto-item collaborative, in: IEEE Internet Computing, vol. 7, IEEE Computer
Society, 2003, pp. 7680.
[11] Keqiu Li, Hong Shen, Francis Y.L. Chin, Weishi Zhang, Multimedia object
placement for transparent data replication, IEEE Transactions on Parallel and
Distributed System 18 (2) (2007) 212224.
[12] Hao Ma, Irwin King, Michael R. Lyu, Effective missing data prediction for
collaborative filtering, in: Proceedings of the 30th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval,
2007, pp. 3946.
[13] Derry O Sullivan, David Wilson, Barry Smyth, Preserving recommender
accuracy and diversity in sparse datasets, in: FLAIRS Conference 2003,
pp. 139143.
[14] G. Linden, B. Smith, J. York, Amazon.com recommendations: item-to-item
collaborative filtering, IEEE Internet Computing (January) (2003).
[15] Manos Papagelis, Dimitris Plexousakis, Themistoklis Kutsuras, Alleviating the
sparsity problem of collaborative filtering using trust inferences, in: iTrust
International Conference 2005, in: LNCS, vol. 3477, 2005, pp. 224239.
[16] Arnaud De Bruyn, C. Lee Giles, David M. Pennock, Offering collaborativelike recommendations when data is sparse: the case of attraction-weighted
information filtering, in: International Conference on Adaptive Hypermedia
and Adaptive Web-based Systems, in: Lecture Notes in Computer Science,
vol. 3137, 2004, pp. 393396.
[17] Alexandrin Popescul, Lyle H. Ungar, David M. Pennock, Steve Lawrence, Probabilistic models for unified collaborative and content-based recommendation
in sparse-data environments, in: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, 2001, pp. 437444.
[18] Jun Wang, Arjen P. de Vries, Marcel J.T. Reinders, Unified relevance models for
rating prediction in collaborative filtering, ACM Transactions on Information
Systems 26 (3) (2008) 142. Article 16.
[19] K. Goldbergh, T. Roeder, D. Gupta, C. Perkins, Eigentaste: a constant time
collaborative filtering algorithm, Information Retrieval 4 (2) (2001) 133151.
1417
[20] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing
by latent semantic analysis, Journal of the American Society for Information
Science 41 (6) (1990).
[21] T. Hofmann, Collaborative filtering via Gaussian probabilistic latent semantic
analysis, in: Proc. of the 26th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, 2003.
[22] M. Balabanovic, Y. Shoham, Fab: content-based, collaborative recommendation, Communications of the ACM 40 (1997) 6672.
[23] C.-N. Ziegler, G. Lausen, L. Schmidt-Thieme, Taxonomy-driven computation of
product recommendations, in: Proceedings of the Thirteenth ACM Conference
on Information and Knowledge Management, 2004.
[24] J.B. MacQueen, Some methods for classification and analysis of multivariate
observations, in: Proceedings of 5-th Berkeley Symposium on Mathematical
Statistics and Probability, vol. 1, University of California Press, Berkeley,
pp. 281297.
[25] http://en.wikipedia.org/wiki/Clusteranalysis.
[26] Rong Hu, Yansheng Lu, A hybrid user and item-based collaborative filtering
with smoothing on sparse data, in: 16th International Conference on Artificial
Reality and Telexistence, 2006, pp. 184189, doi:10.1109/ICAT.2006.12.
[27] Marc Snchez-Artigas, Pedro Garca-Lpez, eSciGrid: A P2P-based e-science
Grid for scalable and efficient data sharing, Future Generation Computer
Systems 26 (5) (2010) 704719.
[28] J. Wang, A.P. De Vries, M.J.T. Reinders, A useritem relevance model for
logbased collaborative filtering, in: Proceedings of the European Conference
on IR Research, Springer, London, 2006, pp. 3748.
[29] J. Wang, A.P. De Vries, M.J.T. Reinders, Unifying user-based and itembased collaborative filtering approaches by similarity fusion, in: Proceedings
of the 29th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, ACM Press, New York, 2006,
pp. 501508.
[30] http://www.absoluteastronomy.com/topics/Holland_Codes.
[31] Grouplens, EachMovil, datadet, MovieLens, 2003. http://www.grouplens.org/.