Escolar Documentos
Profissional Documentos
Cultura Documentos
and that each action is independent of other actions. Naturally, the The framework consists of a definition of the recommendation
stream is a single coherent listening experience for the user. The task for the offline data, a procedure for splitting the data into test
nature of this experience leads us to expect that there is actually and training, and, finally, the evaluation pipeline. Each of these is
some correlation between those actions, for example, if many of covered in this section.
the videos are from the same artist, the sequential ordering of the
video might affect the actions. However, we point out that with
curation-based recommendation, the underlying curation exerts
5.1 Offline curation-based recommendation
a strong force on the resulting recommendations. In particular, it Recall that the online XITE recommender is a curation-based rec-
serves to control redundancy, and there is only a small probability ommender. For a given channel, the recommender generates a per-
that the videos in the stream originate from the same artists. sonalized stream for a user by selecting items from a pre-selected
item set. This item set has been chosen by human curators as being
suited for the channel.
4.3 From user feedback to user state
For our offline evaluation framework, we do not have access to
Although the types of feedback actions gathered from XITE users information about which items were available in the pre-selected
is limited, this information does provide important hints as to the item set during the specific user sessions. For this reason, we need
state of the user, i.e., the level of engagement with the music video to make some well-chosen assumptions in order to create an offline
stream. The feedback types in our use case can be understood as evaluation framework that emulates online recommendations. The
involving both explicit and implicit feedback. When the user clicks first assumption is offered by the technical specifications of our
‘like’, it can be considered as explicit feedback, while a skip and a dataset (i.e., how it was generated). These specifications allow us to
complete play of music video can be considered as implicit feedback. make the assumption that user sessions were either not personal-
We argue that understanding the actual utility of the feedback is ized, lightly personalized, or personalized in a not-yet optimal way.
more complex than understanding the effort of the user relating to From this point, we make a second assumption, namely, that the
the individual feedback types. A complete play for example might videos that we observe in a user session are reflective of the underly-
have some levels of uncertainty, especially in a session where there ing videos in the pre-selected item set from which the recommender
are long continuous played events without any user interruption, is allowed to choose.
which can be considered as passive session [4]. As mentioned in We contrast our approach with one of the most popular protocols
Section 3.1, we cannot be sure that all those videos carry the same for offline evaluation of recommender systems: one-plus-random
level of positive preference, because perhaps the user is not paying [2]. In this approach, for each user in the test set, one rating (or
attention to all of those videos. Wouldn’t it hurt the performance item) is selected as the held-out item. Then, randomly unrated
of our recommender system if we just consider all these videos items (1000 are used in [2]) are also picked, and the items together
with the same level of preference? If we suspect it would, the next with the held-out item are ranked based on the scores produced
question is how to model such uncertainty in the complete played by the algorithm. Based on this order, various types of ranking
signals? metrics can be measured. In contrast, in our use-case, we argue that
To the best of our knowledge, there is little research on model- randomly picking from universe of unobserved items is not fair,
ing user attention when watching long passive sessions with the since there is a known control on what item is presented: the items
goal of inferring different levels of implicit positive preference for presented are drawn from the curated item set for the channel. A
individual items. We argue that such understanding is important more realistic assumption is to assume that the items are drawn
and therefore cannot be simply neglected when developing recom- from items in the session. In order to implement this assumption,
mender systems for a continuous streams of multimedia contents. we calculate our evaluation score at the session-level, and not at the
In this work, we introduce a rudimentary method for taking user item level. However, as we will see, a session level score allows us
state into account. Specifically, in Section 6.3, we propose an ex- to capture information about streaks of consecutive relevant items.
tended version of our evaluation metrics which imposes a threshold The importance of such streaks derives from our user requirements.
on the length of the ‘streaks’ that are taken into account by our Based on these considerations, we arrive at the following proce-
evaluation metric. dure for evaluating an online curation-based recommender in the
offline scenario. We evaluate the performance of the recommender
5 OFFLINE EVALUATION FRAMEWORK algorithm for each user session in the test set individually. The in-
Our goal is an offline evaluation framework that allows us to evalu- put to the recommender is the unordered set of all videos occurring
ate the same recommender algorithms that are used online at XITE. in that session. The recommender treats this set as the pre-selected
4
(3) For each fold, the curation-based recommender system is
trained on the training set. This model is then applied to
each video in the test set. It produces a score that is specific
to the user and to the video.
(4) Based on the predicted score of music videos, a new pre-
dicted session is produced. High-scoring videos occur at the
beginning of the session. Effectively, session re-ordering is a
re-ranking process.
(5) Using the ground truth of binary relevance values from each
music video in the test set, calculate the evaluation score for
each session. The scores can then be averaged for all sessions
in the test set, and/or averaged over all folds.
In the next section, we go on to introduce our metrics for evalu-
ating sessions.
REFERENCES
[1] Riccardo Bambini, Paolo Cremonesi, and Roberto Turrin. 2011. A Recommender
System for an IPTV Service Provider: a Real Large-Scale Production Environment.
In Recommender Systems Handbook, Francesco Ricci, Lior Rokach, Bracha Shapira,
and Paul B. Kantor (Eds.). Springer US, Boston, MA, 299–331.
[2] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Rec-
ommender Algorithms on Top-n Recommendation Tasks. In Proceedings of the 4th
ACM Conference on Recommender Systems (RecSys ’10). ACM, New York, NY, USA,
39–46.
[3] Christopher C Johnson. 2014. Logistic Matrix Factorization for Implicit Feedback
Data. In Proceedings of NIPS 2014 Workshop on Distributed Machine Learning and
Matrix Computations. 1–9.
[4] Suzana Kordumova, Ivana Kostadinovska, Mauro Barbieri, Verus Pronk, and Jan
Korst. 2010. Personalized Implicit Learning in a Music Recommender System. In
User Modeling, Adaptation, and Personalization, Paul De Bra, Alfred Kobsa, and
David Chin (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 351–362.
[5] Martha Larson, Alessandro Zito, Babak Loni, and Paolo Cremonesi. 2017. Towards
Minimal Necessary Data: The Case for Analyzing Training Data Requirements
of Recommender Algorithms. In Proceedings of FATREC Workshop on Responsible
Recommendation.
[6] Babak Loni, Roberto Pagano, Martha Larson, and Alan Hanjalic. 2016. Bayesian
Personalized Ranking with Multi-Channel User Feedback. In Proceedings of the
10th ACM Conference on Recommender Systems (RecSys ’16). ACM, New York, NY,
USA, 361–364.
[7] Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi Elahi.
2018. Current challenges and visions in music recommender systems research.
International Journal of Multimedia Information Retrieval 7, 2 (Jun 2018), 95–116.
[8] Guy Shani and Asela Gunawardana. 2011. Evaluating Recommendation Systems.
In Recommender Systems Handbook, Francesco Ricci, Lior Rokach, Bracha Shapira,
7