Você está na página 1de 6

A Data Generator for Multi-Stream Data

Zaigham Faraz Siddiqui , Myra Spiliopoulou , Panagiotis Symeonidis , and Eleftherios Tiakas
University of Magdeburg ; University of Thessaloniki . [siddiqui,myra]@iti.cs.uni-magdeburg.de; [symeon,tiakas]@csd.auth.gr

Abstract. We present a stream data generator. The generator is mainly intended for multiple interrelated streams, in particular for objects with temporal properties, which are fed by dependent streams. Such data are e.g. customers with their transactions, web users with their clicks, patients with their treatments, students with their class enrolments and exams. However, it can also be used for conventional stream data. The generator is appropriate for testing classication and clustering algorithms on concept discovery and adaptation to concept drift. The number of concepts in the data can be specied as parameter to the generator; the same holds for the membership of an instance to a class. Hence,it is also appropriate for synthetic datasets on overlapping classes or clusters.

Introduction

Most of the data stored in databases, archives and resource repositories are not static collections: they accumulate over time, and sometimes they cannot (or should not) even be stored permanently - they are observed and then forgotten. Many incremental learners and stream mining algorithms have been proposed in the last years, accompanied by methods for evaluating them [2, 1]. However, modern applications ask for more sophisticated stream learners than can currently be evaluated on synthetically generated data. In this work, we propose a generator for complex stream data that adhere to multiple concepts and exhibit drift. The generator can be used for the evaluation of (multi-class) stream classiers, stream clustering algorithms over high-dimensional data and relational learners on streams. Our generator is inspired by the domain of recommendation engines, where data are essentially a combination of static objects and adjoint streams: people rank items - the rankings constitute a conventional stream; new items show up, while old items are removed from the providers portfolio - the items constitute a slow stream; new users show up, while old users re-appear and rank items again, possibly exhibiting dierent preferences as before - users constitute a slow stream. In [3] we devised the term perennial objects for a stream of objects that appear more than once and may change properties. In conventional relational learning, these objects would be static and have one xed label, while in relational stream learning they may have dierent labels at dierent times. In [3] we proposed a decision tree stream classier for a stream of perennial objects.

The core idea of our generator is as follows. The preference of a user towards some item depends on the attitude of the user towards the items attributes. The attitude is the users prole, the attributes of the item constitute the items prole. Multiple users adhere to a user prole, multiple items are characterized an item prole. The learning task is then to predict the prole to which a user adheres, i.e., the label of the user, given the rankings given by this user to items. Noise can be imputed to the data by forcing a user to rank in disordance to her prole with some probability. Drift is imputed to the data by allowing a class to exist only for some timepoints and then forcing it to mutate to one or more classes. In relational stream learning, the generator is intended for the task of learning the valid classes at each moment and for the task of adapting to new classes. For multi-class stream learning, the matrix/stream of the user-item rankings must be extended by the attributes constituting each items prole. Adding the attributes constituting each users prole (without the label) makes the learning task easier. The same matrix extension is needed for high-dimensional clustering, where the learning task is to discover the user clusters and trace their evolution. The paper is organized as follows. We explain the multi-relational generator in Section 2. We summarise and discuss future improvements in Section 3. At the end, we summarise the generator of [4] upon which we build upon.

Generating Proles and Transactions with Concept Drift

Our generator is inspired by the idea of predicting user ratings in a recommendation engine, and builds upon the generator of [5], which is summarized in the Appendix. In particular, our generator creates data according to following scenario: Each user adheres to a user prole, while each item adheres to an item prole; the proles correspond to classes. The rating of a user u for an
Table 1. Parameters of the generator. Param Ni Nu ni nu vi vu d L R Description number of item proles number of user proles number of items per item prole number of users per user prole number of synthetic variables that describe an item prole number of synthetic variables that describe an item prole number of drift levels across the time axis max lifetime of a drift level as number of timepoints max number of items rated by a user at any timepoint

Item Proles IP1 IP2 IP3 IP4

Var 1 23 + 4 2 +9 72 + 7 3 +4

Var 2 52 + 9 1 +7 71 + 3 2 +4

Var 3 97 + 2 46 + 3 26 + 2 27 + 5

Var 4 8 +9 91 + 6 52 + 1 25 + 3

Fig. 1. Item proles with mean and variance for each synthetic variable where v i = 4

item i depends on how close the prole of u is to the prole of i; ratings are generated at each timepoint t. At certain timepoints, user proles mutate, implying that the ratings of the users for the items change. The learning objective is to predict each users prole at each timepoint, given the users ratings. In more detail, our generator takes as input the parameters depicted in Table 1, and described in sequel. It generates: item proles and from them items; user proles and from them users; and ratings of users for items at each timepoint. A user prole may live at most L timepoints before it mutates. Generation of item proles and items. Item proles are described by v i synthetic variables. The generator creates N i item proles and stores for each one the mean and variance of each of the v i variables. Next, each of these item proles is used as prototype for the generation of ni items, producing ni N i items in total. Items also adhere to the v i variables; the value of each variable in an item adhering to prole I is determined by the mean and variance of this variable in the prole I. The description of item proles and is depicted in Figure 1. The number of items considered (rated) at some timepoint may vary from one timepoint to the next, but there is no bias towards items of some specic prole(s). Hence, item proles are not exhibiting concept drift. Generation and transition of user proles. User proles are described by a set of parameters v u . These are synthetic variables. The generator creates N u item proles and stores for each one the mean and variance of each variable in v u . User proles serve as templates for the generation of users, in much the same way as item proles are used to generate items. However, there are two main dierences. First, user proles are subject to transition, and not all of them are active at each drift level d = 1, . . . , d . Second, a user prole exhibits anity towards some item proles, expressed through the probabilities between the item prole and the user prole. The description of user proles is depicted in Figure 2. The anity of user proles towards item proles manifests itself in the user ratings: a generated user adheres to some user prole and rates items belonging

to the item prole(s) preferred by her user prole. The anity U I is dened as the probability of a user prole U selecting an item from item prole U for rating. The probability U I is governed by a global parameter U P 2IP [0, 1]. If U P 2IP is close to zero, user proles show strong anity towards a certain item prole while if the value is closer to 1, the probabilities are initialised randomly.
User Proles Var 1 Var 2 Var 3 Item Prole Probabilities IP1 IP2 IP3 IP4 0.6 0.25 0.05 UP1 13 + 0 22 + 5 51 + 1 0.1 20 + 5 50 + 8 80 + 9 29 + 1 IP1 IP2 IP3 IP4 0.1 0.1 0.1 UP2 34 + 4 55 + 0 68 + 9 0.7 90 + 2 25 + 5 93 + 4 9 + 1 IP1 IP2 IP3 IP4 0.5 0.1 0.2 UP3 21 + 5 98 + 4 1 + 5 0.2 42 + 2 95 + 2 10 + 0 12 + 5 Fig. 2. User proles with mean and variance for synthetic variables with probabilities of selecting an item from a certain item prole (row 1) and mean and variance of the rating that item (row 2), where v u = 3.

At each drift level d = 1, . . . , d , only a subset of the N u proles are active, u i.e., Au N , this number is chosen randomly. For drift level d > 1, the generd d ator maps the proles of level d 1 to the new proles of level d on similarity, i.e. the transition probability from an old to a new prole is a function of the similarity between the two proles. The result is a prole transition graph, an example of which is depicted in Figure 3. This graph is generated and then the thread of each prole is recorded for inspection (cf. Figure 3). The coupling of prole transition to the prole similarity function ensures that prole mutation corresponds to a drift rather than a shift. Prole transitions are further governed by a global variable U P 2U P [0, 1] that determines the true preference of an old user prole for a new user prole. A value close to zero means that the most similar new prole will always be preferred. Larger values allow for a weaker preferential attachment, while a value close to 1 means that the new prole is chosen randomly, and the transition is essentially a concept shift rather than a drift. The similarity between two user proles U and U is dened in Equation 1. sim(U, U ) =
I

(U I U

I )

(1)

where U I is the probability of rating an item from prole I for U. The anity of user proles towards item proles manifests itself in the user ratings: a generated user adheres to some user prole and rates items belonging to the item prole(s) preferred by her user prole. Anity is also aected by

Fig. 3. A prole transition graph; each column corresponds to a timepoint, indicating that the number of proles/classes may change from one timepoint to the next

prole transitions. Once a user prole U mutates to U , all its users adhere to the U prole: they prefer the item proles to which U shows anity, and rate items adhering to these item proles. Generation of users and ratings. For each of the N u user proles are created, the generator creates nu users. As for items, users adhere to the set of parameters v u as user proles; the value of each parameter in a user adhering to prole U is determined by the mean and variance of this variable in the prole U. The proles of each drift level d exists for at most L timepoints, before prole transition occurs; the lifetime of a prole is chosen randomly. At each of these timepoints, the generator creates ratings for all users in each active prole. For each user prole U and user u adhering to U, and for each item prole I is selected based on the probability U I . An item i is randomly chosen from the I and a rating value is generated based on mean and variance in U for rating (c.f. Figure 2). A user can rank at most R items per timepoint.

Conclusions

We presented a multi-stream generator that has been inspired from the domain of recommendation system. It generates ratings data for users according to user proles. With time the proles mutates into newer ones. The mutation can be adjusted to simulate drastic shifts as well more gradual drifts. The generator can be used for evaluating supervised and unsupervised learning task for discovering and adaptation to concept drift. The preliminary results of the data generator can be found in our earlier work on classication rule mining from perennial streams [4]

Appendix A

The generator of [5] is intended for the evaluation of recommendation engines. Let U be the set of users and I be the set of items. The generator builds rst a rating matrix R [0, 1]|U I| : one row corresponds to the ratings of one user for the items (columns of the matrix). From matrix R, the generator derives a user-user similarity matrix S using cosine similarity, and derives from it a friendship matrix F {0, 1}|U U | . The core idea of deriving F is that two users are the more likely to be friends, the more similar they are to each other. Hence, for user i and user j, the generator sets Fij using some function that takes as input the value Sij , but allows for some randomness. The random seed implies that F is asymmetric, i.e. friendship is an asymmetric relation. The matrices R, S can be used to evaluate the performance of a conventional recommender, while the matrices R, S, F are for the evaluation of recommenders that take social relations into account.

References
1. A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavald`. New ensemble a methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 09, pages 139148, New York, NY, USA, 2009. ACM. 2. J. Gama, R. Sebastio, and P. P. Rodrigues. Issues in evaluation of stream learning a algorithms. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 09, pages 329338, New York, NY, USA, 2009. ACM. 3. Z. F. Siddiqui and M. Spiliopoulou. Tree induction over perennial objects. In Proceedings of the 22nd international conference on Scientic and statistical database management, SSDBM10, pages 640657. Springer-Verlag, 2010. 4. Z. F. Siddiqui and M. Spiliopoulou. Classication rule mining for a stream of perennial objects. In the 5th International Symposium on Rules: Research Based and Industry Focused, RuleML 12. Springer-Verlag, 2011. 5. P. Symeonidis, E. Tiakas, and Y. Manolopoulos. Transitive node similarity for link prediction in social networks with positive and negative links. In Proceedings of the fourth ACM conference on Recommender systems, RecSys 10, pages 183190, New York, NY, USA, 2010. ACM.

Você também pode gostar