STEREO

Baseline scenario

The amount of user generated data in today's internet and the rate at which such data is created poses a challenge to state-of-the-art recommender system solutions. For instance, the Twitter micro-blogging service has surpassed 200 million active users, generating more than 500 million tweets (micro-blog posts) per day at rates that recently (Aug 2013) peaked at 143199 tweets per second [1]. The main challenge is to timely satisfy the dynamic short term information needs of the users.
A widely adopted approach to build recommendation systems is represented by Collaborative filtering (CF) algorithms. The essence of CF lies in analyzing the known preferences of a group of users to make predictions about the unknown preferences of other users.

Current state-of-the-art shortcomings

Research efforts spent in the last ten years on this topic yield several solutions [state-of-the-art] that, as of today, provide accurate rating predictions, but may incur prohibitive computational costs and large time-to-prediction intervals when applied on large data sets.
Moreover, current CF algorithms are mostly batch-based: they work by taking as input huge amounts of historical data to provide a single output. The batch-based model is therefore agnostic of the incremental nature of the data production process that lies behind these applications.
The sheering amount of available data to be analyzed and the growing timeliness requirements of modern online applications are pushing this batch-based approach to its limits. Quickly analyzing data to timely provide results requires the usage of large computational infrastructures that often repeat the same calculations on very similar data multiple times.
This problem call for the design of new solutions able to incrementally work on streams of incoming data continuously providing up-to date results. In the last couple of years some proposals appeared in the state of the art that try to attack this problem [state-of-the-art]. However, as of today, none has shown the level of maturity of well established batch-based solutions. Moreover, these solution still build upon the batch-based computational model, by adapting it to incrementally work on relatively small data bursts.

Practical impact

We advocate a transition of collaborative filtering from the well-established batch-based model to a more modern, dynamic and flexible stream-based model.
A data stream is a continuous and potentially unbounded flow of data. Algorithms that process data stream are constantly being fed with new information, and they have to avoid the cost of retraining a model every time new data points arrive. Data streams differ from the conventional stored relation model in several ways [2]: (i) the data elements in the stream arrive online; (ii) the system has no control over the order in which data elements arrive to be processed; (iii) data streams are potentially unbounded in size; (iv) once an element from a data stream has been processed it is discarded or archived. Therefore, the data stream model poses new challenges in storage, computation and data analysis. An algorithm that learns from data streams should respect the following properties [3]:

It must require small constant time for each data element;
main memory requirements must be bounded and independent of the number of data elements;
only a single pass over the data should be necessary to build the model;
the model must be available and up-to-date anytime, not only when it finishes processing the data, since data processing may never cease;
it should produce a model that is equivalent to the one that would be obtained by the corresponding batch algorithm.

The most serious drawback of traditional CF algorithms is that they are currently only useful in the batch prediction model. However, most practical applications fit the data stream model (social networks are, for instance, moving from batch data mining to on-the-fly event processing [4]), and new solutions for on-the-fly predictions are needed.
This computational model exactly captures the timeliness requirements of modern applications that use CF as a mean to improve their outcomes. It allows applications to be more reactive to changes in user behaviors as data observations containing these changes are immediately available for analysis to the CF algorithm. Moreover, it allows a more efficient resource utilization as internal solution models are continuously updated, without the need to restart calculations from scratch every time a new data batch is available in input. Finally, it stills allows for horizontal scalability of the CF solutions as stream operators can be parallelized on demand as soon as the input data flows demand for larger computational resources. Interestingly, this model also allows for dynamic reconfiguration of the stream application deployment, making it possible to release unused computational resources as soon as the input data flows become less intense.

Proposal

The STEREO project aims at building a novel collaborative filtering system able to:

work in a stream fashion
on a possibly unbounded data stream
by executing parallel incremental calculations on this data

Our solutions will be based on well-known data analysis techniques already used in the CF domain. However, we will revise and adapt these solutions to make them fit the stream-based computational model. The STEREO project includes a list of steps that will hopefully bring us to the desired outcome.

Transition to a flexible data representation model.
Most existing CF solutions typically adopt a static matrix-based representation for input data. The transition to a stream-based approach first requires a more flexible scheme for storing and analyzing data. This will let the solution work fully-in-memory, in a distributed fashion with shared-nothing data structures. The STEREO project is currently working in this first step by exploring alternative solutions. We already evaluated statistical-based representations [publications], and are currently exploring dynamic graph-based representations.
Adaptation of existing algorithmic solutions to the new data model.
The second step requires a redesign of existing algorithmic solutions to let them work on these new representation models.
Integration of these solutions in a stream-based computing architecture.
The algorithmic solutions identified at the previous step will then be embedded within a streaming computation architecture where the full CF solution will be designed as a graph of operators with flowing data streams.
Evaluation of the proposed solutions.
This fourth step will include several subtasks needed to understand how to adequately assess the quality of the identified solutions. This will include the definition of performance metrics, the identification of appropriate testing data sets, the implementation of the proposed solutions and other baseline solutions (used for comparison) in a testing environment, the execution of the tests, and the assessment of the obtained results.
Release of the solution as on open source platform.
Ideally, the final step of the project should include the release of the best identified solutions as open source within a usable framework.

References

[1] https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
[2] Babcock, Brian, et al. "Models and issues in data stream systems." Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2002.
[3] Domingos, Pedro, and Geoff Hulten. "Catching up with the Data: Research Issues in Mining Data Streams." DMKD. 2001.
[4] Eyers, David, et al. "Living in the present: on-the-fly information processing in scalable web architectures." Proceedings of the 2nd International Workshop on Cloud Computing Platforms. ACM, 2012.