Escolar Documentos
Profissional Documentos
Cultura Documentos
Recommender Architecture
April 16th, 2019
Who are We?
● E-commerce founded in late 2014
○ Internal Engineering founded in early 2015
○ Launched our in-house website in mid-2015,
app in late-2015
● Concentrated on women's fashion
○ Around Rp 100K range as opposed to > Rp 200K
range
○ Number 1 fashion e-commerce in Indonesia
● Recently rebranded to Sorabel in January 2019
Google Play App Rankings
(5th)
Talk Overview
● Existing Recommender Development Workflow
○ Exploration, Computation, Serving
● New Recommender Development Workflow
○ Exploration, Computation, Serving
○ The Impact
● In-Memory Recommender Architecture
○ In-Depth Architecture
○ Pros & Cons
What is a Recommender?
● Software that tries to optimize future actions of its users
○ Personalized: based on a specific user’s past behavior
○ Non-personalized: based on the aggregate non-user contextual model
Recommenders in Sorabel
● We have recommenders for many things:
○ For our buyers: Restock, Budget, Trend Recommenders
○ For our warehouse: Order-Item Picking Route Optimizer, Item Placement
Optimizer, Logistics Partner Order Allocator
○ For our customers: Product Recommender
● In this talk, we are focused to Product Recommender
Uses of Product Recommender
● Home Feed
● Catalogue Feed
● Similar Products
● Search, and others!
Development Workflow
● Exploration
● Computation
● Serving
Development Workflow
● Exploration
○ Data scientists sift through the data to derive deeper insights by looking beyond the
basic metrics
● Computation
○ Which classes of algorithm are computationally feasible, and how each model
should be built, fit, validated, and then recomputed regularly
● Serving
○ How models are ultimately stored post-computation and served during production
to our millions of users
Previous
Development
Workflow
Previous Recommender
Architecture ● BigQuery: data warehouse, from
multiple data source -- MySQL,
Cassandra, Kafka
● Apache Spark: Engine for
large-scale data processing
● Dataproc: running Apache Spark
jobs inside the GCP
● ElasticSearch: An open-source,
distributed search engine
Exploration
● Data scientists explore, analyze the data, try to build the right model /
heuristics
○ Most of their explorations are done in Python
● This is usually done locally → limited hardware resource
○ Inherently lower limit to the size of data during experimentation → data
scientists are then limited to “toy” datasets during this stage
○ Harder for data scientists to collaborate
Computation
● Data engineers translate data scientists’ works into appropriate Spark jobs
○ Computation was mostly done inside Google Dataproc
● Data engineers will make necessary changes so the model is ready for production
○ For example, dummy dataset vs production-scale dataset
○ Long back-and-forth feedback loop between data scientists & engineers
● Recommendations were largely precomputed at less-than-optimal scope: at the feed
level, done daily
○ Computation and disk-writes take a long time (+ storage costs!)
○ Low usage rate → not everyone will visit their precomputed feed on a daily basis
Serving
● Production read-path that serves the actual models / recommendations to our
users
● A dual-layer architecture:
○ Highly-stateful data storage layer -- ElasticSearch
○ Stateless (horizontally-scaled) REST API servers, that mostly read from the
stateful layer with minimal post-processing
● Implemented and maintained by backend engineers
Recap: Existing Workflow Problems
● Exploration is usually done locally
○ Local resource is limited
○ They usually play with tiny subsets of data to get the work done locally
○ Harder for data scientists to collaborate
● Going back and forth between data scientists and data engineers took longer
than it should be
● Long indexing time (daily job took ~4-8 hrs)
● Non-trivial cost and complexity in the serving infra (Dataproc + ElasticSearch)
New
Development
Workflow
Exploration
● Data Scientists can utilize Sciencebox
○ Jupyterlab running in dedicated containers on top of Kubernetes for each
data scientist
○ Instant access to powerful resources
■ Large core count & RAM
■ GPU if needed
● No longer need to play around with tiny subset of data
● Easier to collaborate, share, and evaluate works between data scientists
Computation
● Introducing DataQuery
○ A platform where anyone can build their own derived tables easily
○ Derived table is composite-data tables -- tables whose data are composed
from multiple tables
■ From raw tables / other derived tables
■ Mostly defined by a SQL query
■ Editable data refresh frequency
○ Built on top of Google BigQuery
Computation
● Frequently, simpler model can be realized by using just DataQuery
○ No need for any Spark jobs for most of the cases
○ Data Scientists can do this independently
Serving
● Serving infrastructure is now a single-layer, Go-based in-memory service (IM for short)
○ We load the “models” from DataQuery (or any other data) into the service’s resident
memory as “Components”, conditionally at startup or as-needed
○ Components are built on top of each other to build a more complete and capable
“component tree” that then serves the actual recommendations as a group
Serving
● Serving infrastructure is now a single-layer, Go-based in-memory service (IM for short),
○ Additional computations (including but not limited to inter-model component
stitching, data re-sorting, realtime-data-sensitive logic, etc.) can be done within the
Components on-request, on-the-fly
○ Centralized “component registry” handles caching / re-computations of different
parts of the component-tree for best performance with little manual work, not
dissimilar to React’s Virtual DOM concept used in user interfaces
○ A much larger chunk of the user-specific recommendation computation can now be
done on-the-fly, only when the user comes to the site
Serving
● Backend engineers implement the components
○ However, due to its simplicity, data scientists often implement
components by themselves
○ Data scientist’s workflow on a new feature is now then very simple:
i. Play around with algorithms & data at Sciencebox
ii. Write the production version of the algorithm as a DataQuery table
iii. “Wrap” the DataQuery model in an IM `Component`
○ We’ll talk about how this works more in-depth later on
Workflow Comparison
Previous Workflow:
Data Scientists
Data Engineers/
Exploration Computation Serving Backend Engineers
New Workflow: