Engineering Challenges in Vertical Search Engines

+
Engineering Challenges
in Vertical Search Engines
Aleksandar Bradic, Senior Director,
Engineering and R&D, Vast.com
+
Introduction
  Vertical Search
  Search focused on vertical data
  Vertical Data – data inherently described by it’s structure:
  Items/Properties for sale (Automotive, Real Estate..)
  Geographical Data (Neighborhoods, Locations..)

  Services (Hotels, Transportation..)
  Businesses (Restaurants, Nightlife..)
  Events (Concerts, Plays..)
  Auction items (Collectibles, Art..)
  Metadata (News, Social Data, Reviews..)
  …
+
Introduction
  Vertical Search != Full Text Search

  Full Text Search queries:
  “Cheap tickets for Broadway shows this week”
  “Trendy Restaurants in San Francisco near SoMa”
  “3-day trips from NYC to anywhere under $1000”
  Vertical Search queries:
  “price-sorted results bellow two standard deviations from tickets
category with Broadway as location and date range of 2010-04-11 to
2010-04-18”
  “distance-sorted results relative to center of SF/SoMa matching the
appropriate threshold of composite score of user review scores and
historical change in query/review volume”
  “total cost-sorted results for all 3-day intervals within next 6 months
combining hotel and airfare price bellow max value of $1000 for all
valid locations”
+
Introduction
  Vertical Search = search on structured data
  Vertical Search at Web-Scale:

  Web-Scale datasets
  Web-Scale query volumes
  Interactive operation
  Low latency requirements
  Utility maximization across all involved parties
  => loads of fun ! : )

+
@Vast.com
  Vast.com : Vertical Search & Analytics Platform
  Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest

Airlines, etc..
+
@Vast.com
  Daily processing up to 1Tb of unstructured and semi-

structured Web data
  Managing ~150M records operational dataset across multiple

verticals
  Handling > 1000 query/sec peak search query loads
  We’re hiring ! : )
+
Challenges in Vertical Search
Engines
  Web Data Retrieval
  Unstructured Data
  Data Processing Infrastructures
  Vertical Search
  Data Analytics
  Computational Advertising
+
Web Data Retrieval
  Crawler Architecture
  Queue Management
  Crawl Ordering Policies
  Duplicate URL Detection
  Content Hash Management
  Politeness Management
  Coverage Measurement
  Freshness Optimization
  Incremental Crawling
+
Web Data Retrieval
  ”Deep Web” crawling

  Locating Deep Web Content Sources
  Selecting Relevant Sources
  Estimating Database Size
  Understanding Content / Form Detection
  Automatic Dispatch of HTML Forms
  Predicting content in free text forms
  Crawling non-HTML Content
  Estimating Query Result Sparsity
  URL Generation problem
  Query Covering Problem
+
Web Data Retrieval
  Focused (Topical) Crawling

  Content Classification
  Link Content Prediction
  Topic Relevance Estimation
  Modeling Temporal Characteristics

  Site-Level Evolution
  Page-Level Evolution
  Adversarial Crawling
  Web Spam Detection
  Cloaked Content Detection
+
Unstructured Data
  Unstructured Data – information that does not have a pre-

defined data model
  Handling Unstructured Data:

  Data Cleaning
  Tagging with Metadata
  Vertical Classification
  Schema Matching
  Information Extraction
Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!
Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!

make model year trim price ???
+
Unstructured Data
  Information extraction from unstructured, ungrammatical

data
  Reference Sets - relational data sets that consist of collection of
known entities with associated common attributes
  Reference Set Selection
  Reference Set Generation
  Record Linkage : Finding “best matching” member of reference
set corresponding post
  Challenge : Automatic Generation of Reference Sets
+
Data Processing Infrastructures
  Infrastructures for continuous processing of unbounded streams

of unstructured data
  Information Extraction as part of processing (non-trivial
computation per each processed entry)
  Inherently distributed infrastructures - in order to support

performance and scalability
  Time-to-site constraints. Ability to process out-of band data.
  Support for complex operations on aggregated data (de-

duplication, static ranking, data enrichment, data cleaning/
filtering …)
  Support for data archival and off-line analysis

+
+
  Distributed Computing Platforms:
  Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)
  Stream-oriented (Flume, S4, Stream SQL…)
  Distributed Data Stores (Dynamo/Cassandra/Riak…)
  The curse of CAP Theorem:

  It is impossible for a distributed system to simultaneously provide
all three of the following guarantees:
  Consistency
  Availability
  Partition tolerance
+
Vertical Search
  Large-Scale structured data search
  Providing both analytic and canonical set of Information

Retrieval functionalities
  Entries are represented in Vector Space Model
  Each result is represented as data point – tuple consisting of

appropriate number of fields :
(make, model, year, trim …)

+
Vertical Search
  Search in Vector Space Model

  Resulting subset generation
  Sorting as linearization using selected metric
  Dynamic subset criteria calculation
  Search Result Clustering
  “Similar” result search
  …
… with up to ~100 ms milliseconds response time

… at 10M+ records in index
… handling 100+ queries/sec/host
+
Vertical Search
  Faceted Search
  fac-et (fas’it) :
  1. One of the flat polished surfaces cut on a gemstone or occurring
naturally on a crystal.
  2. One of numerous aspects, as of a subject.
  Vocabulary problem for faceted data

  Facet Design / selection
  "the keywords that are assigned by indexers are often at
odds with those tried by searchers.”
  Selection of information-distinguishing facet values
  User-specific faceted search
  Dynamic correlated facet generation
  Distributing facet computation
+
Data Analytics
  Clickstream Data Analysis
  Learning from implicit user feedback
  Anonymous user clustering
  Learning to rank
  Inventory/Market Trends
  Rare Event detection
  Price Prediction
  Spam Content detection

+
Data Analytics
  Challenges:
  “Good Deal” detection
  Recommendation Systems for Vertical Data with no explicit user
feedback
  Accuracy of Automatic Valuation Models
  Data-driven feature design
  Click Prediction
  User Behavior Modeling
+
Computational Advertising
  The central problem of computational advertising is to find

the "best match" between a given user in a given context and a
suitable advertisement.
ads
ads
search results !
+
  Vertical Search presents an additional challenge in the sense

that any of the actual search results can be “sponsored”
ad ?
ad ?
+
  Central challenge:
  Find the “best match” between a given user in a given context
and a suitable advertisement
  “best match” – maximizing the value for :
  Users
  Advertisers
  Publishers
  Each of the parties has different set of utilities:
  Users want relevance
  Advertisers want ROI and volume

  Publishers want revenue per impression/search
+
  CTR (ClickThrough Rate Estimation):

  Reactive (statistically significant historical CTR)
  Predictive (CTR estimated from features of ads)
  Hybrid (historical + predictive)
  Personalization of CTR Computation ?

  Dynamic CTR Estimation (online algorithms)
P(click) = ?
+
  Analytical Aparatus:
  Regression Analysis (Linear, Logistic, probit model, High
Dimensional methods)
  Game Theory (Nash Equilibria, dominant strategy)
  Auction Theory (Vickrey, GSP, VCG…)
  Graph Theory (random walks on graphs, graph matching, etc.)
  Information Retrieval Techniques (similarity metrics, etc.)
  …
+
Conclusion
  Vertical Search & Analytics at Web Scale == fun !!!
  Source of large number of relevant research & engineering

problems !
  Opportunity to tackle wide spectra of techniques across all

areas of Computer Science and Engineering !
Jump on the bandwagon ! : )

Engineering Challenges in Vertical Search Engines

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Engineering Challenges in Vertical Search Engines

Enviado por

Direitos autorais:

Formatos disponíveis

+

 Geographical Data (Neighborhoods, Locations..)

 Vertical Search != Full Text Search

 Vertical Search = search on structured data

 Vertical Search at Web-Scale:

 => loads of fun ! : )

 Vast.com : Vertical Search & Analytics Platform

 Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest

 Daily processing up to 1Tb of unstructured and semi-

 Managing ~150M records operational dataset across multiple

 Handling > 1000 query/sec peak search query loads

 Data Processing Infrastructures

 ”Deep Web” crawling

 Focused (Topical) Crawling

 Modeling Temporal Characteristics

 Unstructured Data – information that does not have a pre-

 Handling Unstructured Data:

Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!

Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!

 Information extraction from unstructured, ungrammatical

 Infrastructures for continuous processing of unbounded streams

 Inherently distributed infrastructures - in order to support

 Time-to-site constraints. Ability to process out-of band data.

 Support for complex operations on aggregated data (de-

 Support for data archival and off-line analysis

 Distributed Computing Platforms:

 Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)

 Stream-oriented (Flume, S4, Stream SQL…)

 Distributed Data Stores (Dynamo/Cassandra/Riak…)

 The curse of CAP Theorem:

 Large-Scale structured data search

 Providing both analytic and canonical set of Information

 Entries are represented in Vector Space Model

 Each result is represented as data point – tuple consisting of

(make, model, year, trim …)

 Search in Vector Space Model

… with up to ~100 ms milliseconds response time

 Vocabulary problem for faceted data

 Clickstream Data Analysis

 Learning from implicit user feedback

 Anonymous user clustering

 Rare Event detection

 Spam Content detection

 The central problem of computational advertising is to find

 Vertical Search presents an additional challenge in the sense

 Advertisers want ROI and volume

 CTR (ClickThrough Rate Estimation):

 Personalization of CTR Computation ?

 Vertical Search & Analytics at Web Scale == fun !!!

 Source of large number of relevant research & engineering

 Opportunity to tackle wide spectra of techniques across all

Jump on the bandwagon ! : )

Você também pode gostar

  Geographical Data (Neighborhoods, Locations..)

  Vertical Search != Full Text Search

  Vertical Search = search on structured data

  Vertical Search at Web-Scale:

  => loads of fun ! : )

  Vast.com : Vertical Search & Analytics Platform

  Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest

  Daily processing up to 1Tb of unstructured and semi-

  Managing ~150M records operational dataset across multiple

  Handling > 1000 query/sec peak search query loads

  Data Processing Infrastructures

  ”Deep Web” crawling

  Focused (Topical) Crawling

  Modeling Temporal Characteristics

  Unstructured Data – information that does not have a pre-

  Handling Unstructured Data:

  Information extraction from unstructured, ungrammatical

  Infrastructures for continuous processing of unbounded streams

  Inherently distributed infrastructures - in order to support

  Time-to-site constraints. Ability to process out-of band data.

  Support for complex operations on aggregated data (de-

  Support for data archival and off-line analysis

  Distributed Computing Platforms:

  Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)

  Stream-oriented (Flume, S4, Stream SQL…)

  Distributed Data Stores (Dynamo/Cassandra/Riak…)

  The curse of CAP Theorem:

  Large-Scale structured data search

  Providing both analytic and canonical set of Information

  Entries are represented in Vector Space Model

  Each result is represented as data point – tuple consisting of

  Search in Vector Space Model

  Vocabulary problem for faceted data

  Clickstream Data Analysis

  Learning from implicit user feedback

  Anonymous user clustering

  Rare Event detection

  Spam Content detection

  The central problem of computational advertising is to find

  Vertical Search presents an additional challenge in the sense

  Advertisers want ROI and volume

  CTR (ClickThrough Rate Estimation):

  Personalization of CTR Computation ?

  Vertical Search & Analytics at Web Scale == fun !!!

  Source of large number of relevant research & engineering

  Opportunity to tackle wide spectra of techniques across all