Você está na página 1de 26

+

Engineering Challenges
in Vertical Search Engines
Aleksandar Bradic, Senior Director,
Engineering and R&D, Vast.com
+
Introduction

  Vertical Search
  Search focused on vertical data
  Vertical Data – data inherently described by it’s structure:
  Items/Properties for sale (Automotive, Real Estate..)

  Geographical Data (Neighborhoods, Locations..)


  Services (Hotels, Transportation..)
  Businesses (Restaurants, Nightlife..)
  Events (Concerts, Plays..)
  Auction items (Collectibles, Art..)
  Metadata (News, Social Data, Reviews..)
  …
+
Introduction

  Vertical Search != Full Text Search


  Full Text Search queries:
  “Cheap tickets for Broadway shows this week”
  “Trendy Restaurants in San Francisco near SoMa”
  “3-day trips from NYC to anywhere under $1000”
  Vertical Search queries:
  “price-sorted results bellow two standard deviations from tickets
category with Broadway as location and date range of 2010-04-11 to
2010-04-18”
  “distance-sorted results relative to center of SF/SoMa matching the
appropriate threshold of composite score of user review scores and
historical change in query/review volume”
  “total cost-sorted results for all 3-day intervals within next 6 months
combining hotel and airfare price bellow max value of $1000 for all
valid locations”
+
Introduction

  Vertical Search = search on structured data

  Vertical Search at Web-Scale:


  Web-Scale datasets
  Web-Scale query volumes
  Interactive operation
  Low latency requirements
  Utility maximization across all involved parties

  => loads of fun ! : )


+
@Vast.com

  Vast.com : Vertical Search & Analytics Platform

  Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest


Airlines, etc..
+
@Vast.com

  Daily processing up to 1Tb of unstructured and semi-


structured Web data

  Managing ~150M records operational dataset across multiple


verticals

  Handling > 1000 query/sec peak search query loads

  We’re hiring ! : )
+
Challenges in Vertical Search
Engines
  Web Data Retrieval

  Unstructured Data

  Data Processing Infrastructures

  Vertical Search

  Data Analytics

  Computational Advertising
+
Web Data Retrieval

  Crawler Architecture
  Queue Management
  Crawl Ordering Policies
  Duplicate URL Detection
  Content Hash Management
  Politeness Management
  Coverage Measurement
  Freshness Optimization
  Incremental Crawling
+
Web Data Retrieval

  ”Deep Web” crawling


  Locating Deep Web Content Sources
  Selecting Relevant Sources
  Estimating Database Size
  Understanding Content / Form Detection
  Automatic Dispatch of HTML Forms
  Predicting content in free text forms
  Crawling non-HTML Content
  Estimating Query Result Sparsity
  URL Generation problem
  Query Covering Problem
+
Web Data Retrieval

  Focused (Topical) Crawling


  Content Classification
  Link Content Prediction
  Topic Relevance Estimation

  Modeling Temporal Characteristics


  Site-Level Evolution
  Page-Level Evolution

  Adversarial Crawling
  Web Spam Detection
  Cloaked Content Detection
+
Unstructured Data

  Unstructured Data – information that does not have a pre-


defined data model

  Handling Unstructured Data:


  Data Cleaning
  Tagging with Metadata
  Vertical Classification
  Schema Matching
  Information Extraction

Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!

Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!


make model year trim price ???
+
Unstructured Data

  Information extraction from unstructured, ungrammatical


data
  Reference Sets - relational data sets that consist of collection of
known entities with associated common attributes
  Reference Set Selection
  Reference Set Generation
  Record Linkage : Finding “best matching” member of reference
set corresponding post
  Challenge : Automatic Generation of Reference Sets
+
Data Processing Infrastructures

  Infrastructures for continuous processing of unbounded streams


of unstructured data
  Information Extraction as part of processing (non-trivial
computation per each processed entry)

  Inherently distributed infrastructures - in order to support


performance and scalability

  Time-to-site constraints. Ability to process out-of band data.

  Support for complex operations on aggregated data (de-


duplication, static ranking, data enrichment, data cleaning/
filtering …)

  Support for data archival and off-line analysis


+
Data Processing Infrastructures
+
Data Processing Infrastructures

  Distributed Computing Platforms:

  Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)

  Stream-oriented (Flume, S4, Stream SQL…)

  Distributed Data Stores (Dynamo/Cassandra/Riak…)

  The curse of CAP Theorem:


  It is impossible for a distributed system to simultaneously provide
all three of the following guarantees:
  Consistency
  Availability
  Partition tolerance
+
Vertical Search

  Large-Scale structured data search

  Providing both analytic and canonical set of Information


Retrieval functionalities

  Entries are represented in Vector Space Model

  Each result is represented as data point – tuple consisting of


appropriate number of fields :

(make, model, year, trim …)


+
Vertical Search

  Search in Vector Space Model


  Resulting subset generation
  Sorting as linearization using selected metric
  Dynamic subset criteria calculation
  Search Result Clustering
  “Similar” result search
  …

… with up to ~100 ms milliseconds response time


… at 10M+ records in index
… handling 100+ queries/sec/host
+
Vertical Search

  Faceted Search
  fac-et (fas’it) :
  1. One of the flat polished surfaces cut on a gemstone or occurring
naturally on a crystal.
  2. One of numerous aspects, as of a subject.

  Vocabulary problem for faceted data


  Facet Design / selection
  "the keywords that are assigned by indexers are often at
odds with those tried by searchers.”
  Selection of information-distinguishing facet values
  User-specific faceted search
  Dynamic correlated facet generation
  Distributing facet computation
+
Data Analytics

  Clickstream Data Analysis

  Learning from implicit user feedback

  Anonymous user clustering

  Learning to rank

  Inventory/Market Trends

  Rare Event detection

  Price Prediction

  Spam Content detection


+
Data Analytics

  Challenges:
  “Good Deal” detection
  Recommendation Systems for Vertical Data with no explicit user
feedback
  Accuracy of Automatic Valuation Models
  Data-driven feature design
  Click Prediction
  User Behavior Modeling
+
Computational Advertising

  The central problem of computational advertising is to find


the "best match" between a given user in a given context and a
suitable advertisement.

ads

ads

search results !
+
Computational Advertising

  Vertical Search presents an additional challenge in the sense


that any of the actual search results can be “sponsored”

ad ?

ad ?
+
Computational Advertising

  Central challenge:
  Find the “best match” between a given user in a given context
and a suitable advertisement
  “best match” – maximizing the value for :
  Users
  Advertisers
  Publishers
  Each of the parties has different set of utilities:
  Users want relevance

  Advertisers want ROI and volume


  Publishers want revenue per impression/search
+
Computational Advertising

  CTR (ClickThrough Rate Estimation):


  Reactive (statistically significant historical CTR)
  Predictive (CTR estimated from features of ads)
  Hybrid (historical + predictive)

  Personalization of CTR Computation ?


  Dynamic CTR Estimation (online algorithms)

P(click) = ?
+
Computational Advertising

  Analytical Aparatus:
  Regression Analysis (Linear, Logistic, probit model, High
Dimensional methods)
  Game Theory (Nash Equilibria, dominant strategy)
  Auction Theory (Vickrey, GSP, VCG…)
  Graph Theory (random walks on graphs, graph matching, etc.)
  Information Retrieval Techniques (similarity metrics, etc.)
  …
+
Conclusion

  Vertical Search & Analytics at Web Scale == fun !!!

  Source of large number of relevant research & engineering


problems !

  Opportunity to tackle wide spectra of techniques across all


areas of Computer Science and Engineering !

Jump on the bandwagon ! : )

Você também pode gostar