Escolar Documentos
Profissional Documentos
Cultura Documentos
Lighthouse Capital Partners V, L.P. ("Lighthouse"), the senior secured lender to SearchMe, Inc.,
("SearchMe"), (www.SearchMe.com) is soliciting interest for the acquisition of all or substantially all of
SearchMe's assets, including its Intellectual Property ("IP"), in whole or in part (collectively, the
"SearchMe Assets"). Please be advised that the SearchMe Assets are being offered for sale pursuant to
Section 9-610 of the Uniform Commercial Code. Purchasers of the SearchMe Assets will receive all of
SearchMe's rights in the purchased portion of Lighthouse's collateral, which consists of substantially all of
SearchMe's assets, as provided in the Uniform Commercial Code.
The sale is being conducted by Lighthouse with the cooperation of SearchMe. SearchMe has advised
Lighthouse that it will use its best efforts to make its employees available to assist purchasers with due
diligence and assist with a prompt and efficient transition at a mutually convenient time.
The information in this memorandum does not constitute the whole or any part of an offer or a contract.
The information contained in this memorandum relating to the SearchMe Assets has been supplied by
third parties and obtained from a variety of sources. It has not been independently investigated or verified
by Lighthouse or its respective agents.
Potential purchasers should not rely on any information contained in this memorandum or provided by
Lighthouse (or its respective staff, agents, and attorneys) in connection herewith, whether transmitted
orally or in writing (the "information"), as a statement, opinion, or representation of fact. Please note that
all information provided herein relating to the operations of SearchMe's business and its market positions
relates to periods on or prior to March 31, 2009. Interested parties should satisfy themselves through
independent investigations as they see fit.
Lighthouse and its respective staff, agents, and attorneys, (i) disclaim any and all implied warranties
concerning the truth, accuracy, and completeness of any information provided in connection herewith and
(ii) do not accept liability for the information, including that contained in this memorandum, whether that
liability arises by reasons of Lighthouse's negligence or otherwise.
Any sale of the SearchMe Assets will be made on an "as-is," "where-is," and "with all faults" basis,
without any warranties, representations, or guarantees, either express or implied, of any kind, nature, or
type whatsoever from, or on behalf of, Lighthouse. Without limiting the generality of the foregoing,
Lighthouse, and its respective staff, agents, and attorneys, hereby expressly disclaim any and all implied
warranties concerning the condition of the SearchMe Assets and any portions thereof, including, but not
limited to, environmental conditions, compliance with any government regulations or requirements, the
implied warranties of habitability, merchantability, or fitness for a particular purpose.
This memorandum contains confidential information and is not to be supplied to any person without
Lighthouse's prior consent.
$45 Million
of equity capital proceeds
from institutional venture capital funds
and other private investors from 2004-2008
THE FOLLOWING IS PRESENTED FOR INFORMATIONAL PURPOSES ONLY ON A "BEST EFFORTS"
BASIS. NO WARRANTY IS PROVIDED WITH REGARD TO THE ACCURACY OF THE INFORMATION
HEREIN OR THE VALUE OF SEACHME'S ASSETS. PAST PERFORMANCE MAY NOT BE INDICATIVE
OF FUTURE RESULTS.
SUMMARY POINTS
q SearchMe Invented, Built and Ran a Proprietary, Large Scale Search Technology from the
ground up. One of the only such systems built in the last decade, in addition to those of
Inktomi, Yahoo, Google, and MSN/ Bing.
q SearchMe is one of the Early Pioneers of Integrated Media Blending; mixing video, music
and web pages into a single search. SearchMe's technology is aware of the different
characteristics of rich media, and can disambiguate between media types with similar names.
q SearchMe's Vertical Architecture has Query understanding within 100 milliseconds across
hundreds of categories. This leveraged a complex and deep taxonomy of web pages trained over
one thousand categories.
q Significant Investment in Intellectual Property and Assets – over $45 million of equity capital
has funded the development of SearchMe's business and Intellectual Property.
q Funded and vetted by well-known institutional venture capital funds, including Sequoia
Capital. SearchMe's most recent formal round of equity capital in April 2008 valued the
company's assets at approximately $250 million post-money.
q Well-Known Brand Name, Trademark, and domain name related to Visual Search.
SearchMe is a large-scale general-purpose search engine comprised of back-end server software and
multiple front-end client software for the web and five native mobile platforms. SearchMe specializes in
(1) the visual presentation of web pages, various types of rich media, and other forms of Internet search
results, and (2) the delivery and retrieval of highly relevant results for a given search query within a
vertical domain. SearchMe's visual approach to search is designed to create a more natural and human
experience for the user, while at the same time crafting an environment that may be a better fit for the
delivery of rich streams of Internet video and other types of media. This approach to Internet search has
the potential to create a rich and differentiated platform for Internet advertising campaigns. In addition to
its advantages in rich media search, SearchMe's novel User Interface may have applications beyond
traditional Internet search, including the potential to be repurposed as an elegant user interface for the set-
top box market, or as an embedded UI for electronic devices. SearchMe's technology assets have
leveraged deep industry knowledge from some of the best minds in search and UI design to create a
proprietary technology platform composed of (A) a well-known and elegant front-end User Interface
design and related assets, and (B) a back-end search and advertising architecture and related assets.
SearchMe has been prominently featured in numerous major U.S. and global publications, including Time
Magazine, The Washington Post, TechCrunch, and Bloomberg (Venture Show) among others. Time
Magazine heralded SearchMe as one of the "50 Best Web Sites of 2008."2 At its peak, the company
generated approximately 5 million "people per month."
The company's assets are well-positioned to capitalize on several important industry trends – the robust
growth of Internet advertising and Internet search; the increased demand for online videos and other types
of rich multimedia within search; and the eventual convergence of television with broadband Internet.
SearchMe is a privately held company. SearchMe (founded in 2004 as Kavam, Inc.) is headquartered in
Mountain View, California. To date, SearchMe has secured over $45 million in equity financing, with the
company achieving a post-money valuation of nearly $250MM in April 2008. SearchMe's institutional
investors include Sequoia Capital, DAG and Deep Fork Capital, among others.
THE MARKET:
SearchMe has historically competed primarily in the market for Internet Advertising and specifically in
the market for Internet search engine advertising.
According to PWC and the Internet Advertising Bureau, Internet advertising revenues ("revenues") in the
United States totaled $23.4 billion for the full year of 2008, with Q3 accounting for approximately $5.8
billion and Q4 totaling approximately $6.1 billion. Internet advertising revenues for the full year of 2008
increased 10.6 percent over 2007.
According to PWC and the Internet Advertising Bureau, in 2008, search engines represented the largest
online advertising revenue format, accounting for 45 percent of 2008 full year Internet advertising
1
ALL INFORMATION PROVIDED HEREIN RELATING TO THE OPERATIONS OF SEARCHME 'S BUSINESS AND THE MARKET POSITIONS
RELATES TO PERIODS ON OR PRIOR TO MARCH 31, 2009 AND IS PROVIDED ON A "BEST -EFFORTS" BASIS. INTERESTED PARTIES SHOULD
SATISFY THEMSELVES THROUGH INDEPENDENT INVESTIGATIONS AS THEY OR THEIR LEGAL AND FINANCIAL ADVISORS SEE FIT THAT
THE INFORMATION IS ACCURATE. LIGHTHOUSE MAKES NO WARRANTY TO THE ACCURACY OF ANY INFORMATION CONTIANED HEREIN
OR THE VALUE OF THE SEARCHME ASSETS.
2
http://www.time.com/time/specials/2007/article/0,28804,1809858_1809955_1811466,00.html
It is widely assumed that each percentage share of the search market is worth $1 billion in market cap
valuation3
SearchMe has historically experienced strong growth and has been among the leaders in next generation
search technology and progressive, visual interface designs for the presentation of rich media from a
multitude of sources. However, recent working capital constraints have created the opportunity for all or a
portion of SearchMe's assets to be sold.
Value Over $45 million of equity capital investment in SearchMe's business &
intellectual property. Due to market circumstances, the technology assets and
intellectual property are available for purchase at a material discount to the
substantial total amount of equity invested in SearchMe since 2004.
3
http://dondodge.typepad.com/the_next_big_thing/2007/05/why_1_of_search.html
http://www.altsearchengines.com/2009/06/10/1-market-share-in-search-is-worth-up-to-3-billion-dollars/
IMPORTANT TECHNOLOGY:
(A) Valuable & Elegant User Interface that can be repurposed across markets: The
Company's proprietary user interface design provides superior functionality relative to
competitive solutions.
Ribbon Control – is an elegant Flex control used to display large amounts of visually
compelling media in a horizontal list format. Though this control resembles Apple's cover
flow feature it also supports unique "ribbon" views and a standard "film" view.
Query Auto Complete - given the beginning of a query this feature will predict what the
user is typing and return a list of suggestions. This feature comes with the logic for the auto
complete and the data to power it.
(B) Innovative and Scalable Search & Advertising Architecture: The Company's proprietary
back-end search and advertising architecture is purpose-built to provide highly-relevant
search results paired with rich multimedia advertising capabilities. SearchMe's back-end
architecture includes the following systems, feeds, data, and technology components:
Randy Adams, Founder and CEO: Over 25 years of experience, founded built and sold seven venture-
backed start-ups in the technology sector, raised more than $200 million in venture capital, arranged for
the initial funding of Yahoo, Inc. and served on the Yahoo, Inc. board of directors, created the first
internet commerce company, the Internet Shopping Network, sold to the Home Shopping Network,
served as Division President of the Home Shopping Network and Director of Engineering for Adobe
Systems where he envisioned and created Adobe Acrobat and PDF file format.
John Galatea, Vice President of Sales & Marketing: Over 25 years of experience in Sales, Business
Development and Marketing in the Silicon Valley. John was an early member of the Inktomi team that
revolutionized OEM search. He also co-founded the Paid Inclusion platform for monetizing algorithmic
search that was acquired by Yahoo in 2003 and grew revenues to over $200M annually for Yahoo. In
addition, John as asked to evangelized, develop and lead sales organizations for Yahoo in both Sponsored
Search and Display across Digital Agencies, Direct Clients and SEM's.
Eric Glover, Principal Scientist & Classification Architect: Eric Glover has a PhD from University of
Michigan and has over ten years of commercial and academic web search experience. Previously at Ask
Jeeves and before that NEC Laboratories America, Eric has a proven track record of creating effective,
commercial, and web-scale categorization systems. Eric has numerous highly cited publications and over
ten filed patents.
Timothy Huertas, Client Applications Architect: Timothy received his undergraduate from University
of Central Missouri and has over 7 years of professional experience in industries ranging from finance to
photo sharing. While at SearchMe Timothy was responsible for SearchMe's web and web service
interface. Prior to joining SearchMe Timothy was a member of Snapfish's (a Hewlett Packard company)
Emerging Technologies Group where some of his contributions include the site's image editor, text editor
and slide show, which are still used by millions of people in multiple countries.
SearchMe has identified certain principal technologists that may be available to an acquirer of the
SearchMe Assets to help render the company's technology, systems, data, and architecture.
4
THE BIOGRAPHICAL INFORMATION CONCERNING THE CURRENT MANAGEMENT OF SEARCHME IS INCLUDED FOR INFORMATION
PURPOSES ONLY. ALTHOUGH THIS SALE IS BEING CONDUCTED WITH SEARCHME 'S COOPERATION, THIS SALE IS STRICTLY AN ASSET
SALE OFFERED BY LIGHTHOUSE AS SEARCHME 'S SENIOR LENDER PURSUANT TO ARTICLE 9 OF THE UNIFORM COMMERCIAL CODE.
LIGHTHOUSE HAS NO ARRANGEMENT PURSUANT TO WHICH BUYER OF THE SEARCHME ASSETS COULD BE ASSURED THE FUTURE
SERVICES OF ANY SEARCHME OFFICERS OR EMPLOYEES.
At its peak SearchMe had over 5,000,000 unique people per month, making it the number one visual
search engine at the time. The significant positive press and reviews demonstrate the unique capabilities
of SearchMe's technology. This technology was born out of tens of millions of dollars of research and
several PhDs from top institutions as well as former executives and high-ranking employees from major
commercial search engines.
Most people know SearchMe for the visual and media rich UI, which leveraged our own high resolution
Thumbnail generation technology. In addition, the back-end system demonstrated relevance which was
competitive with the top five search engines. As well as thumbnails, embedded media and powerful
relevance algorithms, SearchMe is also differentiated by its advanced categorization technology. Our
system has labels for over 1000 categories from our more than 1000 topic taxonomy for every web page
in our index. This powerful system allows SearchMe to offer vertical suggestions as a form of real-time
disambiguation, such as knowing the different meanings of 'diamondback', 'bonds', 'saturn', etc.. The
system would know, for each query in real-time the percent affinity (and intersection of) for each of the
nearly 300 exposed categories.
In addition to the system, SearchMe amassed a wealth of valuable data. In order to train a Machine
Learned Relevance function (MLR), and nearly 1000 automated category classifiers, we have about 1
million human labeled judgments. These judgments include tens of thousands of categorical labels for
classifier training and hundreds of thousands of relevance judgments for query/url pairs. This data is
extremely valuable for any company interested in the search space.
SearchMe's IP is divided among many sub-groups; Data (such as human judgments and category
classifiers), Online System, Offline Processing (which includes categorization), and specific technologies
and patent filings. Several of the inventors are available to assist in explaining the benefits and issues with
trying to apply this technology for your organization.
Lighthouse is seeking a buyer for the SearchMe's Assets, in whole or in part. Interested parties may bid
on all or any part of SearchMe's brand name, core technology, front-end user interface, or back-end
search and advertising architecture, enabling the purchaser to leverage SearchMe's brand name, core
technology, front-end user interface, and/or back-end architecture, to establish an Internet search engine
with a visual approach, to enhance the user interface of an existing search engine, to leverage the potential
relevancy improvements.
Interested and qualified parties will be expected to sign a nondisclosure agreement (Exhibit C hereto) to
have access to due diligence documentation and key members of the management and development teams
(the "Due Diligence Access"). Each interested party, as a consequence of the Due Diligence Access
granted to it, shall be deemed to acknowledge and represent (i) that it is bound by the bidding procedures
described herein; (ii) that it has an opportunity to inspect and examine the SearchMe Assets and to review
all pertinent documents and information with respect thereto; (iii) that it is not relying upon any written or
oral statements, representations, or warranties of Lighthouse or SearchMe, or their respective staff,
agents, or attorneys; and (iv) all such documents and reports have been provided solely for the
convenience of the interested party, and Lighthouse and SearchMe (and their respective, staff, agents, or
attorneys) do not make any representations as to the accuracy or completeness of the same.
Indications of Interest (outlining value range and specific assets to be purchased) must be received by
Lighthouse no later than October 9, 2009 at 5pm Pacific Time ("Indication Deadline"), and may be
subject to the completion of due diligence. Based on Lighthouse's evaluation of the Indications of
Interest, Lighthouse will invite those interested parties that Lighthouse deems in its sole discretion to be
viable candidates to deliver binding Letters of Intent consistent with the terms of a standard foreclosure
sale agreement prepared by Lighthouse ("Sale Agreement") and provided by Lighthouse to all parties
invited to Letters of Intent. The Sale Agreement will require the bidder to close and fund the purchase
price within seven days of Lighthouse's delivery of its signature to the Sale Agreement. Letters of Intent
must be received no later than October 31, 2009 at 5pm Pacific Time ("Offer Deadline").
Indications of Interest must include the name of the purchasing entity, purchase price range, assets to be
purchased and any contingencies to closing. Letters of Intent must be accompanied by the bidder's duly
executed final version of the Sale Agreement with a comparison showing all variation and changes from
the form of proposed Sale Agreement provided by Lighthouse: delivery of the bidder's duly executed Sale
Agreement shall constitute a binding, unconditional offer to purchase the identified property. This will
be an "as is", "where is" sale with no representations or warranties provided by the Lighthouse or
SearchMe. Exclusivity will not be provided and it is the winning bidders' sole responsibility to set the
closing agenda.
Lighthouse reserves the right to close the bidding process immediately with or without notice to interested
parties. Interested parties are encouraged to complete due diligence and submit offers as soon as
practicable.
Any person or other entity making a bid must be prepared to provide independent confirmation that they
possess the financial resources to complete the purchase where applicable. Lighthouse reserves the right
to, in its sole discretion, accept or reject any bid, or withdraw any or all assets from sale.
All sales, transfer, and recording taxes, or similar taxes, if any, relating to the sale of the SearchMe Assets
shall be the sole responsibility of the successful bidder and shall be paid to Lighthouse at the closing of
each transaction.
SearchMe's Patent applications are described in an attachment to this document listed as Exhibit A, and
are available for sale in conjunction with or separate from any of SearchMe's other assets.
SearchMe's brand name is well known as one of the first recognizable brands in visual search engines.
(B) Trademarks
The following excerpt comes from the USPTO Trademark Electronic Search System:
Word
SearchMe
Mark
Goods and IC 009. US 021 023 026 036 038. G & S: Computer software for searching, retrieving,
Services mining, classifying, and collecting information on computer networks within individual
workstations and personal computers via the internet
IC 038. US 100 101 104. G & S: Transmission of data, images, video and sound clips via
electronic global computer networks
IC 042. US 100 101. G & S: Computer services, namely, providing computer search engine
software, which makes use of an index of documents, over a network for obtaining
customized on-line user-queried information; providing search engines for obtaining data on
global computer networks, namely, providing search engines that provide indexed
information, such indexed information including web sites, on-line links, and other
information extracted and retrieved from global computer networks, providing search engines
that provide information in the form of text, electronic documents, databases, graphic, and
audio visual information extracted and retrieved from global computer networks, and
providing search engines that retrieve documents, or portions thereof, available on the
Internet and classify such documents, or portions thereof, using classifiers in order to provide
an indexed set of documents that can be queried by a user
SearchMe's User Interface system - which is easily separable from its back-end architecture - can
formulate queries and receive an XML feed that can contain a mix of regular web results and special
results which specify a specific rendering engine to allow for in-line media (i.e. YouTube, Hulu, Imeem,
MTV, etc.)
SearchMe's front-end user interface has been implemented in several different platforms, including AS3,
Javascript, iPhone, Android, S60, Windows Mobile and Blackberry Storm.
The key technology components of SearchMe's font-end user interface are the (A) Thumbnail Server
and (B) Ribbon Control.
SearchMe's thumbnail server leverages Firefox's plug-in framework technology, a combination of, XUL
(markup similar to HTML) and JavaScript this permits manipulation and modification by junior level
staff. The thumbnailer is unique in the following 2 ways:
1. A user does a search and is presented with snapshots of every web page found.
2. The user hits the info button and can opt to make several pieces of the image interactive.
8. The user can opt to highlight every phone number on the page. The plan was to make the phone
number clickable and tie it to an online phone service (Google voice, skype).
SearchMe's IP includes user interfaces for 4 major mobile phone platforms. The UI can easily be ported
to support any search engine giving buyers an instant mobile presence. The use of SearchMe's thumbnail
technology makes for a fast way to browse the web on even the slowest connections. See images below:
IPhone
SearchMe's IP also includes a Firefox tool bar. This toolbar allows users to search directly from their
browser. This toolbar can also be used to gather metrics about the machine it is installed on and its users
browsing behavior. The toolbar can be ported to fit any search engine.
The mini search widgets allows users to embed the power of any visual search technology on their blog,
web site or social network with just a few lines of code. Like SearchMe's site this widget supports
multiple types of media.
Given a URL and image size (width x height) the server will return an image for the given URL.
Given a URL the thumbnail server will return the following metadata:
o The bounding box (x, y, width, height) of every piece of text, hyperlink and embed tag on the
page.
The href of each hyper link on the page.
o The embed code for every embedded object on the page.
o The majority of the US phone numbers (and their respective bounding boxes) on the page.
o Beta: The system is set up to sniff out addresses as well. The coverage is limited and exact
metrics are unknown.
o The fully parsed post on load inner HTML the web page and all its child documents
(frames).
The ribbon control is a proprietary method of presenting information in a fluid, "cover flow" style design.
Its presence gives any website a "wow" effect by visually stringing together linked images in a "ribbon"
or liquid image stream. The background of the ribbon can be customized around any color scheme so as
to better fit the unique needs of any branding requirement. In particular, SearchMe's Ribbon Control has
significant value for web applications that need to display an infinite number of items in a finite space;
SearchMe's Ribbon Control technology enables large volumes of "inventory" to be displayed in a central
location. Ribbon Control is the perfect substitute to the commonplace horizontal or vertical list of web
pages, videos, documents, or other items.
Details:
4. The ribbon control can be presented in any color. It has CSS properties that take an array of colors
and their positions used to form a gradient.
The inner workings of the control are an advanced topic that is beyond the scope of this summary, that
being said the complexities are encapsulated well. To use the control the developer need only create an
item renderer that implements an interface and implement the build, focus and blur methods. The item
renderer must also dispatch an event to let the control know it is ready to be drawn. The control takes an
array collection of data. The control monitors the array collection for changes and responds accordingly.
In other words, if you add or remove records from the array collection the control will logically add or
remove sides. This feature is useful when implementing paging.
SearchMe's Back-End Architecture achieves its "magic" through various aspects of online and offline
components with specialized systems and data. The back-end is substantial in its complexity and robust in
its capabilities. It supports a parallel installation for redundancy, and is designed to fit roughly 10M web
documents per Index Server. In addition, the running system has significant support for editorial
functionality such as Paparazzi and Eddie as well as integration of "special services" from the Widget
Server.
(A) Paparazzi
Paparazzi System is a combination of several components which enable a consistent ranking for specified
classes of results. Paparazzi combines human rules, classification and domain-specific lists to enhance
relevance. For example, an editor might decide that for all Actors, the first result should be the 'actor's
homepage', followed by 'IMDB page', followed by a Hulu-clip (i.e. the actor's recent appearance on
Letterman), etc... Paparazzi simplifies the process of generating a consistent experience over
automatically classified data.
(B) Eddie
Eddie is the nickname for the SearchMe system to editorially manage queries. Some queries have results
which should be blocked (in-appropriate for the given query), or queries where desired content does not
rank as it should. The Eddie system includes a web UI to manage queries or groups of queries to enable
editorial ranking. The Cluster Server honors the Eddie rankings, as does the Index Builder and Index
Servers. The system is designed such that it is possible to rapidly make and incorporate changes - without
requiring a full rebuild of the production/live indexes.
The labeled training data is a key part of the value of SearchMe and is valuable to anyone in this space
(even without the associated infrastructure of the SearchMe platform) - in addition to the judgments (a
human evaluation of relevance for a given query or URL) we have substantial other human data about
every query and URL (in our TORGO judgment collection system). This includes indications of
misspellings, official homepages, adult intent, spam and many others. This data could be used (and was
used) for training more than just the MLR system.
One of the strongest differentiators of the SearchMe system is the Classification. The system (CHOCO)
allows a non-engineer to teach the computer a new web-scale category in just a few hours. The nearly
1000 algorithmic categories could be applied to every page in our index in just a few days of offline
processing. This system is loosely tied to part of the SearchMe system [the document storage system or
docstore (I)]. As we built up the nearly1000 algorithmic categories, we generated substantial training data
and testing data to evaluate each category (M). The labeled pages for each category are also very valuable
by themselves for anyone designing any type of categorization system.
One of the things key to any search engine is the ability to make new features quickly. We have a system
we call "New SFP" (New Special Features Processing) which can take a YAML config file and then use
that to generate a wide variety of page-related features. This system was designed so non-programmers
(i.e. spam or domain experts) could encode complex features - such as "Unique Spam Words in Title" in
just a few lines of Perl or XPATH. New SFP would utilize structured data files (lists) and would run
against all documents stored in the docstore (I) offline. This system is separable, but was designed to
work with the proprietary Docstore system.
(F) Aggregation
The SearchMe aggregation system allowed for scanning the entire docstore (> 3B documents) and
generating summary data - such as "nfl.com" is 95% Football, or the average wordcount of the
imdb.com/movies is X. The set of features aggregated is managed by easily editable config files.
Using a complex trained classifier (using similar system to training the main MLR) we can assign a
numerical score from -1 to +1 where +1 means a strong high-quality document, -1 means a low-quality
document. This system runs on a .tsv file but requires the production system to generate the specific
features used.
Automated SPAM Classification - One of the biggest challenges for web-scale search engines is SPAM,
or automatically generated content designed to manipulate search ranking. Using technology similar to
our MLR, we trained an automated SPAM classifier that assigns a score to every document considered for
In order to make our search engine operate, it is critical to store the "raw data" and provide a central
repository for all of the offline processing. The Docstore is designed for very fast streaming for in and
outbound requests. It is possible to make requests to stream results in a distributed set of clients. You
could run ten or a thousand Classification clients, or run a hundred spiders, etc... Each could read (in
consecutive streams) or write in bulk to the docstore.
Each document required a sequence of steps to prepare it for indexing - beginning with spidering, then
classification, special features processing, etc... In order to make this processing more efficient, a
customized "pipeline processing architecture" was created. This sub-system was used to process the El
Rapido (near-real-time content), as well as media content for indexing.
All large-scale search engines need to obtain content through some type of 'spider'. SearchMe's spider,
called Charlotte, is designed to be multi-threaded and distributed. The spider streams URLs to crawl from
the distributed docstore and stream back the resultant page data (and responses). In addition to the spider,
there is an associated connection throttler which enforced rates per domain (each domain could have
different associated rates).
The TORGO system is a semi-flexible system (parts tied to the production docstore/spider) for allowing
collection of human evaluations. The system permitted making a new project and automatically fetching
results from our engine or third party engines with customized scrapers. Human judges could select
queries to judge where cached copies of pages are presented (the caching system was an installation of the
docstore and spider). The configuration of TORGO is done via simple SQL entries - the menus and
options are totally configurable. Judgments can cover queries (i.e. adult, navigational, etc...), URLs
(spam, porn, etc..) and query/URL (i.e. highly-relevant, official homepage, bad, etc...).
We have about 1M "judgments" - each Judgment includes many pieces of data - ranging from: a
particular URL is highly-relevant for a given query, or the information described above in the TORGO
system description. We also have actual result lists from competitor search engines - to allow for DCG
(relevance score) comparison. The judgment scale was 5-point – "Expected", "Comprehensive",
"Acceptable", "Marginal", and "Bad", as well as other no-judgment options (page error, etc…).
The CHOCO system is one of the strongest differentiators for the SearchMe system - being able to
classify about 1000 algorithmic categories for each and every web page in our index in fractional seconds
each. The CHOCO system was built up over time and each of the about 1000 algorithmic categories has
multiple (ranging from about 40 to 120) positive labeled training examples as well as about 50-100
Like some other big search engines, we needed a way to rapidly insert arbitrary URLs, and provide
tagging of them to allow tracking (potentially for payment). The PI system includes powerful feed
processing that is able to take .tsv (standard formatted) files and automatically insert these into a docstore.
Each production IS required a "built index" - the IB produces this built index. It streams documents from
the docstore and produce an inverted index, along with many other fields required for the resultant XML
(media-fields), and aggregation data. The IB can build a 8.5 M document Index in under a day.
Our engine primarily focused on using URLs discovered as part of the webmap process - an expensive,
time consuming stage. To speed inclusion of new content (i.e. News, special-interest feeds, etc...), we
built the Paid Inclusion System (N) - which can take a .tsv feed and insert it into the docstore (a separate
production process runs the IB). The El Rapido system takes as input a list of RSS feeds, fetches the
results and produces a .tsv file for the PI system to insert into the docstore.
We built a custom system that can connect to HULU (implementing their API) and fetch the 'changes'.
The results are then used to generate a .tsv which contains only the 'active content' and the appropriate
MLR related fields.
We also include pages from YouTube, Imeem and Ficker - we have a version of the El Rapido feed
generation which is generic - there are two halves, one which takes RSS inputs and produces an
'intermediate file', and the other tool takes the 'intermediate file' to produce a feed. This way any custom
tool could be used to connect with proprietary APIs.