Escolar Documentos
Profissional Documentos
Cultura Documentos
The general idea is that search is only useful once you get the data and
preprocess it before sending it to the indexer. This is where the real work
comes in. We keep reinventing the wheel for each new job.
<item>
<title>My title</title>
<text>My text here</text>
</item>
A CORE CONCEPT: THE ITEM CLASS (continued)
Items carry actions like “delete” or “update”. This can be really useful for
keeping a search index in sync with a data source.
IMPLEMENTING PLUGINS, GENERALLY
Plugins
Built-in: File Scanner, Web Crawler, SQL Crawler
Commercial: Sharepoint, Exchange, Documentum, Vignette, Day (JCR),
Interwoven, Lotus Notes/Domino, portals from IBM/WebSphere, SAP,
etc.
DocFilters
Built-in: HTML, plain text, PDF (basic)
Commercial: MS Office, PDF (enhanced), others
Stages
Built-in: Simple tokenizer, regex extractor, router
Commercial: Enhanced tokenizer, entity extraction, language detection, etc.
PARALLEL PROCESSING
Connector Consolidation
Keep it simple
Straightforward, elegant design. Minimize external dependencies. Really easy
plug-in implementation.
Useful out-of-the-box
Both an application and a framework
OpenPipeline is broader
It encompasses crawlers, docfilters, and a means of routing documents. It does not do
text analytics directly -- you'll normally implement that yourself in a stage, or use a 3rd
party plugin.