Você está na página 1de 24

OpenPipeline: Open source software for crawling, parsing, analyzing

and routing documents


Dieselpoint, Inc.

Copyright 2008, Dieselpoint, Inc.


WELCOME TO OPENPIPELINE

OpenPipeline is open source software for crawling, parsing,


analyzing, and routing documents
It has been said that if web search is exciting, enterprise search is its boring
cousin. An even more boring cousin would be document preparation for
search – all the things you need to do to get a document and standardize it
before indexing.

A better title for this presentation would be:


“All the boring stuff you need to do to get enterprise search to work”
A STANDARD INTERFACE

The general idea is that search is only useful once you get the data and
preprocess it before sending it to the indexer. This is where the real work
comes in. We keep reinventing the wheel for each new job.

Why not have an open framework that standardizes integration


across all the pieces?
All the major vendors have some form of “pipeline” for document
preprocessing. At Dieselpoint, we have one as well. We were finding it difficult
to integrate our proprietary pipeline framework with third-party connectors and
document analyzers.

So we decided to open source our version of the document pipeline.


Bottom line: This industry has an interoperability problem. We speak of
breaking down corporate silos, only to create more of our own.
THE BIG PICTURE
A DAY IN THE LIFE OF OPENPIPELINE

Acme Corp wants a universal search engine. They need to:

Crawl intranet, with sensitive HR documents

Crawl website, with products, in three languages

Crawl engineering docs in a content system

Crawl CRM records with customer comments


ADDING A CONNECTOR

You start by picking a connector off the menu…

You start by picking a connector off a menu…


FILE SCANNER

You fill in some crawler parameters…


DOC FILTERS

You might want to alter how the document filters work…


STAGES

You configure the stages in the processing of each document…


A “stage” is a step in the pipeline. Stages modify Items in some manner.
Example stages include tokenizers, stemmers / lemmatizers, entity
recognizers, sentiment analyzers, part-of-speech detectors, etc.
SCHEDULE THE JOB

Pick a time to run the job…


There are a variety of options, including cron expressions, which allow you to
define a trigger like “every day at 4:00 AM except weekends”.
SCHEDULER

And watch it run…


The scheduler is a general control panel for all the jobs in the system. It fires
connectors on schedule, and provides for starting, stopping, and status.
This is a generalized job management system; at Dieselpoint we’re using it for
managing indexing, index optimization, building recommender models,
processing queries for reports, and anything else that needs to run on a
schedule.
LOGGING

The system provides a generalized system for viewing and


managing logs
APP SERVER

The entire assembly runs in a standard J2EE app server. It ships


with Jetty, but runs fine in WebSphere, BEA, Tomcat, others. The
admin app is a plain vanilla webapp, built using POJOs.
A CORE CONCEPT: THE ITEM CLASS

Connectors, filters, and stages operate on Items. An Item is the OpenPipeline


conception of a document. It looks a lot like an XML Document Object Model
(DOM), but simpler. We took the best elements from DOM, JDOM, dom4j,
XOM, SAX, Stax, and boiled them down into a straightforward API, with (we
hope) no loss of generality. One benefit of this approach is that it handles
structured as well as unstructured data.

<item>
<title>My title</title>
<text>My text here</text>
</item>
A CORE CONCEPT: THE ITEM CLASS (continued)

Items also carry annotations. An annotation can be a token (word), an entity


that should be tagged, or really any property of the document that should be
named and processed separately.

<annotation type=“token” offset=“100” len=“6” pos=“verb”>


<annotation type=“sentiment” value=“positive”>
<annotation type=“person” offset=“50” len=“12”
standardized=“Bob Smith”>

Items carry actions like “delete” or “update”. This can be really useful for
keeping a search index in sync with a data source.
IMPLEMENTING PLUGINS, GENERALLY

To implement a plugin, just implement the appropriate interface:


org.openpipeline.pipeline.connector.Connector
org.openpipeline.pipeline.docfilter.DocFilter
org.openpipeline.pipeline.stage.Stage

A plugin is just a jar file


Drop it on the classpath, restart the server. If you construct the jar file properly,
the system will autodiscover the plugin and put it in the UI. It’s also available
through the API:
ConnectorFactory.getConnector()
DocFilterFactory.getFilter()
StageFactory.getStage()
AVAILABLE PLUGINS, DOC FILTERS AND STAGES

Plugins
Built-in: File Scanner, Web Crawler, SQL Crawler
Commercial: Sharepoint, Exchange, Documentum, Vignette, Day (JCR),
Interwoven, Lotus Notes/Domino, portals from IBM/WebSphere, SAP,
etc.

DocFilters
Built-in: HTML, plain text, PDF (basic)
Commercial: MS Office, PDF (enhanced), others

Stages
Built-in: Simple tokenizer, regex extractor, router
Commercial: Enhanced tokenizer, entity extraction, language detection, etc.
PARALLEL PROCESSING

Document processing can be centralized or parallelized as needed


The transport mechanism is simple, web-services XML over HTTP. It’s also
possible to do RSS/Atom feeds.
Parallelization
Connector Analytics Server Search Server

Connector Analytics Server Search Server Client


Browser
Connector Analytics Server Search Server

Connector Consolidation

Connector Analytics Server Search Server Client


Browser
Connector

Connector Replication Search Server

Connector Analytics Server Search Server Client


Browser
Connector Search Server
DEVELOPMENT PHILOSOPHY

Keep it simple
Straightforward, elegant design. Minimize external dependencies. Really easy
plug-in implementation.

Useful out-of-the-box
Both an application and a framework

Design it for massive scalability


Fully distributed

Almost zero-install for everything, including plugins


100% Java
THE COMMERCIAL JUSTIFICATION

Why open source a key technology? Dieselpoint isn’t a charity


- By bringing in other companies, it allows us to do much bigger jobs. It solves
the integration problem
- We hope that some fraction of those who use the software will use our
search engine

Why would third parties participate?


- Makes standalone plugins much more valuable
- The OpenPipeline website provides a new channel for search vendors,
plugin vendors, and consultants
- It’s free. There is no downside.

This is a way of getting high-quality software, commercially-sold and


supported, while solving the interoperability problem.
THE CURRENT ADVISORY BOARD

Our Advisory Board is in formation. At the moment, we have


members from:

Enterprise search companies


Connector companies
Text analytics companies
Consultancies
A COMMON QUESTION: HOW DOES OPENPIPELINE
DIFFER FROM UIMA?
OpenPipeline and UIMA work together nicely

UIMA stands for “Unstructured Information Management Architecture”


It's a product from IBM currently in incubation at Apache. It's a framework for doing text
analytics on unstructured documents. You build annotators and analysis engines to
discover structure and semantics within a document.

OpenPipeline is broader
It encompasses crawlers, docfilters, and a means of routing documents. It does not do
text analytics directly -- you'll normally implement that yourself in a stage, or use a 3rd
party plugin.

A UIMA annotator can be a plugin within the pipeline


OpenPipeline makes it easy to feed documents to a UIMA annotator and then do
something with the results. UIMA is useful for companies that have deeper needs in text
analytics.
A GENERAL PLATFORM

OpenPipeline is a good platform for text processing generally. At


Dieselpoint, we’ve built our core enterprise search product on it.
CONTACT

Chris Cleveland – ccleveland@dieselpoint.com


http://www.openpipeline.org

Você também pode gostar