Introducing Open Pipeline

OpenPipeline: Open source software for crawling, parsing, analyzing
and routing documents

Dieselpoint, Inc.
Copyright 2008, Dieselpoint, Inc.

WELCOME TO OPENPIPELINE
OpenPipeline is open source software for crawling, parsing,

analyzing, and routing documents
It has been said that if web search is exciting, enterprise search is its boring
cousin. An even more boring cousin would be document preparation for
search – all the things you need to do to get a document and standardize it
before indexing.
A better title for this presentation would be:

“All the boring stuff you need to do to get enterprise search to work”
A STANDARD INTERFACE
The general idea is that search is only useful once you get the data and
preprocess it before sending it to the indexer. This is where the real work
comes in. We keep reinventing the wheel for each new job.
Why not have an open framework that standardizes integration

across all the pieces?
All the major vendors have some form of “pipeline” for document
preprocessing. At Dieselpoint, we have one as well. We were finding it difficult
to integrate our proprietary pipeline framework with third-party connectors and
document analyzers.
So we decided to open source our version of the document pipeline.

Bottom line: This industry has an interoperability problem. We speak of
breaking down corporate silos, only to create more of our own.
THE BIG PICTURE
A DAY IN THE LIFE OF OPENPIPELINE
Acme Corp wants a universal search engine. They need to:
Crawl intranet, with sensitive HR documents
Crawl website, with products, in three languages
Crawl engineering docs in a content system
Crawl CRM records with customer comments

ADDING A CONNECTOR
You start by picking a connector off the menu…
You start by picking a connector off a menu…

FILE SCANNER
You fill in some crawler parameters…

DOC FILTERS
You might want to alter how the document filters work…

STAGES
You configure the stages in the processing of each document…

A “stage” is a step in the pipeline. Stages modify Items in some manner.
Example stages include tokenizers, stemmers / lemmatizers, entity
recognizers, sentiment analyzers, part-of-speech detectors, etc.
SCHEDULE THE JOB
Pick a time to run the job…

There are a variety of options, including cron expressions, which allow you to
define a trigger like “every day at 4:00 AM except weekends”.
SCHEDULER
And watch it run…

The scheduler is a general control panel for all the jobs in the system. It fires
connectors on schedule, and provides for starting, stopping, and status.
This is a generalized job management system; at Dieselpoint we’re using it for
managing indexing, index optimization, building recommender models,
processing queries for reports, and anything else that needs to run on a
schedule.
LOGGING
The system provides a generalized system for viewing and

managing logs
APP SERVER
The entire assembly runs in a standard J2EE app server. It ships

with Jetty, but runs fine in WebSphere, BEA, Tomcat, others. The
admin app is a plain vanilla webapp, built using POJOs.
A CORE CONCEPT: THE ITEM CLASS
Connectors, filters, and stages operate on Items. An Item is the OpenPipeline

conception of a document. It looks a lot like an XML Document Object Model
(DOM), but simpler. We took the best elements from DOM, JDOM, dom4j,
XOM, SAX, Stax, and boiled them down into a straightforward API, with (we
hope) no loss of generality. One benefit of this approach is that it handles
structured as well as unstructured data.
<item>
<title>My title</title>
<text>My text here</text>
</item>
A CORE CONCEPT: THE ITEM CLASS (continued)
Items also carry annotations. An annotation can be a token (word), an entity

that should be tagged, or really any property of the document that should be
named and processed separately.
<annotation type=“token” offset=“100” len=“6” pos=“verb”>

<annotation type=“sentiment” value=“positive”>
<annotation type=“person” offset=“50” len=“12”
standardized=“Bob Smith”>
Items carry actions like “delete” or “update”. This can be really useful for
keeping a search index in sync with a data source.
IMPLEMENTING PLUGINS, GENERALLY
To implement a plugin, just implement the appropriate interface:

org.openpipeline.pipeline.connector.Connector
org.openpipeline.pipeline.docfilter.DocFilter
org.openpipeline.pipeline.stage.Stage
A plugin is just a jar file

Drop it on the classpath, restart the server. If you construct the jar file properly,
the system will autodiscover the plugin and put it in the UI. It’s also available
through the API:
ConnectorFactory.getConnector()
DocFilterFactory.getFilter()
StageFactory.getStage()
AVAILABLE PLUGINS, DOC FILTERS AND STAGES
Plugins
Built-in: File Scanner, Web Crawler, SQL Crawler
Commercial: Sharepoint, Exchange, Documentum, Vignette, Day (JCR),
Interwoven, Lotus Notes/Domino, portals from IBM/WebSphere, SAP,
etc.
DocFilters
Built-in: HTML, plain text, PDF (basic)
Commercial: MS Office, PDF (enhanced), others
Stages
Built-in: Simple tokenizer, regex extractor, router
Commercial: Enhanced tokenizer, entity extraction, language detection, etc.
PARALLEL PROCESSING
Document processing can be centralized or parallelized as needed

The transport mechanism is simple, web-services XML over HTTP. It’s also
possible to do RSS/Atom feeds.
Parallelization
Connector Analytics Server Search Server
Connector Analytics Server Search Server Client

Browser
Connector Analytics Server Search Server
Connector Consolidation

Browser
Connector
Connector Replication Search Server

Browser
Connector Search Server
DEVELOPMENT PHILOSOPHY
Keep it simple
Straightforward, elegant design. Minimize external dependencies. Really easy
plug-in implementation.
Useful out-of-the-box
Both an application and a framework
Design it for massive scalability

Fully distributed
Almost zero-install for everything, including plugins

100% Java
THE COMMERCIAL JUSTIFICATION
Why open source a key technology? Dieselpoint isn’t a charity

- By bringing in other companies, it allows us to do much bigger jobs. It solves
the integration problem
- We hope that some fraction of those who use the software will use our
search engine
Why would third parties participate?

- Makes standalone plugins much more valuable
- The OpenPipeline website provides a new channel for search vendors,
plugin vendors, and consultants
- It’s free. There is no downside.
This is a way of getting high-quality software, commercially-sold and

supported, while solving the interoperability problem.
THE CURRENT ADVISORY BOARD
Our Advisory Board is in formation. At the moment, we have

members from:
Enterprise search companies

Connector companies
Text analytics companies
Consultancies
A COMMON QUESTION: HOW DOES OPENPIPELINE
DIFFER FROM UIMA?
OpenPipeline and UIMA work together nicely
UIMA stands for “Unstructured Information Management Architecture”

It's a product from IBM currently in incubation at Apache. It's a framework for doing text
analytics on unstructured documents. You build annotators and analysis engines to
discover structure and semantics within a document.
OpenPipeline is broader
It encompasses crawlers, docfilters, and a means of routing documents. It does not do
text analytics directly -- you'll normally implement that yourself in a stage, or use a 3rd
party plugin.
A UIMA annotator can be a plugin within the pipeline

OpenPipeline makes it easy to feed documents to a UIMA annotator and then do
something with the results. UIMA is useful for companies that have deeper needs in text
analytics.
A GENERAL PLATFORM
OpenPipeline is a good platform for text processing generally. At

Dieselpoint, we’ve built our core enterprise search product on it.
CONTACT
Chris Cleveland – ccleveland@dieselpoint.com

http://www.openpipeline.org

Introducing Open Pipeline

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Introducing Open Pipeline

Enviado por

Direitos autorais:

Formatos disponíveis

OpenPipeline: Open source software for crawling, parsing, analyzing

and routing documents

Copyright 2008, Dieselpoint, Inc.

OpenPipeline is open source software for crawling, parsing,

A better title for this presentation would be:

Why not have an open framework that standardizes integration

So we decided to open source our version of the document pipeline.

Acme Corp wants a universal search engine. They need to:

Crawl intranet, with sensitive HR documents

Crawl website, with products, in three languages

Crawl engineering docs in a content system

Crawl CRM records with customer comments

You start by picking a connector off the menu…

You start by picking a connector off a menu…

You fill in some crawler parameters…

You might want to alter how the document filters work…

You configure the stages in the processing of each document…

Pick a time to run the job…

And watch it run…

The system provides a generalized system for viewing and

The entire assembly runs in a standard J2EE app server. It ships

Connectors, filters, and stages operate on Items. An Item is the OpenPipeline

Items also carry annotations. An annotation can be a token (word), an entity

<annotation type=“token” offset=“100” len=“6” pos=“verb”>

To implement a plugin, just implement the appropriate interface:

A plugin is just a jar file

Document processing can be centralized or parallelized as needed

Connector Analytics Server Search Server Client

Connector Analytics Server Search Server Client

Connector Replication Search Server

Connector Analytics Server Search Server Client

Design it for massive scalability

Almost zero-install for everything, including plugins

Why open source a key technology? Dieselpoint isn’t a charity

Why would third parties participate?

This is a way of getting high-quality software, commercially-sold and

Our Advisory Board is in formation. At the moment, we have

Enterprise search companies

UIMA stands for “Unstructured Information Management Architecture”

A UIMA annotator can be a plugin within the pipeline

OpenPipeline is a good platform for text processing generally. At

Chris Cleveland – ccleveland@dieselpoint.com

Você também pode gostar