Você está na página 1de 32

FAST Enterprise Search Platform

version:5.3.SP1

BrowserEngine

Document Number: ESP1046, Document Revision: A, February 04, 2009


Copyright
Copyright © 1997-2009 by Fast Search & Transfer ASA (“FAST”). Some portions may be copyrighted
by FAST’s licensors. All rights reserved. The documentation is protected by the copyright laws of Norway,
the United States, and other countries and international treaties. No copyright notices may be removed
from the documentation. No part of this document may be reproduced, modified, copied, stored in a
retrieval system, or transmitted in any form or any means, electronic or mechanical, including
photocopying and recording, for any purpose other than the purchaser’s use, without the written
permission of FAST. Information in this documentation is subject to change without notice. The software
described in this document is furnished under a license agreement and may be used only in accordance
with the terms of the agreement.

Trademarks
FAST ESP, the FAST logos, FAST Personal Search, FAST mSearch, FAST InStream, FAST AdVisor,
FAST Marketrac, FAST ProPublish, FAST Sentimeter, FAST Scope Search, FAST Live Analytics, FAST
Contextual Insight, FAST Dynamic Merchandising, FAST SDA, FAST MetaWeb, FAST InPerspective,
GetSmart, NXT, LivePublish, Folio, FAST Unity, FAST Radar, RetrievalWare, AdMomentum, and all
other FAST product names contained herein are either registered trademarks or trademarks of Fast
Search & Transfer ASA in Norway, the United States and/or other countries. All rights reserved. This
documentation is published in the United States and/or other countries.
Sun, Sun Microsystems, the Sun Logo, all SPARC trademarks, Java, and Solaris are trademarks or
registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
Netscape is a registered trademark of Netscape Communications Corporation in the United States and
other countries.
Microsoft, Windows, Visual Basic, and Internet Explorer are either registered trademarks or trademarks
of Microsoft Corporation in the United States and/or other countries.
Red Hat is a registered trademark of Red Hat, Inc.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.
AIX and IBM Classes for Unicode are registered trademarks or trademarks of International Business
Machines Corporation in the United States, other countries, or both.
HP and the names of HP products referenced herein are either registered trademarks or service marks,
or trademarks or service marks, of Hewlett-Packard Company in the United States and/or other countries.
Remedy is a registered trademark, and Magic is a trademark, of BMC Software, Inc. in the United States
and/or other countries.
XML Parser is a trademark of The Apache Software Foundation.
All other company, product, and service names are the property of their respective holders and may be
registered trademarks or trademarks in the United States and/or other countries.

Restricted Rights Legend


The documentation and accompanying software are provided to the U.S. government in a transaction
subject to the Federal Acquisition Regulations with Restricted Rights. Use, duplication, or disclosure of
the documentation and software by the government is subject to restrictions as set forth in FAR 52.227-19
Commercial Computer Software-Restricted Rights (June 1987).
Contact Us

Web Site
Please visit us at: http://www.fastsearch.com/

Contacting FAST
FAST
Cutler Lake Corporate Center
117 Kendrick Street, Suite 100
Needham, MA 02492 USA
Tel: +1 (781) 304-2400 (8:30am - 5:30pm EST)
Fax: +1 (781) 304-2410

Technical Support and Licensing Procedures


Technical support for customers with active FAST Maintenance and Support agreements, e-mail:
tech-support@fastsearch.com
For obtaining FAST licenses or software, contact your FAST Account Manager or e-mail:
customerservice@fastsearch.com
For evaluations, contact your FAST Sales Representative or FAST Sales Engineer.

Product Training
E-mail: fastuniv@microsoft.com
To access the FAST University Learning Portal, go to: http://www.fastuniversity.com/

Sales
E-mail: sales@fastsearch.com
Contents

Preface..................................................................................................ii
Copyright..................................................................................................................................ii
Contact Us...............................................................................................................................iii

Chapter 1: About BrowserEngine.......................................................7


About the BrowserEngine.........................................................................................................8
Architecture..............................................................................................................................8

Chapter 2: Configuring the BrowserEngine....................................11


Enterprise Crawler considerations.........................................................................................12
Configuration via XML File.....................................................................................................12
Modifying BrowserEngine server settings...................................................................12
Setting browser attributes............................................................................................13
Configuring the extractor pipeline................................................................................15
Flash settings..............................................................................................................19
Example.......................................................................................................................19

Chapter 3: Operating the BrowserEngine........................................21


Starting and Stopping.............................................................................................................22
Starting from the administrator interface.....................................................................22
Stopping from the administrator interface....................................................................22
Starting from the command line..................................................................................22
Stopping from the command line.................................................................................22
Logging...................................................................................................................................22
Change the BrowserEngine logging............................................................................22
Monitoring...............................................................................................................................23
Tuning.....................................................................................................................................23
Restrictions.............................................................................................................................24

Chapter 4: BrowserEngine reference information..........................25


BrowserEngine binary............................................................................................................26
XML-RPC Browser Interface..................................................................................................26
XML-RPC Status Interface.....................................................................................................27
Extractor processing examples..............................................................................................28

5
FAST Enterprise Search Platform

6
Chapter

1
About BrowserEngine
Topics: The BrowserEngine is a highly scalable and configurable component that extracts
links and text from JavaScript and Adobe Flash files. The BrowserEngine is used
• About the BrowserEngine by the FAST Enterprise Crawler and may be called from the Document Processing
pipeline.
• Architecture
FAST Enterprise Search Platform

About the BrowserEngine


The BrowserEngine is a highly scalable and configurable component that extracts links and text from JavaScript
and Adobe Flash files. The BrowserEngine is used by the FAST Enterprise Crawler (EC) and can also be
used from the Document Processing pipeline.
The BrowserEngine is a new component that replaces functionality previously available only to the Enterprise
Crawler. It is intended to provide superior web page content, through the following new features:
• Improved Document Object Model (DOM) coverage
• Cookie extraction
• Frame support
• Extensibility and customization
• Scalable architecture
• Link and metadata extraction from Flash
The new BrowserEngine will enable more links to be extracted, improving the scope of a crawl, as well as
improved document content, enhancing the index quality. In addition, customers and partners can modify the
behavior of the engine according to individual needs. Because more thorough emulation of a browser
environment requires additional system resources, the design allows the crawler to take advantage of multiple
BrowserEngines (on one or more hosts) in order to distribute the load and scale the number of pages
processed.

Architecture
The BrowserEngine is a stand-alone ESP component, capable of processing HTML documents containing
javascripts and Flash files. It accomplishes this by emulating a browser's internal environment, without the
need for a display.
The BrowserEngine is implemented in Java and runs as a separate process. This provides isolation from
other components (in particular, from the Enterprise Crawler), in the case of a fatal error. This design also
allows a component to use multiple BrowserEngines, or multiple components can use the same BrowserEngine.
The following diagram illustrates the major functional modules within the BrowserEngine, and shows the
datapaths that will be referenced in the following discussion.

Figure 1: BrowserEngine Architecture


To give an overview of how the BrowserEngine works, consider the flow of an HTML page through the internal
processing. When the BrowserEngine receives a processing request, it assigns the task to a thread from its
pool of idle threads. If the file is a Flash binary content file, it is simply parsed for text and links and the result

8
About BrowserEngine

returned. Otherwise, it is delivered to the JavaScript Handler. The first step is to run a user-definable page
preprocessor to initialize the DOM tree, before any processing of the page contents takes place. This allows
the BrowserEngine to simulate support for browser plug-ins such as Adobe Reader, Apple QuickTime or
Windows Media Player, and also permits initialization of settings such as User-Agent, or the screen size. The
page preprocessor is written in JavaScript, in order to provide quick and easy customization.
After the page preprocessor has initialized the DOM tree, the BrowserEngine parses the HTML document,
fetches external dependencies and populates the DOM tree with HTML elements. External dependencies,
such as scripts and frames, will be looked up in a local dependency cache, or fetched indirectly via the
Enterprise Crawler, which acts as a cacheing proxy. It is also capable of fetching resources directly from the
network, if used by components other than the crawler. The document is loaded just as a real browser would,
by executing scripts and onLoad handlers.
In addition to the page preprocessor, there is an optional script preprocessor that can modify the source code
of every snippet of JavaScript code before it is executed.
After the document is loaded, the constructed DOM tree is passed to a configurable pipeline of extractors.
The pipeline stages create a text representation of the HTML document, extract cookies, generate a document
checksum, simulate user interactions and extract links. This data and metadata is returned to the calling
component.

9
Chapter

2
Configuring the BrowserEngine
Topics: The BrowserEngine can run out of the box with Fast ESP. However, you may
want change the preprocessors and/or the pipeline to fit your needs.
• Enterprise Crawler considerations
• Configuration via XML File
FAST Enterprise Search Platform

Enterprise Crawler considerations


The BrowserEngine does work on behalf of, and in conjunction with, the Enterprise Crawler, and that component
must be configured properly to make use of the BrowserEngine.
There are two requirements in configuring the Enterprise Crawler to make use of the BrowserEngine. The
first is that one of the following attributes must be enabled, by setting it to the value Yes:
• JavaScript support
• Macromedia Flash support
The Enterprise Crawler also needs to be configured with the location of all available BrowserEngines in the
FAST ESP installation. Normally this setup is done by the FAST ESP installation itself, as each BrowserEngine
is enabled. For information about the details, please see the section CrawlerGlobalDefaults.xml options in
the FAST Enterprise Crawler Guide.

Configuration via XML File


The BrowserEngine is configured with default settings that are appropriate for most Fast ESP installations.
You can change the configuration, including the preprocessors and the pipeline, to fit the needs of your
installation.
The BrowserEngine is configured through an XML file, located on the Config Server node at:
$FASTSEARCH/etc/config_data/BrowserEngine/BrowserConfig.xml

Changes made to this file, or any other files used by the BrowserEngine configuration, will not take effect
until the BrowserEngine is restarted.

Modifying BrowserEngine server settings


The BrowserEngine XML file includes a server tag that defines the port number range, and other attributes
used to tune the performance.

Parameter Description

port Base port number, which is used to listen for requests from the Enterprise Crawler.
Note: The BrowserEngine also uses port number "port+1". Both ports must be free.

maxThreads The number of BrowserEngine threads created to process documents. This attribute limits the
number of documents which can be processed concurrently. Note that setting this value too
high can result in wasted CPU utilization due to scheduling, resulting in lower document
throughput. Also, it can cause the BrowserEngine to run out of Java heap space. Thus, a better
solution is to start multiple instances of the BrowserEngine.
maxQueueSize The limit on requests that may be accepted and queued, waiting for an available processing
thread. If the queue becomes full, the BrowserEngine will deny further requests from the
Enterprise Crawler until processing threads become available.

Example:
<server maxThreads="100" maxQueueSize=”100” port="50000"/>

12
Configuring the BrowserEngine

Setting browser attributes


The browser tag in the BrowserEngine XML file includes general browser attributes, and cache, blacklist, and
javascript sub-tags with corresponding attributes.

Browser Tag
Parameter Description

type Specifies the browser type to emulate. Legal values are:

• Mozilla
• InternetExplorer

allowPopups Allow pop-ups in BrowserEngine or not.

useSSL Specifies if the BrowserEngine should use SSL when requesting external dependencies from
the Enterprise Crawler. The attribute should be set to false when used in a FAST ESP installation
with the crawler.
Note: This setting only affects the BrowserEnginer interactions with the Enterprise Crawler,
which may still use SSL to retrieve the dependency.

evaluationTimeout The total maximum time (in seconds) that a document can use on processing. This includes
time used on waiting for external dependencies. Documents which uses a longer time than this
specified value is aborted by the BrowserEngine. In this case the Enterprise Crawler will store
the original document and follow the links it finds.
terminateTimeout The terminateTimeout option sets the maximum time (in seconds) a thread can run before the
BrowserEngine is shutdown. This prevents potential endless spinning threads, not properly
timed out by the evolutionTimeout mechanism, of hogging all system recourses.

Example:
<browser type="mozilla" allowPopups="false" useSSL="true" evaluationTimeout="3600">

Browser sub-tags
Within the browser tag, there are four configurable tags:
• cache
• blacklist
• flash
• javascript

Cache
Parameter Description

size Specifies the cache size in megabytes (MB). The cache improves the performance by reducing
the traffic between the BrowserEngine and the Enterprise Crawler whenever there are external
dependencies.
ttl The maximum time (in milliseconds) that a cache entry may exist in the cache. If the cache
becomes full, cache entries are removed in a Least Recently Used order.

Example:
<cache size="25" ttl="3600000"/>

13
FAST Enterprise Search Platform

Blacklist
Parameter Description

reqexp value The blacklist tag contains a list of regular expressions used to exclude requests for external
dependencies. Before the BrowserEngine requests an external dependency, it checks if the
URI matches a regular expression. If there is a match, the request is not submitted, and the
BrowserEngine will continue to process the document without downloading the dependency. A
common usage is to block advertisements.

Example:

<blacklist>
<regexp value="as-us\.falkag\.net"/>
<regexp value="doubleclick\.net"/>
</blacklist>

JavaScript
Parameter Description

timeout Specifies the maximum time (in milliseconds) that the JavaScript engine is allowed to execute
a snippet of JavaScript code. If the timeout limit is reached the execution of the JavaScript code
will be aborted. This prevents the BrowserEngine from becoming stuck in endless loops.
scriptPreprocessor Specifies the URL or java resource path to the script preprocessor JavaScript code.

pagePreProcessor Specifies the URL or java resource path to the pre preprocessor JavaScript code.

Example:

<javascript timeout="5000">
<pagePreProcessor src="/pagePreProcessor.js"/>
<scriptPreProcessor src="/scriptPreProcessor.js"/>
</javascript>

Specifying a customized page preprocessor


The page preprocessor is regular text file containing JavaScript code. The purpose of the page preprocessor
is to initialize the DOM tree before document processing begins. This allows the BrowserEngine to simulate
support for browser plug-ins, such as Adobe Reader, and allows browser settings such as screen size to be
set.

1. Create or modify the page preprocessor file according to your needs, and save it to the directory containing
the BrowserEngine configuration file.
2. Edit the BrowserEngine configuration file to specify this page preprocessor.
3. Restart the BrowserEngine.

Example: A page preprocessor which emulates support for the Adobe Reader.

navigator.plugins = new Array();


navigator.plugins[0].name = “Adobe Reader 7.0”
navigator.plugins[0].description = "The Adobe Reader plug-in is used to
enable viewing of PDF and FDF files from within the browser."

14
Configuring the BrowserEngine

Specifying a customized script preprocessor


The purpose of the script preprocessor is to modify JavaScript code before processing begins.
The script preprocessor is a text file containing JavaScript code to be executed before the BrowserEngine
executes the current document's JavaScript code. This allows the BrowserEngine to modify the source code
before it is executed. A script preprocessor file must define a function that accepts four parameters:
• page
• sourceCode
• sourceName
• htmlElement
The last line of the script must return the output of that function. See the example below.

1. Create or modify the script preprocessor file according to your needs and save it to the directory containing
the BrowserEngine configuration file.
2. Edit the BrowserEngine configuration file to specify this script preprocessor.
3. Restart the BrowserEngine.

Script PreProcessor example: Returns the source code unmodified.

function scriptPreProcessor(page, sourceCode, sourceName, htmlElement) {


return sourceCode;
}
scriptPreProcessor;

The four parameters to the script preprocessor are:

Parameter Description

page The HTML source page

sourceCode The snippet of JavaScript code to be executed

sourceName The script name

htmlElement The "this" object in a JavaScript context

Configuring the extractor pipeline


After document processing is completed the page is sent through the extractor pipeline, which can be
customized to fit specific needs.

Pipeline tag
The extractor pipeline has four primary responsibilities:
• create the processed HTML document
• retrieve cookies
• create a checksum
• extract links
Additional functionality can also be included in the pipeline.
The configuration of the pipeline consists of parameters to control overall processing, and the list of extractors
to be run for each page.

15
FAST Enterprise Search Platform

Attribute Description

maxIterations Sets a limit on the number of times the pipeline

obeyNoIndex Specifies whether the extractors should obey the HTML noindex meta tag or not (boolean).

abortOnFailure Specify if the pipeline should abort if an extractor in the pipeline fails, or if the BrowserEngine
should return the partial processed document. If set to "true" and a document fails, the document
will not be stored by the Enterprise Crawler and none of the links will be followed. If set to "false"
the document will be stored, and the extracted links may be followed (depending on the crawl
collection configuration.

Example:
<pipeline maxIterations="1" obeyNoIndex="false"
abortOnFailure="false">

Pipeline sub-tags
Within the pipeline tag, many extractors may be defined. The BrowserEngine will execute the extractors in
the specified order. Each extractor tag has two attributes; name and class. In addition there may be multiple
params tags.

Attribute Description

name The extractor identification

class The extractor class path

params An optional list of parameters. A params tag has three attributes; name, value and data type.

HTMLOutput
The extractor generates a HTML document from the DOM tree.
Note: This extractor must always be first in the pipeline!

Example:

<extractor name="HtmlOutput" class="com.fastsearch.jscriptserver.extractors.HtmlOutput">


</extractor>

Cookies
The extractor extracts any cookies which have been created or modified by the executed JavaScript code.
Example:

<extractor name="Cookies" class="com.fastsearch.jscriptserver.extractors.Cookies">


</extractor>

Checksum
This extractor generates an MD5 checksum of the document. The checksum is based on the result of
HTMLOutput, with the HTML tags removed. This is the same algorithm used by default in the Enterprise
Crawler.

16
Configuring the BrowserEngine

Example:

<extractor name="Checksum" class="com.fastsearch.jscriptserver.extractors.Checksum">


</extractor>

AttributeValueExtractor
This extractor retrieves links from HTML attributes. The AttributeValueExtractor takes a series of string
parameters. The "name" parameter is the name of the HTML tag, and "value" is the attribute within this HTML
tag to extract links from.
Example:

<extractor name="AttributeValueExtractor"
class="com.fastsearch.jscriptserver.extractors.AttributeValueExtractor">
<param name="body" value="background" type="str"/>
<param name="embed" value="src" type="str"/>
</extractor>

Clicker
The extractor attempts to simulate user input by “clicking” on elements. This extractor takes one string
parameter, "click". The parameter contains a semicolon separated list of elements to click on.
Example:

<extractor name="Clicker" class="com.fastsearch.jscriptserver.extractors.Clicker">


<param name="click" value="a; area" type="str"/>
</extractor>

EventHandlerRunner
This extractor gets links by triggering JavaScript events. The event handler runner class has one string
parameter, the "events" parameter. The value of this parameter is a semicolon separated list of events, which
the extractor will execute to retrieve new links.
Example:

<extractor name="EventHandlerRunner"
class="com.fastsearch.jscriptserver.extractors.EventHandlerRunner">
<param name="events" value="onFocus; onBlur; onClick; onMouseDown;" type="str"/>
</extractor>

ScriptExtractor
The script extractor uses regular expressions to extract links from JavaScript tags.
Example:

<extractor name="ScriptExtractor"
class="com.fastsearch.jscriptserver.extractors.ScriptExtractor">
</extractor>

17
FAST Enterprise Search Platform

FormExtractor
This extractor tries to extract links from forms by "triggering" submit button of forms.
Example:

<extractor name="FormExtractor"
class="com.fastsearch.jscriptserver.extractors.FormExtractor">
</extractor>

CSSExtractor
The extractor retrieves links from cascading style sheets definitions.
Example:

<extractor name="CSSExtractor" class="com.fastsearch.jscriptserver.extractors.CSSExtractor">

</extractor>

MetaURLFinder
The MetaURLFiner extractor extracts links from within HTML meta tags.
Example:

<extractor name="MetaURLFinder"
class="com.fastsearch.jscriptserver.extractors.MetaURLFinder">
</extractor>

UserScript
The UserScript extractor makes it possible to create extractors using JavaScript. Thus, if none of the other
extractors are able to retrieve the links you can write your own extractor. The extractor has one parameter,
"src". The parameter specifies the location to your JavaScript file. It can be a URL or a java resource path.
Example:

<extractor name="JavaScriptExtractor"
class="com.fastsearch.jscriptserver.extractors.UserScript">
<param name="src" value="/JavaScriptExtractor.js" type="str"/>
</extractor>

Note that this script will be executed like any other script within a page. Please be cautious when naming
variables and functions. The last line in the script must be an object containing the extracted links. The object
must have named properties with their corresponding values being arrays of strings. The name of a property
is the link type, and the array is the list of URIs found for that particular link type.
Example: A user script which extracts image links from a page.

var links = new Object();


links['images'] = new Array();
for (var i = 0; i < document.images.length; i++) {
var image = document.images[i];
links['images'].push(image.src);
}

18
Configuring the BrowserEngine

links;

Flash settings

Setting Description

config Specifies the URI to the flash configuration file, which is used to configure flash extraction.

timeout Maximum time (in milliseconds) that the BrowserEngine will use to process a flash file before
the processing is aborted.

Example:
<flash config="file:///home/user/FlashConfig.xml" timeout="5000"/>

If a Flash configuration file is not specified in the BrowserEngine configuration, the BrowserEngine will use
its default configuration for Flash processing.

Configuration file
The Flash configuration file includes an ExtractLinksFromText tag. This tag has an attribute enable which
can be set to true or false. Setting this attribute to true allows the BrowserEngine to identify links from the
extracted text from the Flash file. Enabling this option will increase the processing time of Flash files.
Note: Most of the links in a Flash file are not contained within the text itself, thus this is just an extra
option to find additional links.

Setting Description

prefix Specifies a prefix. Tokens starting with this value will be identified as links.

suffix Specifies a suffix. Tokens ending with this value will be identified as links.

Below is an example of a Flash configuration file:

<FlashConfig>
<ExtractLinksFromText enabled="false">
<prefix> http </prefix>
<suffix> txt </suffix>
<prefix> ftp </prefix>
<suffix> js </suffix>
<suffix> html </suffix>
</ExtractLinksFromText>
</FlashConfig>

Example
Below is an example file.

<config>

<server maxThreads="50" maxQueueSize="20" port="50000"/>

<browser type="Mozilla" allowPopups="false" useSSL="true">


<cache size="25" ttl="3600000"/>

<blacklist>
<regexp value="http://ads\."/>

19
FAST Enterprise Search Platform

<regexp value="doubleclick\.net"/>
</blacklist>

<javascript timeout="5000">
<scriptPreProcessor src="/scriptPreProcessor.js"/>
<pagePreProcessor src="/pagePreProcessor.js"/>
</javascript>

</browser>

<pipeline maxIterations="1" obeyNoIndex="false" abortOnFailure="true">

<extractor name="HtmlOutput"
class="com.fastsearch.jscriptserver.extractors.HtmlOutput">
</extractor>

<extractor name="Cookies" class="com.fastsearch.jscriptserver.extractors.Cookies">

</extractor>

<extractor name="Checksum" class="com.fastsearch.jscriptserver.extractors.Checksum">

</extractor>

<extractor name="MetaURLFinder"
class="com.fastsearch.jscriptserver.extractors.MetaURLFinder">
</extractor>
</pipeline>
</config>

20
Chapter

3
Operating the BrowserEngine
Topics: This chapter describes how to perform tasks such as starting/stopping, monitoring
and logging of the BrowserEngine.
• Starting and Stopping
• Logging
• Monitoring
• Tuning
• Restrictions
FAST Enterprise Search Platform

Starting and Stopping


Starting and stopping of the BrowserEngine can be done from the administrator interface or from the command
line.

Starting from the administrator interface


To start the BrowserEngine from the administrator interface:

1. Select System Management on the navigation bar.


2. Locate the Browser Engine on the Installed module list - Module name. Select the Start symbol.

Stopping from the administrator interface


To stop the BrowserEngine from the administrator interface:

1. Select System Management on the navigation bar.


2. Locate the Browser Engine on the Installed module list - Module name. Select the Stop symbol.

Starting from the command line


Use the nctrl tool to start the BrowserEngine from the command line.
Refer to the nctrl Tool appendix in the FAST ESP Operations Guide for nctrl usage information.
Run the following command to start the BrowserEngine:

1. $FASTSEARCH/bin/nctrl start browserengine

Stopping from the command line


Use the nctrl tool to stop the BrowserEngine from the command line.
Refer to the nctrl Tool appendix in the FAST ESP Operations Guide for nctrl usage information.
Run the following command to stop the BrowserEngine:

1. $FASTSEARCH/bin/nctrl stop browserengine

Logging
The BrowserEngine produces logs which can help determining the state of a URI or the state of the whole
system. By default, it logs to the $FASTSEARCH/var/log/browserengine directory.
Startup, shutdown, and status messages are the only type of messages sent to the Log Server in order to
reduce network traffic. Messages on a document-level are therefore only logged to the node it's running on.
If you are using the Enterprise Crawler with the BrowserEngine, it also produces log messages that can be
valuable in tracking down what is happening to a specific URI. Refer to the FAST Enterprise Crawler Guide
for more information.

Change the BrowserEngine logging


It is generally not recommended to change the log level. However, one sometimes needs to change the log
level to reveal why a certain page failed to be processed.

22
Operating the BrowserEngine

Knowledge about log4j is required. General log4j information is available at http://logging.apache.org/log4j/docs/


Note: If multiple BrowserEngines run on the same machine, they will all log to the same file. To log to
different files, the log4j configuration has to be different for each engine.

1. Open $FASTSEARCH/components/browserengine/WEB-INF/classes/log4j.xml
2. Change the configuration to your needs and save the file.
3. Using the Node Controller, restart the BrowserEngine.

Monitoring
The BrowserEngine can currently be monitored by reading the log files and by using a set of methods exposed
through XML-RPC.
If you are using the BrowserEngine in combination with the crawler, the crawleradmin tool has an option that
displays statistics for a particular Master (Crawler) node:
$FASTSEARCH/bin/crawleradmin --browserengine

When run on an UberMaster, the output is a list of all the BrowserEngines that are used by the Master nodes.

Tuning
The BrowserEngine may easily get overloaded or run out of Java heap space due to the fact that processing
an HTML document like a browser and executing JavaScripts is a heavy task. This section explains how to
modify configuration settings in order to balance the workload.

Server
Performance may be improved by changing the maxThreads setting, to increase or decrease the thread pool
size. If the BrowserEngine uses too many threads, valuable CPU cycles will be wasted on thread scheduling,
thus lowering throughput. Also, configuring the BrowserEnigne with too many threads increases the probability
of running out of Java heap space. Thus, a better solution may be to run multiple BrowserEngine instances.
While configuring an engine with too few threads may also result in low throughput, as many of the threads
may be blocked waiting for external dependencies. The optimal number of threads is dependent on the
operating system, hardware and the content that is crawled. To tune it, you need to closely monitor the system
before and after the thread pool size has been modified, and measure the affect of each change on
performance.

Browser
Increase the Cache section size parameter, or the TTL setting. This should increase the cache hit ratio, which
means that the number of requests for external scripts and frames is decreased. As a result, fewer threads
in the BrowserEngine will be blocked.

Pipeline
Configure the pipeline to use the minimal set of extractors you need. For instance, if you are only interested
in extracting image links, the default pipeline configuration would involve too much unneeded processing.

23
FAST Enterprise Search Platform

Node deployment
Move the BrowserEngine to a faster (or less heavily utilized) server, or run multiple BrowserEngine instances
on several nodes. Note that the Enterprise Crawler must be reconfigured if the BrowserEngine deployment
is changed.

Enterprise Crawler tuning


The Enterprise Crawler will generate a heavy load for the BrowserEngine at the first crawl cycle, as all
documents are new and needs to be processed. On the subsequent crawl cycles a great portion of these
documents are not modified, thus the load on the BrowserEngine will be significantly reduced. Hence, it is
recommended to setup several BrowserEngine nodes for the firs crawl cycle. After this cycle the number of
nodes can be reduced. Try to limit the number of documents that will be processed by the BrowserEngine.
This can be achieved in the crawler by creating subdomains with JavaScript enabled. Furthermore, decreasing
the javascript_delay attribute in the crawler will help the throughput in the BrowserEngine, as less time
will be used to wait for external dependencies.

Restrictions
In this section two common limitations of the BrowserEngine are discussed.

AJAX
The BrowserEngine does not fully support AJAX (Asynchronous JavaScript and XML). It will extract all links
found in XMLHttpRequest calls, thus if permitted the crawler will follow these links. However, note that it will
not try to download and execute the code

The HTTP POST method


The BrowserEngine and crawler do no support the POST method of HTTP. The POST method is quite
commonly used to update frames and/or iframes. A potential workaround for this issue is to create a customized
stage in the BrowserEngine's pipeline, which extract the links and return them to the crawler as GET operations.
Hence, the required content can be obtained.

24
Chapter

4
BrowserEngine reference information
Topics: This chapter contains various reference information about the BrowserEngine
such as command line parameters and the XML-RPC interface.
• BrowserEngine binary
• XML-RPC Browser Interface
• XML-RPC Status Interface
• Extractor processing examples
FAST Enterprise Search Platform

BrowserEngine binary
The BrowserEngine is invoked by a shell script located at:
UNIX: $FASTSEARCH/components/browserengine/bin/browserengine.sh
Windows: %FASTSEARCH%\components\browserengine\bin\browserengine.cmd
Syntax: browserengine.(sh|cmd) [options] configfile

Option Description

-h Displays the option list.

-v Shows version information.

-p Sets the listening port number.


Note: The option overrides a value set in the BrowserEngine configuration file.

-l Sets the log directory path.

configfile The configuration file as a URL or java resource path. You can specify a configuration file from
the configserver by using the following url syntax:
configserver://<ModuleName>/<FilePath>

For instance:
configserver://BrowserEngine/BrowserConfig.xml

Note: If you want to specify a configuration file on the file system, the URL looks like this:
file:///<FilePath>/<FileName>

XML-RPC Browser Interface


The BrowserEngine exposes methods for processing HTML and Flash through XML-RPC on its baseport.

HTML processing
Map Browser.process(String url, byte[] content, List headers, String proxyHost, int
proxyPort, List extraHeaders)

where

Option Description

url The URL of the page.

content The content of the page.

headers A list of HTTP headers where each entry in the list is a list of length two containing the name
and value of a header. As a minimum, a content-type header with text/html must be supplied.
By adding Set-Cookie headers, you can define which cookies that should be available for
JavaScripts on the page.
proxyHost The hostname or IP address to a HTTP proxy (if any).

proxyPort The port number to a HTTP proxy (if any).

26
BrowserEngine reference information

Option Description
extraHeaders Headers that will be sent with external dependency requests.

Returns a map containing the result (links, cookies, HTML and so on).

Flash processing
byte[] Flash.process(String url, byte[] content)

where

Option Description

url The URL or some other identifier for the Flash content.

content The content of the Flash file.

Returns an XML representation of the Flash file.

XML-RPC Status Interface


The BrowserEngine exposes methods that returns various status information through XML-RPC on baseport
+ 1. The required configserver module methods (ping, ReRegister, ConfigurationChanged and so on) are
also implemented on this port, but is not described any further in this document.
Map statistics()

Returns a map containing various statistical information about the server since it started. Example output (in
the form of a python dictionary):

...
'Total Requests': 2,
'Failed Requests': 0,
'Percentage Statistics':
{ 'CacheHit': 50.0
},
'Pipeline Performance (ms)':
{ 'AttributeValueExtractor': {'avg': 39, 'count': 2, 'max': 54, 'min': 24, 'tot':
78},
'CSSExtractor': {'avg': 3, 'count': 2, 'max': 4, 'min': 3, 'tot': 7},
...
},
'Time Statistics (ms)':
{ 'ExternalResource': {'avg': 532, 'count': 1, 'max': 532, 'min': 532, 'tot': 532},
'PageLoading': {'avg': 1193, 'count': 2, 'max': 1990, 'min': 397, 'tot': 2387},
...
}
...

Map threads()

Returns a map where the keys are thread-ids and the values are maps describing the work status of the
corresponding thread. Example output (in the form of a python dictionary):

...
'pool-2-thread-43': {'started': 1180012778,
'status': 'loading_page',
'url': 'http://somewhere.com/somepage1.html'},
'pool-2-thread-44': {'status': 'idle/dead'},
'pool-2-thread-45': {'started': 1180013128,
'status': 'processing_page',

27
FAST Enterprise Search Platform

'url': 'http://somewhere.com/somepage2.html'},
...

Map getQueueStatus()

Returns a map containing two values,QueueSize and MaxQueueSize. This can be useful to determine whether
or not the BrowserEngine is overloaded.
void quit()

Terminates the server.

Extractor processing examples


Below are examples demonstrating how the different extractors work, and how they extract URIs.

HTMLOutput
Input to extractor:

<html>
<head>
<script language="javascript">
document.writeln('standalone<br>');
function test(arg) {
document.writeln('## function test run from: '+arg+'<br>');
}

test('HEADER');
</script>
</head>

<body>
<script language="javascript">test('BODY');</script>
</body>
</html>

Output from extractor:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body>
standalone<br/>
## function test run from: HEADER<br/>
## function test run from: BODY<br/>
</body>
</html>

Cookies extractor
Input to extractor:

<html>
<head>
<script language="javascript">
function test() {
var param = "cookie_name_";
for (i=1; i<10; i++) {
createCookie(param+i, "val"+i, i);
}

28
BrowserEngine reference information

function createCookie(name, value, days) {


var date = new Date();
date.setTime(date.getTime()+(days*24*60*60*1000));
var expires = "; expires="+date.toGMTString();
document.cookie = name+"="+value+expires+"; path=/";
}
</script>

</head>
<body>
<script language="javascript"> test() </script>
</body>
</html>

Cookies extracted from page:

{'domain': 'www.example.com', 'name': 'cookie_name_1', 'value': 'val1', 'max-age': 86399,


'path': '/', 'spec': 'rfc2109'},
{'domain': 'www.example.com', 'name': 'cookie_name_2', 'value': 'val2', 'max-age': 172799,
'path': '/', 'spec': 'rfc2109'},
{'domain': 'www.example.com', 'name': 'cookie_name_3', 'value': 'val3', 'max-age': 259199,
'path': '/', 'spec': 'rfc2109'},
{'domain': 'www.example.com', 'name': 'cookie_name_4', 'value': 'val4', 'max-age': 345599,
'path': '/', 'spec': 'rfc2109'},
{'domain': 'www.example.com', 'name': 'cookie_name_5', 'value': 'val5', 'max-age': 431999,
'path': '/', 'spec': 'rfc2109'},
{'domain': 'www.example.com', 'name': 'cookie_name_6', 'value': 'val6', 'max-age': 518399,
'path': '/', 'spec': 'rfc2109'},
{'domain': 'www.example.com', 'name': 'cookie_name_7', 'value': 'val7', 'max-age': 604799,
'path': '/', 'spec': 'rfc2109'},
{'domain': 'www.example.com', 'name': 'cookie_name_8', 'value': 'val8', 'max-age': 691199,
'path': '/', 'spec': 'rfc2109'},
{'domain': 'www.example.com', 'name': 'cookie_name_9', 'value': 'val9', 'max-age': 777599,
'path': '/', 'spec': 'rfc2109'}

Checksum generator
Input to extractor:

<html>
<head>
<script language="javascript">
function test() {
document.writeln("<a href=\"test.html\">test.html </a>");
}
</script>
</head>
<body>
<script language="javascript"> test() </script>
</body>
</html>

HTML used for checksum generation in BrowserEngine:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>

<body>
<a href="test.html">test.html </a>

29
FAST Enterprise Search Platform

</body>
</html>

Checksum generated by BrowserEngine: eac0a7ec83537763d3ba7671828d0989

If the BrowserEngine is not configured, and the Enterprise Crawler generate the checksum, it can result in a
different checksum. The JavaScript code of the document is not processed, so there might be different content
in the document.

Checksum generated by Enterprise Crawler: 1ed6cfe48b7a613ef93848c98aa1f88b

If the Enterprise Crawler were to process an HTML document that is identical to the JavaScript processed
document, it would generate the same checksum as the BrowserEngine.
Example: HTML used in Enterprise Crawler for checksum generation

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
<body>
<a href="test.html">test.html </a>
</body>
</html>

Checksum generated by Enterprise Crawler: eac0a7ec83537763d3ba7671828d0989

AttributeValue extractor
Input to extractor:

<img src="img_src_dyn.gif">

Links reported to the Enterprise Crawler:

img_src_dyn.gif

Clicker extractor
Input to extractor:

<html>
<head>
<title> JavaScript testing... </title>

<script language="javascript">
function createLink() {
var protocol = "http";
var sitename = "www.example.com";
var doc = "/cl.html";

document.getElementById("click").innerHTML = "<a href=\"deadlink.html\">Dead


link</a><br><br>";
document.getElementById("click").innerHTML += "<a href=\"" + protocol + "://" +
sitename + "/" + doc + "\">New link</a>";
}
</script>

30
BrowserEngine reference information

</head>

<body>
<center>
<div id="click">
<img src="image.jpg" onclick="createLink();">

</div>
</center>
</body>
</html>

Links reported to the Enterprise Crawler:

deadlink.html
http;//www.example.com/cl.html

EventHandlerRunner extractor
Input to extractor:

<html>
<head>
<script language="javascript">
function createLink() {
var protocol = "http";
var sitename = "www.example.com";
var doc = "event.html";

document.getElementById("click").innerHTML = "<a href=\"deadlink.html\">Dead


link</a><br><br>";
document.getElementById("click").innerHTML += "<a href=\"" + protocol + "://" +
sitename + "/" + doc + "\">New link</a>";
}
</script>
</head>

<body>
<center>
<div id="click">
<img src="picture.jpg" onMouseOut="createLink();">
</div>
</center>
</body>
</html>

Links reported to the Enterprise Crawler:

deadlink.html
http://www.example.com/event.html

Script extractor
Input to extractor:

// document.location = 'http://www.example.com/docLoc.html';
// window.open("http://www.example.com/someOpen4.html", "window name");

Links reported to the Enterprise Crawler:

http://www.example.com/docLoc.html
http://www.example.com/someOpen4.html

31
FAST Enterprise Search Platform

Form extractor
Input to extractor:

document.writeln("<form action=\"action_dyn.html\" method=\"post\"><input type=\"submit\"


value=\"Send\"> <input type=\"reset\"></form>");
<form action="action_static.html" type="submit"></form>

Links reported to the Enterprise Crawler:

action_dyn.html
action_static.html

CSS extractor
Input to extractor:

<style type="text/css">
@import "1.css";
@import url('2.css');
body{background-image: url('3.jpg')}
</style>

Links reported to the Enterprise Crawler:

1.css
2.css
3.jpg

MetaURLFinder extractor
Input to extractor:

<meta name="description" content="'http://fast.no/link.html'">


<meta name="description" content="http://noquoutesused.no/wont_find.html">

Links reported to the Enterprise Crawler:

http://fast.no/link.html

UserScript extractor
JavaScript defined as userscript:

var test = new Object();


test['test'] = new Array();
test['test'].push(testvar);
test;

Input to userscript:

<html>
<body>
<script language="javascript">
var testvar = 'test.html';
vartest = 'MAGIC_'+testvar;
</script>
</body>
</html>

Links reported to the Enterprise Crawler:

MAGIC_test.html.

32

Você também pode gostar