Você está na página 1de 394

Contents

About This Book .......................................................................................xiii


Audience .......................................................................................................... xiii
Prerequisites ..................................................................................................... xiii
Conventions ...................................................................................................... xiv
Whats New in SAS Information Retrieval Studio 1.3 ............................xv
1 About SAS Information Retrieval Studio .............................................1
1.1 What Is SAS Information Retrieval Studio? .............................................. 1
1.2 Benefits of Using SAS Information Retrieval Studio ................................ 3
1.3 How Does SAS Information Retrieval Studio Work? ............................... 3
1.4 How Does SAS Information Retrieval Studio Fit into the SAS
Product Line? ................................................................................................... 4
1.5 How to Get Help for SAS Information Retrieval Studio ........................... 4
1.6 What is a Document? ................................................................................. 5
1.7 Architecture ................................................................................................ 5
2 SAS Information Retrieval Studio Interface .........................................7
2.1 Your First Look at the SAS Information Retrieval Studio User Interface . 7
2.2 Access the SAS Information Retrieval Studio User Interface ................... 8
2.3 Viewing the Overview Pane ....................................................................... 11
2.3.1 Overview of the Overview Pane ...................................................... 11
2.3.2 The Log Tab ..................................................................................... 11
2.4 Viewing the Web Crawler Pane ................................................................. 12
2.4.1 Overview of the Web Crawler Pane ................................................ 12
2.4.2 The Buttons ...................................................................................... 13
2.4.3 The Status Tab ................................................................................. 13
2.4.4 The Configuration Tab ..................................................................... 14
2.4.4.A The Main Tabs in the Configuration Tab ............................. 14
2.4.4.B The General Settings Tab ...................................................... 15
2.4.4.C The Entry Points Tab ............................................................ 18
2.4.4.D The Scope Tab ...................................................................... 20
2.4.4.E The Filename Extensions Tab ............................................... 21
2.4.4.F The Credentials Tab .............................................................. 23
2.4.4.G The Log Tab .......................................................................... 24

iii

2.5 Viewing the File Crawler Pane ...................................................................24


2.5.1 Overview of the File Crawler Pane ..................................................24
2.5.2 The Buttons ......................................................................................24
2.5.3 The Status Tab ..................................................................................25
2.5.4 The Configuration Tab .....................................................................25
2.5.4.A The Main Tabs in the Configuration Tab ..............................25
2.5.4.B The General Settings Pane .....................................................26
2.5.4.C The Paths Pane ......................................................................28
2.5.4.D The Paths to Exclude Pane ....................................................29
2.5.4.E The Filename Extensions Pane ..............................................30
2.6 Viewing the Feed Crawler Pane .................................................................31
2.6.1 Overview of the Feed Crawler Pane .................................................31
2.6.2 The Buttons ......................................................................................31
2.6.3 The Status Tab ..................................................................................32
2.6.4 The Configuration Tab .....................................................................32
2.6.4.A The Main Tabs in the Configuration Tab ..............................32
2.6.4.B The General Settings Tab ......................................................33
2.6.4.C The Feeds Tab .......................................................................34
2.6.5 The Log Tab .....................................................................................35
2.7 Viewing the Proxy Server Pane ..................................................................35
2.7.1 Overview of the Proxy Server Pane .................................................35
2.7.2 The Buttons ......................................................................................36
2.7.3 The Status Tab ..................................................................................36
2.7.4 The Configuration Tab .....................................................................37
2.7.5 The Log Tab .....................................................................................39
2.8 Viewing the Pipeline Server Pane ..............................................................39
2.8.1 Overview of the Pipeline Server Tab ...............................................39
2.8.2 The Buttons ......................................................................................40
2.8.3 The Status Tab ..................................................................................40
2.8.4 The Document Processors Tab .........................................................41
2.8.5 The Document Inspector Tab ...........................................................42
2.8.6 The Log Tab .....................................................................................44
2.9 Viewing the Indexing Server Pane .............................................................44
2.9.1 Overview of the Indexing Server Pane .............................................44
2.9.2 The Buttons ......................................................................................45
2.9.3 The Status Tab ..................................................................................46
2.9.4 The Configuration Tab .....................................................................46
2.9.5 The Log Tab .....................................................................................47

iv

SAS Information Retrieval Studio: Administrators Guide

2.10 Viewing the Query Server Pane ............................................................... 48


2.10.1 Overview of the Query Server Tab ................................................ 48
2.10.2 The Buttons .................................................................................... 48
2.10.3 The Status Tab ............................................................................... 48
2.10.4 The Log Tab ................................................................................... 49
2.11 Viewing the Query Web Server Pane ...................................................... 49
2.11.1 Overview of the Query Web Server Tab ....................................... 49
2.11.2 The Buttons .................................................................................... 49
2.11.3 The Status Tab ............................................................................... 50
2.11.4 The Configuration Tab ................................................................... 50
2.11.4.A The Main Tabs in the Configuration Tab ........................... 50
2.11.4.B The Matching Tab ............................................................... 51
2.11.4.C The Sorting Tab .................................................................. 53
2.11.4.D The Labels Tab ................................................................... 56
2.11.4.E The Match Formatting Tab ................................................. 58
2.11.4.F The Theme Tabs .................................................................. 60
2.11.5 The Log Tab ................................................................................... 65
2.12 Viewing the Query Statistics Server Pane ............................................... 65
2.12.1 Overview of the Query Statistics Server Pane ............................... 65
2.12.2 The Buttons .................................................................................... 65
2.12.3 The Status Tab ............................................................................... 65
2.12.4 The Query Statistics Tab ................................................................ 66
2.12.4.A The Buttons ......................................................................... 66
2.12.4.B The Most Frequent Queries Tab ......................................... 68
2.12.4.C The Most Frequent Queries without Matches Tab .............. 69
2.12.4.D The Hourly Query Rate Tab ............................................... 70
2.12.4.E The Daily Query Rate Tab .................................................. 71
2.12.4.F The Monthly Query Rate Tab .............................................. 72
2.12.5 The Log Tab ................................................................................... 72
2.13 The Add Document Processor Windows ................................................. 73
2.13.1 Overview of Document Processor Windows ................................. 73
2.13.2 Access Document Processor Window ........................................... 73
2.13.3 The Document Processor: add_field Window ............................... 77
2.13.4 The Document Processor: content_categorization Wizard ............ 78
2.13.4.A Overview of the content_categorization
Document Processor .......................................................................... 78
2.13.4.B Configure SAS Content Categorization Server .................. 78
2.13.4.C Specify the Projects ............................................................. 80
2.13.4.D Specify Input ....................................................................... 82
2.13.4.E Specify Categories ............................................................... 83

SAS Information Retrieval Studio: Administrators Guide

2.13.4.F Specify Concepts .................................................................84


2.13.4.G Specify Facts .......................................................................88
2.13.5 The Document Processor: default_mime_type_from_url Window 95
2.13.6 The Document Processor: default_title_from_url Window ...........95
2.13.7 The Document Processor: document_converter Window ..............96
2.13.8 The Document Processor: export_csv Window .............................97
2.13.9 The Document Processor: export_to_files Window ......................100
2.13.10 The Document Processor: export_to_odbc Window ....................102
2.13.11 The Document Processor: export_to_sentiment_
analysis_workbench Window ....................................................................104
2.13.12 The Document Processor: extract_abstract Window ...................106
2.13.13 The Document Processor: extract_pdate Window .......................107
2.13.14 The Document Processor: heuristic_parse_html Window ...........108
2.13.15 The Document Processor: invalidate_duplicates_by_
url Window ................................................................................................110
2.13.16 The Document Processor: match_and_copy Window .................110
2.13.17 The Document Processor: modify_field_name Window .............112
2.13.18 The Document Processor: parse_html Window ...........................112
2.13.19 The Document Processor: parse_xml Window ............................114
2.13.20 The Document Processor: send Window .....................................115
2.13.21 The Document Processor: strip_html Window ............................116
2.13.22 The Document Processor: substitute Window .............................116
2.14 Miscellaneous Windows ...........................................................................118
2.14.1 The Import Settings Window .........................................................118
2.14.2 The Export Settings Window .........................................................119
2.14.3 The Select an HTTP Proxy Window ..............................................120
2.14.4 The Add Entry Point Window ........................................................121
2.14.5 The Edit Entry Point Window ........................................................125
2.14.6 The Add Feed Window ..................................................................126
2.14.7 The Edit Feed Window ...................................................................129
2.14.8 The Add Scope Rule Window ........................................................130
2.14.9 The Edit Scope Rule Window ........................................................132
2.14.10 The Add Filename Extension Window ........................................133
2.14.11 The Edit Filename Extension Window ........................................134
2.14.12 The Add Credential Window .......................................................135
2.14.13 The Edit Credential Window ........................................................136
2.14.14 The Add Path Window .................................................................137
2.14.15 The Edit Path Window .................................................................138
2.14.16 The Add Path to Exclude Window ...............................................138

vi

SAS Information Retrieval Studio: Administrators Guide

2.14.17 The Edit Path to Exclude Window ............................................... 139


2.14.18 The Add Extension Window ........................................................ 139
2.14.19 The Edit Extension Window ........................................................ 140
2.14.20 The Add Backend Window .......................................................... 141
2.14.21 The Edit Backend Window .......................................................... 142
2.14.22 The Add Field Window for the Indexing Server ......................... 143
2.14.23 The Add Field Window for the Query Web Server
Matching Pane .......................................................................................... 146
2.14.24 The Edit Field Window for the Query Web Server
Matching Pane .......................................................................................... 147
2.14.25 The Add Field Window: Query Web Server Labels Pane ........... 148
2.14.26 The Edit Field Window: Query Web Server Labels .................... 150
2.14.27 The Color Box Window ............................................................... 151
2.14.28 Status Windows ........................................................................... 153
2.14.28.A Overview of Status Windows ........................................... 153
2.14.28.B The Confirmation Window ............................................... 153
2.14.28.C The Error Window ............................................................ 153
3 Choosing Your Components ................................................................155
3.1 Before You Choose Your Components ...................................................... 155
3.2 Choosing a Crawler .................................................................................... 156
3.3 Purposes of the Proxy Server ..................................................................... 157
3.4 Choosing Document Processors in the Pipeline Server ............................. 158
3.4.1 Overview of the Pipeline Server ...................................................... 158
3.4.2 Choosing a Document Processor ..................................................... 158
3.4.3 The Export Operations Performed by the Pipeline Server ............... 160
3.5 How the Indexing Server Works ................................................................ 161
3.6 Querying the Index ..................................................................................... 161
3.6.1 Overview of the Querying ............................................................... 161
3.6.2 Using the Query Server .................................................................... 161
3.6.3 Using the Query Web Server ........................................................... 162
3.6.4 Using the Query Statistics Server .................................................... 162
3.7 Defining Labels for Facetted Search .......................................................... 163
3.8 After You Choose Your Components ........................................................ 164
3.9 Exporting and Importing Component Specifications ................................. 164
4 Sample Configurations ..........................................................................165
4.1 Why You Want to Understand Sample Configurations ............................. 165
4.2 Before You Use a Sample Configuration to Create Your Own Application 166

SAS Information Retrieval Studio: Administrators Guide

vii

4.3 Sample Configurations That Use the Web Crawler ...................................167


4.3.1 A Web Crawler, Indexing, and Searching Configuration ................167
4.3.2 The Web Crawler with Exporting and Indexing Processes ..............178
4.4 A Sample Configuration That Uses the File Crawler .................................178
4.5 A Sample Configuration That Uses the Feed Crawler ...............................184
5 Configuring the Web Crawler ...............................................................191
5.1 Overview of the Web Crawler ....................................................................191
5.2 Configuring the Web Crawler ....................................................................192
5.2.1 Overview of Configuring the Web Crawler .....................................192
5.2.2 Specify the General Settings ............................................................193
5.2.3 Specify Entry Points for the Web Crawler .......................................196
5.2.4 Specify the Scope of the Crawl ........................................................198
5.2.5 Exclude Certain Types of Files ........................................................202
5.2.6 Specify Access Information for Password-Protected Sites ..............203
5.3 Run the Web Crawler .................................................................................205
5.4 Troubleshoot with the Log File ..................................................................206
6 Configuring the File Crawler ................................................................. 207
6.1 Overview of the File Crawler .....................................................................207
6.2 Configure the File Crawler .........................................................................207
6.2.1 Overview of Configuring the File Crawler ......................................207
6.2.2 Specify the General Settings ............................................................208
6.2.3 Specify the Paths to Crawl ...............................................................209
6.2.4 Specify the Paths to Exclude ............................................................211
6.2.5 Specify the Types of Files to Return ................................................212
6.3 Run the File Crawler ...................................................................................213
6.4 Troubleshoot with the Log File ..................................................................214
7 Configuring the Feed Crawler ...............................................................217
7.1 Overview of the Feed Crawler ....................................................................217
7.2 Configure the Feed Crawler .......................................................................218
7.2.1 Overview of Configuring the Feed Crawler .....................................218
7.2.2 Specify the General Settings ............................................................218
7.2.3 Specify the Feeds ..............................................................................220
7.3 Run the Feed Crawler .................................................................................222
7.4 Troubleshoot with the Log File ..................................................................223
8 Configuring the Proxy Server ...............................................................225

viii

SAS Information Retrieval Studio: Administrators Guide

8.1 Overview of the Proxy Server .................................................................... 225


8.2 View the Status of the Proxy Server and Input Files ................................. 226
8.3 Configure the Proxy Server ........................................................................ 228
8.4 Run the Proxy Server ................................................................................. 229
8.5 Troubleshoot with the Log File .................................................................. 230
9 Configuring the Pipeline Server ...........................................................233
9.1 Overview of the Pipeline Server ................................................................ 233
9.1.1 Processing Documents and Related SAS Applications ................... 233
9.1.1.A How Document Processing and Export Operations
Work Together ................................................................................... 233
9.1.1.B Process Documents ............................................................... 234
9.1.1.C Export Processed Documents ................................................ 235
9.2 Configuring the Pipeline Server ................................................................. 235
9.2.1 Overview of the Document Processors ............................................ 235
9.2.2 Checking Program Installations ....................................................... 237
9.2.3 Configure the Document Processors ................................................ 238
9.3 See Input Documents with the Document Inspector .................................. 240
9.4 Add a New Field to Input Documents ........................................................ 242
9.5 Match Categories, Concepts, and Facts ..................................................... 246
9.6 Export Categories and Concept Matches .................................................. 256
9.7 Advanced Installation ................................................................................. 258
9.8 Run the Pipeline Server .............................................................................. 258
9.9 Troubleshoot with the Log File .................................................................. 260
10 Creating Facetted Search Labels Using content_categorization ....261
10.1 Before You Begin Using This Example ................................................... 261
10.1.1 How the content_categorization Document Processor
Creates Facetted Search Labels ................................................................ 261
10.1.2 Using Related Programs to Define Labels ..................................... 262
10.1.3 Mapping to Labels ......................................................................... 264
10.1.4 Before You Build Your SAS Content Categorization
Studio Project ............................................................................................ 266
10.1.5 Before You Use the Example in This Chapter ............................... 269
10.2 Creating a Sample Project ........................................................................ 274
10.2.1 Access the Projects on SAS Content Categorization Server ......... 274
10.2.2 Add Projects ................................................................................... 277

SAS Information Retrieval Studio: Administrators Guide

ix

10.2.3 Determine the Input, Matching, and Output ...................................279


10.2.3.A How Input Documents Are Handled ...................................279
10.2.3.B Specify Input Fields .............................................................279
10.2.3.C Specify Categories ...............................................................280
10.2.3.D Specify Concepts .................................................................281
10.2.3.E Specify Facts ........................................................................286
10.2.4 Specify Output ................................................................................292
10.2.5 Apply content_categorization to Input Documents ........................293
10.3 Seeing the Results in the Query Interface ................................................295
11 Configuring the Indexing Server ........................................................ 297
11.1 Overview of the Indexing Server ..............................................................297
11.2 Configure an Index ...................................................................................298
11.3 Changes That Affect the Indexing Server ................................................301
11.4 Run the Indexing Server ...........................................................................302
11.5 Troubleshoot with the Log File ................................................................303
12 Configuring the Query Server ............................................................. 305
12.1 Overview of the Query Server ..................................................................305
12.2 Run the Query Server ...............................................................................305
12.3 Troubleshoot with the Log File ................................................................306
13 Configuring the Query Web Server .................................................... 309
13.1 Overview of the Query Web Server .........................................................309
13.2 Choosing How Search Returns Are Displayed .........................................310
13.2.1 Displays with or without Labels .....................................................310
13.2.2 No Labels Example ........................................................................312
13.2.3 Hierarchical Labels Example .........................................................312
13.2.4 No Hierarchical Display Example ..................................................314
13.2.5 Flattened Hierarchical Example .....................................................315
13.3 Configure the Query Web Server .............................................................316
13.3.1 Overview of Configuring the Query Web Server ..........................316
13.3.2 Specify the Server Port ...................................................................317
13.3.3 Specify How Matching Is Performed .............................................317
13.3.3.A Match Types ........................................................................317
13.3.3.B Select a Match Type ............................................................318
13.3.4 Specify How Matches Are Sorted ..................................................319
13.3.5 Specify Labels for Facetted Search ................................................321
13.3.6 Specify the Formatting for the Matches .........................................326

SAS Information Retrieval Studio: Administrators Guide

13.3.7 Specify the Theme of the Search Window .................................... 329


13.3.7.A Theme Overview ................................................................. 329
13.3.7.B Specify the Colors of the Search Window .......................... 331
13.3.7.C Load New Images into the Search Window ........................ 335
13.4 Run the Query Web Server ...................................................................... 336
13.5 Troubleshoot with the Log File ................................................................ 337
14 Configuring the Query Statistics Server ............................................339
14.1 Overview of the Query Statistics Server .................................................. 339
14.2 Run the Query Statistics Server ............................................................... 340
14.3 View the Query Statistics for a Selected Time Period ............................. 341
14.3.1 Overview of Time Period Views ................................................... 341
14.3.2 See the Statistics for Today ............................................................ 341
14.3.3 See the Statistics for This Month ................................................... 344
14.3.4 See the Statistics for This Year ...................................................... 346
14.3.5 See the Statistics for All Time ....................................................... 348
14.4 After You View the Query Data .............................................................. 349
14.5 Troubleshoot with the Log File ................................................................ 349

Appendixes ............................................................................ 351


A Regular Expressions and XML Field Extraction File ..........................353
A.1 Regular Expressions .................................................................................. 353
A.2 XML File Field Extraction File Format .................................................... 353
B Recommended Reading ........................................................................355
C Glossary .................................................................................................359
Index ...........................................................................................................363

SAS Information Retrieval Studio: Administrators Guide

xi

xii

SAS Information Retrieval Studio: Administrators Guide

About This Book


Audience
SAS Information Retrieval Studio is designed for the following
administrators:
-

Persons who install the software.

Persons who determine what components are used in the custom


information retrieval application that your organization requires.

Persons who choose the components, and their configurations, for


information retrieval.

Persons who design the search window that is used by end users to
query the index.

You could be assigned one of these functions, or all of them.


SAS Information Retrieval Studio enables you to use this software with other
SAS products. This documentation focuses on tasks that define and configure
the information retrieval application and the search interface.

Prerequisites
Here are the prerequisites for using SAS Information Retrieval Studio:
-

SAS Information Retrieval Studio installed on your machine.

A supported browser installed on your desktop client.

Access to data sources.

(Optional) Rules such as category rules and concept definitions created


in other SAS applications.

If you have any questions about whether you are ready to use SAS Information
Retrieval Studio, contact your system administrator.

xiii

Conventions
This manual uses the following typographical conventions:
Convention

Description

TGM_ROOT

The root directory where SAS Information Retrieval Studio is


installed, typically the following:
Windows: C:/Program Files/SAS Information
Retrieval Studio
UNIX: /opt/SAS Information Retrieval Studio

.xml

Code examples are shown in a fixed-width font.

Start button

The labels for user interface controls are shown in a bold, sansserif font.

www.sas.com

The hypertext links are shown in a light blue, fixed-width font,


and are underlined.

This manual contains instructional text that is subject to change.

xiv

SAS Information Retrieval Studio: Administrators Guide

Whats New in SAS Information


Retrieval Studio 1.3
New and enhanced features in SAS Information Retrieval Studio include the
following:
-

SAS licensing replaces the Teragram license.

The content_categorization Document Processor wizard replaces the


categorizer, concept_extractor, and contextual_extractor processors.

The add_field Document Processor enables you to add a field with a


constant value to each input document.

The export_to_files document processor now enables you to mark preescaped fields for XML documents. Use this processor to create nested
XML tags.

The parse_xml document processor can now be instantiated multiple


times. This feature enables you to support multiple document schemas.
This processor can also copy the original URL of the compound
document into each resulting, split document.

The export_csv document processor now supports a non-escaped output


mode.

Entry point quota control is now available for the web crawler. This
feature enables seed-only crawling.

The match_and_copy document processor is similar to the substitute


document processor. Use the match_and_copy document processor to
write the output to a different field from the input.

The default fields ctime, mtime, and atime are included in the Input
fields to exclude field for the content categorization document
processor. These fields preclude these timestamps from processing by
SAS Content Categorization Server.

The passwords in the web crawler Credentials pane are now obscured.

xv

xvi

SAS Information Retrieval Studio: Administrators Guide

About SAS Information


Retrieval Studio
-

What Is SAS Information Retrieval Studio?

Benefits of Using SAS Information Retrieval Studio

How Does SAS Information Retrieval Studio Work?

How Does SAS Information Retrieval Studio Fit into the SAS Product
Line?

How to Get Help for SAS Information Retrieval Studio

What is a Document?

Architecture

1.1 What Is SAS Information Retrieval Studio?


In many organizations, diverse information consumers need to quickly access
specific data. In an environment where data, and its types, grow exponentially
there is a need to automate the related processes. SAS Information Retrieval
Studio combines several key technologies to provide a comprehensive solution
to data collection, indexing, searching, and so on. These technologies are
bundled into one customizable product.
Easy information retrieval
The web, feed, and file crawlers gather the documents that you specify
according to your parameters. Documents are chunks of text, with or
without markup tags, gathered from the Internet, feeds, and databases.
These chunks of text can be treated by the document processors that parse,
convert, categorize, extract concepts and facts, and so on. The documents
can then be sent to the index or to another program such as SAS Sentiment

Analysis Workbench. If indexed, the documents can be searched by your


end users.
Build a custom information retrieval pipeline
Choose to build an information retrieval system that is customized to meet
the needs of your organization. You can choose all, or some, of the
following components:
Crawlers
Choose the web, feed, or file crawlers to gather documents from the
Web, feeds, and file systems, respectively.
Pipeline server
Choose your document processors that parse, categorize, extract
concepts, locate facts, convert documents into text, and so on. These
processors can also hand the gathered documents to other applications
such as SAS Sentiment Analysis Workbench.
Indexing server
Choose how, and whether input documents are indexed. End users can
search indexed documents using a customized search window that
runs on the query web server.
Query web server
Specify how the matching documents are returned in the search
window, the appearance of this window, and how end users can
navigate the returns.
Query statistics server
See the counts for the entered queries according to various time
frames.
Easy component customization
Easy-to-use windows and wizards simplify the process of customizing the
information retrieval components that you choose. These panes also
provide log files, statistics, information about the processes involved, and
data on documents in the pipeline.

SAS Information Retrieval Studio: Administrators Guide

1.2 Benefits of Using SAS Information


Retrieval Studio
SAS Information Retrieval Studio provides the following benefits:
Empowers business owners by locating data
SAS Information Retrieval Studio includes functionality that is designed
to fit your organizations requirements. Use this program to locate,
process, index, and customize a search window for your data. See various
types of informational statistics.
Improves the business value of IT and the corporate data that it manages
SAS Information Retrieval Studio provides you with easy, self-service
access to the information contained in your documents. Use SAS
Information Retrieval Studio to locate, process, index, and search your
data.
Saves money on training and support costs
SAS Information Retrieval Studio is so simple that you can quickly
become self-sufficient, with minimal IT support and no need for extensive
training. Once you start using SAS Information Retrieval Studio you are
no longer dependent on the IT staff.

1.3 How Does SAS Information Retrieval


Studio Work?
SAS Information Retrieval Studio is an application that anyone can use to
locate documents on the Internet or in file systems. You can specify how these
documents are processed, index, or send them to another SAS program. If you
choose to index the documents, end users can query this data in the search
window that you customize.
Use the SAS Information Retrieval Studio window to select the crawler and
document processors. Also determine whether the documents are indexed or
sent to another SAS application and customize a search window. All of these
processes are optional. You can specify the components that you want to use,

SAS Information Retrieval Studio: Administrators Guide

configure the components, and enable your end users to perform facetted
search using labels.

1.4 How Does SAS Information Retrieval Studio


Fit into the SAS Product Line?
As an integral part of the SAS product line, SAS Information Retrieval Studio
provides crawlers, indexing, and searching capabilities. These functionalities
facilitate the processes of information retrieval and management. Use these
capabilities with the following SAS products, among others:
Export document collections to SAS Sentiment Analysis Workbench and SAS
Text Miner
Export the files that the web and feed crawlers gather in SAS Information
Retrieval Studio to SAS Sentiment Analysis Workbench. Here you can see
reports about overall sentiment. Analysts can also see and review
individual documents in SAS Sentiment Analysis Workbench. You can
also export to SAS Text Miner to locate topics and themes in your input
documents.
Category, concept, and fact extraction
SAS Information Retrieval Studio enables you to deploy the rules defined
in SAS Content Categorization Studio and SAS Contextual Extraction
Studio to your gathered documents.
Document conversion
Use SAS Document Conversion to extract text from input files such as
Adobe PDF and Microsoft Office.

1.5 How to Get Help for SAS Information


Retrieval Studio
Select Help --> Contents or Help --> Using this Window.

SAS Information Retrieval Studio: Administrators Guide

1.6 What is a Document?


A document consists of a single text. For example, a document can be any of
the following:
-

an HTML page

a Microsoft Word file

a PDF file

one row in a CSV file or a database

one article or summary in a feed

In SAS Information Retrieval Studio, each document is represented as a


configurable set of fields. Each file has a name and a value. Unnecessary fields
can be either left empty or omitted from the document.

1.7 Architecture
Use the architecture diagram below to gain an overview of the application
processes that you can choose to use in your customized configuration.
Figure 1-1

SAS Information Retrieval Studio

SAS Information Retrieval Studio: Administrators Guide

SAS Information Retrieval Studio: Administrators Guide

SAS Information Retrieval


Studio Interface
-

Your First Look at the SAS Information Retrieval Studio User


Interface

Access the SAS Information Retrieval Studio User Interface

Viewing the Overview Pane

Viewing the Web Crawler Pane

Viewing the File Crawler Pane

Viewing the Feed Crawler Pane

Viewing the Proxy Server Pane

Viewing the Pipeline Server Pane

Viewing the Indexing Server Pane

Viewing the Query Server Pane

Viewing the Query Web Server Pane

Viewing the Query Statistics Server Pane

The Add Document Processor Windows

Miscellaneous Windows

2.1 Your First Look at the SAS Information


Retrieval Studio User Interface
The SAS Information Retrieval Studio user interface provides the workspace
necessary to configure the crawlers and servers that you select to gather and
process information. The gathered documents are processed according to the
specifications that you set and sent to either the indexing server or to another

program. For example, choose to send your documents to SAS Sentiment


Analysis Workbench where the sentiment that they express can be aggregated. If
you want your end users to be able to query the documents that your crawlers
locate, choose to index these documents.
Use the windows in the SAS Information Retrieval Studio to choose your
components according to the tasks that you want to perform:
1. Start, stop, configure, and monitor the status of the web, file, and feed

crawlers. Specify the parameters for each type of crawl. A crawl is


defined as the entire run of the crawler, instead of a single-page
download.
2. Determine how input documents are processed and where they are sent

using the pipeline server.


3. Specify how input documents are stored in the index by determining

what fields are indexed and how the information in these fields is
handled.
4. Choose the type of search that is available to end users and how

matching and sorting are determined. You can also determine whether
and how labels that facilitate facetted search are made available to end
users.
5. Monitor the status of input queries using the query statistics server.

2.2 Access the SAS Information Retrieval Studio


User Interface
To access the SAS Information Retrieval Studio user interface, complete this
step:
Select Start > Programs
Administration.

> SAS Information Retrieval Studio >

SAS Information Retrieval Studio: Administrators Guide

Display 2.1 SAS Information Retrieval Studio User Interface

Use the components of this pane as specified below:


Table 2-1: Components of the Main Window
Component Description
Refresh button
By default, Auto-refresh is selected. Click
to select
Refresh.With the default setting, the status of the components is
updated every few seconds. If you have a slow connection, you can
disable the auto-refresh functionality. In this case, click Refresh to
update the status of any of the components of SAS Information Retrieval
Studio.
Overview

Find information for the application and the import and export operations
that apply to selected components of the application.

Web Crawler

Start, stop, and configure the web crawler.

File Crawler

Start, stop, and configure the file crawler.

Feed Crawler

Start, stop, and configure the feed crawler.

Proxy Server

Start, stop, and configure the proxy server. Also see and search a log file
of output for this server.

Pipeline Server

Start, stop, and configure the pipeline server. Observe the progress of
gathered documents in the Status pane and see the specified log file
output for this server. Use this component to specify the processors that
act on each input document. These processors act on the document using
the specified operation or pass the document to another component.

SAS Information Retrieval Studio: Administrators Guide

Table 2-1: Components of the Main Window (Continued)


Component Description
Indexing Server

Start, stop, and configure the indexing server. Use the indexing server if
you plan to perform search operations using SAS Information Retrieval
Studio.

Query Server

Start, stop, and configure the query server. The query server passes queries
to the index from the search window and hands the results back to the
query web server.

Query Web
Server

Start, stop, and configure the query web server. Specify how matching and
sorting operations are performed. Format the labels, document matches,
and the theme of the search window.

Query Statistics
Server

Start and stop the query statistics server. See the most frequent queries
submitted, those search terms that are not matched, and query rates for
specified time periods.
The width of these panels can be adjusted by dragging this icon located
between the two main panes.

10

SAS Information Retrieval Studio: Administrators Guide

2.3 Viewing the Overview Pane


2.3.1 Overview of the Overview Pane
For information about this pane, see Section 2.2 Access the SAS Information
Retrieval Studio User Interface on page 8.

2.3.2 The Log Tab


See the log pane that describes the operations performed.
Display 2.2 Log Pane.

Number of lines

(default is 20) see this maximum number of timestamped lines of text that
form the log for the proxy server. Click

or

to reset the limit.

Retrieve button

see the log file contents in the Log pane below.

SAS Information Retrieval Studio: Administrators Guide

11

Text to highlight

enter the term that you are seeking to match in the input document.
Find button

click to highlight the matched text in the log pane below.


Log pane

see the specified number of lines in the log file here.

2.4 Viewing the Web Crawler Pane


2.4.1 Overview of the Web Crawler Pane
The web crawler searches the Web and returns Web pages according to the
parameters that you specify. The operations that are available when you click
the Web Crawler tab are explained in the following subsections.
Display 2.3 Web Crawler Pane

12

SAS Information Retrieval Studio: Administrators Guide

2.4.2 The Buttons


The buttons that are available in the Web Crawler pane are listed below from
left to right:
Start

begin the crawl.


Stop

end the crawl.


Apply Changes

modify the behavior of the web crawler according to the changes that you
made in this pane.
Revert

return to the last applied settings.

2.4.3 The Status Tab


See whether the web crawler is running.
Display 2.4 Status Pane

SAS Information Retrieval Studio: Administrators Guide

13

2.4.4 The Configuration Tab


2.4.4.A The Main Tabs in the Configuration Tab
The web crawler does not run until it is configured. Specify the settings for the
web crawler, the points where the crawler enters the Web, the limits of its
crawl, and the file types that it returns. You can also specify the permissions
necessary to access password-protected sites.

General Settings

specify how the web crawler runs. For more information, see Section
2.4.4.B The General Settings Tab on page 15.
Entry Points

enter the starting URLs. The web crawler starts at these Web addresses
and follows their links to gather documents. For more information, see
Section 2.4.4.C The Entry Points Tab on page 18.
Scope

allow, or exclude, Web addresses from the crawling process. In either case,
specify patterns with regular expressions, or a list of specific URLs. For
more information, see Section 2.4.4.D The Scope Tab on page 20.
Filename Extensions

use the default list of excluded file types, or define your own list to
include, or exclude, from the crawl. For more information, see Section
2.4.4.E The Filename Extensions Tab on page 21.
Credentials

enter the necessary names and passwords for password-protected sites


where access is prevented without this information. When you specify this

14

SAS Information Retrieval Studio: Administrators Guide

data, you enable the pages on these sites to be collected. For more
information, see Section 2.4.4.F The Credentials Tab on page 23.

2.4.4.B The General Settings Tab


Specify the settings that determine the overall way that the web crawler runs.
Display 2.5 General Settings Pane

Use the components of this window to specify the general settings.


Table 2-2: General Settings Pane Components
Component

Description

HTTP proxy

Specify the proxy server here, or click Auto-detect as


explained below.

Auto-detect button Access the Select an HTTP Proxy window where you can
choose a proxy server. For more information, see Section 2.14.3
The Select an HTTP Proxy Window on page 120.

SAS Information Retrieval Studio: Administrators Guide

15

Table 2-2: General Settings Pane Components (Continued)


Component

Description

Quota (files)
(Default: 25) Click
the web crawler.

or

to change the file limit for

Quota (megabytes)
(Default: 1000) Click
or
to change the megabyte
limit for the crawler. This is the maximum size of all collected
files.
Number of
downloader
threads
Sleep interval
(seconds)

(Default: 1) Click
or
threads that can be created.

to change the total number of

(Default: 1) Click
or
to change the number of
seconds that the web crawler pauses between page downloads.

Timeout (seconds)
(Default: 300) Click
or
to change the number of
seconds the web crawler waits before it stops attempting to
download a specific page.
Maximum number
of retries
(Default: 3) Click
or
to change the highest number
of times the crawler can attempt to download a page before it
tries the next one.
Retry delay
(seconds)

(Default: 300) click


or
to change the highest
number of seconds that the web crawler waits before it
reattempts to download a page.

Respect robots.txt
(Default: Yes) Click
to select No. The robots.txt
standard enables Web site authors to request that crawlers
(robots) avoid downloading some portions of their site. Select
Yes to ignore this request.

16

SAS Information Retrieval Studio: Administrators Guide

Table 2-2: General Settings Pane Components (Continued)


Component
Find links in
Javascript and
Flash

Description

(Default: Yes) Click


to select No. Leave the default
setting to prohibit the web crawler from returning links that the
crawler finds in either of these two types of code.

Link traversal
order
(Default: Breadth first) Click
to select Depth
first. Breadth first means the top layer of linked pages
at one site are gathered before the links to the next layer are
followed. If you select Depth first, the links are followed
on the first page. Then the crawler goes to the second page, and
so on.

SAS Information Retrieval Studio: Administrators Guide

17

2.4.4.C The Entry Points Tab


Specify the Web addresses, or URLs, where the web crawler begins its crawl.
You can also specify the quota, or maximum number of files that can be
downloaded.
Display 2.6 Entry Points Pane

URL

see the list of the entry points to the Web for the web crawler. These
crawlers access the URLs in the order in which they are listed in the Entry
Points pane.
Hint:

This ordering can affect the number of documents


returned when you enter a quota for the web crawler or
for each URL. For more information about the quota for
the web crawler, see Section 2.4.4.B The General
Settings Tab on page 15.

Quota

see the maximum number of files that can be collected for each Web
address. When you specify the quota for a URL and the web crawler, the

18

SAS Information Retrieval Studio: Administrators Guide

smaller of the two numbers applies. In other words, if you specify 100 for
the web crawler and 35 for the selected URL, only 35 documents can be
downloaded for this URL. On the other hand, if you specify 100
documents for this for URL and 35 for the web crawler, only 35
documents can be downloaded for this URL.
Add

access the Add Entry Point window where you can specify a Web address
to begin the crawl. See Section 2.14.4 The Add Entry Point Window on
page 121.
Remove

delete the selected entry point and quota from the address pane.
Edit

access the Edit Entry Point window to make changes to the selected Web
address. See Section 2.14.5 The Edit Entry Point Window on page 125.

SAS Information Retrieval Studio: Administrators Guide

19

2.4.4.D The Scope Tab


Specify the Web addresses, or URLs, where the web crawler begins its crawl.
Display 2.7 Scope Pane

URL Pattern

specify a pattern for matching URLs.


Match Type

see prefix or regular expressions. Both are patterns. Select prefix to


specify a match at the beginning of the URL. Select regular expression
when you want to use Teragram regular expressions to specify the pattern
of the searchable URLs. Regular expressions enable you to apply greater
precision to the collection operation.
Action

specifies Allow or Exclude to either download, or to prevent a download,


for pages whose URLs match the specified pattern.
Add

access the Add Scope Rule window. Scope determines the links that the
crawler follows, if the URL has the specified prefix. The links in this URL

20

SAS Information Retrieval Studio: Administrators Guide

are followed only if no other scope rules exclude this URL. See Section
2.14.8 The Add Scope Rule Window on page 130.
Remove

delete the selected URL pattern and its attributes.


Edit

access the Edit Scope Rule window. See Section 2.14.9 The Edit Scope
Rule Window on page 132.

2.4.4.E The Filename Extensions Tab


Specify the file extensions that are excluded or included.
Note:

If you specify one file type as included, only the


specified file types are returned.
Display 2.8 Filename Extensions Pane

Extension

see the list of file types that are listed by their file extension.
Action

see the status of each file type. In other words, is this extension excluded
or allowed. If the file type is specified as Allow, the crawler can return this

SAS Information Retrieval Studio: Administrators Guide

21

type of page. If you enable at least one type of file to be returned, only
those files with the Allow operation are returned.
Add

access the Add File Extension window to add an extension that is allowed
or prohibited. See Section 2.14.10 The Add Filename Extension Window
on page 133.
Remove

delete the selected file type.


Edit

access the Edit Filename Extension window to make a change to the file
extension or the operation. See Section 2.14.11 The Edit Filename
Extension Window on page 134.

22

SAS Information Retrieval Studio: Administrators Guide

2.4.4.F The Credentials Tab


Specify the sign-in information that enables you to gain access to the specified
password-protected sites.
Display 2.9 Credentials Pane

Site

see a list of Web sites that are password-protected.


Username

see the name of the user for each password-protected site.


Password

see the password assigned to each user. The password is the second
component of the credentials required for HTTP authentication.
Add

access the Add Credential window to add a password-protected site to


crawl. See Section 2.14.12 The Add Credential Window on page 135.
Remove

delete the selected site with its credentials.


Edit

access the Edit Credential window to make a change to the specifications


for the password-protected site. See Section 2.14.13 The Edit Credential
Window on page 136.

SAS Information Retrieval Studio: Administrators Guide

23

2.4.4.G The Log Tab


This pane provides information about the web crawler operations. For more
information, see Section 2.3.2 The Log Tab on page 11.

2.5 Viewing the File Crawler Pane


2.5.1 Overview of the File Crawler Pane
The file crawler searches the files on your file system and returns these files.
The operations that are available when you click the File Crawler tab are
explained in the following subsections.
Display 2.10 File Crawler Pane

2.5.2 The Buttons


The buttons that are available in the File Crawler pane are the same buttons
that are available for the web crawler. For more information, see Section 2.4.2
The Buttons on page 13.

24

SAS Information Retrieval Studio: Administrators Guide

2.5.3 The Status Tab


See whether the file crawler is running.
Display 2.11 Status Pane

2.5.4 The Configuration Tab


2.5.4.A The Main Tabs in the Configuration Tab
The file crawler does not run until it is configured. Specify the settings for the
file crawler, the points where the crawler enters a file system, the limits of its
crawl, and the file types that it returns.
Display 2.12 Configuration Tab

SAS Information Retrieval Studio: Administrators Guide

25

General Settings

specify how the file crawler runs, specify a date range for returned
documents, and how .xml files are handled. For more information, see
Section 2.5.4.B The General Settings Pane on page 26.
Paths

specify the directories that the file crawler accesses. For more information,
see Section 2.5.4.C The Paths Pane on page 28.
Paths to Exclude

exclude certain directories from the crawl. For more information, see
Section 2.5.4.D The Paths to Exclude Pane on page 29.
Filename Extensions

(optional) if you choose to specify the types of files that can be returned,
only these are permitted. Leave this pane empty if you want the file
crawler to return all file types. For more information, see Section 2.4.4.E
The Filename Extensions Tab on page 21.

2.5.4.B The General Settings Pane


Specify the settings that determine the overall way that the file crawler runs.
Display 2.13 General Settings Pane

Maximum file size (megabytes)

(default is 10) click

26

or

to reset the limit for the size of each file.

SAS Information Retrieval Studio: Administrators Guide

Oldest date

click
to access the calendar where you can select the first date that the
crawler can use for the files that it returns to a query.
Crawl continuously

(default setting is No) click


to select Yes. If you leave the default
selection, start the crawler when you update the files in your file system.
Encapsulate XML files

(default setting is No) click


throughout processing.

to select Yes. Keep XML files intact

By default, XML files are passed by the pipeline server with top-level tags
turned into similar-named fields in the document. If you want to exercise
more control over these fields, set this specification to Yes. This setting
enables you to turn nested tags into fields. When you make this selection,
also specify the parse_xml document processor in the pipeline server.

SAS Information Retrieval Studio: Administrators Guide

27

2.5.4.C The Paths Pane


See the addresses that the crawler can use to gather documents. You add these
paths using the Add button.
Display 2.14 Paths Pane

Add

access the Add Path window. See Section 2.14.14 The Add Path Window
on page 137.
Remove

delete the selected path.


Edit

access the Edit Path window to change the text that specifies an address.
See Section 2.14.15 The Edit Path Window on page 138.

28

SAS Information Retrieval Studio: Administrators Guide

2.5.4.D The Paths to Exclude Pane


See the paths that are not crawled here. This pane also enables you to identify
exceptions to the specified list in the Paths pane. These exceptions are
subdirectories, or files. They are not crawled.
Display 2.15 Paths to Exclude Pane

Add

access the Add Path to Exclude window. See Section 2.14.16 The Add
Path to Exclude Window on page 138.
Remove

delete the selected path.


Edit

access the Edit Path to Exclude window. See Section 2.14.17 The Edit
Path to Exclude Window on page 139.

SAS Information Retrieval Studio: Administrators Guide

29

2.5.4.E The Filename Extensions Pane


Specify the file extensions that can be returned. However, if you enter any file
extensions, only those types of extensions are crawled.
Display 2.16 Filename Extensions Pane

Add button

access the Add Extension window. See Section 2.14.10 The Add Filename
Extension Window on page 133.
Remove

delete the selected file type.


Edit

access the Edit Filename Extension window. See Section 2.14.11 The Edit
Filename Extension Window on page 134.

30

SAS Information Retrieval Studio: Administrators Guide

2.6 Viewing the Feed Crawler Pane


2.6.1 Overview of the Feed Crawler Pane
The feed crawler collects frequently updated documents on the Web. For
example, the feed crawler collects pages from blogs and from forums. It can
also collect pages such as press releases. Unlike the web crawler, the feed
crawler collects only the Web documents that come from feeds.
The operations that are available when you click the Feed Crawler tab are
explained in the following subsections.
Display 2.17 Feed Crawler Pane

2.6.2 The Buttons


The buttons that are available in the Feed Crawler pane are the same buttons
that are available for the web crawler. For more information, see Section 2.4
Viewing the Web Crawler Pane on page 12.

SAS Information Retrieval Studio: Administrators Guide

31

2.6.3 The Status Tab


See whether the feed crawler is running.
Display 2.18 Status Pane.

2.6.4 The Configuration Tab


2.6.4.A The Main Tabs in the Configuration Tab
The feed crawler does not run until it is configured. Specify the settings for the
feed crawler, the URLs where this crawler enters the Internet, and the limits of
its crawl.
Display 2.19 Configuration Pane

General Settings

specify how the feed crawler runs, the server, and other information
necessary to the crawl. For more information, see Section 2.5.4.B The
General Settings Pane on page 26.
Feeds

crawl one, or more, feeds using this pane. For more information, see
Section 2.5.4.B The General Settings Pane on page 26.

32

SAS Information Retrieval Studio: Administrators Guide

2.6.4.B The General Settings Tab


Use the General Settings tab for the feed crawler to configure the feed server
and to specify information specific to the crawl.
Display 2.20 General Settings Tab

HTTP proxy

specify the server that you are accessing here, or click Auto-detect as
explained below.
Auto-detect

access the Select an HTTP Proxy window where you can choose a proxy
server or enter the address for this server. For more information, see
Section 2.14.3 The Select an HTTP Proxy Window on page 120.
Crawl continuously

(default setting is Yes) click


to select No. If you select Yes, fresh
content is always available. For more information about this setting, see
Section 7.2 Configure the Feed Crawler on page 218.
Recrawl interval

(default is 600) click

or

to select another wait time in seconds.

User agent

(default agent is SAS Feed Crawler) enter the name of a third-party feed
crawler, if you choose.

SAS Information Retrieval Studio: Administrators Guide

33

2.6.4.C The Feeds Tab


Use the Feeds tab for the feed crawler to configure the feed crawler, to specify
the feed addresses, whether links are followed, and other information.
Display 2.21 Feeds Tab

Feed URL

specify the Web address for a feed.


Follow Links

choose whether the feed crawler should crawl links in the selected feed.
For more information, see Section 2.14.3 The Select an HTTP Proxy
Window on page 120.
Add

access the Add Feed window where you can choose the address for the
feed that you want to crawl. For more information, see Section 2.14.6 The
Add Feed Window on page 126.
Remove

delete the selected feed URL.


Edit

access the Edit Filename Extension window. See Section 2.14.7 The Edit
Feed Window on page 129.

34

SAS Information Retrieval Studio: Administrators Guide

2.6.5 The Log Tab


See Section 2.3.2 The Log Tab on page 11.

2.7 Viewing the Proxy Server Pane


2.7.1 Overview of the Proxy Server Pane
The proxy server is an intermediary server. The proxy server takes the
documents gathered by the crawler and sends them to the pipeline server for
processing. As an intermediary server, the proxy server provides two benefits:
First, the proxy server enables you to pause the flow of documents. The
incoming documents form a queue until the server is restarted. Use the pause
operation to perform maintenance on the system without interrupting the
crawlers.
Second, you can choose to specify more than one pipeline server. In this case,
the pipeline servers run on different machines and the proxy server sends a
copy of each incoming document to each machine. In case of hardware failure,
these servers serve as mirrors.
The operations that are available when you click the Proxy Server tab are
explained in the following subsections.
Display 2.22 Proxy Server Pane

SAS Information Retrieval Studio: Administrators Guide

35

2.7.2 The Buttons


The Start, Stop, and Apply Changes buttons work for the proxy server like
they work for the crawlers. For more information, see Section 2.4 Viewing the
Web Crawler Pane on page 12.
Pause

stop the proxy server temporarily.


Resume

restart the proxy server operations.

2.7.3 The Status Tab


Use the status tab to see whether the proxy server is running.
Display 2.23 Status Pane

Documents received

see the number of texts handed to the proxy server.


Documents processed

see the number of documents handled by the proxy server.


Documents queued

see the number of documents waiting to be processed by the proxy server.

36

SAS Information Retrieval Studio: Administrators Guide

Last document received

see the timestamp of the latest document that the proxy server accepted.
Last document processed

see the timestamp of the latest text that the proxy server handed to another
server.

2.7.4 The Configuration Tab


Use the Configuration pane to specify the pipeline server where the proxy
server sends a copy of each input document. You can choose to specify several
pipeline servers running on different ports.
You can use multiple pipeline servers for mirroring. To perform this operation,
configure the mirror servers with the same specifications that you set for the
local pipeline server.
You can also set up multiple pipeline servers to perform different sets of
operations on your input documents.

SAS Information Retrieval Studio: Administrators Guide

37

Display 2.24 Configuration Pane.

Host

see the name of the pipeline server. By default, when this server is
running, the information for the local machine appears in the
Configuration pane.
Port

see the number of the port where the pipeline server is running.
Status

see whether the pipeline server is running.


Add

click to access the Add Backend window. Here you can specify additional
pipeline servers. For more information, see Section 2.14.20 The Add
Backend Window on page 141.
Remove

delete the selected server.

38

SAS Information Retrieval Studio: Administrators Guide

Edit

click to access the Edit Backend window. Here you can change the
pipeline server. For more information, see Section 2.14.21 The Edit
Backend Window on page 142.

2.7.5 The Log Tab


This pane provides information about the proxy server operations. For more
information, see Section 2.3.2 The Log Tab on page 11.

2.8 Viewing the Pipeline Server Pane


2.8.1 Overview of the Pipeline Server Tab
The pipeline server is used to analyze, modify, and export each document
before it is sent to the indexer. The operations that are available when you
click the Pipeline Server tab are explained in the following subsections.
Display 2.25 Pipeline Server Pane

SAS Information Retrieval Studio: Administrators Guide

39

2.8.2 The Buttons


The Start and Stop buttons work for the pipeline server in the same ways that
they work for the crawlers. For more information, see Section 2.4.2 The
Buttons on page 13.
Apply Changes

click, if the index server is running. The indexing server is restarted and
the changes take effect for the new index.

2.8.3 The Status Tab


See whether the pipeline server is running and observe the document
processing stages.
Display 2.26 Status Pane

Pipeline Stage

see the four stages of the pipeline server:


Overall

see the documents that have completed all of the document processing
XML parsing stages.
Document processing

see how the document processors are acting on input documents


Sending to the indexer

see the documents that are in the indexing process.

40

SAS Information Retrieval Studio: Administrators Guide

Pending

see the number of documents that are in the queue for each stage, with the
exception of Overall, in the pipeline.
Finished

see the number of documents that completed the processing operations in


each stage.
Last busy time

see the latest processing date and time.

2.8.4 The Document Processors Tab


Document processors act on the documents in the pipeline. Document
processors analyze the data in input documents, perform various operations on
the documents, and export the data. These operations are performed before the
field-value pairs, known as documents, are added to the index or passed to
another application such as SAS Sentiment Analysis Workbench. (Each
document consists of a set of field-value pairs.) In some cases, document
processors pass documents directly to another application for analysis such as
SAS Sentiment Analysis Workbench.
Display 2.27 Document Processors Pane

SAS Information Retrieval Studio: Administrators Guide

41

Add

click to access the Add Document Processor window. For more


information, see Section 2.13 The Add Document Processor Windows on
page 73.
Remove

delete the selected processor.


Edit

access the Document Processor window that is specific to the selected


operation. You can make some changes here. To make a more
comprehensive set of changes including changes to the caption, see
Section 13.3.5 Specify Labels for Facetted Search on page 321.
Move Up

reorder the document processing operations by moving the selected


operation up one level.
Move Down

reorder the document processing operations by moving the selected


operation down one level.
Note:

The ordering of the document processor operations is


performed according to the type of operations that are
necessary to achieve the desired results. For more
information, see Section 9.2 Configuring the Pipeline
Server on page 235.

2.8.5 The Document Inspector Tab


The Document Inspector pane enables you to see all of the versions of the
input document for each stage of the pipeline, simultaneously. At each stage of
the pipeline, the original document changes, but you can still see its original
text.
This snapshot operation is available for one document at a time. It is available
only when the documents are in the pipeline server.

42

SAS Information Retrieval Studio: Administrators Guide

Display 2.28 Document Inspector Pane

Take Snapshot

click this button and you can see the document in the various pipeline
stages.
Cancel

click this button to stop the snapshot operation.


Processing Stage

see a document. Click on a document field to see the document number in


the Document pane.
Document

see the number of this document. Click on the document number to see the
field names for this document in the Field pane.

SAS Information Retrieval Studio: Administrators Guide

43

Field

see the field names in this document. Click on a field to see the contents of
the selected document in the Document Inspector pane.
Document Inspector pane

see the contents of the chosen field of the selected at a specific stage.

2.8.6 The Log Tab


This pane provides information about the pipeline server operations. For more
information, see Section 2.3.2 The Log Tab on page 11.

2.9 Viewing the Indexing Server Pane


2.9.1 Overview of the Indexing Server Pane
The indexing server builds a searchable index of the input documents. If you
want to enable end users to search the collected documents, build an index.
Each document in the index consists of a set of field-value pairs. These fields
are populated when they are matched to similarly named fields in the
documents passed to the indexing server by the pipeline server.
You can build one index at a time from the documents provided by the
continuously running crawlers. Use the different types of index fields for
querying.

44

SAS Information Retrieval Studio: Administrators Guide

Display 2.29 Indexing Server Pane

2.9.2 The Buttons


The Start, Stop, and Apply Changes buttons that are available in the
indexing server pane are the same buttons that are available for the web
crawler. For more information, see Section 2.4.2 The Buttons on page 13.
Delete Index

remove the existing index. A new index can be built with the new
configuration after you restart the crawler.
Revert

click when the indexing server is running and the existing index is deleted.
A new index can be built when new documents are input.

SAS Information Retrieval Studio: Administrators Guide

45

2.9.3 The Status Tab


See whether the indexing server is running.
Display 2.30 Indexing Server Status Pane

2.9.4 The Configuration Tab


The indexing server can be reconfigured according to your specifications.
Display 2.31 Indexing Server Configuration Pane

Field Name

add to, or delete from, the list of field names entered by default. The
default list includes title, date, and body.
Functionality

add to, or delete from, the list of uses for these fields. For more
information, see Section 2.14.22 The Add Field Window for the Indexing
Server on page 143.

46

SAS Information Retrieval Studio: Administrators Guide

Add

access the Add Field window where you add fields to the index according
to the purpose that they are intended to serve. For more information, see
Section 2.14.22 The Add Field Window for the Indexing Server on
page 143.
Remove

delete the selected field from the index configuration. Use this button to
change the configuration of the next index that is built. Any changes do
not affect the current index.
Edit

access the Edit Field window where you make changes to the fields that
you added to the index. For more information, see Section 2.14.22 The
Add Field Window for the Indexing Server on page 143.
Language

optimize the index for the selected language. Click


language.

to select another

2.9.5 The Log Tab


This pane provides information about the indexing server operations. For more
information, see Section 2.3.2 The Log Tab on page 11.

SAS Information Retrieval Studio: Administrators Guide

47

2.10 Viewing the Query Server Pane


2.10.1 Overview of the Query Server Tab
The query server uses the index built by the indexing server to locate matching
documents in response to queries. Use the query web server that is one of the
components of SAS Information Retrieval Studio. You can also use a custom
query API that you write to pass queries to the query server. You can use this
component with the query API. For more information about the query API, see
the SAS Search and Indexing: User and C and Java API Guide.
Display 2.32 Query Server Pane

2.10.2 The Buttons


The Start and Stop buttons work for the query server like they work for the
crawlers. For more information, see Section 2.4.2 The Buttons on page 13.

2.10.3 The Status Tab


See whether the query server is running.
Display 2.33 Query Server Status Pane

48

SAS Information Retrieval Studio: Administrators Guide

2.10.4 The Log Tab


This pane provides information about the query server operations. For more
information, see Section 2.3.2 The Log Tab on page 11.

2.11 Viewing the Query Web Server Pane


2.11.1 Overview of the Query Web Server Tab
Use the query web server to format the search window that the end user sees.
Also configure this server to display the search results with, or without labels,
and specify other ways of rendering the search returns.
Display 2.34 Query Web Server Pane

2.11.2 The Buttons


The Start, Stop, and Apply Changes buttons work for the query web server
like they work for the crawlers. For more information, see Section 2.4.2 The
Buttons on page 13.
Revert

restore the default settings.

SAS Information Retrieval Studio: Administrators Guide

49

2.11.3 The Status Tab


See whether the query web server is running.
Display 2.35 Query Web Server Pane

2.11.4 The Configuration Tab


2.11.4.A The Main Tabs in the Configuration Tab
Specify the settings for the query web server in the Configuration tab. You
configure the query web server when you specify the parser, the processors,
and the order for document processing.
Display 2.36 Configuration Pane

Server Port

(default is 9100) click


web server runs.

or

to select another port where the query

Matching tab

specify the types of searches that the end user can input, the fields the user
can specify, and the weight of each field. For more information, see
Section 2.11.4.B The Matching Tab on page 51.

50

SAS Information Retrieval Studio: Administrators Guide

Sorting tab

specify how the matching documents are ranked and their parameters. For
more information, see Section 2.11.4.C The Sorting Tab on page 53.
Labels tab

specify labels when you choose to enable facetted search. Facetted search
uses a web-like system of related labels to enable users to intuitively
locate the results that they seek. For more information, see Section
2.11.4.D The Labels Tab on page 56.
Match Formatting tab

specify how documents that match the query are displayed in the list of
results. For more information, see Section 2.11.4.E The Match
Formatting Tab on page 58.
Theme tab

specify the look and feel of the query interface. For more information, see
Section 2.11.4.F The Theme Tabs on page 60.

2.11.4.B The Matching Tab


Use the Matching pane to specify the types of searches users can use to query
the index. Before you specify the fields in the index that can be searched and
the weight of the matches returned, consider the simple and advanced search
types.
Select the simple, or fsearch query syntax, and the user enters words and
phrases. The user can also mark these words and quoted phrases as required or
excluded by prefixing them with plus (+) or minus (-) signs.
Select the advanced bsearch query syntax, and the user enters field names as
part of a query expression. The words and phrases can be combined with
Boolean, positional, and counting operators.
For more information about searching, see the SAS Information Retrieval
Studio: Users Guide.

SAS Information Retrieval Studio: Administrators Guide

51

Display 2.37 Matching Pane

Search type

(default setting is Simple) click

to select Advanced.

Field Name

(default is Body) see the name of the fields to search with input queries.
Weight

(default is 1) is a scaling factor that compares one field to another.

52

SAS Information Retrieval Studio: Administrators Guide

Add

access the Add Field window. For more information, see Section 2.14.23
The Add Field Window for the Query Web Server Matching Pane on
page 146.
Remove

delete the selected field.


Edit

access the Edit Field window. For more information, see Section 2.14.24
The Edit Field Window for the Query Web Server Matching Pane on
page 147.

2.11.4.C The Sorting Tab


Use the Sorting tab to specify how documents are ranked when multiple
matches are returned to a query. The fields in the Sorting pane change
according to the Sort type selection that you choose. At this time, only the
combinations that appear based on specific selections are possible. The default
selection, Relevancy, is shown below:
Display 2.38 Sorting Pane

To use the components of this pane, see the following table:

SAS Information Retrieval Studio: Administrators Guide

53

Table 2-3: Sorting Pane Components


Component

Selection

Sort type

Description
Specify the relative importance of the matching documents
according to the metrics that you choose. For example, Cosine
Weight (the only metric that is part of relevancy by default) and
Freshness Weight.

Relevancy

Use the weight of specified fields, in combination with the


following metrics to determine the best returns. These metrics are
Cosine Weight, Proximity Weight, Position Weight,
Density Weight, and Freshness Weight.

Number of
Returns the matching document with the highest number of terms
matching terms that match those in the query syntax. You can select a tiebreaker to
determine a match when two or more documents meet this
threshold.
Number of
Returns the matching document with the highest number of
matching fields matching fields. You can also choose a tiebreaker in cases where
there are two, or more, matching documents.

54

Date (newest
first)

Returns the matching document with the most recent date. The
tiebreaker is Order added to the index.

Date (oldest
first)

Returns the matching document with the earliest date. The


tiebreaker is Order added to the index.

Field value
(largest first)

Returns the matching documents with the highest numeric value


stored in a field. For example, price or rating. You determine the
field in Sort field. Choose a tiebreaker for two, or more,
matching documents.

Field value
(smallest first)

Returns the matching documents with the lowest numeric value


stored in a field. For example, price or rating. You determine the
field in Sort field. Choose a tiebreaker for two, or more,
matching documents.

Field value
(alphabetical)

Returns the matching documents with matches in alphabetical


order. You determine the field in Sort field. Choose a tiebreaker
for two, or more, matching documents.

Order
added to
the index

Select to make the first document input to the index the matching
document. This is true when two or more documents both meet the
match requirements.

SAS Information Retrieval Studio: Administrators Guide

Table 2-3: Sorting Pane Components (Continued)


Component
Tiebreaker

Selection

Description
Click
to make a different selection, see the relevant Sort
Type above. There can be as many as three tiebreaker fields,
depending on the selection that you make in the Sort Type field.
Specify the following weights according to your values. In other
words, if the density weight is more important than any of the other
weights, specify the highest weight number for this field.

Cosine Weight
(Default: 1) Click
or
to change this metric that
weights frequently occurring terms more highly than those that are
infrequent. This metric also takes noise words into consideration.
(Noise words are the words that appear with enough frequency that
they are ranked down.)
Proximity Weight
(Default: 0) Click
or
to specify how to weight
matching query terms that are located close together in the
document.
Position
Weight

Density
Weight

(Default: 0) Click
or
to change the weight assigned to
matches on words located close to the beginning of the document.

(Default: 0) click
or
to change the metric that
balances the number of matched query terms with the total number
of words in the matching document. The number of match instances
is measured as a percentage of the document.

Freshness Weight
(Default is 0) Click
or
to change the number that
determines the age of the matching document. This metric combines
several factors besides the age of the document.

SAS Information Retrieval Studio: Administrators Guide

55

2.11.4.D The Labels Tab


Specify labels that enable your end users to use facetted search to intuitively
locate the information that they seek. For more information about facetted
search, see Section 3.7 Defining Labels for Facetted Search on page 163.
Display 2.39 Labels Pane

Field Name

see the names of the fields that you entered with the Add button.
Caption

see the label that you added with the Add button.
Add

access the Add Field window. For more information, see Section 2.14.25
The Add Field Window: Query Web Server Labels Pane on page 148.
Remove

delete the selected field.

56

SAS Information Retrieval Studio: Administrators Guide

Edit

access the Edit Field window. For more information, see Section 2.14.26
The Edit Field Window: Query Web Server Labels on page 150.
Move Up

change the location of the selected field. Click to move the selected field
up one level in the display shown on the search results page.
Move Down

change the location of the selected field. Click to move the selected field
down one level in the hierarchical taxonomy.
Maximum number of related labels

leave the default setting 10. You can also specify a new highest number of
labels that can be displayed in response to a query.

SAS Information Retrieval Studio: Administrators Guide

57

2.11.4.E The Match Formatting Tab


Specify how matching documents are displayed in the results list.
Display 2.40 Match Formatting Pane

Use the components of this pane as explained below:


Table 2-4: Match Formatting Pane Components
Component

Description

Title source
(Default: Text field) Click
to select HTML field or
None. Use the title fields in this pane to identify the type of field where
the title of the document is located in the input document. Select None
if you do not want to use the title fields for search. (When you select
None, the Title field disappears.)
Title field
(Default: title) Click
index.

58

to select a different info field in the

SAS Information Retrieval Studio: Administrators Guide

Table 2-4: Match Formatting Pane Components (Continued)


Component

Description

Use filename when


document has no
title
(Default: Yes) Click

to select No.

Abstract source
(Default: Concordance) Click
to select Text field, or
HTML field. Use the abstract source fields to locate the summary of
the input document. Use the Concordance selection if you want to
enable hit highlighting. Hit highlighting bolds the matched query term
in an input document.
Link source
(Default: Text field) Click
to select None, URL, or HTML
field. The link fields specify the type of fields that provide a path to the
input text.
Link prefix

Modify the URL at display time for the purposes of passing an


argument from your own CGI script.

Link suffix

Modify the URL at display time for the purposes of passing an


argument from your own CGI script.
Note: For more information about link suffixes and prefixes, see
Section 13.3.6 Specify the Formatting for the Matches on page 326.

Add keywords to
PDF links
(Default: Yes) Click
to select No. Modify the URLs of PDF
files to instruct Adobe Reader to highlight the search terms in an input
document. This operation functions like a concordance, but works for
the entire document, not only the abstract in the results list.
MIME type source
(Default: None) Click
to select Text field. This field
specifies where the name for the document format is located.
Date source
(Default: None) Click
to select Date or Text field. This
field is used to locate the source of the document date.

SAS Information Retrieval Studio: Administrators Guide

59

2.11.4.F The Theme Tabs


Specify the settings for the query web server user interface in the Theme tab
and its related tabs. For more information about formatting the search window,
see www.w3.org.
Display 2.41 Theme Pane

Title

(default is SAS Information Retrieval Studio) enter a new name to


change the name that appears in the search window that end users see
when they query the index.
Font

(default is sans-serif) enter a new font type into this field to change the
display look of the title. For example, enter Times New Roman.

60

SAS Information Retrieval Studio: Administrators Guide

Font size

(default is 10) click

or

to change the size of the display letters.

Use pop-up menus

(default setting is Yes) click


to select No. Select No to disable the
pop-up menus functionality for older browsers that do not support
Javascript.
Colors tab

specify the colors for the user interface. (See www.w3.org for more
information.)
Display 2.42 Colors Pane

SAS Information Retrieval Studio: Administrators Guide

61

Use the operations in this pane as follows:


Table 2-5: Colors Pane Components
Component

Description

Header
background color
(Default: Custom) Click

to select an image. You can also click

to access the color box window to change the header color.


Note: For more information about the Color Box window, see Section
2.14.27 The Color Box Window on page 151.
Header text color
(Default: Custom) Click
to select one of the images that you
loaded into the work/query-web-server subdirectory of your

installation directory. You can also click


window to select the text color for headers.

to access the color box

Link color
(Default: Custom) Click
to select one of the images that you
loaded into the work/query-web-server subdirectory of your
installation directory. You can also click
window and select the color of the links.

to access the color box

Visited Link color


(Default: Custom) Click
to select one of the images that you
loaded into the work/query-web-server subdirectory of your
installation directory. You can also click
to access the color box
window and select the color for your visited links.

62

SAS Information Retrieval Studio: Administrators Guide

Table 2-5: Colors Pane Components (Continued)


Component

Description

Hover Link color


(Default: Custom) Click
to select one of the images that you
loaded into the work/query-web-server subdirectory of your
installation. directory. You can also click
to access the color box
window and select the color of the links when a user slides the cursor
over them.
Menu border color
(Default: WindowFrame) Click
to select one of the images that
you loaded into the work/query-web-server subdirectory of your
installation directory. This selection specifies how the color is applied to
menus.
Menu unselected
background color
(Default: Window) Click
to select one of the images that you
loaded into the work/query-web-server subdirectory of your
installation directory. This selection specifies how the color is applied to
menus.
Menu unselected
text color
(Default: WindowText) Click
to select one of the images that
you loaded into the work/query-web-server subdirectory of your
installation directory. This selection specifies how the color is applied to
unselected text in menus.
Menu selected
background color
(Default: Highlight) Click
to select one of the images that
you loaded into the work/query-web-server subdirectory of your
installation directory. This selection specifies how the color is applied to
the background of menus.

SAS Information Retrieval Studio: Administrators Guide

63

Table 2-5: Colors Pane Components (Continued)


Component

Description

Menu selected text


color
(Default: HighlightText) Click
to select one of the images
that you loaded into the work/query-web-server subdirectory of
your installation directory. This selection specifies how the color is
applied to the text in menus.
Reset to Default
button
Images

Click to restore the default settings.

tab

load the images that you plan to use for the search window into the work/
query-web-server subdirectory of your installation directory.

Left header image

(default is None) click


to select one of the images that you
loaded into the work/query-web-server subdirectory of your
installation directory.
Right header image

(default is sas.png) click


to select one of the images that
you loaded into the work/query-web-server subdirectory of your
installation directory.

64

SAS Information Retrieval Studio: Administrators Guide

2.11.5 The Log Tab


This pane provides information about the query web server operations. For
more information, see Section 2.3.2 The Log Tab on page 11.

2.12 Viewing the Query Statistics Server Pane


2.12.1 Overview of the Query Statistics Server Pane
The query statistics enables you to see the input query terms and information
about when these terms were entered.
Display 2.43 Query Statistics Server Pane

2.12.2 The Buttons


The Start, and Stop buttons work for the query statistics server like they work
for the crawlers. For more information, see Section 2.4.2 The Buttons on
page 13.

2.12.3 The Status Tab


See whether the query statistics server is running.

SAS Information Retrieval Studio: Administrators Guide

65

Display 2.44 Query Statistics Server Status Pane

2.12.4 The Query Statistics Tab


2.12.4.A The Buttons
Specify the settings for the query statistics server in the Query Statistics tab.
You can see the various query analytics run by SAS Information Retrieval
Studio when you use this pane.
Display 2.45 Query Statistics Pane

Use the components of this pane as follows:


Table 2-6: Query Statistics Pane Components

66

Component

Description

Today

Click to see the date of the current day in the Year, Month, and Day
fields.

This Month

Click to see the current month in the Month field and the current day
in the Day field. The Year field is unavailable.

This Year

Click to see the current year in the Year field. The Day and Month
fields are inaccessible.

SAS Information Retrieval Studio: Administrators Guide

Table 2-6: Query Statistics Pane Components (Continued)


Component

Description

All Time

Click and the Year, Month, Day fields and the Previous and
Next buttons are all inaccessible.

Year field
Click
to select a year. You can select a year back until 1980, or
leave the default selection --.
Month field
Click

to select a month.

Click

to select a day.

Day field

Previous

Click to enter the preceding date. For example, if you selected 2010,
2009 appears in the Year field.

Next

Click to select the following date. For example, if you selected 2010,
8, and 20, the next day 21 appears in the Day field.

SAS Information Retrieval Studio: Administrators Guide

67

2.12.4.B The Most Frequent Queries Tab


See the terms that are most often searched in this pane for the selected time
period.
Display 2.46 Most Frequent Queries Pane

Query

see a list of the input search terms ranked from the highest to the lowest
input number.
Number of Occurrences

see the total number of query submissions.

68

SAS Information Retrieval Studio: Administrators Guide

2.12.4.C The Most Frequent Queries without Matches Tab


See the terms that are most often searched and not matched in the input
documents in this pane for the selected time period.
Display 2.47 Most Frequent Queries without Matches Pane

Query

see a list of the input search terms that are unmatched in the index. This
list is ordered from the highest to the lowest number of entries.
Number of Occurrences

see the total number of times that each query was submitted.

SAS Information Retrieval Studio: Administrators Guide

69

2.12.4.D The Hourly Query Rate Tab


See the number of search terms entered during each 24-hour period, by hour,
for the selected time period.
Display 2.48 Hourly Query Rate Pane

Hour

see each of the 24 hours.


Number of Queries

see the total number of queries submitted each hour.

70

SAS Information Retrieval Studio: Administrators Guide

2.12.4.E The Daily Query Rate Tab


See the number of queries submitted for each day of the week in this pane.
Display 2.49 Daily Query Rate Pane

Day

see a list of the seven days of the week.


Number of Queries

see the total number of queries submitted each day.

SAS Information Retrieval Studio: Administrators Guide

71

2.12.4.F The Monthly Query Rate Tab


See the number of queries submitted for each month of the year in this pane.
Display 2.50 Monthly Query Rate Pane

Month

see a list of the 12 months of the year.


Number of Queries

see the total number of queries submitted each month.

2.12.5 The Log Tab


This pane provides information about the query statistics server operations.
For more information, see Section 2.3.2 The Log Tab on page 11.

72

SAS Information Retrieval Studio: Administrators Guide

2.13 The Add Document Processor Windows


2.13.1 Overview of Document Processor Windows
Use the Add Document Processor window to perform a processing operation
on input documents. For example, use these windows to remove mark-up tags,
categorize, extract concepts, and so on.
You can add more than one processor in order to perform several types of
operations on a single, input document. The document processors act on the
incoming documents in the order specified.
Note:

The order of the document processing operations is


important. For this reason choose to perform
operations such as parse_html and
heuristic_parse_html before operations such as
categorizer.

2.13.2 Access Document Processor Window


To access a Document Processor window, complete these steps:
1. Select Pipeline Server --> Configuration --> Document
Processors.

2. Click Add in the Document Processors pane for the pipeline server. The

Add Document Processor window appears.

SAS Information Retrieval Studio: Administrators Guide

73

3. Select one of the following processors:


add_field

add a new field with the value that you specify to each input
document. This field has one name and one value that is the same
for every indexed document. For more information, see Section
2.13.3 The Document Processor: add_field Window on page 77.
content_categorization

match categories, concepts, and facts in input documents. For more


information, see Section 2.13.4 The Document Processor:
content_categorization Wizard on page 78. Use this document
processor to create labels to use in facetted search. You can see
these labels in the labels pane in the Query Web Server
Configuration pane.
default_mime_type_from_url

return the document type that is located in the address fields of input
documents. For more information, see Section 2.13.5 The
Document Processor: default_mime_type_from_url Window on
page 95.
default_title_from_url

return the document title from the Web address of any input
documents. For more information, see Section 2.13.6 The
Document Processor: default_title_from_url Window on page 95.
document_converter

deploy SAS Document Conversion to change incoming files, such


as Adobe PDF and Microsoft Office documents, into text. For more
information, see Section 2.13.7 The Document Processor:
document_converter Window on page 96.
export_csv

use this selection to transfer document text into a comma-separated


format that can be exported into another program such as SAS Text
Miner or Excel. For more information, see Section 2.13.8 The
Document Processor: export_csv Window on page 97.
export_to_files

choose to save each document to a separate file. The name of the


file is based on a hash of its contents. For more information, see

74

SAS Information Retrieval Studio: Administrators Guide

Section 2.13.9 The Document Processor: export_to_files Window


on page 100.
export_to_odbc

Use the Document Processor: export_to_odbc window to send the


documents directly to a database. For more information, see Section
2.13.10 The Document Processor: export_to_odbc Window on
page 102.
export_to_sas_sentiment_analysis_workbench

send the document to SAS Sentiment Analysis Workbench. For


more information, see Section 2.13.11 The Document Processor:
export_to_sentiment_analysis_workbench Window on page 104.
extract_abstract

generate an abstract for the document based on the first 25 to 50


words in the body of the input text. For more information, see
Section 2.13.12 The Document Processor: extract_abstract
Window on page 106.
extract_pdate

normalize the date to a specific format that is understood by SAS


Search and Indexing. For more information, see Section 2.13.13
The Document Processor: extract_pdate Window on page 107.
heuristic_parse_html

separate the body of an HTML document from its tags. This is a


more advanced version of parse_html. heuristic_parse_html
searches for paragraphs of text without many links and extracts
these bodies of text. For more information, see Section 2.13.14 The
Document Processor: heuristic_parse_html Window on page 108.
invalidate_duplicates_by_url

stop more than one document with the same Web address from
being returned. For more information, see Section 2.13.15 The
Document Processor: invalidate_duplicates_by_url Window on
page 110.
match_and_copy

This is similar to the substitute document processor. Use the


match_and_copy document processor to write the output to a
different field than the input field. For more information, see

SAS Information Retrieval Studio: Administrators Guide

75

Section 2.13.16 The Document Processor: match_and_copy


Window on page 110. Also see Section 2.13.22 The Document
Processor: substitute Window on page 116.
modify_field_name

change the name of a field in an input document. For more


information, see Section 2.13.17 The Document Processor:
modify_field_name Window on page 112.
parse_html

separate the text from the HTML mark-up tags. For more
information, see Section 2.13.18 The Document Processor:
parse_html Window on page 112 and strip_html below.
Use this operation when you an HTML document and you want to
extract the body of this document, and possibly the metadata.
parse_xml

separate the text from the XML mark-up tags. For more
information, see Section 2.13.19 The Document Processor:
parse_xml Window on page 114.
send

save each input document to each pipeline server. For more


information, see Section 2.13.20 The Document Processor: send
Window on page 115.
strip_html

return only the text without the HTML mark-up tags. For more
information, see Section 2.13.21 The Document Processor:
strip_html Window on page 116 and parse_html above.
Use strip_html when you have a field that contains some HTML
code that you want to convert into plain text. For example, if input
XML documents contain HTML code.
substitute

exchange all occurrences of one term, tag, or other attribute for


another. For more information, see Section 2.13.22 The Document
Processor: substitute Window on page 116.
4. Click OK and the appropriate Document Processor window appears. For

more information, see Section 2.13 The Add Document Processor

76

SAS Information Retrieval Studio: Administrators Guide

Windows on page 73. The selected operation appears in the Document


Processors pane.

2.13.3 The Document Processor: add_field Window


Use the Document Processor: add_field window to add a field with a constant
string to each document passed to the index. For example, use this operation to
specify an identifier for each indexed document that specifies a specific
collection of documents.
To use the Document Processor: add_field window, complete these steps:
1. Select add_field in the Add Document Processor window and click
Next. The

first Document Processor: add_field window appears.

2. Enter a field name into field. For example, enter


corporate_documents.
Note:

Field names can be entered only in lowercase letters.

3. Enter a field name into value. For example, enter MyCompanyName.


4. Click Finish.

SAS Information Retrieval Studio: Administrators Guide

77

2.13.4 The Document Processor:


content_categorization Wizard
2.13.4.A Overview of the content_categorization
Document Processor
Use the Document Processor: content_categorization window to specify the
categories, concepts, and facts that can be matched in indexed documents. You
can also specify the labels that are used for facetted search using this wizard.
These processors automatically populate the index and the query web server
components with index fields and labels for facetted search. You can specify
these labels or use the default settings.

2.13.4.B Configure SAS Content Categorization Server


The categories, concepts, and facts are applied by SAS Content Categorization
Server. For this reason, you configure the server to work with the
content_categorization Document Processor.
To configure SAS Content Categorization Server, complete these steps:
1. Select content_categorization in the Add Document Processor

window and click Next. The Document Processor:


content_categorization window appears

2. (Optional) By default, the name of the server where SAS Content

Categorization Server is running is specified in the Hostname field.


For example, see localhost. You can enter a different server name if
SAS Content Categorization Server is running on another server.

78

SAS Information Retrieval Studio: Administrators Guide

3. (Optional) By default, the port number for the specified server is entered

in Port. For example, see 6500. Click


port number.

or

to select a different

4. (Optional) By default, 10 is entered into Timeout. Click

or

to select a different number. This is the number of seconds that


the Pipeline Server waits before it stops the download process.
5. Click Next to save these settings.

SAS Information Retrieval Studio: Administrators Guide

79

2.13.4.C Specify the Projects


SAS Content Categorization Server applies categories, concepts, and facts to
documents in the SAS Information Retrieval Studio Pipeline Server.
To specify the projects, complete these steps:
1. The Document Processor: content_categorization window appears. This

window lists the projects and their types. The categories, concepts, and
facts that are applied by the pipeline server are limited to those that are
specified in the projects that you specify.

2. Click Add. The Document Processor: content_categorization window

appears.

3. (Optional) By default, Categorization is selected in Type. This is

true if categories are part of the taxonomy in one of the SAS Content
Categorization Studio projects that you uploaded to SAS Content
Categorization Server. Alternately, Concept extraction, or
Contextual extraction

is selected. Click

to make a different

selection.

80

SAS Information Retrieval Studio: Administrators Guide

4. (Optional) By default, a project is selected. For example, see


Lifestyle. Click
to select a different project. For example,
select Sports. This selection limits the available categories, concepts,
and facts to those in the project.

5. Click Ok and the project appears in the Document Processor:

content_categorization window.

6. (Optional) Repeat Step 2. through Step 5. above until you have added all

of your projects.
7. Click Next to save your changes.

SAS Information Retrieval Studio: Administrators Guide

81

2.13.4.D Specify Input


SAS Content Categorization Server applies the categories, concepts, and facts
to input fields. These matches are labelled or exported as output.
To specify the input and output processing, complete these steps:
1. After you complete Step 7. on page 81, the Document Processor:

content_categorization wizard appears.

2. (Optional) By default, Input Fields is blank. Enter any field names that

you want to search for matches for your categories, concepts, and facts.
If you leave this field blank, all of the fields are searched with the
exception of any fields entered into Input fields to exclude.
3. (Optional) By default, Input Fields to exclude contains metadata

fields. Enter any field names that you want to exclude from the search
for your categories, concepts, and facts.
Hint:

To ensure that excess time stamped fields are not sent


to SAS Content Categorization Server, leave the ctime,
atime, and mtime fields. These field names represent
created, accessed, and modified dates for a file. For
more information, see Section 2.8.5 The Document
Inspector Tab on page 42.

4. (Optional) Click Finish to save your changes.

82

SAS Information Retrieval Studio: Administrators Guide

2.13.4.E Specify Categories


Select all of the categories in a SAS Content Categorization Studio projects
that you uploaded to SAS Content Categorization Server. These category rules
are applied by SAS Content Categorization Server to the documents in the
Pipeline Server running in SAS Information Retrieval Studio.
Note:

When you select categories, you select all of the


categories in all of the projects. However, when you
select concepts and facts, you can choose all or some
of the concepts and facts that exist in the selected
projects.

To specify that all of the categories in the uploaded projects can be used by
SAS Content Categorization Server, complete these steps:
1. Click Categories to access the Categories pane.

2. (Optional) By default, categories is entered into the Field name field.

You can enter a new field name.


3. (Optional) By default, Categories is entered into Caption. You can

enter a new caption name to change the label for facetted search. For

SAS Information Retrieval Studio: Administrators Guide

83

more information, see Section 9.5 Match Categories, Concepts, and


Facts on page 246 and Chapter 10: Creating Facetted Search Labels
Using content_categorization.
4. (Optional) By default, %c is entered into Format for the category name.

You can enter a new format that might include %% for a literal percent
sign. You can also use x as a modifier to request XML escaping. For
example, enter %xc.
5. (Optional) Enter a regular expression into the Category name pattern

field.
6. (Optional) Enter a string into the Category name replacement field.

This string is a constant value that replaces all of the category names.
7. (Optional) By default, ; (semicolon) appears in Separator. Enter a new

separator such as a comma (,).


8. (Optional) By default, the highest number of categories that can be

matched in any single input document is 15. Click


change this default selection in Max categories.

or

to

9. Click Finish.

2.13.4.F Specify Concepts


Select the concepts in the SAS Content Categorization Studio projects that you
uploaded to SAS Content Categorization Server. These concept definitions are
applied by SAS Content Categorization Server to the documents in the
Pipeline Server running in SAS Information Retrieval Studio.
To specify some of the concepts, complete the following steps. (If you want to
specify all of the concepts for all of the selected projects, see Step 1. and then
go to Step 10.)
1. Click Concepts to access the Concepts pane. Use this pane to add all of

the concepts and contextual extraction concepts. The concepts that are

84

SAS Information Retrieval Studio: Administrators Guide

specified by PREDICATE or SEQUENCE definitions are added in the Facts


pane.

2. Click Add. The Document Processor: content_categorization window

appears. Use this pane to specify the settings for each individual
concept. These settings override the settings specified for all of the
concepts in the Concepts tab.

SAS Information Retrieval Studio: Administrators Guide

85

to select a concept in the Concept field. For example,


select SPORTS from the drop-down menu.

3. Click

4. (Optional) By default, sports is entered into Field name. You can

enter a new field name.


5. (Optional) By default, Sports is entered into the Caption field. You can

enter a new caption name to change the label for facetted search. For
more information, see Section 9.5 Match Categories, Concepts, and
Facts on page 246.
6. (Optional) By default, % is entered into the Format field for the concept

name. You can also use any of the following symbols:


Table 2-7: Default Format Symbols
Symbol

Description

%c

Match the concept name.

%p

Add to %c to include the path with the concept name.

%m

Match the text.

%i

Match the information associated with the entity, or the match text
if no information is available.

%I

Match the information associated with the entity unconditionally.

%%

Match the literal percent sign.

Use as a modifier, such as in %xc to request XML escaping

If you want to output nested XML tags, specify the format for these
tags such as <body>%xi</body>. For more information, see Section
2.13.9 The Document Processor: export_to_files Window on
page 100.
7. (Optional) By default, ; (semicolon) appears in the Default separator

field. Enter a new separator such as a comma (,).


8. Click Ok. If you want to make changes to your edits, click Copy
Defaults.

86

SAS Information Retrieval Studio: Administrators Guide

9. (Optional) Use Step 2. on page 85 through Step 8. on page 86,

reiteratively, until you have added all of the concepts that you want to
deploy in SAS Information Retrieval Studio.

10. (Optional) By default, concepts is entered into Default field name.

You can enter a new field name.


11. (Optional) By default, Concepts is entered into Default caption. You

can enter a new caption name to change the label for facetted search. For
more information, see Section 9.5 Match Categories, Concepts, and
Facts on page 246 and Chapter 10.
12. (Optional) By default, %c: %i is entered into the Default format field

for the concept name. You can also use any of the following symbols:
Table 2-8: Default Format Symbols
Symbol

Description

%c

Match the concept name.

%p

Exclude the path.

SAS Information Retrieval Studio: Administrators Guide

87

Table 2-8: Default Format Symbols (Continued)


Symbol

Description

%m

Match the text.

%i

Match the information associated with the entity, or the match text
if no information is available.

%%

Match the literal percent sign.

Use as a modifier, such as in %xc to request XML escaping

13. (Optional) By default, ; (semicolon) appears in the Default separator

field. Enter a new separator such as a comma (,).


14. (Optional) By default, the highest number of concepts that can be

matched in any single input document is 15. Click


or
change this default selection in the Max concepts field.

to

15. Click Finish.

2.13.4.G Specify Facts


Select the facts in the SAS Content Categorization Studio projects that you
uploaded to SAS Content Categorization Server. The arguments in the
PREDICATE and SEQUENCE rules fact definitions are applied by SAS Content
Categorization Server to input documents.
To specify some of the facts, complete the following steps. (If you want to
specify all of the concepts for all of the selected projects, see Step 1. and then
go to Step 13. on page 93. In other words, when you leave facts in Default
field name all of the facts in the project are selected by default.)
1. Click Facts to access the Facts pane. (Facts are the contextual

extraction concepts that are defined by at least one PREDICATE or

88

SAS Information Retrieval Studio: Administrators Guide

rule. Unlike other concept rules, PREDICATE or SEQUENCE


rules have arguments.)
SEQUENCE

SAS Information Retrieval Studio: Administrators Guide

89

2. Click Add to specify a fact with its field name and caption for facetted

search.

to select a fact in the Fact field. Facts are contextual


extraction concepts that contain at least one PREDICATE or SEQUENCE
rule. For example, select SIDE_EFFECT from the drop-down menu.

3. Click

4. (Optional) When you select a fact using Step 3. above, the Field name

is automatically entered. For example, see sideeffect. Enter a new


name if you choose.
5. (Optional) When you select a fact, the Caption field is automatically

filled in. For example, see Side Effect. Enter a new name if you
choose.
6. (Optional) By default, the format for the matched fact is entered into the
Format

field. For example, see the following format:

SIDE_EFFECT(drug: %v{drug}, sideeffect: %v{sideeffect})

In this example, drug and sideffect are the returned arguments for
the fact SIDE_EFFECT if these arguments are matched. The match
strings for the arguments are %v{drug} and %v{sideeffect}. If there
are more than one PREDICATE or SEQUENCE rule in the definition with
these same arguments, this line is specified for all rules. If there are

90

SAS Information Retrieval Studio: Administrators Guide

any other types of rules in the definition, this fact also appears as a
concept when you select the Concepts tab.
7. (Optional) By default, %n: %v is entered into the Argument format

field for the argument format. You can also use any of the following
symbols:
Table 2-9: Default Argument Format Symbols
Symbol

Description

%f

Match the fact name.

%a

Match a formatted list of arguments.


Note: If you do not specify the argument symbol, the
Argument format field, even when specified, does not apply.

%v{name}

Output the value for a specific argument.

%m

Match the text.

%s

Return the concordance list.


Note: If you do not specify the concordance, the concordance is
not returned. This is true even when you specify the
Concordance type and Surrounding words in the
Facts pane.

%%

Match the literal percent sign.

Use as a modifier, such as in %xf to request XML escaping.

Argument List:
%n

Match the argument name.

Use as a modifier, such as in %xf to request XML escaping.

8. (Optional) By default, ; (semicolon) appears in the Default separator

field. Enter a new separator such as a comma (,).


9. (Optional) By default, the highest number of concepts that can be

matched in any single input document is 15. Click


or
change this default selection in the Max concepts field.

SAS Information Retrieval Studio: Administrators Guide

to

91

10. (Optional) Click Copy Defaults to insert the entries from the main
Facts

tab into all of the fields with the exception of the Fact field.

11. Click Ok.

92

SAS Information Retrieval Studio: Administrators Guide

12. (Optional) Use Step 2. on page 90 through Step 11. on page 92,

reiteratively, until you have added all of the facts that you want to apply
in SAS Information Retrieval Studio.

13. (Optional) By default, facts is entered into the Default field name

field. You can enter a new field name.


14. (Optional) By default, Facts is entered into the Default caption field.

You can enter a new caption name to change the label for facetted
search. For more information about facetted search, see Chapter 10.
15. (Optional) By default, %f(%a) is entered into the Default format field

for the concept name. You can edit this entry using any of the symbols
in Table 2-9 on page 91 with the exception of %v{name}.

SAS Information Retrieval Studio: Administrators Guide

93

Note:

Unless you specify %a, no arguments are called. This is


true even if you make entries in the Default argument
format field.

16. (Optional) By default, %n: %v is entered into the Default argument


format field for the concept name. You can edit this entry using the %n,
%v, %%, and the x modifier symbols in Table 2-9 on page 91.

17. (Optional) By default, , (comma) is entered in the Default separator

field. You can enter a new separator such as a semicolon (;).


18. (Optional) By default, Surrounding words is selected in Concordance
type.

Click

to select Full sentence.

Notes: Concordance refers to the surrounding text that is

returned with the match.


When you select Full sentence, the Surrounding words
field disappears.
If you do not specify the concordance using %s, the
concordance is not returned. This is true even when you
specify the Concordance type and Surrounding words in
the Facts pane.

19. (Optional) By default, 10 is selected in the Surrounding words field.

Click

or

to change this default selection.

20. (Optional) By default, the highest number of facts that can be matched

in any single input document is 15. Click


default selection in the Max facts field.

or

to change this

21. Click Finish to save your selections.

94

SAS Information Retrieval Studio: Administrators Guide

2.13.5 The Document Processor:


default_mime_type_from_url Window
Use the Document Processor: default_mime_type_from_url window to return
the document type from the address fields of any input documents. To perform
this operation, SAS Information Retrieval Studio looks for the filename
extension in the URL of an input document.
To use the Document Processor: default_mime_type_from_url window,
complete these steps:
1. Select default_mime_type_from_url in the Add Document Processor

window and click Next. The Document Processor:


default_mime_type_from_url window appears.

2. Leave the default specification, mimetype, or enter a new field name

into mime-type-field.
3. Leave the default specification, id, or enter the new field name into urlfield.

2.13.6 The Document Processor:


default_title_from_url Window
Use the Document Processor: default_title_from_url window to return the
document title using the Web address of the input documents.
To use the Document Processor: default_title_from_url window, complete
these steps:

SAS Information Retrieval Studio: Administrators Guide

95

1. Select default_title_from_url in the Add Document Processor

window and click Next. The Document Processor:


default_title_from_url window appears.

2. Leave the default specification, title, or enter a new field that specifies

what field is searched to locate the title of the input document. If the
document has no value for the field, the value of the URL field is used.
3. Leave the default specification, id, or enter the field where the

document title can be located into url-field.

2.13.7 The Document Processor:


document_converter Window
Use the Document Processor: document_converter window to extract plain
text from other types of file formats, such as Adobe PDF and Microsoft Word
documents. This document processor uses SAS Document Conversion.
To use the Document Processor: document_converter window, complete these
steps:

96

SAS Information Retrieval Studio: Administrators Guide

1. Select document_converter in the Add Document Processor window

and click Next. The Document Processor: document_converter


window appears.

2. Replace the default specification, localhost:54321, with a string

naming the server and its port in the server field. (The server name and
port number are separated by a colon [:]).
3. Leave the default specification, mimetype, or enter a new field where

the document type is found into mime-type-field. This field specifies


that non-ASCII text can be formatted into text.
4. Leave the default entry id, or specify a different field where the

identification information for the file can be located into filenamefield.

5. Leave the default specification, raw in the input-field and the

document processor gets the content in the body and title fields.
6. Leave the default specification, body, or enter a new location for the

output information into output-field. The output field is where the


plain text version of the document is stored.

2.13.8 The Document Processor: export_csv Window


Use the Document Processor: export_csv window to save documents to a .csv
file under the column headings that you specify with fields. (CSV represents
comma-separated value.) You can also specify categories, concepts, and facts

SAS Information Retrieval Studio: Administrators Guide

97

instead of fields when new files are created. Export these files with escaped, or
nonescaped characters, to be used in SAS Text Miner, Base SAS, Microsoft
Excel, and so on.
To use the Document Processor: export_csv window, complete these steps:
1. Select export_csv in the Add Document Processor window and click
Next.

The Document Processor: export_csv window appears.

2. Rename the default comma separated file, articles.csv, or leave this

entry in the filename field. If you add %s to the entry in this field, the
file is timestamped.
3. Leave the default entry 1, if you want to append to an existing file with

the specified name in the append field. This operation takes place when
the pipeline is restarted. Enter 0 to overwrite an existing file when the
pipeline is restarted.

98

SAS Information Retrieval Studio: Administrators Guide

4. Leave the default entry 0 in the new-file-after-n-rows field, or enter


1

if you appended %s to the entry in the filename field.

Notes: This field, like the following four fields, controls when a

file is closed and another file is accessed.


The operations in Step 4. through Step 7. are not
mutually exclusive. For this reason, a new file is started
when any of the enabled conditions is true.

5. Leave the default entry 0 in the new-file-after-n-idle-seconds field,

or enter 1 if you appended %s to the entry in the filename field. A new


file is created after the pipeline server is idle for the specified number of
seconds.
6. Leave the default entry 0 in the new-file-after-n-seconds field, or

enter 1 if you appended %s to the entry in the filename field. A new file
is created after the pipeline server is idle for the specified number of
seconds.
7. Leave the default entry 0 in the new-file-each-hour field. Enter 1 if

you appended %s to the entry in the filename field and a new file is
created every hour.
8. Leave the default entry 0 in the new-file-each-day field. Enter 1 if

you appended %s to the entry in the filename field and a new file is
created every day.
9. Leave the default entries id, title, and body as a comma-separated list

of field names that corresponds to the columns in the CSV file.


Alternatively, edit, or enter a new list of field names into the columns
field.
If you plan to use categorization, specify categories. If you want to
perform concept extraction, list the concepts in this field. This
includes the contextual extraction concepts and facts.
10. Leave the default specification 0 in the invalidate field if you want to

stop the input files at this point in the pipeline. Alternatively, enter 1 to
enable further document processing in the pipeline.

SAS Information Retrieval Studio: Administrators Guide

99

11. Leave the default specification 0 in the cleanup-white-space field if

you want to remove any new lines in the document. This operation
makes it easier to parse the document text. Alternatively, enter 1 to keep
these lines in the document as it is parsed.
12. Leave the default comma (,) that is entered into the delimiter field.

You can enter another character that is used to delimit the fields in the
output file.
13. Leave the default 1 setting in the excel-quoting field. Set this number

to 0 for nonescaped output.


14. Leave the default utf-8 encoding specification for input files in the
encoding

field.

2.13.9 The Document Processor: export_to_files


Window
Use the Document Processor: export_to_files window to save each document
to a separate file.
To use the Document Processor: export_to_files window, complete these
steps:

100

SAS Information Retrieval Studio: Administrators Guide

1. Select export_to_files in the Add Document Processor window and

click Next. The Document Processor: export_to_files window appears.

2. Leave the default selection, work/export-to-files, or change this

folder, in the directory field. SAS Information Retrieval Studio sends


each document to a separate file in the specified directory, based on a
hash of its contents.
3. Leave the default specification, xml, or enter text into the format field.
4. Enter a comma-separated list of field names to include in the output file.

If you leave fields blank, all of the document fields appear in the output
file.
5. Leave the default selections raw and mimetype in fields-to-exclude.

You can also specify different field names. The text in these fields does
not appear in the output file.
6. (Optional) When you want to output nested XML tags, enter the name

of the field whose value contains the XML syntax into xmlpreescaped-fields.
This field name is listed in the Field name of the Document Processor:
content_categorization window. For example, organization. For more
information, see Section 2.13.4.F Specify Concepts on page 84. This

SAS Information Retrieval Studio: Administrators Guide

101

field name is commonly used when XML escaping is requested in the


Format field of the Document Processor: content_categorization
window. See the following example:

7. Leave the default specification, article if you specified XML for the

output format. If you are using text as the output format, enter a
different document tag type into the document-tag field.
8. Leave the default utf-8 encoding specification for input files in the
encoding

field.

9. Leave the default specification 0 in the invalidate field if you want to

stop the input files at this point in the pipeline. Alternatively, enter 1 to
enable further document processing in the pipeline.

2.13.10 The Document Processor: export_to_odbc


Window
Use the Document Processor: export_to_odbc window to send the documents
directly to a database.
To use the Document Processor: export_to_odbc window, complete these
steps:

102

SAS Information Retrieval Studio: Administrators Guide

1. Select export_to_odbc in the Add Document Processor window and

click Next. The Document Processor: export_to_odbc window appears.

2. (Optional) Enter the name of the ODBC driver into the connectionstring
Note:

field.
Consult your database documentation for details before
you use this step and Step 6. below.

3. Enter the name of the database table into the table field.
4. Use the table-init field to specify the operation that is performed if the

database table already exists. If drop is specified, the existing table is


removed, allowing a new table to be created. The columns in the new
table replace those in the old. If set to truncate, the existing table is
preserved, but all of the rows in it are deleted.
5. Leave the default entries for the columns in the database table, or

specify new fields in the columns field.

SAS Information Retrieval Studio: Administrators Guide

103

6. Leave the default entry id in the merge-column field if you want to

add new rows with a merge operation. If you use the merge operation,
specify the name of a column. (Also see the note above.)
7. Leave the default specification 0 in the invalidate field if you want to

stop the input files at this point in the pipeline. Alternatively, enter 1 to
enable further document processing in the pipeline.
8. Leave the default setting of 1024 in the max-length field. You can also

change the highest number of characters permitted in the value specified


for a single database column.

2.13.11 The Document Processor:


export_to_sentiment_analysis_workbench
Window
Use the Document Processor: export_to_sentiment_analysis_workbench
window to send the documents directly to SAS Sentiment Analysis Workbench.
To use the Document Processor: export_to_sentiment_analysis_workbench
window, complete these steps:

104

SAS Information Retrieval Studio: Administrators Guide

1. Select export_to_sentiment_analysis_workbench in the Add

Document Processor window and click Next. The Document Processor:


export_to_sentiment_analysis_workbench window appears.

2. Leave the default setting, localhost, or enter a new server name into

the hostname field.


3. Leave the default setting, 4000, or enter a new number into the port

field.
4. Enter the name of the SAS Sentiment Analysis Workbench project into
project-name.

5. Enter the name of the output folder into document-set-name.

SAS Information Retrieval Studio: Administrators Guide

105

6. Leave the default selection, docid, or delete this entry in docid-field. If

this field is empty, a unique identifier is automatically generated for


each input document.
7. Leave the default selection, link, or delete this entry in link-field. If

this field is empty, a unique link to each document is automatically


generated.
8. Leave the createtime entry in createtime-field to specify the time

that the document was created. You can also specify another field for
this entry. In either case, the format of the contents of this field is
specified in the createtime-format field below.
9. Specify the format of the matching createtime data in the createtimeformat field. For example, %m/%d/%Y %I:%M:%S %p for SAS Sentiment
Analysis Workbench, or %Y%m%d for SAS Search and Indexing.

10. Leave the default specification, title, or enter the field where the name

of the document is located into title-field.


11. Leave the default specification, author, or enter the field where the

name of the person who wrote the document is located into authorfield.

12. Leave the default specification, geolocation, or enter the field that

specifies where the location is found into geolocation-field.


13. Leave the default specification, source, or enter the field that specifies

where the document originates into source-field.


14. Leave the default specification, body, or enter the field where the text of

the document is located into body-field.


15. Leave the default specification 0 in the invalidate field if you want to

stop the input files at this point in the pipeline. Alternatively, enter 1 to
enable further document processing in the pipeline.

2.13.12 The Document Processor: extract_abstract


Window
Use the Document Processor: extract_abstract window to generate the text in
the <abstract> tag. This document processor takes approximately the first 25 to
50 words from an existing field in a document, such as the body field, to

106

SAS Information Retrieval Studio: Administrators Guide

generate the abstract. (This location typically contains summary information for
a technical or scientific document.)
The abstract functions like the concordance if the document is sent to the
search engine. However, the abstract is static and therefore independent of any
query, the concordance is query-specific. For this reason, the concordance is
available only when a search operation is performed.
To use the Document Processor: extract_abstract window, complete these
steps:
1. Select extract_abstract in the Add Document Processor window and

click Next. The Document Processor: extract_abstract window appears.

2. Leave the default specification, body, or enter a new source for the
<body>

tag into source-field.

3. Leave the default specification, abstract, or enter the name of the

format tag where the document summary can be located into abstractfield.

2.13.13 The Document Processor: extract_pdate


Window
Use the Document Processor: extract_pdate window to convert the date value
in an input document into the pdate format. The pdate format is understood
by the search operation.
To use the Document Processor: extract_pdate window, complete these steps:

SAS Information Retrieval Studio: Administrators Guide

107

1. Select extract_pdate in the Add Document Processor window and

click Next. The Document Processor: extract_pdate window appears.

2. Leave the default specification, date, or enter a new source for the

document date into date-field. This date is converted into the pdate for
the search operation.
3. (Optional) Enter the strptime format of the date field in the input

document into date-format. If this field is left empty, the RFC822


format (Internet text message) is used.
4. Leave the default specification, pdate, or define a new field where the

pdate is stored in new-field.

2.13.14 The Document Processor:


heuristic_parse_html Window
Use the Document Processor: heuristic_parse_html window to separate the
body of an HTML document from its tags. This operation skips sections of the
document that it determines to be navigation sections.

108

SAS Information Retrieval Studio: Administrators Guide

To use the Document Processor: heuristic_parse_html window, complete these


steps:
1. Select heuristic_parse_html in the Add Document Processor window

and click Next. The Document Processor: heuristic_parse_html


window appears.

2. Leave the default specification, raw, or enter a new body field into
input-field.
body fields.

The raw specification returns the text in the title and

3. Leave the default specification, title, or enter a new title field into
title-output-field.

The plain text of the title is output to this field.

4. Leave the default specification, body, or enter a new body field into
body-output-field.

The plain text of the body field is output to this

field.
5. The entry 1 in require-mime-type specifies that matching documents

have a mimetype of HTML. If you enter 0, this field is not required.


6. Leave the mimetype entry in mime-type-field, or specify a different

field.
7. The entry 1 in the base64-input field specifies that the text is encoded

in the mime content transfer encoding. If you enter 0, this encoding is


not used.

SAS Information Retrieval Studio: Administrators Guide

109

2.13.15 The Document Processor:


invalidate_duplicates_by_url Window
Use the Document Processor: invalidate_duplicates_by_url window to run a
checksum operation that eliminates the accidental error of storing duplicate
documents with the same Web address. You can also specify where to store
these checksum URLs that can be tracked even if a restart operation is
performed on the pipeline server.
To use the Document Processor: invalidate_duplicates_by_url, complete these
steps:
1. Select invalidate_duplicates_by_url in the Add Document

Processor window and click Next. The Document Processor:


invalidate_duplicates_by_url window appears.

2. Leave the default specification, id, or enter a new field where the URL

is stored into the url_field.


3. (Optional) Enter the name of the checksum file into the checksumfile

field. When you enter this name, duplicate URLs continue to be


eliminated even after the pipeline is restarted.

2.13.16 The Document Processor: match_and_copy


Window
Use the Document Processor: match_and_copy window in ways that are
similar to the Document Processor: substitute window. However, the

110

SAS Information Retrieval Studio: Administrators Guide

match_and_copy window enables you to write the output to a field that is


different from the input field.
To use the Document Processor: match_and_copy window, complete these
steps:
1. Select match_and_copy in the Add Document Processor window and

click Next. The Document Processor: match_and_copy window


appears.

2. Enter the name of the field to be located into input-field.


3. Specify the pattern of the regular expression for this field in the pattern

field.
4. Enter the name of the field where the output is placed into outputfield.

5. (Optional) If the format field contains a value, the value controls how

each match is formatted, otherwise the matches are copied in full.


6. (Optional) The append parameter controls whether these values are

added to the end of an existing value for the output field, or these values
replace an existing value.
7. (Optional) By default the semicolon character (;) is entered into the
separator field. You can enter a different character,
output-field is used as a label or sent to the index.

SAS Information Retrieval Studio: Administrators Guide

or a string, if the

111

8. Click Finish.

2.13.17 The Document Processor:


modify_field_name Window
Use the Document Processor: modify_field_name window to change a field
name.
To use the Document Processor: modify_field_name window, complete these
steps:
1. Select modify_field_name in the Add Document Processor window

and click Next. The Document Processor: modify_field_name window


appears.

2. Enter the name of the field that you want to change into oldname.
3. Enter the name of the new field into newname.
4. Click Finish.

2.13.18 The Document Processor: parse_html


Window
Use the Document Processor: parse_html window to extract the contents of an
input HTML document.
To use the Document Processor: parse_html window, complete these steps:

112

SAS Information Retrieval Studio: Administrators Guide

1. Select parse_html in the Add Document Processor window and click


Next. The

Document Processor: parse_html window appears.

2. Leave the default specification raw, or enter a new location where this

processor can locate the unmodified document data, into input-field. If


you leave raw, the parse HTML tool extracts the text and puts it into
title-output-field and the body-output-field.
3. Leave the default specification title, or enter a new title field into
title-output-field.

The plain text of the title is output to this field. If


you leave this field empty, no title text is output.

4. Leave the default specification body, or enter a new title field into
body-output-field. The plain text of the body field is output to this
field. If you leave this field empty, no body text is output.

5. Leave the default specification 0 in output-metadata, to specify that

additional information in other fields is output. (The text in the body and
title fields is always output.) For example, description and
keywords might be output. The fields that are used for output depend on
the meta field types that appear in the HTML documents.
6. The entry 1 in the require-mime-type field specifies that the
mimetype

field is required. If you enter 0, this field is not required.

SAS Information Retrieval Studio: Administrators Guide

113

7. Leave the mimetype entry in mime-type-field, or specify a different

type field.
8. The entry 1 in the base64-input field specifies that the text is encoded

in the mime content transfer encoding. If you enter 0, this encoding is


not used.

2.13.19 The Document Processor: parse_xml Window


Use the Document Processor: parse_xml window to extract the contents of an
input XML document. You can instantiate this field multiple times in order to
support multiple document schemas.
To use the Document Processor: parse_xml window, complete these steps:
1. Select parse_xml in the Add Document Processor window and click
Next.

The Document Processor: parse_xml window appears.

2. Leave the default specification, raw, in input-field and the document

processor gets the text from the document.


3. Leave the default specification mimetype in mime-type-field. Only

documents with a mimetype of XML are processed.


4. (Optional) Enter the name of the file that tells the application how to

treat fields in input documents. The template-filename field specifies


the name and location of this file.

114

SAS Information Retrieval Studio: Administrators Guide

5. (Optional) Enter the name of the tag in input XML documents that

contains the string that is the identifier for output documents into the
copy-url-from-field.
6. (Optional) Enter the name of the tag in output documents that contains

the identifying string for these documents into the copy-url-to-field.


7. (Optional) Enter the name of the file that tells the application how to

treat fields in input documents. The template-filename specifies the


name and location of this file.
For more information, and to see an example of this file format, see Section
A.2 XML File Field Extraction File Format on page 353.

2.13.20 The Document Processor: send Window


Use the Document Processor: send window to pass each document to another
pipeline server. This operation is used when you want to deploy multiple
pipeline servers.
To use the Document Processor: send window, complete these steps:
1. Select send in the Add Document Processor window and click Next.

The Document Processor: send window appears.

2. Enter a new server name into the host field.


3. Enter a new number into the port field.

SAS Information Retrieval Studio: Administrators Guide

115

Note:

Change these settings to prevent an endless loop.

4. Leave the default entry id in id-field, or enter the name of a new field

where the document identifier is located.


5. Leave the default setting 0 in the invalidate field if you want to stop

the input files at this point in the pipeline. Alternatively, enter 1 to send
each instance of a document to another instance of the pipeline.

2.13.21 The Document Processor: strip_html Window


Use the Document Processor: strip_html window when you have a field that
contains some HTML code that you want to convert into plain text. For
example, if input XML documents contain HTML code. This operation leaves
the textual contents and removes the mark-up tags. (If the entire document is in
HTML, use the parse_html operation instead.)
To use the Document Processor: strip_html window, complete these steps:
1. Select strip_html in the Add Document Processor window and click
Next.

The Document Processor: strip_html window appears.

2. Leave the body field, or add new fields that are separated by commas

into Fields. These are the fields where the HTML tags are stripped in
order to return the text that they contain.

2.13.22 The Document Processor: substitute Window


Use the Document Processor: substitute window to perform regular expression
substitutions.

116

SAS Information Retrieval Studio: Administrators Guide

To use the Document Processor: substitute window, complete these steps:


1. Select substitute in the Add Document Processor window and click
Next. The

Document Processor: substitute window appears.

2. Enter the name of the first regular expression field to be located into
Field.

3. Specify the pattern of the regular expression into the Pattern field.
4. Enter the replacement for the first regular expression field into
replacement.

For more information about regular expressions, see Appendix A.

SAS Information Retrieval Studio: Administrators Guide

117

2.14 Miscellaneous Windows


2.14.1 The Import Settings Window
Use the Import Settings window to specify the XML file to import and the
affected components of SAS Information Retrieval Studio. When you select
this operation, you choose to modify the selected components of SAS
Information Retrieval Studio with the settings that you import.
To access and use the Import Settings window, complete these steps:
1. Click Import Settings in the Overview pane.

The Import Settings window appears.

2. Enter the name of the file that you want to import into the Filename

field. For example, enter ProjASettings.xml.

118

SAS Information Retrieval Studio: Administrators Guide

3. (Default setting: all components are selected) Deselect any of the

components that you do not want to modify with the imported file in the
Components section. For example, deselect Feed crawler, Indexing
server, and Query web server.
4. Click OK to save these settings.

2.14.2 The Export Settings Window


Use the Export Settings window to save the settings for the components that
you configured in SAS Information Retrieval Studio as an XML file. You can
then import this file to use it with another project.
To access and use the Export Settings window, complete these steps:
1. Click Export Settings in the Overview pane.

The Export Settings window appears.

2. Enter the name of the file that you want to export into the Filename

field.
3. Click OK to save these settings.

SAS Information Retrieval Studio: Administrators Guide

119

2.14.3 The Select an HTTP Proxy Window


When you select the HTTP proxy server, you choose a server that is not the
proxy server for SAS Information Retrieval Studio. The HTTP proxy server is
a server that is an intermediary between the crawler and the Web site. The
HTTP proxy server evaluates requests before passing them to the web server.
To access and use the Add HTTP Proxy window, complete these steps:
1. Select Configuration --> General Settings in the Web Crawler pane.

120

SAS Information Retrieval Studio: Administrators Guide

2. Click Auto-detect and the Select an HTTP Proxy window appears.

3. Choose an HTTP Proxy. For example, select HTTPProxyname.


4. Click OK and the server appears in the HTTP proxy field.

2.14.4 The Add Entry Point Window


Use the Add Entry Point window to specify the URL that the web crawler uses
to begin its Web crawl. You can also limit the scope of the crawl and the
number of files that are downloaded from this site.
To access and use the Add Entry Point window, complete these steps:
1. Click the Configuration tab in the Web Crawler pane.

SAS Information Retrieval Studio: Administrators Guide

121

2. Click the Entry Points tab.

122

SAS Information Retrieval Studio: Administrators Guide

3. Click Add. The Add Entry Point window appears.

4. Enter a Web address into the URL field. For example, enter
www.sas.com.
Hint:

The http:// part of the address is automatically


inserted for you after you click OK.

5. (Optional) Leave the default selection Yes in the Add to scope field or

click
to select No. Unless there are scope rules, the crawler
follows all links found on the entry point page, the links found on those
pages, and so on. Scope rules limit the links that the crawler follows.
Use this feature to constrain the crawl to a single site, or section of the
site. In other words, the scope rule follows the way that many Web
pages are laid out. When you leave Yes selected, the URL is
automatically added to the Scope tab in the web crawler Configuration
pane. For more information, see Section 2.4.4.D The Scope Tab on
page 20.

SAS Information Retrieval Studio: Administrators Guide

123

or
to reset the number in the Quota field.
For example, specify 90000. When you specify a quota for the links
from the entry point, the overall quota for the crawler, or this number,
applies.

6. (Optional) Click

7. Click OK and this address appears in the entry points list.

124

SAS Information Retrieval Studio: Administrators Guide

2.14.5 The Edit Entry Point Window


Use the Edit Entry Point window to change the URL or Quota that you added
using the Add Entry Point window.
To access and use the Edit Entry Point window, complete these steps:
1. Select an entry point and click Edit in the Entry Points tab of the

Configuration pane for the web crawler. The Edit Entry Point window
appears.

2. Enter your changes into the URL field. For example, enter http://
.*\.sas\.com.

3. (Optional) Click

or

to reset the number in the Quota field.

4. Click OK to save these settings.

SAS Information Retrieval Studio: Administrators Guide

125

2.14.6 The Add Feed Window


The feed crawler collects postings, whether full texts or summaries, from both
RSS and Atom feeds. Use the Add Feed window to add the URL for a feed to
the Feeds pane for the feed crawler. In order to select a URL, you locate the
Web page where the feed is located and copy the entire feed URL.
If you choose to collect summaries, select Yes in the Follow links field of the
Add Field window.
Hint:

If the feed parser collects summaries, select the


parse_html document processor in the pipeline server.
You can also specify a custom document processor to
handle the pages returned from these links. However,
the follow links operation does not perform recursively
like it does for the web crawler.

To obtain a feed, not a page, URL, complete these steps:


1. Locate the Web page with the orange box that symbolizes an RSS feed.

For example, http://support.sas.com/community/rss/.

126

SAS Information Retrieval Studio: Administrators Guide

2. Click

located to the left of the feed that you want. For example, Press
Releases.

3. The feed page appears.

4. Copy the feed URL from the URL field in the browser. For example,

copy http://www.sas.com/news/preleases/SASRecentPress.xml.
After you copy the URL for an RSS feed using the steps above, complete these
steps:

SAS Information Retrieval Studio: Administrators Guide

127

1. Select Feed Crawler --> General Settings --> Feeds. The Feeds

pane appears.

2. Click Add and the Add Feed window appears.

3. Paste the copied RSS feed URL into the Feed URL field.

to select Yes in the Follow


field.
(If
you
are
collecting
summaries,
Yes.
links

4. Leave the default setting No, or click

5. Click OK and the URL appears in the Feeds pane.

128

SAS Information Retrieval Studio: Administrators Guide

2.14.7 The Edit Feed Window


Use the Edit Feed window to change a URL that you added to the Feeds pane.
To edit a feed URL, complete these steps:
1. Click Edit in the Feeds tab of the Configuration pane for the feed

crawler. The Edit Feed window appears.

2. Place your cursor into the Feed URL field and make any necessary

changes.

3. Leave the default setting No, or click

to select Yes in the Follow

links field.

4. Click OK to save these entries.

SAS Information Retrieval Studio: Administrators Guide

129

2.14.8 The Add Scope Rule Window


Use the Add Scope Rule window to set limits for the pages that the web
crawler can follow on the Internet.
To access and use the Add Scope Rule window, complete these steps:
1. Click Add in the Scope tab of the Configuration pane for the web

crawler. The Add Scope Rule window appears.

2. Enter a Web address, or enter a regular expression to define a matching

pattern for URLs, into the URL Pattern field.


A URL pattern is a pattern that matches against URLs. It is not a URL
itself. SAS Information Retrieval Studio supports two types of
patterns. The first is the prefix pattern that matches against the
beginning of the URL. For example, http://www.sas.com matches
this pattern and could return http://www.sas.com/technologies/
analytics/index.html. The second is a regular expression pattern
that matches the whole URL, but it also supports wildcards and other
operators.
3. Leave the default selection Prefix in the Match type field unless you

specified a regular expression in the URL Pattern field. In this case,


click

130

to select Regular expression.

SAS Information Retrieval Studio: Administrators Guide

4. Leave the default selection Allow in the Action field unless you want to

exclude URLs that match this pattern from the crawl. In this case, click
to select Exclude.
5. Click OK and this address appears under the URL Pattern heading.

SAS Information Retrieval Studio: Administrators Guide

131

2.14.9 The Edit Scope Rule Window


Use the Edit Scope Rule window to change the limits of the web crawlers
Internet search.
To access and use the Edit Scope Rule window, complete these steps:
1. Click Edit in the Scope tab of the Configuration pane for the web

crawler. The Edit Scope Rule window appears.

2. Use Step 2. through Step 5. in the Section 2.14.8 The Add Scope Rule

Window on page 130 to make any necessary changes.

132

SAS Information Retrieval Studio: Administrators Guide

2.14.10 The Add Filename Extension Window


Use the Add Filename Extension window to create a list of file types that
should specifically be excluded, or included, in the crawl. If you choose to
include one or more file extension types, all others are excluded. For this
reason, if you want to include all file types do not use the steps below.
Note:

This matching is case-sensitive.

To access and use the Add Filename Extension window, complete these steps:
1. Click Add in the Filename Extensions tab of the Configuration pane

for the web crawler. The Add Filename Extension window appears.

2. Enter the file extension that you want to return or to exclude in the
Extension

field. For example, enter gif.

3. Leave the default selection. If you click

and select Exclude, the

selected file type is not returned.


4. Click OK and the change appears in this URL pattern in the Filename

Extension pane.

SAS Information Retrieval Studio: Administrators Guide

133

2.14.11 The Edit Filename Extension Window


Use the Edit Filename Extension window to change the Allow or Exclude
operation for the selected file type.
Note:

This matching is case-sensitive.

To access and use the Edit Filename Extension window, complete these steps:
1. Click Edit in the Filename Extensions tab of the Configuration pane

for the web crawler. The Edit Filename Extension window appears.

2. Use Step 2. through Step 4. in the Section 2.14.10 The Add Filename

Extension Window on page 133 to make any changes.

134

SAS Information Retrieval Studio: Administrators Guide

2.14.12 The Add Credential Window


Use the Add Credential window to set the user name and password that is
required to access any password-protected site that you want to access.
To access and use the Add Credential window, complete these steps:
1. Click Add in the Credentials tab of the Configuration pane for the web

crawler. The Add Credential window appears.

2. Enter the address for a Web site that requires credentials into the Site

field.
3. Enter the name of the user into the Username field.
4. Enter the secret term matched to this user into the Password field.
5. Click OK and these entries appear in the Credentials pane.

SAS Information Retrieval Studio: Administrators Guide

135

2.14.13 The Edit Credential Window


Use the Edit Credential window to make changes to the information that you
entered in the Add Credential window. Use this window to narrow the
crawling scope. In other words, crawl everything but the specified file or
directory.
To access and use the Edit Credential window, complete these steps:
1. Click Edit in the Credentials tab of the Configuration pane for the web

crawler. The Edit Credential window appears.

2. Use Step 2. through Step 5. in the Section 2.14.12 The Add Credential

Window on page 135 to make any changes.

136

SAS Information Retrieval Studio: Administrators Guide

2.14.14 The Add Path Window


Use the Add Path window to specify the location of a file or directory for the
file crawler. You can use either relative or absolute paths. These path types are
relative to the component that uses them. However, absolute paths are
recommended for accurate search returns.
To access and use the Add Path window, complete these steps:
1. Click Add in the Paths tab of the Configuration pane for the file

crawler. The Add Path window appears.

2. Enter a file or directory name into the Path field. For Windows

fileshares, use universal naming conventions (UNC) instead of local


paths.
3. Click OK and this entry appears in the Paths pane.

SAS Information Retrieval Studio: Administrators Guide

137

2.14.15 The Edit Path Window


Use the Edit Path window to make a change to the selected path.
To access and use the Edit Path window, complete these steps:
1. Select a path and click Edit in the Paths tab of the Configuration pane

for the file crawler. The Edit Path window appears.

2. Use Step 2. through Step 3. in the Section 2.14.14 The Add Path

Window on page 137 to make any necessary changes.

2.14.16 The Add Path to Exclude Window


Use the Add Path to Exclude window to deny the file crawler access to a file
or directory. This window enables you to limit the scope of the crawl.
To access and use the Add Path to Exclude window, complete these steps:
1. Click Add in the Paths to Exclude tab of the Configuration pane for

the file crawler. The Add Path to Exclude window appears.

2. Enter a path to the files that the file crawler should not access in the
Path

field.

3. Click OK and this entry appears in the Paths to Exclude pane.

138

SAS Information Retrieval Studio: Administrators Guide

2.14.17 The Edit Path to Exclude Window


Use the Edit Path to Exclude window to make a change in the directory path.
To access and use the Edit Path to Exclude window, complete these steps:
1. Click Edit in the Paths to Exclude tab of the Configuration pane for

the file crawler. The Edit Path to Exclude window appears.

2. Use Step 2. through Step 3. in Section 2.14.16 The Add Path to

Exclude Window on page 138 to make any changes.

2.14.18 The Add Extension Window


Use the Add Extension window to limit the access of the file crawler to the list
of files specified in this pane. If you enable at least one type of file to be
returned, only these files are returned
To access and use the Add Extension window, complete these steps:
1. Click Add in the Filename Extensions tab of the Configuration pane

for the file crawler. The Add Extension window appears.

2. Enter a string into the Extension field. For example, enter html.
3. Click OK and this entry appears in the Filename Extensions pane.

SAS Information Retrieval Studio: Administrators Guide

139

2.14.19 The Edit Extension Window


Use the Edit Extension window to limit file crawler access to the list of files
specified in this pane.
To access and use the Edit Extension window, complete these steps:
1. Click Edit in the Filename Extensions tab of the Configuration pane

for the file crawler. The Edit Extension window appears.

2. Use Step 2. through Step 3. in Section 2.14.18 The Add Extension

Window on page 139 to make any necessary changes.

140

SAS Information Retrieval Studio: Administrators Guide

2.14.20 The Add Backend Window


Use the Add Backend window to add a pipeline server to the configuration
pane for the proxy server.
To access and use the Add Backend window, complete these steps:
1. Click Add in the Configuration pane for the proxy server. The Add

Backend window appears.

2. Enter a new string into the Host field. For example, enter newhost.
3. Click

or
specify 9008.

to reset the number in the Port field. For example,

4. Click OK and the new server information appears in the configuration

pane.

SAS Information Retrieval Studio: Administrators Guide

141

2.14.21 The Edit Backend Window


Use the Edit Backend window to change information about the pipeline server
that appears in the proxy server configuration pane.
To access and use the Edit Backend window, complete these steps:
1. Click Edit in the Configuration pane for the proxy server. The Edit

Backend window appears.

2. Use Step 2. through Step 4. in Section 2.14.20 The Add Backend

Window on page 141 to make any necessary changes.

142

SAS Information Retrieval Studio: Administrators Guide

2.14.22 The Add Field Window for the Indexing


Server
Use the Add Field window to define the fields of the document stored in the
index. You also use this window to change the specifications that you entered
when you click Edit in the Configuration pane of the Indexing Server.
To access and use the Add Field window, complete these steps:
1. Click Add in the Configuration pane of the Indexing Server and the Add

Field window appears.

2. Enter the field name into the Name field. You can specify any field

name that appears in your documents.

3. Click

to choose one of the following selections:

Searching

(Default) Search for words that match the input query terms in this
field. This choice is equivalent to the standard functionality.

SAS Information Retrieval Studio: Administrators Guide

143

Label

(default) select this type of usage for facetted search. For more
information about facetted search labels, see Section 9.5 Match
Categories, Concepts, and Facts on page 246.
Display and Sorting

display the matching URLs according to the sorting type that you
select. Sort the results alphabetically, or numerically, instead of by
relevancy. This selection corresponds to marking the field as info.
Identification

choose this field to identify the field that contains the individual
identification number for each document. Each field in the index
requires a unique identifier. If a new document is added that has the
same identification number as an old document, the new document
replaces the old document.
Custom

select one, or more, of the following field types:

Standard

make this field a regular field. This selection enables searching


within this field.

144

SAS Information Retrieval Studio: Administrators Guide

Boolean

enable searches with Boolean operators. Boolean fields require


an exact match on a Boolean field in a document. If the entire
contents of a Boolean field are equal to the term, in a byte-forbyte manner, there is a match. In other words, case, punctuation,
and whitespace characters for the matched term are identical.
Info

make this field an information field. The information field is


used to pass static data. Information fields are not modified and
they cannot be matched by a query.
URL

contains either a Web address or a unique string.


Date

contains a field that represents the date of the document.


Number

contains an integer value that is associated with the document


such as a price or rating. Use this selection for range-based
query constraint at query time, if you choose to use the query
API. For more information about the query API, see the SAS
Search and Indexing: User and C and Java API Guide.
4. Click OK to add this field to the list of fields specified for this index.

SAS Information Retrieval Studio: Administrators Guide

145

2.14.23 The Add Field Window for the Query Web


Server Matching Pane
Use the Add Field window to add fields to the Matching pane of the query web
server.
Note:

The only fields that are available are those added to the
index with search functionality.

To access and use the Add Field window, complete these steps:
1. Click Add in the Matching pane of the Query Web Server and the Add

Field window appears.

2. Leave the default selection such as id to search the identification fields

of the input documents. Click


to choose another field. The fields
in this drop-down list are added in the pipeline server.
3. Leave the default Weight value 1, or click

or
to reset the
weight assigned to this field. The weight value sets a number that is
relative to the other matching fields and is used to prioritize matches.

4. Click OK to save this field in the Matching pane.

146

SAS Information Retrieval Studio: Administrators Guide

2.14.24 The Edit Field Window for the Query Web


Server Matching Pane
Use the Edit Field window to change field entries in the query web server
Matching pane.
To access and use the Edit Field window, complete these steps:
1. Click Edit in the Configuration pane of the Query Web Server and the

Edit field window appears.

2. Use Step 2. through Step 3. in Section 2.14.23 The Add Field Window

for the Query Web Server Matching Pane on page 146 as necessary.

SAS Information Retrieval Studio: Administrators Guide

147

2.14.25 The Add Field Window: Query Web Server


Labels Pane
Use the Add Field window to add labels to the query web server for use with
facetted search.
To access and use the Add Field window, complete these steps:
1. Click Add in the Labels pane of the Query Web Server and the Add a

Field window appears.

2. Any field in the index that has label functionality is available in the

drop-down list when you click


. For example, categories is added
as a document processor to the pipeline server.
3. Enter a label name into the Caption field. This term appears for any

matches on the categories. If you added concepts in the Pipeline Server


tab, you could specify a different label for each concept.
4. Leave the default selection No in the Hierarchical field, or

click
to select Yes or Flattened. No displays the list view. Yes
displays the tree view, and Flattened displays the tree in a list format.

148

SAS Information Retrieval Studio: Administrators Guide

Note:

You can specify Yes or Flattened in the Hierarchical


field for categories only. Concepts and facts do not
have parent-child relationships.

5. Leave the default selection No in the Display counts field, or

click
to select Yes to see the number of matching values for each
label field.
6. Leave the Show in matches value 0. You can click

or
reset the number of labels found in each individual matching
document that are displayed in the results list.

to

7. Click OK to save these entries in the Labels pane.

SAS Information Retrieval Studio: Administrators Guide

149

2.14.26 The Edit Field Window: Query Web Server


Labels
Use the Edit Field window to make changes to the labels in the query web
server for use with facetted search.
To access and use the Edit Field window, complete these steps:
1. Select a label in the Labels pane.
2. Click Edit and the Edit field window appears.

3. Use Step 2. through Step 7. in Section 2.14.25 The Add Field Window:

Query Web Server Labels Pane on page 148 as necessary.

150

SAS Information Retrieval Studio: Administrators Guide

2.14.27 The Color Box Window


Use the color box window to make changes to the colors that you use for the
query web server user interface.
To access and use the color box window, complete these steps:
1. Select Query Web Server --> Configuration --> Theme -->
Colors.

2. Click

3. Select

in the Color pane and the color box window appears.

to make a color change.

4. Click a color box beneath Basic to select a color.


5. See the colors that you previously selected in the Recently used boxes.

SAS Information Retrieval Studio: Administrators Guide

151

6. Click Custom Colors to access the expanded version of the color box

window.

7. See the color range for the selected color in the large pane.
8. Slide the
9. Click

and
or

to reset the default number 112 assigned to the

or
Green field.

to reset the default number 138 assigned to the

Red

field.

10. Click

11. Click
Blue

152

buttons up or down to select a new color.

or

to reset the default number 116 assigned to the

field.

SAS Information Retrieval Studio: Administrators Guide

12. See the newly selected color in the New pane.


13. See the existing color in the Current pane.
14. See the hexadecimal color code that corresponds with the selected color

for the Web page.


15. Click OK to save this color in the selected field.

2.14.28 Status Windows


2.14.28.A Overview of Status Windows
Some, but not all, of the status windows that appear in SAS Information
Retrieval Studio are displayed in this section. Use the status windows in this
application to understand the processes, to catch errors, and to make changes
to your application.

2.14.28.B The Confirmation Window


Use the Delete Index window to remove an index from the server.
To access and use the Delete Index window, complete these steps:
1. Click Delete Index in the Indexing Server pane and the Delete Index

window appears.

2. Click OK and the index that you have compiled is deleted.

2.14.28.C The Error Window


The Error window appears when an operation cannot be completed. This
window contains a string providing relevant information. See the example
below:

SAS Information Retrieval Studio: Administrators Guide

153

Display 2.51 Error Window

3. Click OK to close this window and make your changes.

154

SAS Information Retrieval Studio: Administrators Guide

Choosing Your Components


-

Before You Choose Your Components

Choosing a Crawler

Purposes of the Proxy Server

Choosing Document Processors in the Pipeline Server

How the Indexing Server Works

Querying the Index

Defining Labels for Facetted Search

After You Choose Your Components

Exporting and Importing Component Specifications

3.1 Before You Choose Your Components


SAS Information Retrieval Studio enables you, the administrator, to create a
custom document acquisition and processing application. You choose the
components that fit the requirements of your organization and you configure
each of these components. For example, if you choose to use the web crawler,
you install SAS Web Crawler before installing SAS Information Retrieval
Studio. If you want to perform search and indexing operations, you install
SAS Search and Indexing before you install SAS Information Retrieval
Studio.
You also choose the order for the document processors and decide whether to
index the processed texts or to send to another application. For information
about sample configurations, see Chapter 4: Sample Configurations.

3.2 Choosing a Crawler


Choose a crawler to locate and return the documents that contain the
information that you seek. Crawlers crawl the Internet and your corporate
files. Each returned document could be a Web page, a blog post, or a file that
is posted to the Internet or on your local machine.
Each document is a unit of textual data. For example, a document can be an
HTML page, a Microsoft Word or a PDF file, one row in a CSV file, or an
article or summary in a feed.
In SAS Information Retrieval Studio, each document is represented as a
configurable set of fields. Each file has a name and a value. This name-value
pair might be returned as binary data (Word documents). Each document has
an associated ID tag that identifies the input document as it is collected,
processed, and output by SAS Information Retrieval Studio.
There are three types of crawlers in SAS Information Retrieval Studio:
Web crawler

crawls the Web, according to the parameters that you set. These
parameters define the types of documents and information that you seek,
and they also limit the scope of the crawl. The scope, or breadth and depth
of the crawl, prevent the crawler from attempting to return every
document that appears on the Internet.
When you limit the scope of the Web crawl, you optimize the crawl and
minimize the time that it takes to return this data. You can also specify the
credentials that are necessary to access password-protected sites.
File crawler

crawls fileshares on your organizationss network, or your local machine,


for the types of files that you specify. You input the parameters of the
crawl to limit the retrieval operation to the document types and paths that
you select, or exclude.

156

SAS Information Retrieval Studio: Administrators Guide

Feed crawler

use the feed crawler when you want to obtain blog posts, user forum
pages, and other trending data such as press releases.

3.3 Purposes of the Proxy Server


All deployments of SAS Information Retrieval Studio require the proxy
server. The proxy server controls the flow of documents. As an intermediary
server, the proxy server copies the collected data to multiple places. This copy
process prevents the loss of data in the case of hardware failure. The proxy
server sends the same set of documents to each pipeline server.
The proxy server performs two basic types of operations. These functions
make it an integral link between the crawlers and the servers that together
form your customized SAS Information Retrieval Studio application. For this
reason, the proxy server is not optional.
First, the proxy server enables you to pause the flow of incoming
documents from a crawler. When you stop the flow, the incoming
documents form a queue until you instruct the proxy server to resume
operations. You can use the pause operation to perform maintenance
without interrupting crawling.
Second, the proxy server enables you to specify a list of pipeline servers
running on different machines. A copy of each incoming document can be
sent to each of these servers.
Use the proxy server to send the same set of documents to multiple pipeline
servers:
To create mirrors, specify the same configuration for each server. The
servers that run on different machines act as mirrors in case of hardware
failure.
To create mirrors for multiple types of document processors, specify
identical document processing capabilities for each type of pipeline server.
In case of hardware failure, the input documents are saved in another
pipeline server.

SAS Information Retrieval Studio: Administrators Guide

157

3.4 Choosing Document Processors in the


Pipeline Server
3.4.1 Overview of the Pipeline Server
Use the pipeline server to specify the document processors that act on input
documents. For example, strip the markup tags from HTML documents, or
convert Microsoft Word documents into text. These document processors take
the input documents and prepare them to be used by another component of
SAS Information Retrieval Studio or another application.
After the input documents are processed by the pipeline server, choose to
export the data or build a searchable index of the documents using the
indexing server. Use the pipeline server to pass the input documents to the
selected program or to the indexing server.

3.4.2 Choosing a Document Processor


All deployments of SAS Information Retrieval Studio that process incoming
documents require the pipeline server. Input documents can be processed and
sent to the index or they can be passed to another application. For example,
HTML documents collected by the web crawler are passed to the proxy server
and then to the pipeline server. In the pipeline server, you can select the
document processors that act on this document before it is indexed or sent to
another application.
For example, if a document is in HTML format, it requires processing before
the text is used by another application. Use the parse_html, or
heuristic_parse_html processors to perform HTML processing before the
pipeline server performs any additional document processing operations on
the input document.
Use the following document processing operations before you index a
document:
categorizer

match category rule terms that appear in one, or more, fields of an input
document. These rules are found in the categories project running on SAS
Content Categorization Server.

158

SAS Information Retrieval Studio: Administrators Guide

concept_extractor

extract any matching concepts from an input document. These terms are
located in the specified concepts project running on SAS Content
Categorization Server.
contextual_extractor

return the matching contextual extraction concepts and facts in the


specified project running on SAS Content Categorization Server.
default_mime_type_from_url

determine the type of the original, input document based on the filename
extension found in its Web address.
default_title_from_url

name documents that lack a title based on their Web addresses.


document_converter

change the format of incoming files, such as Adobe PDF and Microsoft
Office documents into text using the SAS Document Conversion
application.
extract_abstract

obtain the summary for each document.


extract_pdate

normalize the document dates into a format understood by SAS Search


and Indexing.
heuristic_parse_html

separate the body of an HTML document from its tags using an operation
that provides an algorithm to obtain the optimal result. This operation
searches for paragraphs of text without many tags and extracts these
bodies of text.
invalidate_duplicates_by_url

prevent the collection of multiple copies of the same document from being
returned.
modify_field_name

change the name of a field in an input document.

SAS Information Retrieval Studio: Administrators Guide

159

parse_html

separate the text from the HTML mark-up tags in an input HTML
document.
parse_xml

separate the text from the XML mark-up tags in an input XML document.
send

save each input document to each pipeline server.


strip_html

remove the mark-up tags and return only the text from an HTMLformatted field in a document.
substitute

use regular expressions to modify fields that match a specified pattern.


Regular expressions are used to locate patterns. For more information, see
Appendix A.

3.4.3 The Export Operations Performed by the


Pipeline Server
Use the pipeline server to perform the following operations. The export
operation is a function of the proxy server. You can configure each of the
following operations for the pipeline server:
export_csv

transfer document text into a comma-separated format. The columns in this


file match the fields that are processed and placed into the output file. The
identification (ID) field in this file tracks the URL that identifies the
document.
The output can be used by many applications. For example, use this file type
to place a document into a spreadsheet.
export_to_files

save each document to a separate file whose name is based on a hash of its
contents.
export_to_odbc

send documents to a database or to another ODBC provider.

160

SAS Information Retrieval Studio: Administrators Guide

export_to_sentiment_analysis_workbench

pass the information directly to the SAS Sentiment Analysis Workbench


application for further analysis.
Note:

By default, when SAS Search and Indexing is installed,


documents are sent to the index if they are not
exported to other applications.

3.5 How the Indexing Server Works


After the gathered documents are processed, they are passed to the indexing
server that builds a searchable index of the documents. Each document in the
index consists of a set of fields. These fields are populated with the data in the
matched fields of the documents passed to the indexing server by the pipeline
server. Use the different field types in the index to enable different types of
queries. If you choose to enable facetted search, you can also specify fields as
labels and enable intuitive search.

3.6 Querying the Index


3.6.1 Overview of the Querying
The query server and the query web server work together to obtain queries and
pass them to the index. You can format the query web server to display the
matched documents in the search window. To see a statistical analysis of these
queries, use the selections that are available for the query statistical server.

3.6.2 Using the Query Server


The query server controls the flow of queries to the index and the matches that
are returned to input queries. You can see a log file for the query server if you
want to run a check on the server.

SAS Information Retrieval Studio: Administrators Guide

161

You can design a Web page that enables users to input queries and to obtain
search results. However, you can also use the query server with an application
that does not require an interface to search the index. In this case, you can
write a custom program to provide a connection between the query server and
your application.

3.6.3 Using the Query Web Server


The query web server provides the capabilities that are necessary to customize
a Web-browser interface for the query server. Use this window to specify the
following types of parameters:
Searches

specify simple or advanced searches. Simple searches match an input


term. Advanced searches match field names and take various types of
operators. Advanced searches limit and more accurately return results.
You can also specify facetted search using labels. These labels enable
users to follow related search terms to intuitively locate the results that
they seek.
Sort the results

decide whether to return search results based on relevancy, date, or the


number of matching terms or fields.
Format the user interface and results

select the way that results are displayed in the custom user interface that
you design. You can also specify themes and colors.

3.6.4 Using the Query Statistics Server


The query statistics server enables you to monitor the queries entered from the
query server. Choose to perform this monitoring by specifying a date range.
You can choose to see any, or all, of the following statistics:
Most frequent queries

see a list of the most frequent query terms and the number of occurrences
for each term.

162

SAS Information Retrieval Studio: Administrators Guide

Most frequent queries without matches

see a list of the most frequent query terms that did not locate results in the
index. You can also see the number of occurrences for each term.
Query rate by hour

see the number of queries for each hour in a day.


Query rate by day

see the number of queries for each day of the week.


Query rate by month

see the number of queries for each month in a year.


Query rate for all time

see the number of queries since you installed the SAS Information
Retrieval Studio application.

3.7 Defining Labels for Facetted Search


Facetted search enables end users to query the index using clusters of related
labels to intuitively locate the information that they seek. The labels that
appear in the interface are those that occur most frequently in the matching
documents.
End users can navigate between labels without using the Back button in the
Web interface or breadcrumbs. Instead, users select meaningful terms and
navigate by using them to refine their query.
Labels are used to identify matching categories and concepts in indexed
documents. In other words, labels use the names of the SAS Content
Categorization Studio categories and concepts to cluster matching documents.
You specify these labels when you create a SAS Content Categorization Studio
project and upload it to SAS Content Categorization Server.
Users can click on labels to see related documents, or select a document to
view related labels. These webs of links, provided by the document
classifications, are an alternative to the linear paths provided by breadcrumbs,
or to the Back button in the browser.

SAS Information Retrieval Studio: Administrators Guide

163

3.8 After You Choose Your Components


After you select the SAS Information Retrieval Studio components that fit the
document retrieval, processing, and search requirements of your organization,
you can construct your application. When you design the application, it is
important to consider the order of the processes involved. For example, you
cannot use the query web server interface to search documents that are not
indexed.
These specifications, as well as all of the information necessary to configure
each component of SAS Information Retrieval Studio is explained in the
following chapters. It is necessary only to review the chapters that discuss the
components that you choose to use.

3.9 Exporting and Importing Component


Specifications
After you develop your SAS Information Retrieval Studio application, you can
export the specifications for your components. When you choose to use this
process, you create an XML file that can be imported into another project. Use
this process to create a new project using the old settings for some, or all, of
the SAS Information Retrieval Studio components. For more information, see
Section 2.14.1 The Import Settings Window on page 118 and Section 2.14.2
The Export Settings Window on page 119.

164

SAS Information Retrieval Studio: Administrators Guide

Sample Configurations
-

Why You Want to Understand Sample Configurations

Before You Use a Sample Configuration to Create Your Own


Application

Sample Configurations That Use the Web Crawler

A Sample Configuration That Uses the File Crawler

A Sample Configuration That Uses the Feed Crawler

4.1 Why You Want to Understand Sample


Configurations
When you understand how some of the sample configurations for SAS
Information Retrieval Studio work, you have a better idea of how to build a
customized application. For this reason, these examples include the types of
processes, specifications, and purposes for these configurations.
Specific examples of the settings for the necessary document processors in the
pipeline server are also included. Select the document processors that act on
input documents to prepare them to be handled by another server or
application using the pipeline server.
Configure the components for the application that meets the requirements of
your organization using these samples to understand how the various
components work together.

4.2 Before You Use a Sample Configuration to


Create Your Own Application
Sample configurations are examples that are designed to be changed to meet
your organizations requirements. It is important to understand some of the
operations that are necessary to make when you develop an application and
then choose to make changes.
All of the following information is contained in the appropriate chapters that
follow this chapter. For your convenience, a summary of the operations that
are necessary when you make changes to your configuration is outlined below:
-

Make sure that your document processors are listed in the order of
logical operations:
a. Normalize input text. For example, place parse_html,

heuristic_parse_html, or document_converter, first in the list of


document processors. Each of these processors strips the text from
the input document.
b. Process the text. For example, categorize, or extract concepts,

extract an abstract, and so on. These processors act only on


normalized text.
c. Export the documents to a SAS application or to third-party

software. Send only the processed and normalized text that can be
used by an index (by default, if you install SAS Search and
Indexing, the documents are indexed) or by other applications.
These applications include SAS Text Miner and SAS Sentiment
Analysis Workbench.

166

Click Apply Changes before you leave the tab for any component
where you make changes.

Delete the index if you want all of the gathered documents to be indexed
according to the changed settings. If you do not delete the index, the
documents that were indexed according to the old settings remain in the
index. The documents that are added after you save your changes to the
new index are indexed according to the new settings.

SAS Information Retrieval Studio: Administrators Guide

If necessary, stop and restart the web, file, or feed crawler that is
running. If you delete the index, stop and restart the web, file, or feed
crawler that you chose to build the index.
Hint:

When you restart the crawling process, the documents


that were previously gathered are collected again.

If you decide to check the results of your index by entering query terms
using the search window, consider the path and scope of your crawl. In
other words, if you limit your crawl to SAS documents, do not expect to
enter medical terms and locate matches in these documents.

4.3 Sample Configurations That Use the Web


Crawler
4.3.1 A Web Crawler, Indexing, and Searching
Configuration
For this configuration, using the web crawler with several other SAS
Information Retrieval Studio components, make sure that the following
components are installed:
-

SAS Information Retrieval Studio

SAS Search and Indexing

SAS Web Crawler

Optional application: A category, concepts, or contextual extraction project


developed in SAS Content Categorization Studio and loaded on SAS Content
Categorization Server.
Note:

It is necessary to choose an HTML document processor


when you use the web crawler.

SAS Information Retrieval Studio: Administrators Guide

167

To set up a simple project that crawls the Web, builds an index, and configures
the query server, complete these steps:
1. Select Web Crawler --> Configuration --> General Settings.

2. Click Auto-detect and the Select an HTTP Proxy window appears.

Use this window to select the proxy server that is located between the
web crawler and the Internet.

3. Select a server. For example, choose my.default.proxy.server.

168

SAS Information Retrieval Studio: Administrators Guide

4. Click OK and the selected server appears in the HTTP proxy field of the

General Settings pane.


5. (Optional) Make changes to the web crawler. For example, Use these

steps:
a. Increase the number of files that the web crawler can collect from
25

to 3000 in the Quota field.

b. Increase the total size of the files that can be collected to 3000

megabytes in the Quota field.


c. If you increase the Number of downloader threads, the web

crawler can access more files quickly. However, too many threads
can overwhelm the site that the web crawler is crawling.
6. Click OK and the server appears in the General Settings pane.
7. Select Configuration --> Entry Points to specify the Web site where

the web crawler begins its Internet crawl.

SAS Information Retrieval Studio: Administrators Guide

169

8. Click Add and the Add Entry Point window appears.

9. Enter the Web address that the web crawler uses to enter the Internet

into the URL field.


10. (Optional) Limit this crawl to the specified site, select Add to scope. If

you specify at least one permitted site, every other site is excluded.
For more information, see Section 5.2.4 Specify the Scope of the Crawl
on page 198.
11. (Optional) Change the limit on the number of files downloaded from

this site in the Quota field using


or
. The lesser of the
numbers entered into this field and the Quota field in the General
Setting field applies.
12. Click OK and this address appears in the Entry Points pane.
13. (Optional) To limit the file types that are returned, click the Filename
Extensions

tab. For more information, see Section 5.2.5 Exclude


Certain Types of Files on page 202.
14. (Optional) To specify user names and passwords for password-protected

sites, click the Credentials tab. For more information, see Section
5.2.6 Specify Access Information for Password-Protected Sites on
page 203.

170

SAS Information Retrieval Studio: Administrators Guide

15. Click Apply Changes to save the new web crawler configuration.
Note:

Do not start the web crawler until you have configured


all of the components for your application. If you start
the web crawler before you configure the indexing
server, delete and rebuild the index.

16. Select Pipeline Server --> Document Processors. The Document

Processors pane appears.

SAS Information Retrieval Studio: Administrators Guide

171

17. Click Add and the Add Document Processor window appears.

18. Select heuristic_parse_html.


Hint:

You could also select parse_html but the


heuristic_parse_html processor uses a heuristic to
exclude navigation text.

19. Click Next and the Document Processor: heuristic_parse_html

window appears.

20. Leave the default settings or make changes. For more information about

these fields, see Section 2.13.14 The Document Processor:


heuristic_parse_html Window on page 108.

172

SAS Information Retrieval Studio: Administrators Guide

21. Click Finish and the document processor that you select appears in the
Document Processors

tab.

22. Use Step 17. through Step 21. above, reiteratively, until you have added

all of the document processors that you require. For example, if you
want to add labels to enable facetted search use the
content_categorization Document Processor. For more information, see
Section 2.13.4 The Document Processor: content_categorization
Wizard on page 78.
In this example, both categories and concepts are added to enable
facetted search.
23. Select a document processor and click Edit to make a change to a

document processor. If you want to change the functionality of a


category or concept field, see Step 26. below.
Note:

Be sure to make the heuristic_parse document


processor the first in the list in the Document
Processors pane.

24. Click Apply Changes.

SAS Information Retrieval Studio: Administrators Guide

173

25. Click the Configuration tab in the Indexing Server.

26. (Optional) Leave the default settings or click Edit to make changes to

the functionality of an index field.


Hint:

If you added fields such as categories or concepts,


these field names automatically appear in the
Configuration pane. The categories field, and each
concept field, has the Label functionality.

27. (Optional) By default, the index is optimized for the English


Language.

To change this setting, click


and select another
language from the drop-down menu that appears.
28. Select Query Web Server --> Configuration --> Matching. Use

this pane to set the priorities for field matches. Weights are a relative

174

SAS Information Retrieval Studio: Administrators Guide

setting. The priority value that you specify for each field is determined
only in relationship to other matched fields in a document.

29. (Optional) To add a field and specify its weight, click Add, and the Add

Field window appears.

30. Click

to select a field that appears in the Configuration pane. For


example, select title in the Name field.

SAS Information Retrieval Studio: Administrators Guide

175

31. (Optional) To change the default setting 1 in the Weight field,

click

or

32. Click OK and this field and weight appear in the Matching pane.
33. Select the Labels pane to see all of the selected categories, concepts,

and facts.

34. (Optional) Select a field and click Edit to make changes to this field.

For example, use the Edit Field window that appears to change the
caption, or label, name. For more information, see Section 2.14.25 The
Add Field Window: Query Web Server Labels Pane on page 148.
to change the default setting 10 in the Maximum
field. This is the highest number of related
labels that can be displayed in response to a query. The end user sees
these labels after entering a query into the SAS Information Retrieval
Studio search window.

35. Click

or

number of related labels

36. Click Apply Changes.

176

SAS Information Retrieval Studio: Administrators Guide

37. Select the Web Crawler pane and click Start.


38. Select Query Web Server --> Status.

39. Click the blue hyperlink and the search window appears.
40. Enter a query into the search field in the SAS Information Retrieval

Studio search window that appears. For example, enter analytics.

41. Click Search and see the labels that match the returned documents on

the left side of the search window. On the right side see the matching
documents with links to the full text for each document.
42. To see the statistics for queries, click the Query Statistics Server tab.

For more information, see Section 14.3 View the Query Statistics for a
Selected Time Period on page 341.

SAS Information Retrieval Studio: Administrators Guide

177

4.3.2 The Web Crawler with Exporting and Indexing


Processes
You can send the same set of documents, collected by the web crawler, to an
index and SAS Text Miner. To perform these operations, configure the
pipeline server with the document processors appropriate to the index and to
the export operation.

4.4 A Sample Configuration That Uses the


File Crawler
For this configuration, using several SAS Information Retrieval Studio
components and processes, make sure that the following components are
installed:
-

SAS Information Retrieval Studio

SAS Search and Indexing

SAS Document Conversion

Optional application: A category, concepts, or contextual extraction project


developed in SAS Content Categorization Studio and loaded on SAS Content
Categorization Server.
Note:

178

It is necessary to choose document_converter


processor for this configuration.

SAS Information Retrieval Studio: Administrators Guide

To set up a simple project that crawls the files on your machine and exports
these files, complete the following steps:
1. Select File Crawler --> Configuration --> Paths.

2. Click Add to add one, or more, paths to the Paths pane. The Add Path

window appears.

3. Select a directory. For example, enter


\\MyComputer\Documents\FolderA.

SAS Information Retrieval Studio: Administrators Guide

179

4. Click OK and the path appears in the Paths pane.


5. (Optional) Use Step 2. through Step 4., reiteratively, until you have

added all of your paths.


6. Click any setting in the other tabs in the Configuration pane that you

want to use to configure the file crawler.


7. Click Apply Changes to save the new file crawler configuration.
Note:

Do not start the file crawler until all of the configuration


processes are complete. If you start the file crawler
before configuring components such as the indexing
server, delete the index and rebuild it.

8. Select Pipeline Server --> Document Processors. The Document

Processors pane appears.

180

SAS Information Retrieval Studio: Administrators Guide

9. Click Add and the Add Document Processor window appears.

10. Select document_converter. This document processor extracts plain

text from input document formats such as Microsoft Office and Adobe
PDF files. This document processor is relevant for the file crawler, but it
can also be used with the web crawler after the parse_html document
processor is used.

11. Click Next and the Document Processor: document_converter window

appears.
12. (Optional) Change any of the settings in this window. For more

information, see Section 2.13.7 The Document Processor:


document_converter Window on page 96.

SAS Information Retrieval Studio: Administrators Guide

181

13. Click Finish and the selected document processor appears in the

Document Processors pane.


14. Click Add and select export_to_files in the Add Document Processor

window that appears.

15. Click Next and the Document Processor: export_to_files window

appears.

16. (Optional) Add a field name such as body to fields.

182

SAS Information Retrieval Studio: Administrators Guide

Note:

If you add one field, only the specified field is included.


In this example, the body field was selected in the
Document Processor: document_converter window.

17. (Optional) Make any other changes to the fields in the Document

Processor: export_to_files window. For more information, see


Section 2.13.9 The Document Processor: export_to_files Window on
page 100.
18. Click Finish and the document processor that you added appears in the

Document Processors Pane.


19. Use Step 14. through Step 18. above, reiteratively, until you have added

all of the document processors required. For example, if you want to add
labels to enable facetted search, see Section 2.13.4 The Document
Processor: content_categorization Wizard on page 78.
Note:

If you add any additional document processors, be sure


to move them up above the export_to_files document
processor in the Document Processors pane.

20. Click Edit to make any changes to your fields.


21. Click Apply Changes.

SAS Information Retrieval Studio: Administrators Guide

183

4.5 A Sample Configuration That Uses the


Feed Crawler
For this configuration, using several SAS Information Retrieval Studio
components and processes, make sure that the following components are
installed:
-

SAS Information Retrieval Studio

SAS Search and Indexing

Optional application: A category, concepts, or contextual extraction project


developed in SAS Content Categorization Studio and loaded on SAS Content
Categorization Server.
Note:

It is necessary to choose an HTML document processor


for this configuration.

When you set up the feed crawler, you can choose to return either summaries
or full length texts. If the feed collects summaries, you can enable the feed
crawler to follow the links contained in the summaries to the full texts of each
article. If the feed crawler collects summaries, it also follows any links to the
full story. For this reason, enable this capability using the steps below.
To set up the feed crawler, complete the following steps:

184

SAS Information Retrieval Studio: Administrators Guide

1. Select Feed Crawler --> Configuration.

2. Click Add and the Add Feed window appears.

SAS Information Retrieval Studio: Administrators Guide

185

3. Access your Web browser and locate the Web page with the orange

box

that symbolizes an RSS feed. For example, http://


support.sas.com/community/rss/.

4. Click

located to the left of the feed that you want. For example,
Media Coverage.

186

SAS Information Retrieval Studio: Administrators Guide

5. The feed page appears.

6. Copy the feed URL from the URL field in the browser. Paste this URL

into the Feed URL field in the Add Feed window. For example, copy
http://www.sas.com/news/mediacoverage/
SASRecentMediaCoverage.xml into the Feed URL

field.

7. Summaries of news articles comprise the RSS feed shown in the

example above. For this reason, click


links field in the Add Feed window.

to select Yes in the Follow

8. Click OK in the Add Feed window.

SAS Information Retrieval Studio: Administrators Guide

187

9. Select Pipeline Server --> Document Processors and the

Document Processors pane appears.

10. Click Add and the Add Document Processor window appears.

11. Select parse_html. In this example, summaries are collected and the

feed crawler is instructed to follow links to the HTML pages that are
linked to each summary. (See Step 7. on page 187 where Yes is selected
in the Follow Links drop-down menu.)

188

SAS Information Retrieval Studio: Administrators Guide

12. Click Next and the Add document Processor: parse_html window

appears.

13. (Optional) Make any changes that you choose.


14. Click Finish. The parse_html document processor appears in the

Document Processors pane.


15. Click Apply Changes.
Note:

You can also add custom document processors to


perform operations on the input feed text. For
example, when the Follow links selection is enabled in
the Add Feed window, documents that contain both a
post and a list of comments or replies are returned. If
you want to separate each post into a separate
document, write a site-specific document processor.
Use this document processor instead of parse_html.

SAS Information Retrieval Studio: Administrators Guide

189

190

SAS Information Retrieval Studio: Administrators Guide

Configuring the Web Crawler


-

Overview of the Web Crawler

Configuring the Web Crawler

Run the Web Crawler

Troubleshoot with the Log File

5.1 Overview of the Web Crawler


The SAS Web Crawler is controlled by SAS Information Retrieval Studio. The
web crawler searches the Internet and returns the documents that it locates,
according to the parameters that you set. You specify the types of files to
return, the Web addresses where the collection process begins, and the scope
of the crawl. You can also specify the user names and passwords that are
necessary to crawl password-protected sites.
The web crawler passes the documents that it collects to the proxy server that
sends them to the pipeline server. According to the processing that the pipeline
server performs, the documents can be sent to an application, database, or to
the indexing server where they can be queried.
After the web crawler collects the maximum number of pages allowed, it stops
running. You can restart the web crawler at any time.
Notes: If you plan to crawl blogs, user forums, or other time-

sensitive data such as press releases, use the feed


crawler instead.

5.2 Configuring the Web Crawler


5.2.1 Overview of Configuring the Web Crawler
You configure the web crawler in stages, or according to the parameters set up
for each tab in the Configuration pane. Each tab, or set of configurations,
defines a specific aspect of the crawler. This section is set up as a how-to
guide, but it also contains the background information that is necessary to set
the specific parameters for each tab.
Display 5.1 Web Crawler Configuration Pane

Use each of the following sections to configure your web crawler with one
exception. The Credentials information is necessary only when you choose to
crawl password-protected sites.
After you make all of your changes, click Apply Changes in the Web Crawler
pane. If the file crawler is running when you click this button, the Restart Web
Crawler window appears.
Display 5.2 Restart Web Crawler Window

Click Yes.
If the web crawler is not running, click Start.

192

SAS Information Retrieval Studio: Administrators Guide

5.2.2 Specify the General Settings


You configure the web crawler to specify how the crawl and download
operations work. As you work through each of the steps below, the appropriate
background information is included.
To specify the parameters for the web crawler, complete these steps:
1. Click the General Settings tab in the Web Crawler pane.

2. Click Auto-detect to access the Select an HTTP Proxy window.

SAS Information Retrieval Studio: Administrators Guide

193

a. Select a proxy server. For example, choose MyHTTPProxyServer.

The HTTP proxy server is a server that is an intermediary between


the crawler and the Web site. The HTTP proxy server is not the
proxy server for SAS Information Retrieval Studio. The HTTP
proxy server evaluates requests before passing them to the web
server.
b. Click OK and this server appears in the HTTP proxy field.

or
to change the default setting of 25 in the Quota
field. This is the maximum size for all of the files collected by
the web crawler.

3. Click

(files)

4. Click

or
to change the default megabyte limit of 1000 for
the maximum number of megabytes in the Quota (megabytes) field.
This limit applies to all of the collected documents.

5. Click

or
to change the total number of threads that can be
created in the Number of downloader threads field. For example,
change this setting to 16. (The default setting is 1.) The more threads
you specify, the faster the download process becomes. However, a
higher number of downloaded files can also overwhelm a site and shut
it down.

6. Click

or
to change the number of seconds that the web
crawler rests between page downloads in the Sleep interval field.
(The default setting is 1.) This setting enables the web crawler to be
polite. In other words, a single thread does not overwhelm a site with
download requests. This is not true, if you use this setting but have
many threads. For example, 100 threads operating on 5-second sleep
intervals could potentially launch 100 requests simultaneously to a
site.

7. Click

or
to change the number of seconds before the web
crawler stops trying to download a page in the Timeout field. (The
default setting is 300.)

194

SAS Information Retrieval Studio: Administrators Guide

8. Click

or
to change the number of times that the web
crawler tries to download a page before it stops in the Maximum
number of retries field. (The default setting is 3.)

9. Click

or
to change the highest number of seconds that the
web crawler waits before it tries to download a page again in the
Retry delay field. (The default setting is 300.)

to select No, the default setting is Yes, in the Respect


robots.txt field. Select No to ignore a Web site authors request not to
crawl specific portions of a site.

10. Click

11. Click

to select No to prohibit the web crawler from following


links found in either of these types of code in the Find links in
Javascript and Flash field. The default setting is Yes,

12. Click
first,

to select Depth first, the default setting is Breadth


in the Link traversal order field.

In the breadth-first mode, the crawler searches all of the links in the
point-of-entry page. The crawler then searches all of the links in the
first layer of child pages. The crawler repeats this process for the
second layer of child pages, and so on, until it has crawled all of the
links related to the point-of-entry page. This is a first in, first out
operation.
In depth-first mode, the crawler follows one set of links through all of
its children. The crawler then backtracks to the next child page and
crawls the links of its children, and so on. This process is repeated
reiteratively until all of the links in a page are crawled. This operation
drills deep and then backtracks, reiteratively.

SAS Information Retrieval Studio: Administrators Guide

195

5.2.3 Specify Entry Points for the Web Crawler


After you specify the general settings for the web crawler, add the entry
points. Entry points are the Web addresses that are used by the crawler to
begin its crawl. Unless you specify otherwise in the Scope pane, the entire
entry point site and all of its links are crawled. For example, if you want to
crawl the SAS Web site, you could enter www.sas.com.
To specify the entry points for the web crawler, complete these steps:
1. Click the Entry Points tab in the Web Crawler pane.

196

SAS Information Retrieval Studio: Administrators Guide

2. Click Add and the Add Entry Point window appears.

3. Enter the Web address for the first site into the URL field.
4. (Optional) To add this address to the Scope pane as an allowed site for

the crawl, leave the default selection Yes in the Add to scope rules
field. If you do not want to add this address to the Scope pane, click
and select No.
Note:

If you add any URL patterns to the Scope pane, all of


the other URLs are excluded from the crawl.

5. By default, the Quota (files) field is set to 100000000. Click

or

to change this number.


6. Click OK and the Web address that you entered appears in the Entry

Points pane.
7. Use this process reiteratively until you have added all of your URLs to

the Entry Points pane.

SAS Information Retrieval Studio: Administrators Guide

197

5.2.4 Specify the Scope of the Crawl


After you specify one or more entry points, you can add a list of permitted and
excluded sites. However, the default setting is the empty pane. This is because
all of the Web addresses on the Internet are allowed. If you specify at least one
permitted site, every other site is excluded. This is true whether you specify
that a site is part of the scope of the crawl within this window, or in the Add
Entry Point window. If you want to exclude a segment of the site that you
permitted, you can also perform this action.
For example, you could use the Scope pane to limit the crawl to the SAS
publications pages and exclude the new books pages from the crawl.
Continuing with Section 5.2.3 Specify Entry Points for the Web Crawler on
page 196, you limit the crawl to one section of the SAS Web site. Within the
publications section, you exclude any new books.

198

SAS Information Retrieval Studio: Administrators Guide

To specify the scope of the web crawl, complete these steps:


1. Click the Scope tab in the Web Crawler pane.

2. Click Add and the Add Scope Rule window appears.

3. Enter a pattern for a URL, or a regular expression, into the URL


Pattern

field. Both enable the web crawler to match patterns. For


example, type https://support.sas.com/pubscat/complete.jsp.

SAS Information Retrieval Studio: Administrators Guide

199

Note:

For information about how to write regular expressions,


see Section A.1 Regular Expressions on page 353.

4. Leave the default setting Prefix in the Match type field, or

click
to select Regular Expression. This setting tells the web
crawler how to use the characters entered in the URL Pattern field.
A prefix match is one that matches against the beginning of the URL.

5. Leave the default setting Allow in the Action field. (Click

to
select Exclude if you do not want the crawler to download pages from
this URL.)
This URL appears in the Add Scope pane.

6. Click Add and a new Add Scope Rule window appears.


7. Continuing with this example, type https://support.sas.com/
pubscat/newbooks.jsp

into the URL Pattern field.

8. Leave the default selection Prefix.

9. Click

200

to select Exclude, in the Action field.

SAS Information Retrieval Studio: Administrators Guide

10. Click OK and see the complete list of included and excluded URLs. In

this example, the web crawler searches only the publications pages of
the SAS Web site. It does not search the pages that list recent books.

SAS Information Retrieval Studio: Administrators Guide

201

5.2.5 Exclude Certain Types of Files


After you specify the scope of the crawl, you might want to limit the types of
files that are returned by the web crawler. For example, you could exclude
files that contain programs or images.
To specify the file types that are excluded from a crawl, complete these steps:
1. Click the Filename Extensions tab in the Web Crawler pane.

2. Click the Add button to access the Add Filename Extension window.

3. Enter the extension of the file type that you want to exclude into the
Extension

202

field.

SAS Information Retrieval Studio: Administrators Guide

Note:

The file type extensions are case-sensitive.

4. Click

to select Exclude to prevent the crawler from gathering


this type of file. (The default setting is Allow.) If you enable one type
of file to be returned, only those with the Allow specification are
returned.

5.2.6 Specify Access Information for PasswordProtected Sites


Some sites are password-protected. To crawl these sites, you provide the web
crawler with the information that it requires to download these pages.
To specify the sites and the user and password information that the web
crawler uses to download pages, complete these steps:
1. Click the Credentials tab in the Web Crawler pane.

SAS Information Retrieval Studio: Administrators Guide

203

2. Click the Add button to access the Add Credential window.

3. Enter the URL followed by a colon (:) and the port number for the host

into the Site field. For example, enter www.medscape.com:80.


4. Enter the name of a user, who has access to this site, into the Username

field. For example, enter UserMD.


5. Enter the password for this user into the Password field. For example,

enter mdpassword. When you enter this password, the characters that
comprise the password are represented by the asterisk symbol (*) in the
Credentials pane.
6. Click OK and this site with its credentials is added to the Credentials

pane.
7. Click Apply Changes in the Web Crawler pane.

204

SAS Information Retrieval Studio: Administrators Guide

5.3 Run the Web Crawler


After you configure the web crawler, you can run it. You should configure all
of the components that you plan to use before you run the web crawler. Click
Apply Changes after you modify any of the default settings for these
components.
To start, restart, and stop the web crawler, complete any of these actions:
-

Click Start in the Web Crawler pane.

The appropriate message appears in the Status pane after any of these
operations.
-

Click Stop to stop the crawl.

(Optional) If you make any changes to the configuration while the web
crawler is running, click Apply Changes. The Restart Web Crawler
window appears.

Click Yes.
If the web crawler is not running, click Start.
-

Click Revert to return to the last applied settings.

SAS Information Retrieval Studio: Administrators Guide

205

5.4 Troubleshoot with the Log File


This log pane enables you to see a history of the operations performed by the
web crawler. Use the contents of the Log pane when you require customer
support.
To access and use the log pane, complete these steps:
1. Click the Log tab in the Web Crawler pane.

2. (Optional) Click

or
if you want to change the default
setting of 20 in the Number of lines field. This field specifies the
maximum number of timestamped lines that are displayed for the
searchable log file in this pane.

3. Click Retrieve to display the specified number of lines in the log file.
4. (Optional) Enter a search term into the Text to highlight field. For

example, enter sas.


5. Click Find to locate all instances of the entered term in this pane.

206

SAS Information Retrieval Studio: Administrators Guide

Configuring the File Crawler


-

Overview of the File Crawler

Configure the File Crawler

Run the File Crawler

Troubleshoot with the Log File

6.1 Overview of the File Crawler


The file crawler crawls your organizations file system and returns documents,
according to the parameters that you set. These specifications include the
paths to crawl, file types to return, and whether the crawl is continuous. They
also include the oldest date and maximum file size that can be returned.
The file crawler passes the documents to the proxy server that passes them to
the pipeline server. According to the processing that the pipeline server
performs, the files can be sent to an application, database, or to the indexing
server where they can be queried.

6.2 Configure the File Crawler


6.2.1 Overview of Configuring the File Crawler
You configure the file crawler using the four tabs in the Configuration pane.
Each stage, or set of configurations, defines a specific aspect of the crawler.
This section is set up as a how-to guide, but it also contains the background
information that is necessary to set the specific parameters for each stage.

Display 6.1 File Crawler Configuration Pane

Use each of the following sections to configure your file crawler. After each
change, click Apply Changes in the File Crawler pane. If the file crawler is
running when you click this button, the Restart File Crawler window appears.

Click Yes.

6.2.2 Specify the General Settings


You configure the file crawler to specify how the crawl and download
operations work for the files that the file crawler collects. As you work
through each of the steps below, the appropriate background information is
included.
To specify the parameters for the file crawler, complete these steps:

208

SAS Information Retrieval Studio: Administrators Guide

1. Click the General Settings tab in the File Crawler pane.

2. Click

or
to change the default setting 10 that is specified in
the Maximum file size field. Increasing or decreasing this number
affects the size of the documents collected. For example, you might
want to gather white papers but not books.

3. Click

to access the calendar where you can select the first date for
the crawl. Documents that have creation dates before the date
specified in the Oldest date field are not collected by the file crawler.

to select Yes, the default setting is No in the Crawl


field. Choose to continuously crawl your file system
only when it is constantly updated.

4. Click

continuously

to select Yes, the default setting is No in the Encapsulate


XML files field. In this case, only top-level XML tags are turned into
fields. If you select Yes, you can exert more control over this process.
For example, you can turn nested fields into tags. In this case, also select
the parse_xml document processor.

5. Click

6.2.3 Specify the Paths to Crawl


After you specify the general settings for the file crawler, you can enter a list
of paths to crawl. When you specify a list of paths to crawl, all other paths are
not permitted. These paths should be absolute instead of relative. For

SAS Information Retrieval Studio: Administrators Guide

209

Windows fileshares, use universal naming conventions (UNC) names instead


of local paths.
To specify the paths for the file crawl, complete these steps:
1. Click the Paths tab.

2. Click Add and the Add Path window appears.

3. Enter an absolute path into the Path field. If you specify a Windows

fileshare, enter a name that is written according to UNC conventions.


4. Click OK and the path appears in the Paths pane.
5. Continue this process, reiteratively, until you have added all of the paths

that you want crawled.

210

SAS Information Retrieval Studio: Administrators Guide

6.2.4 Specify the Paths to Exclude


After you specify the general settings, you can enter a list of paths that should
not be crawled. This pane enables you to specify limits within the crawl that
you set in the Paths pane. For example, choose to exclude the Trash folder on
your local computer. Or choose one, or more subdirectories to exclude from
the crawl. These paths should be absolute instead of relative. For Windows
fileshares, use universal naming conventions (UNC) names instead of local
paths.
To specify the paths that the file crawler does not crawl, complete these steps:
1. Click the Paths to Exclude tab.

2. Click Add and the Add Path to Exclude window appears.

3. Enter an absolute path into the Path field. If you specify a Windows

fileshare, enter a name that is written according to UNC conventions.


4. Click OK and the path appears in the Paths to Exclude pane.

SAS Information Retrieval Studio: Administrators Guide

211

5. Continue this process, reiteratively, until you have added all of the paths

that you want to exclude.

6.2.5 Specify the Types of Files to Return


You can choose to limit the types of files returned to the crawl. If you do not
specify any files to return, all of the files that the file crawler locates are sent
to the proxy server.
To specify the paths for the file crawl, complete these steps:
1. Click the Filename Extensions tab.

2. Click Add and the Add Filename Extension window appears.

212

SAS Information Retrieval Studio: Administrators Guide

3. Enter a file extension into the Extension field. For example, enter txt

or png. If you specify any file extension, only those file types are
returned. No other files are collected.

4. Click

to select Exclude, the default setting is Allow in the


Action field.

5. Click OK and the path appears in the Filename Extensions pane.


6. Repeat Step 2. through Step 5., reiteratively, until you have added all of

the file extension types that you want returned.

6.3 Run the File Crawler


After you configure the file crawler, you can run it. You should also configure
all of the components that you plan to use before you run the file crawler.
Click Apply Changes after you modify the default settings for any of these
components.
To start, restart, and stop the file crawler, complete any of these steps:
-

Click Start in the File Crawler pane.

The appropriate message appears in the Status pane after any of these
operations.

SAS Information Retrieval Studio: Administrators Guide

213

(Optional) If you make any changes to the configuration while the file
crawler is running, click Apply Changes. The Restart File Crawler
window appears.

Click Yes.
If the file crawler is not running, click Start.
-

(Optional) Click Revert to return to the last applied settings.

To stop the crawl, click Stop.

6.4 Troubleshoot with the Log File


This log pane enables you to see a history of the operations performed by the
file crawler. Use the contents of the Log pane when you require customer
support.
To access and use this Log pane, complete these steps:

214

SAS Information Retrieval Studio: Administrators Guide

1. Click the Log tab in the File Crawler pane.

2. (Optional) Click

or
if you want to change the default
setting of 20 in the Number of lines field. This field specifies the
maximum number of timestamped lines that are displayed for the
searchable log file in this pane.

3. Click Retrieve to display the specified number of lines in the log file.
4. (Optional) Enter a search term into the Text to highlight field. For

example, enter filename.


5. Click Find to locate all instances of the entered term in this pane.

SAS Information Retrieval Studio: Administrators Guide

215

216

SAS Information Retrieval Studio: Administrators Guide

Configuring the Feed Crawler


-

Overview of the Feed Crawler

Configure the Feed Crawler

Run the Feed Crawler

Troubleshoot with the Log File

7.1 Overview of the Feed Crawler


The feed crawler crawls the Internet for frequently updated content and returns
these documents, according to the parameters that you set. Like the web
crawler, the feed crawler is used for Web content, only. Unlike the web
crawler, the feed crawler seeks newly updated information in the form of a
Web feed. You specify the parameters for the Web address where the feed
crawler begins its crawl and determine whether it follows links and crawls
continuously.
The feed crawler passes the documents to the proxy server that passes them to
the pipeline server. According to the processing that the pipeline server
performs, the files can be sent to an application, database, or to the indexing
server where they can be queried. For example, the feed crawler is used to
gather documents that express sentiment from blogs and customer reviews for
SAS Sentiment Analysis Workbench.

7.2 Configure the Feed Crawler


7.2.1 Overview of Configuring the Feed Crawler
You configure the feed crawler in stages, or according to the parameters set up
for each tab in the Configuration pane. Each tab, or set of configurations,
defines a specific aspect of the crawler. This section is set up as a how-to
guide, but it also contains the background information that is necessary to set
the specific parameters for each stage.
Display 7.1 Feed Crawler Configuration Pane

Use each of the following sections to configure your feed crawler.


After you make all of your changes, click Apply Changes in the Feed Crawler
pane. If the feed crawler is running when you click this button, the Restart
Feed Crawler window appears.

Click Yes.
If the feed crawler is not running, click Start.

7.2.2 Specify the General Settings


You configure the feed crawler to specify the location of the feed. As you
work through each of the steps below, the appropriate background information
for each setting is included.

218

SAS Information Retrieval Studio: Administrators Guide

To specify the parameters for the feed crawler, complete these steps:
1. Select Configuration --> General Settings in the Feed Crawler

pane.

2. Click Auto-detect to access the Select an HTTP Proxy window.

a. Select a proxy server. For example, choose MyHTTPProxyServer.

The HTTP proxy server is a server that is an intermediary between


the crawler and the Web site. The HTTP proxy server is not the
proxy server for SAS Information Retrieval Studio. The HTTP
proxy server evaluates requests before passing them to the web
server.
b. Click OK and this server appears in the HTTP proxy field.

to select No in the Crawl


field. The crawler seeks updated items posted to the
feed over time, unless this operation is prohibited.

3. (default setting is Yes) Click


continuously

SAS Information Retrieval Studio: Administrators Guide

219

4. Click

or
to change the default setting of 600 for the number
of seconds for the Recrawl interval field.

5. (Optional) Enter another name of the crawler into the User agent field

if you choose to change the name of this crawler.

7.2.3 Specify the Feeds


The feed crawler collects postings, whether full texts or summaries, from both
RSS and Atom feeds. For more information, see Section 2.14.5 The Edit
Entry Point Window on page 125.
You specify the feed urls and whether links are followed in the Feeds tab.
To perform these operations, complete these steps:
1. Select Configuration --> Feeds in the Feed Crawler pane.

220

SAS Information Retrieval Studio: Administrators Guide

2. Click Add to access the Add Feed window.

a. Paste an address for a feed into the Feed URL field. For example,

choose http://www.sas.com/success/SASRecentSuccess.xml.

to select No, the default setting is Yes, in the Follow


field. This setting specifies whether links from the Web
address set in the Feed URL field are crawled. If you select Yes,
these links might lead to other feeds.

b. Click

Links

There are two common types of feeds. These are the full content and
summary-only feeds. In the full content feed, all of the information
that you seek is present in the feed itself. In the summary-only field,
only a brief description of the content is passed. In this case, the link
is followed, like a traditional Web page link, to locate the rest of the
content.
If you want to crawl the summary-only fields, select Yes in the
Follow links field. Also select the parse_html document processor
in the pipeline server. However, the follow links operation does not
perform recursively like the Web crawler.
c. Click OK and this information appears in the Feeds pane.
3. Enter the Web address that you want to crawl into the Feed URL field.

SAS Information Retrieval Studio: Administrators Guide

221

7.3 Run the Feed Crawler


After you configure the feed crawler, you can run it. You should configure all
of the components that you plan to use before you run the feed crawler. Click
Apply Changes after you modify any of the default settings for these
components.
To start, restart, and stop the feed crawler, complete any of these steps:
-

Click Start in the Feed Crawler pane.

The appropriate message appears in the Status pane after any of these
operations.
-

(Optional) If you make any changes to the configuration while the feed
crawler is running, click Apply Changes. The Restart Feed Crawler
window appears.

Click Yes.
If the feed crawler is not running, click Start.

222

(Optional) Click Revert to return to the last applied settings.

To stop the crawl, click Stop.

SAS Information Retrieval Studio: Administrators Guide

7.4 Troubleshoot with the Log File


After you configure the feed crawler, you can run it. Use the contents of the
Log pane when you require customer support.
To access and use this Log pane, complete these steps:
1. Click the Log tab in the Feed Crawler pane.

2. (Optional) Click

or
if you want to change the default
setting of 20 in the Number of lines field. This field specifies the
maximum number of timestamped lines that are displayed for the
searchable log file in this pane.

3. Click Retrieve to display the specified number of lines in the log file.
4. (Optional) Enter a search term into the Text to highlight field. For

example, enter close.


5. Click Find to locate all instances of the entered term in this pane.

SAS Information Retrieval Studio: Administrators Guide

223

224

SAS Information Retrieval Studio: Administrators Guide

Configuring the Proxy Server


-

Overview of the Proxy Server

View the Status of the Proxy Server and Input Files

Configure the Proxy Server

Run the Proxy Server

Troubleshoot with the Log File

8.1 Overview of the Proxy Server


The proxy server sends the documents that it receives from one, or more,
crawlers to the pipeline server for processing. The proxy server can also pass
the same set of documents to a pipeline server that was set up with a second
installation of SAS Information Retrieval Studio.
When the proxy server passes the documents that it receives, the proxy server
passes the same set to each server. This functionality makes it possible for you
to perform different operations on the same set of documents on the respective
servers. As an intermediary server, you only configure those specifications
that are necessary for the proxy server to pass documents to another server.
Use the proxy server for the following purposes:
-

Pause the flow of documents if you want to perform maintenance on


one, or more, of the components in your application.

Send the documents to multiple pipeline servers for the following


reasons:
-

Create mirrors. These are pipeline servers that perform identical


processing operations on the input documents. Multiple servers are
used for backup purposes in case of hardware failure.

Use the same set of documents for multiple purposes. In this case,
send the input documents to pipeline servers that are configured
differently. For example, send the documents to one pipeline server
for indexing and searching. Send this same set of documents to a
second pipeline server that analyzes the sentiment located in them.

You can find information about the number of documents at different stages in
this server and see a log file.
For all of these reasons, the proxy server is an integral part of any customized
configuration of SAS Information Retrieval Studio.

8.2 View the Status of the Proxy Server and


Input Files
Use the Status pane to see whether the proxy server is running and where the
input files are in the various processing stages. This pane provides view-only
displays that show the current statistics for the proxy server. You can also use
the Status pane to troubleshoot any backups in the input process.
By default, the proxy server is running. If you add any servers in the
configuration pane, click Apply Changes. You can then see the statistics for
these operations in the Status pane.
To see whether the proxy server is running and to see the statistics for this
server, complete the following steps:

226

SAS Information Retrieval Studio: Administrators Guide

1. Click Status in the proxy server pane.

2. See the number of documents that were input to the proxy server in the
Documents received

field. For example, 25 documents were

received.
In this example, the Quota (files) setting was set in the General
Settings pane of the Configuration pane at 25 for the web crawler. This
is the only crawler in this configuration. This crawler has returned the
maximum number of allowed documents.
3. See the number of documents that the proxy server sent to the pipeline

server in the Documents processed field. For example, see 25.


4. See the number of documents that are waiting to be received by the

proxy server in the Documents queued field. For example, see 0.


5. See the date and time that the last document entered this server in the
Last documents received field. For example, 2010-09-17 is the year

and 12:46 is the month.


6. See the date and time that the last document entered this server in the
Last documents processed

field. See the example in Step 5. above.

7. Click Refresh, if the maximum number of documents specified has not

been returned.
If you see a discrepancy, you can use the Log pane to see the connections and
errors that might be the cause. For more information, see Section 8.5
Troubleshoot with the Log File on page 230.

SAS Information Retrieval Studio: Administrators Guide

227

8.3 Configure the Proxy Server


When you configure the proxy server, you can either add a new pipeline server
or you can change the settings for the default pipeline server. This server
appears by default in the Configuration pane under the Host heading. Use the
Configuration pane to add pipeline servers to the proxy server or to change the
Host, Port, or Status settings.
You can also add multiple pipeline servers. Choose to add these servers for
backup purposes or to specify different types of processes for input
documents.
Note:

The same input documents are passed to each pipeline


server.

To add a new proxy server, complete the following steps:


1. Click Configuration in the Proxy Server pane.

228

SAS Information Retrieval Studio: Administrators Guide

2. Click Add and the Back-end Server window appears. Use this window

to add another pipeline server to the customized application that you are
building.

3. Enter the name of the machine into the Host field. For example, enter
Mirror1.

4. If the default setting is incorrect, click

or
to change the
default setting of 9004 in the Port field. For example, change the port
to 9100.

5. Click OK and the new server is added to the Configuration pane. (The

new server is automatically started and its status is running.)


6. (Optional) Repeat Step 2. through Step 5., reiteratively, to add more

servers to your pipeline.


7. Click Apply Changes in the Proxy Server pane.

8.4 Run the Proxy Server


If you make any configuration changes to the proxy server, or to another
component of SAS Information Retrieval Studio, you can restart the proxy
server. (By default, the proxy server is always running.) Click Apply Changes
after you modify the default settings for any of these components.
To start, stop, pause, resume, or apply changes to the proxy server, complete
any of these steps:

SAS Information Retrieval Studio: Administrators Guide

229

If you have stopped the proxy server for any reason, click Start in the
Proxy Server pane.

The appropriate message appears in the Status pane after any of these
operations.
-

(Optional) Click Stop and the proxy server ceases its running process.

(Optional) Click Pause to interrupt the running process.

If you have stopped or paused the proxy server, click Resume.

If you make any changes to the configuration while the proxy server is
running, click Apply Changes.

8.5 Troubleshoot with the Log File


The log pane enables you to see the history of the operations performed by the
proxy server. Use the contents of the Log pane when you require customer
support.
To see this Log pane, complete these steps:

230

SAS Information Retrieval Studio: Administrators Guide

1. Click Log in the Proxy Server pane.

2. Use the default selection Connections. Click

to select Errors.

or
to change the default setting of 20 in the Number
field. This field specifies the number of lines that are
displayed for the searchable log file in this pane.

3. Click

of lines

4. Click Retrieve to see the specified number of lines in the log file pane

below.
5. Enter the text that you want to locate in the Text to highlight field. For

example, enter 10.


6. Click Find to see these terms, highlighted in bold font, in the log file

pane. For example, see each instance of 10 highlighted in the dates and
find it in the queue.

SAS Information Retrieval Studio: Administrators Guide

231

232

SAS Information Retrieval Studio: Administrators Guide

Configuring the Pipeline Server


-

Overview of the Pipeline Server

Configuring the Pipeline Server

See Input Documents with the Document Inspector

Add a New Field to Input Documents

Match Categories, Concepts, and Facts

Export Categories and Concept Matches

Advanced Installation

Run the Pipeline Server

Troubleshoot with the Log File

9.1 Overview of the Pipeline Server


9.1.1 Processing Documents and Related SAS
Applications
9.1.1.A How Document Processing and Export Operations
Work Together
The pipeline server enables you to select document processors that act on
input documents to prepare these texts to be handled by another server or
application. These processes are known as normalization, analysis, and export
operations.
For example, normalization includes the process of stripping Web documents
of their HTML markup tags and using SAS Document Conversion on
documents collected by the file crawler. You can then use the SAS Content
Categorization Studio to analyze input documents. Finally, export documents

to SAS programs such as SAS Sentiment Analysis Workbench and SAS Text
Miner.
Note:

Before you can analyze or export your documents,


make sure that any required software is installed and
running.

For more information about installing these software applications, see SAS
Information Retrieval Studio: Installation Guide or the installation guide for
each SAS application that you want to use.

9.1.1.B Process Documents


The pipeline server performs many operations that are integral to document
handling and processing. These normalization, analysis, and export operations
include category matching, concept extraction, contextual extraction,
document conversion, and exporting documents to other applications.
The analysis operations of the SAS Content Categorization Studio document
processors are also used for the labels associated with facetted search. These
labels, or captions, are specific to the categorization, concepts extraction, or
contextual extraction matching technologies in SAS Content Categorization
Studio. When you choose to create labels, a series of windows enables you to
track an input document field. You can track this field from the crawler
through the pipeline and indexing servers and into the query web server. You
can see the results in input documents when a query is entered in the search
page.
Make sure that the following programs are installed and running before you
try to process documents using them:
SAS Content Categorization Server

identifies categories, concepts, and facts from SAS Content


Categorization Studio and SAS Contextual Extraction Studio. Make
sure that the taxonomies that you want to apply to the documents input
to SAS Information Retrieval Studio are uploaded to SAS Content
Categorization Server before you configure the pipeline server.
SAS Document Conversion

234

SAS Information Retrieval Studio: Administrators Guide

extract plain text from documents such as Microsoft Word and PDF
files.

9.1.1.C Export Processed Documents


Export the documents that were collected by a crawler to the following
programs:
SAS Content Categorization Studio

uses input files for training and testing purposes.


SAS Sentiment Analysis Workbench

analyze the sentiment in input documents.


SAS Text Miner

identifies topics and themes in input documents.

9.2 Configuring the Pipeline Server


9.2.1 Overview of the Document Processors
Input documents are defined as one chunk of text returned as the result of a
crawl. This text can be a news article, a file received from the file crawler, or a
PDF document. Each of these documents is processed using the operations
that you specify, before the document is passed to another server. By default, if
SAS Search and Indexing is installed, all input documents go to the indexing
server. This is true if the documents are also sent to other applications.
Choose your document processors according to the operations that you want to
perform:
First, consider the crawlers that you defined and the document types that they
are configured to return in order to normalize the input text. For example, the
web crawler can return PDF and Microsoft Word documents in addition to
HTML documents. For this reason, choose a processor to strip the HTML tags
from the text such as parse_html or heuristic_parse_html. You can also
select the document_converter operation to extract text from documents such
as Microsoft Word and PDF.

SAS Information Retrieval Studio: Administrators Guide

235

Second, If you choose to use the feed crawler, you might select
invalidate_duplicates_by_url. This operation ensures that only one copy
of a document is passed to another process. This document processor is
important for applications such as SAS Sentiment Analysis Workbench where
the freshness of the document matters.
Third, choose the content_categorization document processor if you want
to enable facetted search using the categorizer, concept, or contextual
extraction processors. You can also use these processors to categorize and
extract concepts and facts from your input documents before passing them to
another operation.
Fourth, use the export_csv and the export_to_files processors to export the
normalized (and analyzed) documents to put these documents into a format
that can be used by another application. To send documents directly to SAS
Sentiment Analysis Workbench, specify
export_to_sas_sentiment_analysis_workbench.
Note:

You can also add deployment-specific document


processors by placing them into the bin/postrpocessors
subdirectory of your installation.

By default, when SAS Search and Indexing is installed, all input documents go
to the indexing server. This is true if the documents are also sent to other
applications.
After you consider these available operations, use the Add Document
Processor window to add and configure your document processors. You can
choose to use one document processor, or you can build a pipeline that orders
several processors. For example, use the heuristic_parse_html operation to
extract paragraphs of text without their HTML tags. The next processor in the
pipeline might be the export_to_files processor that enables you to export
the file in XML or in text format. In either case, you can specify whether the
document stops here in the pipeline or goes to the indexing server.
The operations that you specify in the Document Processors pane occur in the
same order that they are listed in this pane. You can specify the document
processors in any order and use the Move up and Move down buttons to
reorder these operations. If document processing operations are incorrectly
ordered, unexpected results might occur.

236

SAS Information Retrieval Studio: Administrators Guide

9.2.2 Checking Program Installations


Document processors are specified and configured within the pipeline server.
If you choose to use one of the following processing operations, make sure
that the necessary application is running:
-

SAS Document Conversion:

SAS Content Categorization Server: Identify categories, concepts,


and facts from SAS Content Categorization Studio and SAS Contextual
Extraction Studio. Make sure that the taxonomies that you want to apply
to the documents input to SAS Information Retrieval Studio are
uploaded to SAS Content Categorization Server. Run SAS Content
Categorization Server before you configure the pipeline server.

If you want to process documents such


as Microsoft Word and PDF files, install and run this application before
you specify this document processor.

Before you use SAS Content Categorization Server, create projects


using SAS Content Categorization Studio and SAS Contextual
Extraction Studio.
You can also export documents to another SAS program after you specify the
document processor and start the program.
-

SAS Content Categorization Studio: Use the files that you export
from SAS Information Retrieval Studio for training and testing
purposes.

SAS Sentiment Analysis Workbench:

Analyze the sentiment

expressions in input documents.


-

SAS Text Miner:

Identify entities.

For more information, see the SAS Information Retrieval Studio: Installation
Guide.

SAS Information Retrieval Studio: Administrators Guide

237

9.2.3 Configure the Document Processors


To add the parse_html processor, or to use this section as an example of how
to add a different processor, complete these steps:
1. Select Pipeline Server --> Configuration --> Document
Processors.

2. Click Add in the Document Processors pane. The Add Document

Processor window appears.

3. Select parse_html.

238

SAS Information Retrieval Studio: Administrators Guide

4. Click Next and the Document Processor: parse_html window appears.

5. Leave the default specification, raw, or enter a new field name in the
input-field. The processor uses this field to obtain the unmodified,
document data. raw specifies that the original, unmodified content was
placed into the HTML document using this identifying field name.

6. Leave the default specification, title in the title-output-field. You

can also enter a different field name where the processor stores the plain
text of the document title.
7. Leave the default specification, body in the body-output-field. You

can enter a different field where the processor stores the body text
located in the input document. This field is used by other applications
such as SAS Content Categorization Studio, when they are part of the
processing pipeline.
8. Change the default entry to 1 in the output-metadata field and this

processor populates other fields, such as keywords and description,


with values taken from the HTML document.
9. The entry 1 in the require-mime-type field specifies that a document

is checked to ensure that it is an HTML document. If you enter 0, this


check is not required.

SAS Information Retrieval Studio: Administrators Guide

239

10. Leave the mimetype entry in the mime-type-field, or specify a

different field.
11. The entry 1 in the base64-input field specifies that the text is

preserved in the mime content transfer encoding. If you enter 0, this


encoding is not used.
12. Click Finish to save these settings.
13. (Optional) Continue adding the document processors to the pipeline.
14. (Optional) To make changes to the specifications for a processor, click
Edit.

15. (Optional) To change the ordering of the processors in the pipeline, click

Move up, or Move down until the order is correct.


Note:

For more information, see Section 2.8.4 The Document


Processors Tab on page 41.

9.3 See Input Documents with the Document


Inspector
Use the Document Inspector pane to see all of the versions of the input
document. You can see each version, simultaneously, at each stage in the
pipeline server. The original document changes at each stage of the pipeline,
but you can still see its original text.
This snapshot operation is available for one document at a time, but only when
the documents are moving through the pipeline server. In this pane, you can
see each document as it moves, whether it is intact or split into multiple
documents.

240

SAS Information Retrieval Studio: Administrators Guide

Display 9-1 Viewing a Document in the Document Inspector Pane

To use the document Inspector pane to see a document, use the following
steps:
1. Click Take Snapshot.
2. Click on a document processing operation that appears in the
Processing Stage pane. For example, click on
heuristic_parse_html. A document number appears
Document pane.

in the

3. Click the number in the Document pane and the fields in this document

appear in the Field pane. For example, click on 1.


4. Click on one of the fields that the document consists of in the Field

pane. For example, click on body.

SAS Information Retrieval Studio: Administrators Guide

241

5. See the contents of the selected field in the Document Inspector pane.

For example, see http://money.cnn.com/2010/01/21/technology/


sas_best_companies.fortune/.

9.4 Add a New Field to Input Documents


You can add a new field, with a constant value to each of the input documents.
Use this feature to assign the same field to each indexed document. For
example, you might want to add a field to all of the documents. This field
might be used to specify that the documents are indexed from a particular
source, during a specific time period, or for other defining purposes.
When you choose to add a field, you also specify the alphanumeric string that
is assigned to each document.
To add a field to each input document, complete these steps:
1. Select Pipeline Server --> Document Processors. The document

Processors pane appears.

242

SAS Information Retrieval Studio: Administrators Guide

2. Click Add. The Add Document Processor window appears.

3. Select add_field.
4. Click Next. The Document Processor: add_field window appears.

5. Enter the name of the field that you want to add to each input document

into field. For example, type Date.


6. Enter the value that populates the added field into the value field. For

example, type 062011.

SAS Information Retrieval Studio: Administrators Guide

243

7. Click Finish to see this addition in the Document Processors pane.

8. Click Stop to halt the Pipeline Server.


9. Click Apply Changes.
10. Perform Step 8. through Step 9. above for the crawler that you are using.
11. Perform Step 8. through Step 9. above for the Proxy Server.
12. Perform Step 8. through Step 9. above for the Indexing Server.

244

SAS Information Retrieval Studio: Administrators Guide

13. Select Pipeline Server --> Document Inspector.

14. Click Take Snapshot.

SAS Information Retrieval Studio: Administrators Guide

245

15. Click Processing Stage to see a list of the document processors.

Select a processor. For example, click add_field. The document


number appears in the Document pane.

16. Click the document number under Document to see the fields for this

document in the Field pane. For example, click 1 in the Document


pane to see concepts, Data, promotion, and id in the Field pane.
17. Click a field in the Field pane to see the related information in the

empty pane. For example, click Data to see that the value 062011 that
you assigned to the add_field processor was assigned to document 1.

9.5 Match Categories, Concepts, and Facts


You can match categories, concepts, and facts in input documents using the
content_categorization Document Processor. You use this processor to specify
the categories, and classifier and grammar concepts, that you created and
defined in SAS Content Categorization Studio. The concepts that you define in
the SAS Content Categorization Studio add-on program SAS Contextual

246

SAS Information Retrieval Studio: Administrators Guide

Extraction Studio are used as concepts or facts. Any concept that is developed
in SAS Contextual Extraction Studio and specified with a PREDICATE or
SEQUENCE rule is a fact.
The content_categorization Document Processor is the client for SAS Content
Categorization Server. The categories, concepts, and facts are applied by SAS
Content Categorization Server to the documents processed by SAS
Information Retrieval Studio.
The following example uses concepts. If you want to use categories or facts,
make the appropriate substitutions. Also see Chapter 10: Creating Facetted
Search Labels Using content_categorization. This chapter uses the Document
Processor: content_categorization wizard to create labels for facetted search.
To map concepts to labels, complete these steps:
1. Select Pipeline Server.

2. Click

to access the Document Processors pane.

SAS Information Retrieval Studio: Administrators Guide

247

3. Click Add and the Add Document Processor window appears.

4. Select content_categorization. The Document Processor:

content_categorization window appears.

5. (Optional) By default, the name of the server where SAS Content

Categorization Server is running is specified in the Hostname field.


For example, see localhost. You can enter a different server name if
SAS Content Categorization Server is running on another server.
6. (Optional) By default, the port number for the specified server is entered

in the Port field. For example, see 6500. Click


different port number.

or

to select a

7. (Optional) By default, 10 is entered into the Timeout field.

Click
or
to select a different number. This is the number of
seconds that the Pipeline Server waits before this server stops
attempting to download an input field.

248

SAS Information Retrieval Studio: Administrators Guide

8. Click Next. The Document Processor content_categorization window

appears. Use this window to add any of the projects that are uploaded to
SAS Content Categorization Server to SAS Information Retrieval
Studio.

9. Click Add and the Document Processor: content_categorization

window appears.

10. (Optional) Click

and select Concept extraction unless this


processor is already selected.

11. (Optional) Click

and select a project that you added, unless the


project that you want to use is already selected. For example, select
Entities.

12. Click Ok and the project appears in the Document Processor:

content_categorization window. For example, see Entities under

SAS Information Retrieval Studio: Administrators Guide

249

Project

and Concept extraction under Type. Your selection limits


the available concepts to those in the project.

13. (Optional) Continue to add projects using Step 9. through Step 12. The

concepts in each of the projects that you select are available to match
your input documents.
14. Click Next. The Document Processor: content_categorization window

appears. By default, the Input tab is displayed.

15. (Optional) Enter the fields that are in any of your input documents

where you want to locate matches for your concepts. Enter these fields,
as a comma-separated list into the Input fields field. If you leave this
field blank, all fields, with the exception of those listed in the Input
fields to exclude field are searched.
16. (Optional) By default, fields that contain information about the

document are listed in the Input fields to exclude field. You can add
additional fields, or delete fields from this list:
id,url,feed_url,raw,mimetype,date,pdate,source,
promotion,ctime,atime,mtime

If you edit this list, be sure to insert a comma (,) between each field.
17. (Optional) If you make any changes, click Finish to save these edits.

250

SAS Information Retrieval Studio: Administrators Guide

18. Click Concepts and the Concepts pane appears.

19. Click Add to specify the concepts that are matched in input documents.

The Document Processor: content_categorization window appears.

20. Click
Concept

to select the concept that you want to match in the


field. For example, select LOCATION.

SAS Information Retrieval Studio: Administrators Guide

251

Hint:

Only the concepts that are part of the selected project


are available in the drop-down menu that appears.

21. (Optional) By default, the name of the concept is entered into the Field
name

field. For example, see location. Enter a new name, if you


choose.

22. (Optional) By default, the name of the label for the facetted search is

entered into the Caption field. For example, see Location. Enter a new
caption, if you choose. For more information about facetted search, see
Chapter 10: Creating Facetted Search Labels Using
content_categorization.
23. (Optional) By default, %c: %i is entered into the Format field. These

symbols indicate that information about the concept (%c) followed by


information about the entity (%i) is output. Choose different symbols
from those symbols that are available, if you choose.
Table 9-1: Default Format Symbols
Symbol

Description

%c

Match the concept name.

%p

Add to %c to include the path with the concept name.

%m

Match the text.

%i

Match the information associated with the entity, or the match text
if no information is available.

%I

Match the information associated with the entity unconditionally.

%%

Match the literal percent sign.

Use as a modifier, such as in %xc to request XML escaping

24. (Optional) By default, ; (the semicolon) is used to separate the output

fields. Enter a different separator character if you choose.

252

SAS Information Retrieval Studio: Administrators Guide

25. (Optional) Click Copy Defaults to revert to the concepts entries in the
Concepts

tab.

26. Click Ok to save your changes. The Document Processor:

content_categorization window appears.

27. See the newly entered concept with its field name, and caption. For

example, see Location under Concept, location under Field name,


and Basketball Player under Caption.

SAS Information Retrieval Studio: Administrators Guide

253

28. (Optional) If you want to continue to add concepts, click Add. Use Step

19. through Step 26. on page 253, reiteratively, until you have added all
of the concepts that you want to use for facetted search.
Note:

By default, you can add a maximum of 10 concepts to


the project. To change this number, see Section
13.2.1 Displays with or without Labels on page 310.

29. (Optional) By default, concepts is entered into the Default field


name

field. You can choose to enter a different name into this field.

30. (Optional) By default, Concepts is entered into the Default caption

field. You can choose to enter a different name into this field.
31. (Optional) By default, %c: %i is entered into the Default format field.

You can choose to enter different symbols into this field. You can edit
this entry using any of the symbols in Table 9-1 on page 252 with the
exception of %I.
32. (Optional) By default, ; (semicolon) is entered into the Default
separator

field. Enter a different separator character if you choose.

33. (Optional) By default, 15 is entered into the Max concepts field. This

is the highest number of concepts that can be located in an input


document. Click

or

to enter a different number.

34. Click Finish to save these settings.


35. If an index was in the process of building while you added captions for

your concepts, the Delete Index window might appear:

36. Click Yes to delete the index.

254

SAS Information Retrieval Studio: Administrators Guide

37. See the name that you entered into Field name appears in the

Configuration pane of the indexing server when this operation is


complete.

38. Click Start in the main Pipeline Server window to restart the Pipeline

Server.
39. When you click the Add button in the Matching pane of the query web

server, you can select this field in the Add Field window. This caption
name appears as a field in the Matching pane of the Query Web Server.

SAS Information Retrieval Studio: Administrators Guide

255

This caption also appears in the user interface when a matching term is
located in an input document.

9.6 Export Categories and Concept Matches


You can export matches on your category and concept fields using file, CSV,
and ODBC operations:
To export matched categories and concepts fields, complete these steps:

256

SAS Information Retrieval Studio: Administrators Guide

1. Use the steps in Section 9.5 Match Categories, Concepts, and Facts on

page 246.

2. (Optional) If you plan to export your matched fields without indexing

them, deselect the Label field in the index check box.


3. Deselect the Label field in the query web server check box.
4. Select one of the following operations:
File export

fields are exported as files


CSV export

fields are exported in commas separated format


ODBC export

fields are exported into a database


5. Click Finish.

SAS Information Retrieval Studio: Administrators Guide

257

9.7 Advanced Installation


When you choose to use the advanced installation, you can configure two or
more pipeline servers. When you choose this type of SAS Information
Retrieval Studio configuration, you can perform some of the document
processing operations on one server. This pipeline server can send a copy of
the processed documents to another pipeline server where more document
processors can act on them.

9.8 Run the Pipeline Server


By default, the pipeline server is running. Configure all of the components that
you plan to use. Click Apply Changes after you modify any of the default
settings for these components. Perform these operations before you view the
statistics for the pipeline server.
To start, restart, and stop the pipeline server, complete any of these steps:
-

Click Start in the Pipeline Server pane.

The appropriate message appears in the Status pane after any of these
operations.
See the progress of the input documents in the Status pane:
a. The Overall - Pending table cell is always empty.

258

SAS Information Retrieval Studio: Administrators Guide

b. See how many documents have finished all of the processing

operations in the Overall - Finished table cell. For example, 32.


c. See how many XML documents are in the process of having their

XML tags removed in the XML parsing - Pending table cell. For
example, 1.
d. See how many XML documents have completed the process of

XML tag removal in the XML parsing - Finished table cell. For
example, 31.
e. See how many documents are in the pipeline process in the
Document processing - Pending

f.

table cell. For example, 22.

See how many documents have completed all of the pipeline


operations in the Document processing - Finished table cell. For
example, 8.

g. See how many documents are going to the indexing server in the
Sending to the indexer - Pending

table cell. For example, 7.

h. See how many documents have completed the indexing process in

the Sending to the indexer - Finished table cell. For example, 0.


-

(Optional) If you make any changes to the configuration while the


pipeline server is running, click Apply Changes.

(Optional) Click Revert to return to the last applied settings.

(Optional) Click Refresh to see any changes in this pane.

To stop the crawl, click Stop.

SAS Information Retrieval Studio: Administrators Guide

259

9.9 Troubleshoot with the Log File


The log pane enables you to see the operations performed by the pipeline
server. Use the contents of the Log pane when you require customer support.
To access and use the log pane, complete these steps:
1. Click the Log tab in the Pipeline Server pane.

2. Use the default selection Connections. Click

to select Errors.

or
to change the default setting of 20 in the Number
field. This field specifies the number of lines that are
displayed for the searchable log file in this pane.

3. Click

of lines

4. Click Retrieve to display the specified number of lines in the log file.
5. (Optional) Enter a search term into the Text to highlight field. For

example, enter target machine.


6. Click Find to locate all instances of this term in this pane.

260

SAS Information Retrieval Studio: Administrators Guide

10

Creating Facetted Search Labels


Using content_categorization
-

Before You Begin Using This Example

Creating a Sample Project

Seeing the Results in the Query Interface

10.1 Before You Begin Using This Example


10.1.1 How the content_categorization Document
Processor Creates Facetted Search Labels
Facetted search applies identifying labels to matched documents. These labels
enable you to intuitively navigate to the documents that match your input
query terms. Unlike traditional search, facetted search enables you to search
instinctively and faster because the matching texts are pre-organized. (You can
also apply in-line tagging using these labels. This tagging can be used by a
third-party program at this time.)
Labels are values within fields. These fields can have display names that are
specified in the Caption field in the Document Processor:
content_categorization windows. Captions do not have formatting restrictions,
unlike internal field names that can contain only lowercase English letters.

Figure 10.1 Example of Facetted Labels

10.1.2 Using Related Programs to Define Labels


When you define labels for facetted search, you use the following programs:
-

SAS Content Categorization Studio

(Optional) SAS Contextual Extraction Studio

SAS Content Categorization Server

Use the following architectural diagram to gain an overview of these


applications:
Figure 10.1 Architecture for Facetted Label Creation

262

SAS Information Retrieval Studio: Administrators Guide

Define your labels using the categories and concepts that you specify in SAS
Content Categorization Studio with or without SAS Contextual Extraction
Studio. Labels apply the matching requirements set by the rules that define
categories and concepts. Labels also enable facetted search operations in the
query interface of SAS Information Retrieval Studio.
Use SAS Content Categorization Studio alone to develop categories that
identify documents based on their subject matter. Also define concepts that
locate relevant terms based on rules that are specified by lists of matching
terms or parts of speech and other symbols.
Display 10.1 SAS Content Categorization Studio

When you use the add-on SAS Contextual Extraction Studio application with
SAS Content Categorization Studio, you can define LITI concepts. These
concepts increase matching precision (matches all of the relevant texts) and
recall (matches only the relevant texts). LITI concepts differ from the
classifier and grammar concepts in SAS Content Categorization Studio
because you can mix rule types within a single definition.
Contextual Extraction, or LITI, concepts can also include facts. Facts are rules
that are defined by arguments. Arguments are defined by concepts that are
related if they are matched by the fact rule. For this reason, facts return related
pieces of information in input documents. For example, define facts when you
want to identify relationships between drugs, symptoms, and gender.

SAS Information Retrieval Studio: Administrators Guide

263

Display 10.2 Two Facts in One LITI Concept

Note:

Rules appear on only one line. The rules that appear on


more than one line in this example are spaced only for
illustrative purposes.

When you use facts as labels, you can specify the string that is returned for the
label. Each string contains terms that are custom filled according to the
matched text.

10.1.3 Mapping to Labels


You can enter the names of the labels for the categories, concepts, and facts
that you want to serve as navigation tags for facetted search. These labels link
to one or more matched documents in the query interface. For this reason, the
names of the labels and the rules that define each taxonomy node should be
part of a schema that reflects appropriate ways of searching. For example, if
you want drugs to be part of the taxonomy for your SAS Content
Categorization Studio, you might also define the SIDE_EFFECTS, GENDER,
and DISEASE concepts.
Use the Document Processor: content_categorization wizard to select
categorization, concept, and fact extraction processors. These processors
locate matching terms in the input text, or within the document fields that you
specify, and return matches.

264

SAS Information Retrieval Studio: Administrators Guide

When you choose categories, SAS Information Retrieval Studio applies all of
the categories in the selected project to input texts. Although the default
selections for concepts and facts are the same, you can select specific facts and
concepts to apply.
All LITI concept definitions that include PREDICATE and SEQUENCE rules are
treated as facts. If a LITI concept rule contains one, or more, facts and other
concept rules, the facts and the concepts are applied separately. The default
settings in the Document Processor: content_categorization wizard return the
matched fact and concept rules for each LITI definition under the same label
name. For this reason, consider renaming either the fact or the concept label
and field name for each LITI definition that contains a concept and a fact.
Choose the display selection that works best for your end users:
Display 10.3 Example of Default Setting

SAS Information Retrieval Studio: Administrators Guide

265

Display 10.4 Facts and Concepts Labeled Differently

and SEQUENCE rules match two, or more, concepts to provide


otherwise overlooked relationships in input documents. The related matches
are facts. For this reason, facts consist of at least two arguments. In the
example above, the arguments for SIDE_EFFECT are drug and sideeffect.
These arguments match Topamax and restlessness, respectively, in the input
document.
PREDICATE

10.1.4 Before You Build Your SAS Content


Categorization Studio Project
Before you develop, or choose to use an existing, SAS Content Categorization
Studio project, consider the types of labels that you want to display in the
query interface. The category, concept, and fact names that you define in SAS

266

SAS Information Retrieval Studio: Administrators Guide

Content Categorization Studio are the default settings for SAS Information
Retrieval Studio. (You can also write a custom string that displays these
names.) For this reason, use care when specifying names and writing
PREDICATE and SEQUENCE rules that specify terms that are visible to the end
user.
Also use care when writing rules that return many matches. For example, you
might develop a SAS Content Categorization Studio project that includes an
EMAIL concept. This concept might contain rules defined by regular
expressions that are designed to return all e-mail accounts within internal
company documents. The inclusion of this EMAIL concept might not be
appropriate for a facetted search on the Web.
Before you upload a SAS Content Categorization Studio project to SAS
Content Categorization Server, check the Project Settings - Misc tab of SAS
Content Categorization Studio. If there are entries in the XML Default Fields
field, remove these fields and leave the XML Default Fields field blank.
(These fields apply to categories and to LITI concepts and facts. For this
reason, grammar and classifier concepts in SAS Content Categorization Studio
are matched regardless of the field entries. Other matches that should occur,
might not.)
Display 10.5 Project Settings - Misc Tab

SAS Information Retrieval Studio: Administrators Guide

267

Use care when changing rules and uploading projects to avoid propagating the
same rule or its variations. For example, you might upload a SAS Content
Categorization Studio project to SAS Content Categorization Server. If you
change a concept definition and upload the same project with a new name to
the server, both rules are available for matching. This is true if you add both
projects to your SAS Information Retrieval Studio project using the Document
Processor: content_categorization wizard.
In other words, matches might be made on concept definitions where one or
more definitions is specified using an outdated rule. This behavior can occur
because SAS Information Retrieval Studio consolidates all of the rules for
categories, concepts, and facts with the same names.
Naming also affects LITI facts and concepts. For example, you might have a
LITI concept definition that includes both fact and concept definitions. See the
example below:
Figure 10.2 Facts and Concept Rules in One Concept Definition

Note:

For the purposes of this example only, each fact rule


appears on two lines.

In this example, if matches occurred for both the facts and concepts, all of
these matches would return a match on the SIDE_EFFECT concept. However,
you can use the content categorization document processor to specify different
names for the concept and fact matches.

268

SAS Information Retrieval Studio: Administrators Guide

10.1.5 Before You Use the Example in This Chapter


Before you follow the example in this chapter, install the following programs:
-

SAS Content Categorization Studio

(Optional) SAS Contextual Extraction Studio

SAS Content Categorization Server


Note:

This chapter provides one example of how labels are


mapped to concepts and facts and viewed in the query
interface for SAS Information Retrieval Studio. For
more general information, see Section 9.5 Match
Categories, Concepts, and Facts on page 246.

To understand how to use these applications with SAS Information Retrieval


Studio, complete the following steps:
1. Develop a sample SAS Content Categorization Studio project with, or

without, SAS Contextual Extraction Studio concepts and facts. For

SAS Information Retrieval Studio: Administrators Guide

269

more information, see SAS Content Categorization Studio: Users


Guide and SAS Contextual Extraction Studio: Users Guide.

270

SAS Information Retrieval Studio: Administrators Guide

2. Use the Build menu to build, compile, and upload the relevant

categories and concepts projects to SAS Content Categorization Server.


For more information, see SAS Content Categorization Studio:
Administrators Guide.

3. Specify the name of the project in the Upload window that appears. The

entry in the Server Project Name field can be unique for the SAS
Information Retrieval Studio project.

4. (If you uploaded your project awhile ago) Select Start --> Programs -> SAS Content Categorization Server

to make sure that the server

is running.

SAS Information Retrieval Studio: Administrators Guide

271

5. Configure a sample SAS Information Retrieval Studio project using the

content categorization document processor that references your sample


SAS Content Categorization Studio project. See the following sections
of this chapter for step-by-step directions.

272

SAS Information Retrieval Studio: Administrators Guide

6. Check the matching results using the Document Inspector tab.

(Always click Take Snapshot before you start a crawler.)

7. Select Start --> Programs --> SAS Information Retrieval Studio


--> Query Interface.

8. Enter a query term such as side effects.

SAS Information Retrieval Studio: Administrators Guide

273

9. Click Search to see the results. The facetted search labels appear on the

left side of the query interface.

10. (Optional) Check your search results against the original project to

ensure that the expected results occur.

10.2 Creating a Sample Project


10.2.1 Access the Projects on SAS Content
Categorization Server
Use the Document Processor: content_categorization window to specify the
location where SAS Content Categorization Server is running. The category,
concept, and fact definitions are uploaded in projects to SAS Content
Categorization Server.
The content_categorization Document Processor is the client for SAS Content
Categorization Server. The categories, concepts, and fact definitions are

274

SAS Information Retrieval Studio: Administrators Guide

applied by SAS Content Categorization Server to the documents processed by


SAS Information Retrieval Studio.
Note:

The following steps apply when text documents are


input to SAS Information Retrieval Studio. If HTML or
XML documents are input, use the appropriate parser.
For example, add the parse_html document processor.

To specify the location of SAS Content Categorization Server, complete these


steps:
1. Select Pipeline Server --> Document Processors.

Hint:

The Pipeline Server can either be running or stopped.

SAS Information Retrieval Studio: Administrators Guide

275

2. Click Add. The Add Document Processor window appears.

3. Select content_categorization.
4. Click Next. The Document Processor: content_categorization window

appears.

5. (Optional) By default, the name of the server where SAS Content

Categorization Server is running is specified in the Hostname field.


For example, see localhost. Enter a different server name if SAS
Content Categorization Server is running on another server.
6. (Optional) By default, the port number for the specified server is entered

into the Port field. For example, see 6500. Click


a different port number.

or

to select

7. (Optional) By default, 10 is entered into the Timeout field.

Click
or
to select a different number of seconds that the
pipeline server waits before it stops trying to complete a matching
operation.

276

SAS Information Retrieval Studio: Administrators Guide

8. Click Next. You can now add your projects to SAS Content

Categorization Server.

10.2.2 Add Projects


Use this section to specify the projects that you uploaded to SAS Content
Categorization Server that are used by SAS Information Retrieval Studio. You
use these projects to select the categories, concepts, and facts that SAS
Information Retrieval Studio applies to input documents. For this reason, you
select the project that you want to use for each type of label source.
To add projects, complete the following steps:
1. Click Add in the Document Processor: content_categorization window

that appears after you click Next in aboveStep 8..

The Document Processor: content_categorization window appears.

2. (Optional) By default, Categorization is selected in the Type field.

This is true if you uploaded a categories project to SAS Content


Categorization Server. Otherwise, Concept extraction or

SAS Information Retrieval Studio: Administrators Guide

277

Contextual extraction is selected. Click

to change the default

selection.
3. (Optional) By default, a project is selected in the Project field such as
Sample. Click

to select a different project that is running on SAS


Content Categorization Server with the appropriate matching
technology. For example, if Sample is selected, only categories are
available for matching. This is true because Categorization is
selected in the Type field and Sample was uploaded as a categories
project.
4. Click Ok and the project appears in the Document Processor:

content_categorization window.

5. (Optional) Repeat Step 1. on page 277 through Step 4. above until you

have added all of the projects and their matching types. For example,
add MedicalProj to include concepts. Add MedicalProj2 to match
LITI concepts and facts. If you have multiple project for a specific
matching technology, you can upload all of these projects.
6. Click Next.

278

SAS Information Retrieval Studio: Administrators Guide

10.2.3 Determine the Input, Matching, and Output


10.2.3.A How Input Documents Are Handled
The Document Processor: content_creation document processor enables you
to specify the input fields, the matching field names, and how the fields are
labeled or exported. These specifications determine how the content of input
documents is handled.

10.2.3.B Specify Input Fields


Input documents such as HTML and XML documents contain fields, some of
which are for informational purposes only. You can choose to limit the fields
that are searched for categories, concepts, and facts. You can also exclude
some fields. When you specify input fields, all of the unlisted fields are not
searched.
Before you use the steps below, consider the types of documents that you plan
to input. These documents types determine the fields to include or exclude.
To select the input fields, complete these steps after you click Next in Step 6.
on page 278. The Document Processor content_categorization window
appears. By default, the Input tab is selected.

1. (Optional) By default, the Input Fields field is blank. Use a comma (,)

separated list to specify any field names that you want to search for
matches for your categories, concepts, and facts. If you leave this field
blank, all of the fields are searched with the exception of any fields
entered into the Input fields to exclude. If you specify any fields,
only the listed fields are searched.

SAS Information Retrieval Studio: Administrators Guide

279

2. (Optional) By default, the Input Fields to exclude field contains

these fields:
id,url,feed_url,raw,mimetype,date,pdate,source,
promotion,ctime,atime,mtime

Using a comma-separated format, you can edit this list.


3. (Optional) Click Finish to save your changes.

10.2.3.C Specify Categories


Categories define the information that is located in input documents by
specifying the subject matter of the documents. When you select categories,
unlike concepts and facts, all of the categories in the project are applied to
input documents.
To add categories to the project, complete these steps:
1. Click Categories to access the Categories pane.

2. (Optional) By default, categories is entered into the Field name field.

You can enter a new field name.


3. (Optional) By default, Categories is entered into the Caption field.

You can enter a new caption name for facetted search.

280

SAS Information Retrieval Studio: Administrators Guide

4. (Optional) By default, %c is entered into the Format field for each,

individual category name. You can enter a new format that might
include %% for a literal percent sign. You can also use x as a modifier to
request XML escaping. For example, enter %xc.
5. (Optional) Enter a regular expression into the Category name pattern

field. Regular expressions specify the pattern for the category name.
6. (Optional) Enter a string into the Category name replacement field.

This string is a constant value that replaces each of the individual


category names with the name that you specify here.
7. (Optional) By default, ; (semicolon) appears in the Separator field.

Enter a new separator such as a comma (,) for the matched categories.
8. (Optional) By default, the highest number of categories that can be

matched in any single input document is 15. Click


or
change this default selection in the Max categories field.
Hint:

to

This field specifies the number of categories that have


the highest numbers of matches. For example,
matches might occur for 25 categories. However, the
results for this example are displayed only for the 15
categories with most matches in input documents.

9. Click Finish.

10.2.3.D Specify Concepts


Concepts identify metadata, or data on information. You specify concepts to
locate specific types of information in input documents using SAS Content
Categorization Studio. You can add the classifier and grammar concepts in
SAS Content Categorization Studio or any of the concept types in SAS
Contextual Extraction Studio.
However, concepts that include SEQUENCE and PREDICATE rules in their
definitions are added as Facts. After you upload these concepts to SAS
Content Categorization Server, the projects that contain these rules can be
applied by SAS Information Retrieval Studio.

SAS Information Retrieval Studio: Administrators Guide

281

Matches for any of the concepts that you specify explicitly, appear in the table
at the top of the Document Processor: content_categorization window. These
matches appear in the specified format and are placed into the specified output
field. Matches for any other concepts that are not in the table are assigned the
default format. The text of these matches appears in the Default field name.
You can also choose to exclude concepts from matching. For example, exclude
all of the matches that are not specified when you leave the Default field
name empty in the Concepts tab. If you want to specify one or more concepts
to exclude, leave the Field name blank when you specify the excluded
concepts.
If you want to prevent a specific concept from the output, leave the empty.
To add concepts to the project, complete these steps:
1. Click Concepts to access the Concepts pane. You can use this pane to

add all of the concepts and contextual extraction concepts. If any of the
LITI concepts include PREDICATE or SEQUENCE rules, these rules are
matched as facts. Access these facts using the Facts pane.

2. Click Add. The Document Processor: content_categorization window

appears. Use this pane to specify the settings for each individual

282

SAS Information Retrieval Studio: Administrators Guide

concept. These settings override the specifications for all of the concepts
in the Concepts pane.

in the Concept field to select a concept from the available


projects. For example, select SIDE_EFFECT from the drop-down menu.

3. Click

4. (Optional) When you select a concept using Step 3. above, the name of

the selected concept appears in the Field name field after you make a
selection in the Concept field.
In this example, the concept SIDE_EFFECT also contains PREDICATE
and SEQUENCE rules. For this reason, SIDE_EFFECT appears in the Facts
drop-down list also. In order to avoid ambiguity in the search results,
you can choose to rename either the concept or the fact. In this
example, negativeeffects is entered.
5. (Optional) The name of the selected concept appears in the Caption

field after you make a selection in the Concept field. For example,
Negative Effects. You can enter a new caption name such as
Negative Effects. For more information, see Section 9.5 Match
Categories, Concepts, and Facts on page 246. You can also use the
sample project in Chapter 4: Sample Configurations
Note:

If you change the Field name field, also change the


name that appears in the Caption field.

SAS Information Retrieval Studio: Administrators Guide

283

6. (Optional) By default, % is entered into the Format field for the concept

name. You can also use any of the following symbols:


Table 10-1: Concept Output Format Symbols
Symbol

Description

%c

Output the concept name.

%p

Output the concept name with its path. You can specify %c to
include the path with the concept name.

%m

Output the text.

%i

Output the information associated with the entity, or the match


text if no information is available.

%I

Output the information associated with the entity unconditionally.

%%

Output the literal percent sign.

Use as a modifier, such as in %xc to request XML escaping

7. (Optional) By default, ; (semicolon) appears in the Separator field.

You can choose to enter a new separator such as a comma (,).


8. (Optional) Use Step 2. on page 282 to Step 7. above, reiteratively, until

you have added all of your concepts.


9. Click Ok. If you want to reload the default settings, click Copy
Defaults.

284

SAS Information Retrieval Studio: Administrators Guide

10. See the concepts in the Concepts tab. Make sure that you loaded all of

the concepts that you want to use in your project.

11. (Optional) By default, concepts is entered into the Default field


name

field. You can enter a new field name.

12. (Optional) By default, Concepts is entered into the Default caption

field. You can enter a new caption name for facetted search.
13. (Optional) By default, %c: %i is entered into the Default format field

for the concept name. You can edit this entry using any of the symbols
in Table 10-1 on page 284.
14. (Optional) By default, ; (semicolon) appears in the Default separator

field. You can enter a new separator such as a comma (,).


15. (Optional) By default, the highest number of concepts that can be

matched in any single input document is 15. Click


or
change this default selection in the Max concepts field.

SAS Information Retrieval Studio: Administrators Guide

to

285

Hint:

This field specifies the number of concepts that have


the highest numbers of matches. For example,
matches might occur for 25 concepts. However, the
results for this example are displayed only for the 15
concepts with most matches in input documents.

16. Click Finish.

10.2.3.E Specify Facts


Facts match two, or more, concepts to provide otherwise overlooked
relationships in input documents. Facts consist of at least two arguments and
are defined by PREDICATE and SEQUENCE rules. If a contextual extraction
concept contains either a PREDICATE or a SEQUENCE rule, these rules are treated
as a fact by SAS Information Retrieval Studio. Facts are automatically
separated from contextual extraction concepts when these concepts are
uploaded to SAS Content Categorization Server. The arguments and the
matched values appear in the query interface.
Matches for any of the facts that you specify explicitly appear in the table at
the top of the Document Processor: content_categorization window. These
matches have the specified format. Matches for any other facts that are not in
the table are assigned the default format. The text of these matches appears in
the Default field name.
You can also choose to exclude facts from matching. For example, exclude all
of the matches that are not specified when you leave the Default field name
empty in the Facts tab. If you want to specify one or more facts to exclude,
leave the Field name blank when you specify the excluded facts.
To add facts to the project, complete these steps:

286

SAS Information Retrieval Studio: Administrators Guide

1. Click Facts to access the Facts pane.

SAS Information Retrieval Studio: Administrators Guide

287

2. Click Add and the Document Processor: content_categorization

window appears.

to select a fact in the Fact field. Facts are contextual


extraction concepts that contain at least one PREDICATE or SEQUENCE
rule. For example, select SIDE_EFFECT from the drop-down menu.

3. Click

4. (Optional) When you select a fact using Step 3. above, the Field name

field is automatically entered. For example, see sideeffect. Enter a


new name if you choose.
5. (Optional) When you select a fact, the Caption field is automatically

entered. For example, see Side Effect. Enter a new name if you
choose.
6. (Optional) By default, the format for the matched fact is entered into the
Format

field. This is the argument string that fills in the document


matches as a label. For example, see the following format:
SIDE_EFFECT(drug: %v{drug}, sideeffect: %v{sideeffect})

288

SAS Information Retrieval Studio: Administrators Guide

7. In this example, the SIDE_EFFECT concept has two arguments drug and
gender.

You can use the following symbols to edit this field.


Table 10-2: Fact Output Format Symbols

Symbol

Description

%f

Output the fact name.

%a

Output a formatted list of arguments.


Note: If you do not specify the argument symbol, the
Argument format field, even when specified, does not apply.

%v{name}

Output the value for a specific argument.

%m

Output the text.

%s

Return the concordance list.


Note: If you do not specify the concordance, the concordance is
not returned. This is true even when you specify the
Concordance type and Surrounding words in the
Facts pane.

%%

Output the literal percent sign.

Use as a modifier, such as in %xf to request XML escaping.

The following symbols appear in the format string of the


argument for a matched fact.
%n

Output the argument name for the arguments that comprise the
definition.

%v

Output the value for the specified argument.

8. (Optional) By default, %n: %v is entered into the Argument format

field. You can also use any of the symbols in Table 10-2 above.
9. (Optional) By default, a , (comma) appears in the Argument
separator

field. Enter a new separator such as a period (.).

10. (Optional) By default, a ; (semicolon) appears in the Separator field.

Enter a new separator such as a hyphen (-).


11. (Optional) Use Step 2. on page 288 to Step 10. above, reiteratively, until

you add all of your facts.

SAS Information Retrieval Studio: Administrators Guide

289

12. Click Ok. If you want to use the same settings specified in the Facts

tab, click Copy Defaults. The Facts tab appears.

13. (Optional) By default, facts is entered into the Default field name

field. You can enter a new field name.


Note:

If do not change the default entry facts, in the Default


field name field in the Facts tab, all of the concepts are
matched.

14. (Optional) By default, Facts is entered into the Default caption field.

You can enter a new caption name for facetted search.


15. (Optional) By default, %f(%a) is entered into the Default format field

for the concept name. You can edit this entry using any of the symbols
in Table 10-2 on page 289 with the exception of %v{name}.

290

SAS Information Retrieval Studio: Administrators Guide

Note:

Unless you specify %a, no arguments are called. This is


true even if you make entries in the Default argument
format field.

16. (Optional) By default, %n: %v is entered into the Default argument


format field for the concept name. You can edit this entry using the %n,
%v, %%, and the x modifier symbols in Table 10-2 on page 289.

17. (Optional) By default, , (comma) is entered in the Default separator

field. You can enter a new separator such as a semicolon (;).


18. (Optional) By default, Surrounding words is selected in Concordance
type.

Click

to select Full sentence.

Concordance refers to the surrounding text that is returned with the


match. When you select Full sentence, the Surrounding words field
disappears. If you do not specify the concordance using %s, the
concordance is not returned. This is true even when you specify the
Concordance type and Surrounding words in the Facts pane.
19. (Optional) By default, 10 is selected in the Surrounding words field.

Click

or

to change this default selection.

20. (Optional) By default, the highest number of concepts that can be

matched in any single input document is 15. Click


change this default selection in the Max facts field.
Hint:

or

to

This field specifies the number of facts that have the


highest numbers of matches. For example, matches
might occur for 25 facts. However, the results for this
example are displayed only for the 15 facts with most
matches in input documents.

21. Click Finish to save your selections.

SAS Information Retrieval Studio: Administrators Guide

291

10.2.4 Specify Output


After you specify the input and matching requirements, choose how the
matched fields are treated.
1. Click Output to access the Output window.

2. (Optional) If you do not want to enable facetted search, deselect Label


field in the index.

(This field applies only to the index.) When you


deselect this check box, you can use the label fields for another purpose.

3. (Optional) If you do not want to enable facetted search, deselect Label


field in the query web server. This field applies only to the query
web server pane. If you are using a custom query interface, you might
select this operation. In this case, the Label field in the query web
server operation is irrelevant.

4. (Optional) If you want to export these fields as files, select File export.
5. (Optional) If you want to export these fields in comma-separated

format, select CSV export. Choose this selection to export your files
into programs such as SAS Text Miner or Microsoft Excel.
6. (Optional) If you want to export these fields to a file system, select
ODBC export.

292

SAS Information Retrieval Studio: Administrators Guide

7. Click Finish and see the categories field listed in the Document

Processors pane.

10.2.5 Apply content_categorization to Input


Documents
After you specify the content_categorization document processor, you can
apply these operations to input documents.
To apply content_categorization to input documents, complete these steps:

SAS Information Retrieval Studio: Administrators Guide

293

1. Click Stop, Apply Changes, and Start to apply changes and to restart

the Document Processor.

2. Begin with the selected crawler and work down through the list of

components clicking Stop, Apply Changes, and Start.


You can also perform any of these operations:
-

Click Edit in the Document Processors pane. You can follow any of the
steps in Section 10.2.1 Access the Projects on SAS Content
Categorization Server on page 274 through Section 10.2.4 Specify
Output on page 292.

Click Move up or Move down to reorder your document processors.

Select Pipeline Server --> Document Inspector. Click Take


Snapshot to see the results.
Hint:

294

The Document Inspector captures the next document


that is sent to the pipeline server.

SAS Information Retrieval Studio: Administrators Guide

Click Start in the Document Inspector pane (below the


Document Processors pane) before you start a crawl.

10.3 Seeing the Results in the Query Interface


You can see, and test, the results of the content categorization document
processor when you use the query interface. For comprehensive directions, see
SAS Information Retrieval Studio: Users Guide.
To test the results of the content categorization document processor that you
defined, complete these steps:
1. Select Start --> Program --> SAS Information Retrieval Studio -> Query Interface.

2. Enter a search term into the blank field to the left of the Search button.

SAS Information Retrieval Studio: Administrators Guide

295

3. Click Search.
4. See the results.

296

SAS Information Retrieval Studio: Administrators Guide

11

Configuring the Indexing Server


-

Overview of the Indexing Server

Configure an Index

Changes That Affect the Indexing Server

Run the Indexing Server

Troubleshoot with the Log File

11.1 Overview of the Indexing Server


The indexing server works like the index that is located in the back of a
textbook. The index is a list of unique words and the locations where these
words occur. Unless you specify another application, all of the documents that
are collected by the crawlers are automatically sent to the indexing server.
The unique fields in the index are populated by the data from the input
documents. These fields are specified when you use the Document Processor
windows in the Pipeline Server pane. The various types of fields in the index
are used for different query functionalities.
You can select a language when you build an index. Your language selection
does not prevent documents written in other languages from being indexed,
but it does optimize the index for the selected language. The index matches
only words in the document.
It is important to remember that you do not change the existing index, but that
you can configure the next index that is built. For this reason, the Apply
Changes button deletes the current index and configures the new index.
The index can be searched during the build process, or after the index is built.

11.2 Configure an Index


You configure an index in order to specify the fields that are used for search
operations. You also determine how the information that is located in these
fields is stored in the index.
To configure an index, complete these steps:
1. Select Indexing Server --> Configuration.

See the list of field names that are the default selections for the index.
For example, see id, title, date, and so on.

298

Click Remove to delete an entry in the current index configuration.

Click Edit to make changes to the purposes specified for a field.

Click Add to enter a new field name with its functionality. You can
enter any field name that is found in any of the input documents. It
is not necessary for every document to contain each of the specified
fields.

SAS Information Retrieval Studio: Administrators Guide

2. When you click Add or Edit the Add Field window appears.

For detailed explanations for each of the functionalities that are


available in this window, see Table 11-1 below:
Table 11-1: Field Functionalities
Field Type

Purpose

Searching

(Default) Search for words that match the input query terms. This
selection is equivalent to the standard function.

Label

Select for facetted search, only. This selection is equivalent to marking the
field as both standard and Boolean.
For more information about facetted search labels, see Section 9.5 Match
Categories, Concepts, and Facts on page 246.

SAS Information Retrieval Studio: Administrators Guide

299

Table 11-1: Field Functionalities (Continued)


Field Type

Purpose

Display and
Sorting

Sort the results alphabetically, or numerically, instead of by relevancy.


The field is returned with the URL of each document in the results list.
This choice is equivalent to marking the field as info.

Identification

Identify a document. This field corresponds to marking the field as URL.


It is not necessary for this field to contain a standard-compliant URL.
However, it is necessary for this field to contain a unique string.

Custom

Choose one, or more, of the following selections:


-

Standard: Make this field a regular field.

Info: Make this field an information field.

Boolean: Use Boolean, counting, and positional operators.

URL: Specify either a Web address or a unique string.

3. Click

to select a language for index optimization. However, the


index is built with the returned documents regardless of language.

4. Use the Add Field window to add, and make changes to, all of the fields

in the index.
5. Click Apply Changes to delete the current index and to set the

configuration for the new index.

300

SAS Information Retrieval Studio: Administrators Guide

11.3 Changes That Affect the Indexing Server


If you choose to build an index, many other operations affect the index. For
example, see the following list of operations:
-

Starting and stopping a crawler affects the flow of documents to the


server. For example, if you stop the crawler and then restart it, the same
documents are collected.

Some of the document processing operations in the pipeline server


specify the names of the fields passed to the indexing server. If you
make a change to one of these document processors while the indexing
server is running, click Apply Changes.

If you change the field names, types, and functionalities that you specify
in the Configuration pane of the indexing server, the index is affected.

Whenever you make a change to any of these operations, the current index is not
affected. These changes can affect only the new index. For this reason, you have
two choices:
-

Click the Delete Index button to remove the existing index. A new
index can be built with the specified changes after you restart the
crawler.

Click the Apply Changes button when the indexing server is running.
The existing index is deleted and the indexing server is restarted so that
a new index can be built.

For example, if you make changes to fields in the pipeline server, they can
affect the indexing server.

SAS Information Retrieval Studio: Administrators Guide

301

11.4 Run the Indexing Server


By default, the indexing server is running. Configure all of the components
that you plan to use. Click Apply Changes after you modify any of the default
settings for these components. You might also want to delete the existing
index after you make changes, and before end users enter queries.
To start, restart, and stop the indexing server, complete any of these steps:
-

Click Start in the Indexing Server pane.

The appropriate message appears in the Status pane after any of these
operations.

302

To stop the server, click Stop.

(Optional) If you make any changes that affect the index, click Delete
Index. This operation removes the old index. For example, if you add a
title field to the list of indexed fields a new index might be necessary.

(Optional) Click Apply Changes if the index server is running. This


operation deletes the old index and restarts the indexing server. A new
index can be built with this set of changes.

(Optional) Click Revert to return to the last applied settings.

SAS Information Retrieval Studio: Administrators Guide

11.5 Troubleshoot with the Log File


The indexing server log pane enables you to locate information about the
operations of the indexing server. Use the contents of the Log pane when you
require customer support.
To access and use the log pane, complete these steps:
1. Click the Log tab in the Indexing Server pane.

or
to change the default setting of 20 in the Number
field. This field specifies the number of lines that are
displayed for the searchable log file in this pane.

2. Click

of lines

3. Click Retrieve to display the specified number of lines in the log file.
4. (Optional) Enter a search term into the Text to highlight field. For

example, enter ID.


5. Click Find to locate all instances of this term in this pane.

SAS Information Retrieval Studio: Administrators Guide

303

304

SAS Information Retrieval Studio: Administrators Guide

12

Configuring the Query Server


-

Overview of the Query Server

Run the Query Server

Troubleshoot with the Log File

12.1 Overview of the Query Server


The query server serves the queries that it receives from the query web server
to the index. The query server then returns the matched documents from the
index to the query web server. The query web server displays these matches to
the end user according to the parameters that you specify. The query server is
merely the conduit that passes queries and results and logs these interactions
with both servers.

12.2 Run the Query Server


The query server uses the index built by the indexing server to locate matching
documents in response to queries. Use the query web server or specify a
custom application that you write using the Query API, to pass queries to the
query server.
By default, the query server is running. Configure all of the components that
you plan to use in SAS Information Retrieval Studio. Click Apply Changes
after you modify any of the default settings for these components. The query
server does not require updates.
To start and stop the query server, complete any of these steps:

Click Start in the Query Server pane.

The appropriate message appears in the Status pane after both the start
and stop operations.
-

To stop the server, click Stop.

12.3 Troubleshoot with the Log File


The log pane enables you to see information about the query processing
operations of the query server. Use the contents of the Log pane when you
require customer support.
To use this log pane, complete these steps:
1. Select Query Server --> Log.

306

SAS Information Retrieval Studio: Administrators Guide

or
to select a new Number of lines, the default
setting is 20. For example, choose 25 to see more lines.

2. Click

3. Click Retrieve to display this number of lines in the blank pane below.
4. (Optional) Enter the terms that you want to locate in the Text to
highlight

field. For example, enter INDEX.

5. Click Find to display this text in the log file.

SAS Information Retrieval Studio: Administrators Guide

307

308

SAS Information Retrieval Studio: Administrators Guide

13

Configuring the Query Web


Server
-

Overview of the Query Web Server

Choosing How Search Returns Are Displayed

Configure the Query Web Server

Run the Query Web Server

Troubleshoot with the Log File

13.1 Overview of the Query Web Server


Use the query web server to specify the fields that are searched when an end
user enters a query, and how this information is matched and prioritized. The
links to the matches are displayed according to the selections that you choose.
For example, you can specify the URLs, the text that is displayed, and what
fields in the input document are searched to return this text.
You can also specify labels to make facetted search possible. Facetted search
enables users to locate the information that they seek moving in an intuitive,
instead of linear progression. These labels appear to the left of the search
returns that are displayed in list format on the right in a hierarchical, or flat,
layout.
You can also choose the display settings and design the query window for your
end users. When you use the query web server to specify the look and feel of
the search window, choose the banner, colors, and other components for this
window. You can also access the search window through the link provided in
the Status pane of the query web server window.

Display 13-1 SAS Information Retrieval Studio Search Window

13.2 Choosing How Search Returns Are


Displayed
13.2.1 Displays with or without Labels
You can customize the way that end users see and navigate the matches that
are located for their input query terms. These display selections enable the end
user to navigate the returned documents and to optimize search within the
returns to locate the results that they seek.
To specify how labels are displayed for categories, concepts, and facts,
complete these steps:

310

SAS Information Retrieval Studio: Administrators Guide

1. Select Query Web Server --> Configuration --> Labels.

2. Click Add and the Add Field window appears.

in the Hierarchical field to choose a hierarchical, nonhierarchical, or a flattened display of the labels. In this example, Yes is
selected to enable a hierarchical display of the categories.

3. Click

SAS Information Retrieval Studio: Administrators Guide

311

4. Make any other changes and click OK to see this selection in the Labels

pane.
See the following examples that include search windows that do, and do not,
display labels.

13.2.2 No Labels Example


If you choose to use no labels, search results are displayed in list format. To
see the matched document, select the blue hyperlink that appears to the right
of the number in the ordered list. (Returns are ordered according to the
specifications that you select in the Sorting pane of the Query Web Server.)
Display 13.1 No Labels

13.2.3 Hierarchical Labels Example


If you specify a hierarchical ordering of labels, the matching sections of
taxonomy for categories is displayed. This is a matched portion of the same
taxonomy that appears in the Taxonomy pane of the SAS Content
Categorization Studio user interface. You can also choose to see a count of the
matching documents.

312

SAS Information Retrieval Studio: Administrators Guide

Note:

Concepts and facts do not have a parent-child


relationship. For this reason, this specification does not
work.

Display 13.2 Hierarchical Taxonomy of Matched Labels

You can click the left mouse button on a hyperlink label to make one of the
following selections:
Require

the path to the selected label appears below the search box. The displayed
documents match both the query term and the selected label. If you specify
more than one label, the documents match the query term and the selected
labels. In this case each path is appended with a plus sign (+).
Exclude

one label, preceded by the minus sign (-) appears in the SAS Information
Retrieval Studio search window. The displayed documents match the
query term, but not the selected labels.
View

one label appears in the SAS Information Retrieval Studio search window
that displays all of the matching documents for this label, only. Bolded
matches for existing query terms no longer appear below the document
links on the right side of the search window.

SAS Information Retrieval Studio: Administrators Guide

313

Remove

this operation is available for the label, or path, appearing at the top of the
SAS Information Retrieval Studio search window. This selection is the
only available after you use any of the above operations.

13.2.4 No Hierarchical Display Example


If you select No for the hierarchical display selections of the related taxonomy,
the matched categories are displayed with slashes (/). These slashes indicate
the paths, or parent-child relationships that exist in the SAS Content
Categorization Studio taxonomy that they match.
The hierarchical view does not work for concepts and facts. For this reason,
they do not exist in a parent-child relationship.
Display 13.3 No Hierarchy

314

SAS Information Retrieval Studio: Administrators Guide

13.2.5 Flattened Hierarchical Example


If you select a flattened display of the related taxonomy, the matched
categories are displayed. However, the full path to that category does not
appear.
Note:

You can also use the Require, Exclude, View, and


Remove operations. For more information, see Section
13.2.3 Hierarchical Labels Example on page 312.
Display 13.4 Flattened Hierarchy

Hover the mouse over a category or concept to see the hierarchy, or parentchild relationships existing in SAS Content Categorization Studio.

SAS Information Retrieval Studio: Administrators Guide

315

13.3 Configure the Query Web Server


13.3.1 Overview of Configuring the Query Web
Server
You configure the query web server using the configuration sets in each pane
of the Configuration tab.
Use this pane to specify the type of search and how results are sorted. If you
enable facetted search, you also specify the captions that appear as label
names. Choose the fields where matches can be located and design the
appearance of the SAS Information Retrieval Studio search window.
This section is set up as a how-to guide, but it also contains the background
information that is necessary for each tab.
Display 13.5

316

Query Web Server Configuration Pane

SAS Information Retrieval Studio: Administrators Guide

13.3.2 Specify the Server Port


The Server Port field displays the number of the port where the query web
server is running. The default is 9100.
Display 13.6 Query Web Server Port

To change the query web server port, click


number that is not already in use.

or

to select a new port

13.3.3 Specify How Matching Is Performed


13.3.3.A Match Types
Use the Matching pane to specify how the terms that are matched to the query
are located. In other words, each document is indexed as a field-value pair.
When you leave the default selection Simple selected, you can specify the
fields that are searched in the index.
There are two types of searches that you can specify for your end users:
Simple (fsearch)

specify the query fields and enable end users to prefix required words and
quoted phrases by prefixing them with plus (+) and minus (-) signs. When
you make this selection, you specify the indexing fields that are searched.
Advanced (bsearch)

enable end users to specify the query fields. Query terms can be combined
when you specify the following operators:
-

Boolean operators such as AND, OR, and NOT add precision to your search.

Positional words such as SENT and PAR specify that matches are located
only if the specified words appear in the same sentence or paragraph,
respectively.

SAS Information Retrieval Studio: Administrators Guide

317

Counting words such as MINOC_n total the number of matched words. A


match only occurs when there is at least (MINOC_n) this number, of
matching words in the input document.
When you specify Advanced search, you do not select any index fields.
Instead, you choose the sorting and weights for matches.

13.3.3.B Select a Match Type


To determine the match type, complete these steps:
1. Select Query Web Server --> Configuration --> Matching.

2. Leave the default selection, Simple search, or click

to select

Advanced.

318

SAS Information Retrieval Studio: Administrators Guide

3. Click Add and the Add Field window appears.

4. Leave the default selection such as title. Click

to select a
different field. Each of these fields is listed in the Configuration pane of
the indexing server. In other words, the fields that appear in this dropdown menu are also in the index.
or
to select a new Weight. For example, choose 5 to
weight matches that are located in the body field more heavily than
those in the title field. The weight setting is relative across all fields.

5. Click

6. Click OK to see this field in the Configuration pane.


7. Click Remove in the Matching pane to delete a field.
8. Click Edit to make a change in one of the fields.

13.3.4 Specify How Matches Are Sorted


Matches are sorted by date, field values, the number of matching terms or
fields, or by relevancy. You can affect relevancy when you specify a weight
for a field that ranks matches on one field higher than those on another field.
For more information, see Section 13.3.3 Specify How Matching Is Performed
on page 317.
The following set of steps provides information about sorting by relevancy. To
use the other selections that are available in this pane, see Section 2.11.4.C
The Sorting Tab on page 53.
To specify how matches are sorted, complete these steps:

SAS Information Retrieval Studio: Administrators Guide

319

1. Select Query Web Server --> Configuration --> Sorting.

2. Leave the default selection Relevancy, or click

to choose a
different selection in the Sort type drop-down menu. The selection that
you make in this field determines the fields that are displayed below this
field.

3. Specify the following weights according to your values. In other words,

if the density weight is more important than any of the other weights,
specify the highest weight number for this field:
or
to select a new Cosine
This metric assigns the highest weights to the most
frequently occurring terms. It takes noise words into consideration.
(Noise words are the words that appear with enough frequency
that they are ranked down.)

a. (Default selection is 1) Click


Weight.

to add Proximity Weight to the relevancy


metric. Specify higher numbers when multiple query terms appear
close to each other in a document.

b. (Optional) Click

to add Position Weight to the relevancy


metric. Specify higher weights for query terms that are matched at
the beginning of a document.

c. (Optional) Click

320

SAS Information Retrieval Studio: Administrators Guide

to add Density Weight to the relevancy


metric. Density measures the proximity of matches as a percentage
of the document size.

d. (Optional) Click

to add Freshness Weight to the relevancy


metric. Freshness enables you to combine date sorting with the
other measures.

e. (Optional) Click

4. Click Apply Changes before you select another pane.

13.3.5 Specify Labels for Facetted Search


Labels cluster matching documents. Labels apply the matched categories and
concepts that occur most frequently in the matched documents. You can see a
general label that uses the caption that you specify in the search window.
Beneath this label, you can see a taxonomy, or list, of matched categories and
concepts.
You choose whether to use categories, concepts, or both. You make this choice
in the Document Processor window of the pipeline server.
You specify labels when you want to enable facetted search in the search
window for your end users. Facetted search enables your users to intuitively
search and locate documents that match the input word. For example, if a user
enters the word cars, all of the documents that match cars are returned.
Related labels such as car parts, car repair, and antique cars might also
appear. These labels, or categories and concepts, enable the end user to see
related terms that are also matched in the returned documents.
Use the Labels pane to specify the caption for the categories and concepts that
users see. For categories, the caption replaces the Top node that you see in the
SAS Content Categorization Studio Taxonomy pane. You specify a caption for
each concept, or SAS Information Retrieval Studio uses the name of the
concept, by default. These captions are applied to index fields in order to
rename them with user-friendly text. For more information about the fields to
labels process, see Section 3.7 Defining Labels for Facetted Search on
page 163.

SAS Information Retrieval Studio: Administrators Guide

321

Note:

Only index fields where the specified functionality is


Label can be accessed in this window and can have a
caption.

To specify labels, complete these steps:


1. Select the Query Web Server --> Configuration --> Labels.

322

SAS Information Retrieval Studio: Administrators Guide

2. Click Add and the Add Field window appears.

3. Leave the default field selection in Field name. For example, if you

selected categories as the field with Label functionality in the


Indexing Server pane, you can leave this default selection. The only
selections that are available in this field are those with the Label
functionality.
4. Enter a new name for the label into the Caption field. For example,

enter Categories that uses an uppercase letter.


5. Leave the default selection No in the Hierarchical field or

click
to select either Yes or Flattened. To see the types of results
that are displayed for these selections, see Section 13.2 Choosing How
Search Returns Are Displayed on page 310.

SAS Information Retrieval Studio: Administrators Guide

323

6. By default, Yes is selected in the Display counts field. Click

to
select No. If you select No, the numbers of matching documents do not
appear to the right of the labels in the SAS Information Retrieval Studio
search window.

324

SAS Information Retrieval Studio: Administrators Guide

or
to select a number of matching fields in the Show
field. For example, choose to display the three categories
with the highest number of matches in the SAS Information Retrieval
Studio search window. (If there are more than the specified number of
fields, the term and other information appears. This term is appended to
the list to indicate that the display is incomplete.)

7. Click

in matches

8. Click Move Up to relocate a match on this field when it is displayed in

the search results window.


9. Click Move Down to relocate a match on this field it is displayed in the

search results window.


10. Click
labels

or
to display.

to change the Maximum number of related

11. Click OK to save these settings.


12. Click the link in the Status tab to see the results of your changes in the

search window.

SAS Information Retrieval Studio: Administrators Guide

325

13.3.6 Specify the Formatting for the Matches


You can choose how matched information appears to the end user. For
example, use the Match Formatting pane to specify how the links and fields
for matching documents are displayed in the SAS Information Retrieval
Studio search results window. This pane also enables you to specify the
sources and the allowed prefix and suffix for the displayed links.
To specify the formatting for the matching documents returned to queries,
complete these steps:
1. Select Query Web Server --> Configuration --> Match
Formatting.

to select HTML field in the Title


field. This field identifies the location of the title in the input
document. Select None if you choose not to display the document title.
In this case, the Title field disappears and No Title is displayed for
each match.

2. (Default is Text field) Click


source

326

SAS Information Retrieval Studio: Administrators Guide

in the Title field to select from the


fields in the index. This is the field where the document title can
be located.

3. (Default is title) Click


info

to select No in the Use filename when


field. Use this selection to generate a title for
the document when the title field in an input document is empty.

4. (Default is Yes) Click

document has no title

5. (Default is Concordance) Click

to select Text field, or HTML


field in the Abstract source field. Use these fields to locate the type

of field where the summary of the input document is located. If you


select None or Concordance, Abstract field disappears.
Select Concordance to enable hit highlighting. Matched query terms in
an input document appear in bold.

to select another field in Abstract


This is the field where the abstract can be located.

6. (Default is Title) Click


field.

to select None or Text field in Link


field. This field specifies the link to the matched document.

7. (Default is URL) Click


Source

8. Enter a string to prepend to the URL, before it is displayed, into the


Link prefix

field. Specify how to modify the prefix of the URL at


display time for the purposes of passing an argument from your own
CGI script.

9. Enter a string to prepend to the URL, before it is displayed, into the


Link suffix

field. Specify how to modify the suffix of the URL at


display time for the purposes of passing an argument from your own
CGI script.

For example, the link field might contain the unique identifier 12345.
However, the browser does not understand this string. In this case, set

SAS Information Retrieval Studio: Administrators Guide

327

the link prefix to http://host/script?id= and the link suffix to


&format=html. The browser now sees the link as http://host/
script?id=12345&format=html. The only other requirements are that
the CGI script exists and that the browser can render this type of ID.

to select No in the Add keywords to PDF


field. When you leave the default selection Yes, you instruct
Adobe Reader to highlight the search terms in an input .pdf
document. This operation functions like a concordance, but works for
the entire document, not only the abstract in the results list.

10. (Default is Yes) Click


links

11. (Default is None) Click

to select Text field. This field is used


to locate the type of input document in the MIME type source field.

to select Date or Text field in the Date


field. This field is used to locate the creation date of the
matched document.

12. (Default is None) Click


source

328

SAS Information Retrieval Studio: Administrators Guide

13.3.7 Specify the Theme of the Search Window


13.3.7.A Theme Overview
There are three parts to the Theme pane. The first four fields provide the
specifications for the matched documents. The Colors pane enables you to
select the colors for the search window. The Images pane enables you to
specify the images that replace the existing title bar.
Display 13.7 SAS Information Retrieval Studio Search Window

Determine the look and feel of the SAS Information Retrieval Studio search
window.
To specify the theme of the search window, complete these steps:
1. Select Query Web Server --> Configuration --> Theme.

2. Leave the default selection SAS Information Retrieval Studio, or

enter a new name into the Title field.

SAS Information Retrieval Studio: Administrators Guide

329

3. Leave the default selection sans-serif, or enter a new display font into

the Font field.


4. Leave the default selection 10, or click

or
to select a new
size for the display letters in the Font size field. For example, choose
12 to display the search returns in a larger font size.

to select No in the Use popup menus


field. Select No when you want to disable pop-up menus for older
browsers that cannot use Javascript.

5. (Default is Yes) Click

6. Click the link in the Status tab to see the results of your changes in the

search window.

330

SAS Information Retrieval Studio: Administrators Guide

13.3.7.B Specify the Colors of the Search Window


Determine the colors that are displayed in the SAS Information Retrieval
Studio search window. For example, change the background color and the
visited link colors to appear in red.
Display 13.8 New Colors Specified for the Search Window

For more information about formatting the search window, see http://
www.w3.org/TR/CSS/ui.html#system-colors.

SAS Information Retrieval Studio: Administrators Guide

331

To specify the colors in the search interface, complete these steps:


1. Select the Query Web Server --> Configuration --> Theme -->
Colors.

2. Leave the default selection Custom in the Header background color

field. Click
also click
such as red.
Note:

to select the location that you want to color. You can


to access the color box window and select a color

For more information about the Color Box window, see


Section 2.14.27 The Color Box Window on page 151.

3. Leave the default selection Custom in the Header text color field.

Click

332

to select ActiveBorder, ActiveCaption, AppWorkspace,

SAS Information Retrieval Studio: Administrators Guide

Background, and so on. You can also click


box window to select another color.

to access the color

4. Leave the default selection Custom in the Link color field.

Click

to select ActiveBorder, ActiveCaption, AppWorkspace,

Background, and so on. You can also click


box window and select a color such as red.

to access the color

5. Leave the default selection Custom in the Visited link color field.

Click

to select ActiveBorder, ActiveCaption, AppWorkspace,

Background, and so on. You can also click


window and select another color.

to access the color box

6. Leave the default selection Custom in the Hover link color field.

Click

to select ActiveBorder, ActiveCaption, AppWorkspace,

Background, and so on. You can also click


window and select a color such as red.

to access the color box

7. Leave the default selection Window in the Menu border color field.

Click
to select ActiveBorder, ActiveCaption, AppWorkspace,
Background, and so on.
8. Leave the default selection Window in the Menu unselected
background color

field. Click

to select ActiveBorder,

ActiveCaption, AppWorkspace, Background,

and so on. Click


access the color box window and select a different color.

SAS Information Retrieval Studio: Administrators Guide

to

333

9. Leave the default selection Custom in the Menu unselected text

to select Custom, ActiveBorder,


color field. Click
ActiveCaption, AppWorkspace, Background, and so on.
10. Leave the default selection GrayText in the Menu selected

to select Custom, ActiveBorder,


background color field. Click
ActiveCaption, AppWorkspace, Background, and so on.
11. Leave the default selection HighlightText in the Menu selected

to select Custom, ActiveBorder,


text color field. Click
ActiveCaption, AppWorkspace, Background, and so on.
12. (Optional) Click Reset to Default to revert to the standard SAS

Information Retrieval Studio settings.


13. Click Apply Changes before you select another pane.
14. Click the link in the Status tab to see the results of your changes.

334

SAS Information Retrieval Studio: Administrators Guide

13.3.7.C Load New Images into the Search Window


You can upload images, or borders, into the search window that your end users
see. Before you use the steps below, make sure that you load your images into
the work/query-web-server subdirectory of your installation directory.
Note:

PNG, JPEG, and GIF images are all supported.

To upload images or borders, complete these steps:


1. Select the Query Web Server --> Configuration --> Theme -->
Images.

2. Leave the default selection None in the Left header image field.

Click

to select one of the images that you loaded into the work/
query-web-server subdirectory of your installation directory.

3. Leave the default selection sas.png in the Right header image field.

Click

to select one of the images that you loaded into the work/
query-web-server subdirectory of your installation directory.

4. Click Apply Changes.


5. Click the link in the Status tab to see the results of your changes in the

search window.

SAS Information Retrieval Studio: Administrators Guide

335

13.4 Run the Query Web Server


By default, the query web server is running. Configure all of the components
that you plan to use. Click Apply Changes after you modify any of the default
settings for these components.
To start, restart, and stop the query web server, complete any of these steps:
-

Click Start in the Query Web Server pane.

The appropriate message appears in the Status pane after any of these
operations.

336

Click the link to the machine where the query web server is running to
see the SAS Information Retrieval Studio search window.

(Optional) If you make any changes to the configuration while the


indexing server is running, click Apply Changes.

(Optional) Click Revert to return to the last applied settings.

To stop the server, click Stop.

SAS Information Retrieval Studio: Administrators Guide

13.5 Troubleshoot with the Log File


The log pane enables you to see information about the queries entered by an
end user in the search pane. Use the contents of the Log pane when you require
customer support.
To use this Log pane, complete these steps:
1. Select Query Web Server --> Log.

or
to select a new Number of lines, the default
setting is 20. For example, choose 25 to see more lines.

2. Click

3. Click Retrieve to display this number of lines in the blank pane below.
4. (Optional) Enter the terms that you want to locate in the Text to
highlight

field. For example, enter query.

5. Click Find to display this text in the log file.

SAS Information Retrieval Studio: Administrators Guide

337

338

SAS Information Retrieval Studio: Administrators Guide

14

Configuring the Query Statistics


Server
-

Overview of the Query Statistics Server

Run the Query Statistics Server

View the Query Statistics for a Selected Time Period

After You View the Query Data

Troubleshoot with the Log File

14.1 Overview of the Query Statistics Server


The query statistics server monitors the queries that end users enter into the
SAS Information Retrieval Studio search window. The query statistics server
tracks and displays information such as the most frequent query terms, query
terms that did not return matching documents, and other time-related
information. This server also provides the query analytics that enable you to
troubleshoot or to see numbers that interpret the flow of traffic through the
SAS Information Retrieval Studio search window.
You can select the time-periods and types of data that you want to see when
you use the panes in this window. See the most frequent queries or monitor the
flow of traffic hour-by-hour. Some choices provide access to all of the tabs in
this pane, other selections limit the panes that you can see to those that apply
to your selection.

Display 14-1 Query Statistics Pane

14.2 Run the Query Statistics Server


By default the query statistics server is running.
To start and stop the query statistics server, complete any of these steps:
-

Click Start in the Query Statistics Server pane.

The appropriate message appears in the Status pane after either of


these operations.
-

340

To stop the server, click Stop.

SAS Information Retrieval Studio: Administrators Guide

14.3 View the Query Statistics for a Selected


Time Period
14.3.1 Overview of Time Period Views
By default, the Query Statistics pane displays the four periods of time that you
can use to see related analytics. Use the Today, This Month, This Year, and
All Time buttons in this tab to select a time period. You can then select one of
the following panes, Most Frequent Queries, Most Frequent Queries Without
Matches, Hourly Query Rate, Daily Query Rate, or Monthly Query Rate.
The availability of these panes depends on the time period button that you
click. For example, when you click All Time, you see all of the available tabs.
If you click Today the first three tabs are available.

14.3.2 See the Statistics for Today


When you click Today, the Most Frequent Queries, Most Frequent
Queries Without Matches, and the Hourly Query Rate tabs remain
accessible.
To see the query statistics for searches performed today or yesterday, complete
these steps:
1. Select Query Statistics Server --> Query Statistics.

2. Click Today to see the screen shown above. Todays date is displayed

by Year, Month, and Day. For example, 2010, 9, 15.

SAS Information Retrieval Studio: Administrators Guide

341

3. (Optional) Click Previous to see the date assigned to yesterday and to

see these results. For example, 2010, 9, 14.


4. Click the Most Frequent Queries tab to see the query terms and

number of times that these words were entered into the search window.

5. See the query with the highest number of entries at the top of the list

under Query. For example, see the word sas.


6. See the total count for the number of entries under Number of
Occurrences.

342

For example, sas was entered 73 times.

SAS Information Retrieval Studio: Administrators Guide

7. Click the Most Frequent Queries Without Matches tab to see the

search terms that are not located.

8. See any search terms that were not matched by the searched corpus

under Query. For example, see the term produc.


Hint:

In this example, this term is also listed in the Most


Frequent Queries pane.

9. Under Number of Occurrences, see the number of times this search

term was input by end users. For example, produc was entered one time.

SAS Information Retrieval Studio: Administrators Guide

343

10. Click the Hourly Query Rate tab to see the query traffic over the

current 24-hour period.

11. See each Hour and the Number of Queries. For example, see 8 am,
17

and 9 am, 12.

14.3.3 See the Statistics for This Month


When you click This Month, the Most Frequent Queries, Most Frequent
Queries Without Matches, Hourly Query Rate, and the Daily Query Rate
tabs remain accessible.
To see the query statistics for searches performed this month, complete these
steps:

344

SAS Information Retrieval Studio: Administrators Guide

1. Select Query Statistics Server --> Query Statistics.

2. Click This Month.


3. See the screen shown above. The date for this month is displayed by
Year

and Month. For example, 2010 and 9.

4. Use Step 3. on page 342 through Step 11. on page 344.


5. Click Daily Query Rate.

6. See each Day of the week and the Number of Queries input for that

day. For example, see Wednesday, 74.

SAS Information Retrieval Studio: Administrators Guide

345

14.3.4 See the Statistics for This Year


When you click This Year, all of the tabs remain accessible.
To see the query statistics for searches performed this year, complete these
steps:
1. Select Query Statistics Server --> Query Statistics.

2. Click This Year.


3. See the date that is displayed in year format in Year. For example, 2010.
4. Use Step 3. on page 342 through Step 11. on page 344.
5. Use Step 5. through Step 6. on page 345 for the Daily Query Rate tab.

346

SAS Information Retrieval Studio: Administrators Guide

6. Click Monthly Query Rate to see the total number of queries that were

input during each of the weekdays during the selected month.

7. See each Month and the Number of Queries for that month. For

example, see September, 97.

SAS Information Retrieval Studio: Administrators Guide

347

14.3.5 See the Statistics for All Time


When you click All Time, all of the tabs remain accessible.
To see the query statistics for searches performed from the time when your end
users began to query until now, complete these steps:
1. Select Query Statistics Server --> Query Statistics

2. Click All Time.


3. See that no date is displayed and the Previous and Next buttons are not

accessible for this date selection.


4. Use Step 4. on page 342 through Step 11. on page 344.
5. Use Step 5. through Step 6. on page 345 for the Daily Query Rate tab.

The number for each day of the week matches the total number of input
queries received on each of the respective weekdays over the course of
the year.
Hint:

The statistics are preserved when you restart the


application, but not when SAS Information Retrieval
Studio is reinstalled.

6. Use Step 6. through Step 7. on page 347 for the Monthly Query Rate

tab.

348

SAS Information Retrieval Studio: Administrators Guide

14.4 After You View the Query Data


The query statistics help you to understand whether changes should be made
to the current configuration of SAS Information Retrieval Studio. For
example, you can find the following information:
-

Discover your peak query hours, days, and months. If performance


should be increased, consider adding additional hardware or network
bandwidth.

Discover any changes that should be made to the index. For example,
see whether queries without matches might be matched if an additional
field is added to the index.

See whether the most frequent query terms adequately match the
searched corpus. If not add a new link.

14.5 Troubleshoot with the Log File


The query web server Log pane enables you to see the history of the queries
entered by an end user in the search pane. Use the contents of the Log pane
when you require customer support.
To use this Log pane, complete these steps:

SAS Information Retrieval Studio: Administrators Guide

349

1. Select Query Statistics Server --> Log.

or
to select a new Number of lines, the default
setting is 20. For example, choose 25 to see more lines.

2. Click

3. Click Retrieve to display this number of lines in the blank pane below.
4. (Optional) Enter the terms that you want to locate in the Text to
highlight

field. For example, enter time.

5. Click Find to display this text in the log file.

350

SAS Information Retrieval Studio: Administrators Guide

Appendixes
-

Appendix A: Regular Expressions and XML Field Extraction File on


page 353

Appendix B: Recommended Reading on page 355

Appendix C: Glossary on page 359

351

352

SAS Information Retrieval Studio: Administrators Guide

Appendix: A
Regular Expressions and XML
Field Extraction File
-

Regular Expressions

XML File Field Extraction File Format

A.1 Regular Expressions


The document processors, file crawler, and feed crawler use the Python
compatible equivalent of PCRE. The query web server uses the Java
compatible equivalent of PCRE. The web crawler uses the SAS wrapper for
PCRE.
For more information, see the following pages:
-

PCRE: http://www.pcre.org/

Python compatible equivalent: http://docs.python.org/library/


re.html

Java compatible equivalent: http://download.oracle.com/javase/


7/docs/api/java/util/regex/Pattern.html

A.2 XML File Field Extraction File Format


Use this section when you want to extract the contents of a specific XML
document. For more information, see Section The Document Processor:
parse_xml Window on page 114.
Example A.1: Original Document
<article>
<content>foo bar</content>
<tsrc>unwanted garbage</tsrc>

353

<thumbnail>
<tsrc>http://img.com/</tsrc>
</thumbnail>
</article>

Suppose you want to extract the value of the content field, and the value of
the tsrc field in the thumbnail field. In order to extract only the tsrc field
that is located inside the thumbnail field, specify the following syntax
<article>
<content />
<tsrc index="no" />
<thumbnail index="no">
<tsrc />
</thumbnail>
</article>

In this example the attribute index has the value no". This value specifies
that the parser does not add the value of this field to its list of documents.
The default value of the index attribute is "yes". This specification means that
every field in the input XML that does not have the index attribute remains in
the document.

354

SAS Information Retrieval Studio: Administrators Guide

Appendix: B
Recommended Reading
The following books are recommended as companion guides:
-

SAS Information Retrieval Studio: Installation Guide: Install SAS


Information Retrieval Studio and prerequisite software.

SAS Information Retrieval Studio: Users Guide: Use the search


window that an administrator customized to query the index built in
SAS Information Retrieval Studio.

SAS Sentiment Analysis Studio: Users Guide: Create a SAS


Sentiment Analysis Studio project, test, and upload it to SAS
Sentiment Analysis Server.

SAS Sentiment Analysis Server: Administrators Guide: Automate the


process of applying the rules that you define in SAS Sentiment Analysis
Studio to your input documents.

SAS Sentiment Analysis Workbench: Installation Guide: Install SAS


Sentiment Analysis Workbench and prerequisite software.

SAS Sentiment Analysis Workbench: Administrators Guide: Set up


SAS Sentiment Analysis Studio projects, add users, and specify the files
to be used. These files include SAS Sentiment Analysis Studio and SAS
Content Categorization Studio files.

SAS Sentiment Analysis Workbench: Users Guide: Review and edit the
automated analyses and create reports illustrated with graphs that
illustrate these analyses.

SAS Content Categorization: Users Guide: Create a SAS Content


Categorization Studio project, test, and upload to SAS Content
Categorization Server.

SAS Content Categorization Studio: Quick Start Guide: Advanced


users can learn how to expeditiously set up a SAS Content
Categorization Studio project.

SAS Content Categorization: Installation Guide: Install SAS Content


Categorization Server.

355

SAS Content Categorization Server: Administrators Guide:


Understand how SAS Content Categorization Server applies the .mco
and .concepts files to input documents. Program this application using
the Java language.

SAS Contextual Extraction Studio: Administrators Guide: Use this


add-on application to SAS Content Categorization Studio to write
complex concept definitions that can include multiple rule types within
a single definition.

SAS Contextual Extraction Studio: Installation Guide: Install SAS


Contextual Extraction Studio.

SAS Document Conversion: Developers Guide: Use this C API for


SAS Document Conversion to convert documents in formats such as
Adobe PDF and Microsoft Office into text.

Use the language book that applies to the language that you use to create
your project. Each of the SAS world language books contain a
comprehensive list of part-of-speech tags.

SAS offers instructor-led training and self-paced e-learning courses to


help you get started with the SAS add-in, learn how the SAS add-in
works with the other products in the SAS Enterprise Intelligence
Platform, and learn how to run stored processes in the SAS add-in.
For more information about the courses available, see
support.sas.com/training.

For a complete list of SAS publications, see the current SAS Publishing
Catalog. To order the most current publications or to receive a free copy of the
catalog, contact a SAS representative at
SAS Publishing Sales
SAS Campus Drive
Cary, NC 27513
Telephone: (800) 727-3228*
Fax: (919) 677-8166
E-mail: sasbook@sas.com
Web address:support.sas.com/pubs
* For other SAS Institute business, call (919) 677-8000.
Customers outside the United States should contact their local SAS office.

356

SAS Information Retrieval Studio: Administrators Guide

Appendix: C
Glossary
Boolean operators

specifies words such as AND, OR, and NOT, to construct logical definitions
that locate the matches that you seek.
caption

specifies an alternative version of a label field name that is displayed to


the end user during facetted search. Also see label.
categorization

concisely defines the subject matter of a document, in other words, the


main idea or subject of the document.
checksum operation

eliminates a duplicate document according to the type of operation that


you specify. For example, choose to remove old documents, or eliminate
documents with the same URL.
concept

specifies any of the followinga string, token, or an argumentto locate


in an input document. A concept locates the metadata of the input
document.
contextual extraction

specifies concepts and facts.


corpus

specifies one set of documents. For multiple sets, see corpora.


corpora

specifies multiple sets of training documents. See corpus for one set
crawl

an entire run of a crawler, instead of a single page download.

359

definition

defines a concept. There can be many rules for each concept definition.
This term is also used interchangeably with rule. See rule.
document

is a unit of textual data. For example, a document can be an HTML page, a


Microsoft Word file, a PDF file, or one row in a CSV file or a database. A
document can also be an article or summary in a feed.
In SAS Information Retrieval Studio, each document is represented as a
configurable set of fields. Each file has a name and a value. Unnecessary
fields can be either left empty or omitted from the document.
dominance

when an object is dominant, it is mentioned more frequently than the other


comparable objects that you defined in the Products tab of SAS
Sentiment Analysis Studio.
Facetted search

applies identifying labels to matched documents. These labels enable you


to intuitively navigate to the documents that match your input query terms.
fact

links two, or more, concepts to provide otherwise overlooked relationships


in input documents.
filter

criteria that restrict the data that is displayed in a graph.


hash

change a string of characters into a value that can be indexed. The hash
process expedites the search process.
label

specify the value of the field that is passed to the query web server for
each match that appears within the search window. Also see caption.
metadata

identifies information about information.


MIME

is an acronym for Multipurpose Internet Mail Extensions. Non ASCII


messages are formatted using MIME to be sent over the Internet.

360

SAS Information Retrieval Studio: Administrators Guide

MIME type

is the format of the input document.


Noise words

appear with enough frequency that they are ranked down in the metrics for
weight.
polite

means that a single thread does not overwhelm a site with download
requests, but respects the robots.txt standard. This standard enables Web
site developers to specify portions of their sites that should not be crawled.
precision

measures the accuracy of the model. It reflects the percentage of


documents that were correctly classified.
prominence

see where the information about the product is located in the document.
The information can appear primarily in the top 20%, or in the bottom
80%, of the selected document.
raw

specifies the original, unmodified content that was placed into an HTML
document using this identifying field name.
recall

measures the number of documents that are a match for the definition out
of those texts that were successfully returned.
rule

defines a category. Unless you use SAS Contextual Extraction Studio,


only one rule defines each category. This term is also used interchangeably
with definition. See definition.
sentiment

expresses feeling, or like or dislike.


string

refers to a group of words or characters that you specify for a rule.

SAS Information Retrieval Studio: Administrators Guide

361

362

SAS Information Retrieval Studio: Administrators Guide

Index
A
Abstract source
defined ...............................................................................................................59
Action heading
defined ...............................................................................................................20
Add Backend window usage ..................................................................................141
Add button
defined ............................................... 19, 20, 22, 23, 28, 29, 30, 38, 42, 47, 53, 56
Add Credential window
usage ................................................................................................................135
Add Entry Point window
usage ................................................................................................................121
Add Extension window
usage ................................................................................................................139
Add Field window
usage ................................................................................................ 143, 146, 148
Add Filename Extension window
usage ................................................................................................................133
Add HTTP Proxy window ......................................................................................120
usage ................................................................................................ 118, 119, 120
Add keywords to PDF links
defined ...............................................................................................................59
Add Path to Exclude window
usage ................................................................................................................138
Add Path window
usage ................................................................................................................137
Add Scope Rule window
usage ................................................................................................................130
add_field
document processor ...........................................................................................74
All Time button
defined ...............................................................................................................67

363

Apply Changes button


defined .............................................................................................................. 13
pipeline server ................................................................................................... 40
Auto-detect button
defined .........................................................................................................15, 33
feed crawler ..................................................................................................33, 34

B
bsearch
defined ............................................................................................................ 317

C
Caption heading
defined .............................................................................................................. 56
categorizer
defined ............................................................................................................ 158
color box window
usage ............................................................................................................... 151
colors
search window ................................................................................................ 331
Colors tab
defined .............................................................................................................. 61
concept_extractor
defined ............................................................................................................ 159
Configuration pane
defined .............................................................................................................. 37
content_categorization
document processor .......................................................................................... 74
contextual_extractor
defined ............................................................................................................ 159
Cosine Weight
defined .............................................................................................................. 55
Crawl continuously
defined .........................................................................................................27, 33
Credentials tab
web crawler ....................................................................................................... 14

364

SAS Information Retrieval Studio: Administrators Guide

D
Date
Sort tab ..............................................................................................................54
Date source
defined ...............................................................................................................59
Day
defined ...............................................................................................................71
Day field
defined ...............................................................................................................67
default_mime_type_from_url
defined .............................................................................................................159
document processor ...........................................................................................74
default_title_from_url
defined .............................................................................................................159
document processor ...........................................................................................74
Delete Index button
defined ....................................................................................................... 45, 303
Delete Index window
usage ................................................................................................................153
Density Weight
defined ...............................................................................................................55
document
defined ...............................................................................................................44
Document processing heading
defined ...............................................................................................................40
Document Processor
add_field window ..............................................................................................77
content_categorization window ................................................................. 78, 274
default_mime_type_from_url window ..............................................................95
default_title_from_url window ..........................................................................95
document_converter window ............................................................................96
export_csv window ............................................................................................97
export_to_files window ...................................................................................100
export_to_odbc window ..................................................................................102
export_to_sentiment_analysis_workbench window ........................................104
extract_abstract window ..................................................................................106
extract_pdate window ......................................................................................107
heuristic_parse_html window ..........................................................................108
invalidate_duplicates_by_url window .............................................................110
match_and_copy window ................................................................................111
modify_field_name window ............................................................................112

SAS Information Retrieval Studio: Administrators Guide

365

parse_html window ......................................................................................... 112


parse_xml window .......................................................................................... 114
strip_html window ...................................................................................115, 116
substitute window ........................................................................................... 117
document processors
choose ............................................................................................................. 235
document_converter
defined ............................................................................................................ 159
document processor .......................................................................................... 74
Documents processed
defined .............................................................................................................. 36
Documents queued
defined .............................................................................................................. 36
Documents received
defined .............................................................................................................. 36

E
Edit Backend window
usage ............................................................................................................... 142
Edit button
defined ......................................... 19, 21, 22, 23, 28, 29, 30, 34, 39, 42, 47, 53, 57
Edit Credential window
usage ............................................................................................................... 136
Edit Entry Point window
usage ............................................................................................................... 125
Edit Extension window
usage ............................................................................................................... 140
Edit Field window
usage ........................................................................................................147, 150
Edit Filename Extension window
usage ............................................................................................................... 134
Edit Path to Exclude window
usage ............................................................................................................... 139
Edit Path window
usage ............................................................................................................... 138
Encapsulate XML files
define ................................................................................................................ 27
entry points
defined ............................................................................................................ 196

366

SAS Information Retrieval Studio: Administrators Guide

Entry Points tab


usage ................................................................................................................196
web crawler .......................................................................................................14
Export Settings window ..........................................................................................119
export_csv
defined .............................................................................................................160
document processor ...........................................................................................74
export_to_files
defined .............................................................................................................160
document processor ...........................................................................................74
export_to_odbc
defined .............................................................................................................160
document processor ...........................................................................................75
export_to_sas_sentiment_analysis_workbench
document processor ...........................................................................................75
export_to_sentiment_analysis_workbench
defined .............................................................................................................161
Extension
defined ...............................................................................................................21
extract_abstract
defined .............................................................................................................159
document processor ...........................................................................................75
extract_pdate
defined .............................................................................................................159
document processor ...........................................................................................75

F
facetted search
defined .............................................................................................................163
feed crawler
configure ..........................................................................................................218
defined ......................................................................................................... 9, 157
Feeds pane .........................................................................................................34
General Settings pane ........................................................................................33
operations ..........................................................................................................31
run ....................................................................................................................222
usage ................................................................................................................217
Feed URL
feed crawler .......................................................................................................34

SAS Information Retrieval Studio: Administrators Guide

367

Feeds tab
feed crawler ....................................................................................................... 32
Field Name heading
defined .........................................................................................................46, 56
Field Name tab
defined .............................................................................................................. 52
Field value ................................................................................................................ 54
Sort tab .............................................................................................................. 54
file crawler
configure ......................................................................................................... 207
defined .........................................................................................................9, 156
general settings ............................................................................................... 208
run ................................................................................................................... 213
Filename Extensions tab
file crawler ........................................................................................................ 26
web crawler ....................................................................................................... 14
Find button
defined .............................................................................................................. 12
Finished heading
defined .............................................................................................................. 41
flattened hierarchy
search returns .................................................................................................. 315
Follow Links
feed crawler ....................................................................................................... 34
Font
defined .............................................................................................................. 60
Font size
defined .............................................................................................................. 61
formatting
query web server ............................................................................................. 162
Freshness Weight
defined .............................................................................................................. 55
fsearch
defined ............................................................................................................ 317
Functionality heading
defined .............................................................................................................. 46

368

SAS Information Retrieval Studio: Administrators Guide

G
General Settings tab
feed crawler .......................................................................................................32
file crawler .........................................................................................................26
web crawler ............................................................................................... 14, 193

H
Header background color
defined ...............................................................................................................62
heuristic_parse_html
defined .............................................................................................................159
document processor ...........................................................................................75
Host
defined ...............................................................................................................38
Hour
defined ...............................................................................................................70
HTTP proxy
defined ...............................................................................................................15
feed crawler .......................................................................................................33

I
Import Settings window ..........................................................................................118
index
configure ..........................................................................................................298
defined .............................................................................................................161
input documents ..................................................................................................8
indexing server
defined ...............................................................................................................10
run ....................................................................................................................302
usage ................................................................................................................297
input documents
index ....................................................................................................................8
invalidate_duplicates_by_url
defined .............................................................................................................159
document processor ...........................................................................................75

SAS Information Retrieval Studio: Administrators Guide

369

L
labels
defined ............................................................................................................ 163
hierarchy ......................................................................................................... 312
navigation tools ............................................................................................... 264
usage ............................................................................................................... 321
Labels tab
defined .............................................................................................................. 51
Last busy time heading
defined .............................................................................................................. 41
Last document processed
defined .............................................................................................................. 37
Last document received
defined .............................................................................................................. 37
Left header image
defined .............................................................................................................. 64
Link prefix
defined .............................................................................................................. 59
Link source
defined .............................................................................................................. 59
Link suffix
defined .............................................................................................................. 59
Link traversal order
defined .............................................................................................................. 17
log file
entire application ............................................................................................... 11
feed crawler ..................................................................................................... 223
file crawler ...................................................................................................... 214
indexing server ................................................................................................ 303
pipeline server ................................................................................................. 260
proxy server ...............................................................................................12, 230
query server ..................................................................................................... 306
query web server ......................................................................................337, 349
troubleshoot .................................................................................................... 223

M
Match Formatting tab
defined .............................................................................................................. 51
usage ............................................................................................................... 326

370

SAS Information Retrieval Studio: Administrators Guide

match type
select ................................................................................................................318
Match Type heading
defined ...............................................................................................................20
match_and_copy
document processor ...........................................................................................75
matches
sort ...................................................................................................................319
Matching tab
defined ...............................................................................................................50
usage ................................................................................................................317
Maximum file size (megabytes)
defined ...............................................................................................................26
Maximum number of related labels
defined ...............................................................................................................57
Maximum number of retries
defined ...............................................................................................................16
Menu selected background color
defined ...............................................................................................................63
Menu selected text color
defined ...............................................................................................................64
Menu unselected background color
defined ...............................................................................................................63
Menu unselected text color
defined ...............................................................................................................63
MIME type source
defined ...............................................................................................................59
modify_field_name
defined .............................................................................................................159
document processor ...........................................................................................76
Month
defined ...............................................................................................................72
Month field
defined ...............................................................................................................67
most frequent queries
defined .............................................................................................................162
query statistics server ......................................................................................162
most frequent queries without matches
query statistics server ......................................................................................163
Move Down button
defined ......................................................................................................... 42, 57

SAS Information Retrieval Studio: Administrators Guide

371

Move Up button
defined .........................................................................................................42, 57

N
Next button
defined .............................................................................................................. 67
no hierarchy
search returns .................................................................................................. 314
no labels
search returns .................................................................................................. 312
Number of downloader threads
defined .............................................................................................................. 16
Number of lines
defined .............................................................................................................. 11
Number of matching fields
Sort tab .............................................................................................................. 54
Number of matching terms
Sort tab .............................................................................................................. 54
Number of Occurrences
defined .............................................................................................................. 69
Number of Occurrences heading
defined .............................................................................................................. 68
Number of Queries
defined ...................................................................................................70, 71, 72

O
Oldest date
defined .............................................................................................................. 27
operation history
log file ............................................................................................................. 206
Order added to the index
Sort tab .............................................................................................................. 55
Overall heading
defined .............................................................................................................. 40

372

SAS Information Retrieval Studio: Administrators Guide

P
parse_html
defined .............................................................................................................160
document processor ...........................................................................................76
parse_xml
defined .............................................................................................................160
document processor ...........................................................................................76
Password
defined ...............................................................................................................23
password-protected sites
crawl ................................................................................................................203
Paths tab
file crawler .........................................................................................................26
paths to crawl
specify .............................................................................................................209
paths to exclude
file crawler .......................................................................................................211
Paths to Exclude tab
file crawler .........................................................................................................26
Pending heading
defined ...............................................................................................................41
pipeline server
defined .................................................................................................................9
operations ........................................................................................................234
run ....................................................................................................................258
Pipeline Server tab
operations ..........................................................................................................39
Pipeline Stage
stages .................................................................................................................40
Port
defined ...............................................................................................................38
Position Weight
defined ...............................................................................................................55
Previous button
defined ...............................................................................................................67
processes
order .................................................................................................................164
Proximity Weight
defined ...............................................................................................................55

SAS Information Retrieval Studio: Administrators Guide

373

proxy server
configure ......................................................................................................... 228
defined ...................................................................................................9, 35, 157
operations ........................................................................................................ 157
run ................................................................................................................... 229
status ............................................................................................................... 226
usage ............................................................................................................... 225

Q
queries ........................................................................................................................ 8
Query
defined .............................................................................................................. 69
Query heading
defined .............................................................................................................. 68
query rate by day
defined ............................................................................................................ 163
query rate by hour
defined ............................................................................................................ 163
query rate by month
defined ............................................................................................................ 163
query rate for all time
defined ............................................................................................................ 163
query rates
query statistics server ...................................................................................... 163
query server
defined .................................................................................................10, 48, 305
usage ............................................................................................................... 305
query statistics
all queries ........................................................................................................ 348
see ............................................................................................................341, 344
this year ........................................................................................................... 346
usage ............................................................................................................... 349
Query Statistics pane
usage ............................................................................................................... 341
query statistics server
defined .............................................................................................................. 10
run ................................................................................................................... 340
usage ............................................................................................................... 339

374

SAS Information Retrieval Studio: Administrators Guide

query web server


configure ..........................................................................................................316
defined ...............................................................................................................10
run ....................................................................................................................336
usage ................................................................................................................309
Quota (files)
defined ...............................................................................................................16
Quota (megabytes)
defined ...............................................................................................................16
Quota heading
defined ...............................................................................................................18

R
Recrawl interval
defined ...............................................................................................................33
Refresh button
defined ........................................................................................................... 9, 13
Relevancy
Sort tab ..............................................................................................................54
Remove button
defined ......................................... 19, 21, 22, 23, 28, 29, 30, 34, 38, 42, 47, 53, 56
Reset to Default button
defined ...............................................................................................................64
Respect robots.txt
defined ...............................................................................................................16
Retrieve button
defined ...............................................................................................................11
Retry delay (seconds)
defined ...............................................................................................................16
Revert button
defined ...............................................................................................................13
indexing server ..................................................................................................45
Right header image
defined ...............................................................................................................64
robots.txt
defined ...............................................................................................................16

SAS Information Retrieval Studio: Administrators Guide

375

S
sample project
set up ........................................................................................................168, 179
SAS Content Categorization Server ................................................................234, 237
install ................................................................................................234, 235, 237
taxonomies ...............................................................................................234, 237
SAS Content Categorization Studio ................................................................235, 237
concepts and categories .................................................................................. 234
SAS Contextual Extraction Studio
concepts and facts ....................................................................................234, 237
SAS Document Conversion
install ............................................................................................................... 234
SAS Sentiment Analysis Workbench
install ........................................................................................................235, 237
SAS Text Miner ..............................................................................................235, 237
install ........................................................................................................235, 237
scope
defined ............................................................................................................ 198
Scope tab
defined .............................................................................................................. 14
search
query web server ............................................................................................. 162
type ...................................................................................................................... 8
search box
customize ........................................................................................................ 310
Search type
defined .............................................................................................................. 52
send
defined ............................................................................................................ 160
document processor .......................................................................................... 76
Sending to the indexer heading
defined .............................................................................................................. 40
Server Port field
defined .............................................................................................................. 50
specify ............................................................................................................. 317
Site
defined .............................................................................................................. 23
Sleep interval (seconds)
defined .............................................................................................................. 16
sort
matches ........................................................................................................... 319

376

SAS Information Retrieval Studio: Administrators Guide

sort the results


query web server .............................................................................................162
Sort type
defined ...............................................................................................................53
Sort Type field
defined ...............................................................................................................55
Sorting tab
defined ......................................................................................................... 51, 53
Start button
defined ...............................................................................................................13
Status
defined ...............................................................................................................38
status tab
defined ...............................................................................................................36
Stop button
defined ...............................................................................................................13
strip_html
defined .............................................................................................................160
document processor ...........................................................................................76
substitute
defined .............................................................................................................160
document processor ...........................................................................................76

T
Take Snapshot
usage ..................................................................................................................43
Text to highlight
defined ...............................................................................................................12
Theme pane
usage ................................................................................................................329
Theme tab
defined ...............................................................................................................51
This Month button
defined ...............................................................................................................66
This Year button
defined ...............................................................................................................66
Tiebreaker
defined ...............................................................................................................55
Timeout (seconds)
defined ...............................................................................................................16

SAS Information Retrieval Studio: Administrators Guide

377

Title
defined .............................................................................................................. 60
Title field
defined .............................................................................................................. 58
Title source
defined .............................................................................................................. 58
Today button
defined .............................................................................................................. 66
types of files
limit ................................................................................................................. 202

U
URL heading
defined .............................................................................................................. 18
URL Pattern heading
defined .............................................................................................................. 20
Use pop-up menus
defined .............................................................................................................. 61
User agent
defined .............................................................................................................. 33
Username
defined .............................................................................................................. 23

W
web crawler
configure ......................................................................................................... 192
defined .........................................................................................................9, 156
run ................................................................................................................... 205
specify operations ........................................................................................... 193
Web Crawler pane
operations .......................................................................................................... 12
Weight tab
defined .............................................................................................................. 52

X
XML document
extract contents ............................................................................................... 353

378

SAS Information Retrieval Studio: Administrators Guide

Y
Year field
defined ...............................................................................................................67

SAS Information Retrieval Studio: Administrators Guide

379

380

SAS Information Retrieval Studio: Administrators Guide

Você também pode gostar