Irsag Sas

Contents
About This Book .......................................................................................xiii

Audience .......................................................................................................... xiii
Prerequisites ..................................................................................................... xiii
Conventions ...................................................................................................... xiv
Whats New in SAS Information Retrieval Studio 1.3 ............................xv
1 About SAS Information Retrieval Studio .............................................1
1.1 What Is SAS Information Retrieval Studio? .............................................. 1
1.2 Benefits of Using SAS Information Retrieval Studio ................................ 3
1.3 How Does SAS Information Retrieval Studio Work? ............................... 3
1.4 How Does SAS Information Retrieval Studio Fit into the SAS
Product Line? ................................................................................................... 4
1.5 How to Get Help for SAS Information Retrieval Studio ........................... 4
1.6 What is a Document? ................................................................................. 5
1.7 Architecture ................................................................................................ 5
2 SAS Information Retrieval Studio Interface .........................................7
2.1 Your First Look at the SAS Information Retrieval Studio User Interface . 7
2.2 Access the SAS Information Retrieval Studio User Interface ................... 8
2.3 Viewing the Overview Pane ....................................................................... 11
2.3.1 Overview of the Overview Pane ...................................................... 11
2.3.2 The Log Tab ..................................................................................... 11
2.4 Viewing the Web Crawler Pane ................................................................. 12
2.4.1 Overview of the Web Crawler Pane ................................................ 12
2.4.2 The Buttons ...................................................................................... 13
2.4.3 The Status Tab ................................................................................. 13
2.4.4 The Configuration Tab ..................................................................... 14
2.4.4.A The Main Tabs in the Configuration Tab ............................. 14
2.4.4.B The General Settings Tab ...................................................... 15
2.4.4.C The Entry Points Tab ............................................................ 18
2.4.4.D The Scope Tab ...................................................................... 20
2.4.4.E The Filename Extensions Tab ............................................... 21
2.4.4.F The Credentials Tab .............................................................. 23
2.4.4.G The Log Tab .......................................................................... 24
iii
2.5 Viewing the File Crawler Pane ...................................................................24

2.5.1 Overview of the File Crawler Pane ..................................................24
2.5.2 The Buttons ......................................................................................24
2.5.3 The Status Tab ..................................................................................25
2.5.4 The Configuration Tab .....................................................................25
2.5.4.A The Main Tabs in the Configuration Tab ..............................25
2.5.4.B The General Settings Pane .....................................................26
2.5.4.C The Paths Pane ......................................................................28
2.5.4.D The Paths to Exclude Pane ....................................................29
2.5.4.E The Filename Extensions Pane ..............................................30
2.6 Viewing the Feed Crawler Pane .................................................................31
2.6.1 Overview of the Feed Crawler Pane .................................................31
2.6.2 The Buttons ......................................................................................31
2.6.3 The Status Tab ..................................................................................32
2.6.4.A The Main Tabs in the Configuration Tab ..............................32
2.6.4.B The General Settings Tab ......................................................33
2.6.4.C The Feeds Tab .......................................................................34
2.6.5 The Log Tab .....................................................................................35
2.7 Viewing the Proxy Server Pane ..................................................................35
2.7.1 Overview of the Proxy Server Pane .................................................35
2.7.2 The Buttons ......................................................................................36
2.7.3 The Status Tab ..................................................................................36
2.7.5 The Log Tab .....................................................................................39
2.8 Viewing the Pipeline Server Pane ..............................................................39
2.8.1 Overview of the Pipeline Server Tab ...............................................39
2.8.2 The Buttons ......................................................................................40
2.8.3 The Status Tab ..................................................................................40
2.8.4 The Document Processors Tab .........................................................41
2.8.5 The Document Inspector Tab ...........................................................42
2.8.6 The Log Tab .....................................................................................44
2.9 Viewing the Indexing Server Pane .............................................................44
2.9.1 Overview of the Indexing Server Pane .............................................44
2.9.2 The Buttons ......................................................................................45
2.9.3 The Status Tab ..................................................................................46
2.9.5 The Log Tab .....................................................................................47
iv
SAS Information Retrieval Studio: Administrators Guide
2.10 Viewing the Query Server Pane ............................................................... 48

2.10.1 Overview of the Query Server Tab ................................................ 48
2.10.2 The Buttons .................................................................................... 48
2.10.3 The Status Tab ............................................................................... 48
2.10.4 The Log Tab ................................................................................... 49
2.11 Viewing the Query Web Server Pane ...................................................... 49
2.11.1 Overview of the Query Web Server Tab ....................................... 49
2.11.2 The Buttons .................................................................................... 49
2.11.3 The Status Tab ............................................................................... 50
2.11.4 The Configuration Tab ................................................................... 50
2.11.4.A The Main Tabs in the Configuration Tab ........................... 50
2.11.4.B The Matching Tab ............................................................... 51
2.11.4.C The Sorting Tab .................................................................. 53
2.11.4.D The Labels Tab ................................................................... 56
2.11.4.E The Match Formatting Tab ................................................. 58
2.11.4.F The Theme Tabs .................................................................. 60
2.11.5 The Log Tab ................................................................................... 65
2.12 Viewing the Query Statistics Server Pane ............................................... 65
2.12.1 Overview of the Query Statistics Server Pane ............................... 65
2.12.2 The Buttons .................................................................................... 65
2.12.3 The Status Tab ............................................................................... 65
2.12.4 The Query Statistics Tab ................................................................ 66
2.12.4.A The Buttons ......................................................................... 66
2.12.4.B The Most Frequent Queries Tab ......................................... 68
2.12.4.C The Most Frequent Queries without Matches Tab .............. 69
2.12.4.D The Hourly Query Rate Tab ............................................... 70
2.12.4.E The Daily Query Rate Tab .................................................. 71
2.12.4.F The Monthly Query Rate Tab .............................................. 72
2.12.5 The Log Tab ................................................................................... 72
2.13 The Add Document Processor Windows ................................................. 73
2.13.1 Overview of Document Processor Windows ................................. 73
2.13.2 Access Document Processor Window ........................................... 73
2.13.3 The Document Processor: add_field Window ............................... 77
2.13.4 The Document Processor: content_categorization Wizard ............ 78
2.13.4.A Overview of the content_categorization
Document Processor .......................................................................... 78
2.13.4.B Configure SAS Content Categorization Server .................. 78
2.13.4.C Specify the Projects ............................................................. 80
2.13.4.D Specify Input ....................................................................... 82
2.13.4.E Specify Categories ............................................................... 83
2.13.4.F Specify Concepts .................................................................84

2.13.4.G Specify Facts .......................................................................88
2.13.5 The Document Processor: default_mime_type_from_url Window 95
2.13.6 The Document Processor: default_title_from_url Window ...........95
2.13.7 The Document Processor: document_converter Window ..............96
2.13.8 The Document Processor: export_csv Window .............................97
2.13.9 The Document Processor: export_to_files Window ......................100
2.13.10 The Document Processor: export_to_odbc Window ....................102
2.13.11 The Document Processor: export_to_sentiment_
analysis_workbench Window ....................................................................104
2.13.12 The Document Processor: extract_abstract Window ...................106
2.13.13 The Document Processor: extract_pdate Window .......................107
2.13.14 The Document Processor: heuristic_parse_html Window ...........108
2.13.15 The Document Processor: invalidate_duplicates_by_
url Window ................................................................................................110
2.13.16 The Document Processor: match_and_copy Window .................110
2.13.17 The Document Processor: modify_field_name Window .............112
2.13.18 The Document Processor: parse_html Window ...........................112
2.13.19 The Document Processor: parse_xml Window ............................114
2.13.20 The Document Processor: send Window .....................................115
2.13.21 The Document Processor: strip_html Window ............................116
2.13.22 The Document Processor: substitute Window .............................116
2.14 Miscellaneous Windows ...........................................................................118
2.14.1 The Import Settings Window .........................................................118
2.14.2 The Export Settings Window .........................................................119
2.14.3 The Select an HTTP Proxy Window ..............................................120
2.14.4 The Add Entry Point Window ........................................................121
2.14.5 The Edit Entry Point Window ........................................................125
2.14.6 The Add Feed Window ..................................................................126
2.14.7 The Edit Feed Window ...................................................................129
2.14.8 The Add Scope Rule Window ........................................................130
2.14.9 The Edit Scope Rule Window ........................................................132
2.14.10 The Add Filename Extension Window ........................................133
2.14.11 The Edit Filename Extension Window ........................................134
2.14.12 The Add Credential Window .......................................................135
2.14.13 The Edit Credential Window ........................................................136
2.14.14 The Add Path Window .................................................................137
2.14.15 The Edit Path Window .................................................................138
2.14.16 The Add Path to Exclude Window ...............................................138
vi
2.14.17 The Edit Path to Exclude Window ............................................... 139

2.14.18 The Add Extension Window ........................................................ 139
2.14.19 The Edit Extension Window ........................................................ 140
2.14.20 The Add Backend Window .......................................................... 141
2.14.21 The Edit Backend Window .......................................................... 142
2.14.22 The Add Field Window for the Indexing Server ......................... 143
2.14.23 The Add Field Window for the Query Web Server
Matching Pane .......................................................................................... 146
2.14.24 The Edit Field Window for the Query Web Server
Matching Pane .......................................................................................... 147
2.14.25 The Add Field Window: Query Web Server Labels Pane ........... 148
2.14.26 The Edit Field Window: Query Web Server Labels .................... 150
2.14.27 The Color Box Window ............................................................... 151
2.14.28 Status Windows ........................................................................... 153
2.14.28.A Overview of Status Windows ........................................... 153
2.14.28.B The Confirmation Window ............................................... 153
2.14.28.C The Error Window ............................................................ 153
3 Choosing Your Components ................................................................155
3.1 Before You Choose Your Components ...................................................... 155
3.2 Choosing a Crawler .................................................................................... 156
3.3 Purposes of the Proxy Server ..................................................................... 157
3.4 Choosing Document Processors in the Pipeline Server ............................. 158
3.4.1 Overview of the Pipeline Server ...................................................... 158
3.4.2 Choosing a Document Processor ..................................................... 158
3.4.3 The Export Operations Performed by the Pipeline Server ............... 160
3.5 How the Indexing Server Works ................................................................ 161
3.6 Querying the Index ..................................................................................... 161
3.6.1 Overview of the Querying ............................................................... 161
3.6.2 Using the Query Server .................................................................... 161
3.6.3 Using the Query Web Server ........................................................... 162
3.6.4 Using the Query Statistics Server .................................................... 162
3.7 Defining Labels for Facetted Search .......................................................... 163
3.8 After You Choose Your Components ........................................................ 164
3.9 Exporting and Importing Component Specifications ................................. 164
4 Sample Configurations ..........................................................................165
4.1 Why You Want to Understand Sample Configurations ............................. 165
4.2 Before You Use a Sample Configuration to Create Your Own Application 166
vii
4.3 Sample Configurations That Use the Web Crawler ...................................167

4.3.1 A Web Crawler, Indexing, and Searching Configuration ................167
4.3.2 The Web Crawler with Exporting and Indexing Processes ..............178
4.4 A Sample Configuration That Uses the File Crawler .................................178
4.5 A Sample Configuration That Uses the Feed Crawler ...............................184
5 Configuring the Web Crawler ...............................................................191
5.1 Overview of the Web Crawler ....................................................................191
5.2 Configuring the Web Crawler ....................................................................192
5.2.1 Overview of Configuring the Web Crawler .....................................192
5.2.2 Specify the General Settings ............................................................193
5.2.3 Specify Entry Points for the Web Crawler .......................................196
5.2.4 Specify the Scope of the Crawl ........................................................198
5.2.5 Exclude Certain Types of Files ........................................................202
5.2.6 Specify Access Information for Password-Protected Sites ..............203
5.3 Run the Web Crawler .................................................................................205
5.4 Troubleshoot with the Log File ..................................................................206
6 Configuring the File Crawler ................................................................. 207
6.1 Overview of the File Crawler .....................................................................207
6.2 Configure the File Crawler .........................................................................207
6.2.1 Overview of Configuring the File Crawler ......................................207
6.2.3 Specify the Paths to Crawl ...............................................................209
6.2.4 Specify the Paths to Exclude ............................................................211
6.2.5 Specify the Types of Files to Return ................................................212
6.3 Run the File Crawler ...................................................................................213
7 Configuring the Feed Crawler ...............................................................217
7.1 Overview of the Feed Crawler ....................................................................217
7.2 Configure the Feed Crawler .......................................................................218
7.2.1 Overview of Configuring the Feed Crawler .....................................218
7.2.3 Specify the Feeds ..............................................................................220
7.3 Run the Feed Crawler .................................................................................222
8 Configuring the Proxy Server ...............................................................225
viii
8.1 Overview of the Proxy Server .................................................................... 225

8.2 View the Status of the Proxy Server and Input Files ................................. 226
8.3 Configure the Proxy Server ........................................................................ 228
8.4 Run the Proxy Server ................................................................................. 229
8.5 Troubleshoot with the Log File .................................................................. 230
9 Configuring the Pipeline Server ...........................................................233
9.1 Overview of the Pipeline Server ................................................................ 233
9.1.1 Processing Documents and Related SAS Applications ................... 233
9.1.1.A How Document Processing and Export Operations
Work Together ................................................................................... 233
9.1.1.B Process Documents ............................................................... 234
9.1.1.C Export Processed Documents ................................................ 235
9.2 Configuring the Pipeline Server ................................................................. 235
9.2.1 Overview of the Document Processors ............................................ 235
9.2.2 Checking Program Installations ....................................................... 237
9.2.3 Configure the Document Processors ................................................ 238
9.3 See Input Documents with the Document Inspector .................................. 240
9.4 Add a New Field to Input Documents ........................................................ 242
9.5 Match Categories, Concepts, and Facts ..................................................... 246
9.6 Export Categories and Concept Matches .................................................. 256
9.7 Advanced Installation ................................................................................. 258
9.8 Run the Pipeline Server .............................................................................. 258
9.9 Troubleshoot with the Log File .................................................................. 260
10 Creating Facetted Search Labels Using content_categorization ....261
10.1 Before You Begin Using This Example ................................................... 261
10.1.1 How the content_categorization Document Processor
Creates Facetted Search Labels ................................................................ 261
10.1.2 Using Related Programs to Define Labels ..................................... 262
10.1.3 Mapping to Labels ......................................................................... 264
10.1.4 Before You Build Your SAS Content Categorization
Studio Project ............................................................................................ 266
10.1.5 Before You Use the Example in This Chapter ............................... 269
10.2 Creating a Sample Project ........................................................................ 274
10.2.1 Access the Projects on SAS Content Categorization Server ......... 274
10.2.2 Add Projects ................................................................................... 277
ix
10.2.3 Determine the Input, Matching, and Output ...................................279

10.2.3.A How Input Documents Are Handled ...................................279
10.2.3.B Specify Input Fields .............................................................279
10.2.3.C Specify Categories ...............................................................280
10.2.3.D Specify Concepts .................................................................281
10.2.3.E Specify Facts ........................................................................286
10.2.4 Specify Output ................................................................................292
10.2.5 Apply content_categorization to Input Documents ........................293
10.3 Seeing the Results in the Query Interface ................................................295
11 Configuring the Indexing Server ........................................................ 297
11.1 Overview of the Indexing Server ..............................................................297
11.2 Configure an Index ...................................................................................298
11.3 Changes That Affect the Indexing Server ................................................301
11.4 Run the Indexing Server ...........................................................................302
11.5 Troubleshoot with the Log File ................................................................303
12 Configuring the Query Server ............................................................. 305
12.1 Overview of the Query Server ..................................................................305
12.2 Run the Query Server ...............................................................................305
12.3 Troubleshoot with the Log File ................................................................306
13 Configuring the Query Web Server .................................................... 309
13.1 Overview of the Query Web Server .........................................................309
13.2 Choosing How Search Returns Are Displayed .........................................310
13.2.1 Displays with or without Labels .....................................................310
13.2.2 No Labels Example ........................................................................312
13.2.3 Hierarchical Labels Example .........................................................312
13.2.4 No Hierarchical Display Example ..................................................314
13.2.5 Flattened Hierarchical Example .....................................................315
13.3 Configure the Query Web Server .............................................................316
13.3.1 Overview of Configuring the Query Web Server ..........................316
13.3.2 Specify the Server Port ...................................................................317
13.3.3 Specify How Matching Is Performed .............................................317
13.3.3.A Match Types ........................................................................317
13.3.3.B Select a Match Type ............................................................318
13.3.4 Specify How Matches Are Sorted ..................................................319
13.3.5 Specify Labels for Facetted Search ................................................321
13.3.6 Specify the Formatting for the Matches .........................................326
13.3.7 Specify the Theme of the Search Window .................................... 329

13.3.7.A Theme Overview ................................................................. 329
13.3.7.B Specify the Colors of the Search Window .......................... 331
13.3.7.C Load New Images into the Search Window ........................ 335
13.4 Run the Query Web Server ...................................................................... 336
13.5 Troubleshoot with the Log File ................................................................ 337
14 Configuring the Query Statistics Server ............................................339
14.1 Overview of the Query Statistics Server .................................................. 339
14.2 Run the Query Statistics Server ............................................................... 340
14.3 View the Query Statistics for a Selected Time Period ............................. 341
14.3.1 Overview of Time Period Views ................................................... 341
14.3.2 See the Statistics for Today ............................................................ 341
14.3.3 See the Statistics for This Month ................................................... 344
14.3.4 See the Statistics for This Year ...................................................... 346
14.3.5 See the Statistics for All Time ....................................................... 348
14.4 After You View the Query Data .............................................................. 349
14.5 Troubleshoot with the Log File ................................................................ 349
Appendixes ............................................................................ 351

A Regular Expressions and XML Field Extraction File ..........................353
A.1 Regular Expressions .................................................................................. 353
A.2 XML File Field Extraction File Format .................................................... 353
B Recommended Reading ........................................................................355
C Glossary .................................................................................................359
Index ...........................................................................................................363
xi
xii
About This Book

Audience
SAS Information Retrieval Studio is designed for the following
administrators:
-
Persons who install the software.
Persons who determine what components are used in the custom

information retrieval application that your organization requires.
Persons who choose the components, and their configurations, for

information retrieval.
Persons who design the search window that is used by end users to
query the index.
You could be assigned one of these functions, or all of them.

SAS Information Retrieval Studio enables you to use this software with other
SAS products. This documentation focuses on tasks that define and configure
the information retrieval application and the search interface.
Prerequisites
Here are the prerequisites for using SAS Information Retrieval Studio:
-
SAS Information Retrieval Studio installed on your machine.
A supported browser installed on your desktop client.
Access to data sources.
(Optional) Rules such as category rules and concept definitions created

in other SAS applications.
If you have any questions about whether you are ready to use SAS Information
Retrieval Studio, contact your system administrator.
xiii
Conventions
This manual uses the following typographical conventions:
Convention
Description
TGM_ROOT
The root directory where SAS Information Retrieval Studio is

installed, typically the following:
Windows: C:/Program Files/SAS Information
Retrieval Studio
UNIX: /opt/SAS Information Retrieval Studio
.xml
Code examples are shown in a fixed-width font.
Start button
The labels for user interface controls are shown in a bold, sansserif font.
www.sas.com
The hypertext links are shown in a light blue, fixed-width font,

and are underlined.
This manual contains instructional text that is subject to change.
xiv
Whats New in SAS Information

Retrieval Studio 1.3
New and enhanced features in SAS Information Retrieval Studio include the
following:
-
SAS licensing replaces the Teragram license.
The content_categorization Document Processor wizard replaces the

categorizer, concept_extractor, and contextual_extractor processors.
The add_field Document Processor enables you to add a field with a

constant value to each input document.
The export_to_files document processor now enables you to mark preescaped fields for XML documents. Use this processor to create nested
XML tags.
The parse_xml document processor can now be instantiated multiple

times. This feature enables you to support multiple document schemas.
This processor can also copy the original URL of the compound
document into each resulting, split document.
The export_csv document processor now supports a non-escaped output

mode.
Entry point quota control is now available for the web crawler. This
feature enables seed-only crawling.
The match_and_copy document processor is similar to the substitute

document processor. Use the match_and_copy document processor to
write the output to a different field from the input.
The default fields ctime, mtime, and atime are included in the Input
fields to exclude field for the content categorization document
processor. These fields preclude these timestamps from processing by
SAS Content Categorization Server.
The passwords in the web crawler Credentials pane are now obscured.
xv
xvi
About SAS Information

Retrieval Studio
-
What Is SAS Information Retrieval Studio?
Benefits of Using SAS Information Retrieval Studio
How Does SAS Information Retrieval Studio Work?
How Does SAS Information Retrieval Studio Fit into the SAS Product
Line?
How to Get Help for SAS Information Retrieval Studio
What is a Document?
Architecture
1.1 What Is SAS Information Retrieval Studio?

In many organizations, diverse information consumers need to quickly access
specific data. In an environment where data, and its types, grow exponentially
there is a need to automate the related processes. SAS Information Retrieval
Studio combines several key technologies to provide a comprehensive solution
to data collection, indexing, searching, and so on. These technologies are
bundled into one customizable product.
Easy information retrieval
The web, feed, and file crawlers gather the documents that you specify
according to your parameters. Documents are chunks of text, with or
without markup tags, gathered from the Internet, feeds, and databases.
These chunks of text can be treated by the document processors that parse,
convert, categorize, extract concepts and facts, and so on. The documents
can then be sent to the index or to another program such as SAS Sentiment
Analysis Workbench. If indexed, the documents can be searched by your

end users.
Build a custom information retrieval pipeline
Choose to build an information retrieval system that is customized to meet
the needs of your organization. You can choose all, or some, of the
following components:
Crawlers
Choose the web, feed, or file crawlers to gather documents from the
Web, feeds, and file systems, respectively.
Pipeline server
Choose your document processors that parse, categorize, extract
concepts, locate facts, convert documents into text, and so on. These
processors can also hand the gathered documents to other applications
such as SAS Sentiment Analysis Workbench.
Indexing server
Choose how, and whether input documents are indexed. End users can
search indexed documents using a customized search window that
runs on the query web server.
Query web server
Specify how the matching documents are returned in the search
window, the appearance of this window, and how end users can
navigate the returns.
Query statistics server
See the counts for the entered queries according to various time
frames.
Easy component customization
Easy-to-use windows and wizards simplify the process of customizing the
information retrieval components that you choose. These panes also
provide log files, statistics, information about the processes involved, and
data on documents in the pipeline.
1.2 Benefits of Using SAS Information

Retrieval Studio
SAS Information Retrieval Studio provides the following benefits:
Empowers business owners by locating data
SAS Information Retrieval Studio includes functionality that is designed
to fit your organizations requirements. Use this program to locate,
process, index, and customize a search window for your data. See various
types of informational statistics.
Improves the business value of IT and the corporate data that it manages
SAS Information Retrieval Studio provides you with easy, self-service
access to the information contained in your documents. Use SAS
Information Retrieval Studio to locate, process, index, and search your
data.
Saves money on training and support costs
SAS Information Retrieval Studio is so simple that you can quickly
become self-sufficient, with minimal IT support and no need for extensive
training. Once you start using SAS Information Retrieval Studio you are
no longer dependent on the IT staff.
1.3 How Does SAS Information Retrieval

Studio Work?
SAS Information Retrieval Studio is an application that anyone can use to
locate documents on the Internet or in file systems. You can specify how these
documents are processed, index, or send them to another SAS program. If you
choose to index the documents, end users can query this data in the search
window that you customize.
Use the SAS Information Retrieval Studio window to select the crawler and
document processors. Also determine whether the documents are indexed or
sent to another SAS application and customize a search window. All of these
processes are optional. You can specify the components that you want to use,
configure the components, and enable your end users to perform facetted
search using labels.
1.4 How Does SAS Information Retrieval Studio

Fit into the SAS Product Line?
As an integral part of the SAS product line, SAS Information Retrieval Studio
provides crawlers, indexing, and searching capabilities. These functionalities
facilitate the processes of information retrieval and management. Use these
capabilities with the following SAS products, among others:
Export document collections to SAS Sentiment Analysis Workbench and SAS
Text Miner
Export the files that the web and feed crawlers gather in SAS Information
Retrieval Studio to SAS Sentiment Analysis Workbench. Here you can see
reports about overall sentiment. Analysts can also see and review
individual documents in SAS Sentiment Analysis Workbench. You can
also export to SAS Text Miner to locate topics and themes in your input
documents.
Category, concept, and fact extraction
SAS Information Retrieval Studio enables you to deploy the rules defined
in SAS Content Categorization Studio and SAS Contextual Extraction
Studio to your gathered documents.
Document conversion
Use SAS Document Conversion to extract text from input files such as
Adobe PDF and Microsoft Office.
1.5 How to Get Help for SAS Information

Retrieval Studio
Select Help --> Contents or Help --> Using this Window.
1.6 What is a Document?

A document consists of a single text. For example, a document can be any of
the following:
-
an HTML page
a Microsoft Word file
a PDF file
one row in a CSV file or a database
one article or summary in a feed
In SAS Information Retrieval Studio, each document is represented as a

configurable set of fields. Each file has a name and a value. Unnecessary fields
can be either left empty or omitted from the document.
1.7 Architecture
Use the architecture diagram below to gain an overview of the application
processes that you can choose to use in your customized configuration.
Figure 1-1
SAS Information Retrieval Studio
SAS Information Retrieval

Studio Interface
-
Your First Look at the SAS Information Retrieval Studio User

Interface
Access the SAS Information Retrieval Studio User Interface
Viewing the Overview Pane
Viewing the Web Crawler Pane
Viewing the File Crawler Pane
Viewing the Feed Crawler Pane
Viewing the Proxy Server Pane
Viewing the Pipeline Server Pane
Viewing the Indexing Server Pane
Viewing the Query Server Pane
Viewing the Query Web Server Pane
Viewing the Query Statistics Server Pane
The Add Document Processor Windows
Miscellaneous Windows
2.1 Your First Look at the SAS Information

Retrieval Studio User Interface
The SAS Information Retrieval Studio user interface provides the workspace
necessary to configure the crawlers and servers that you select to gather and
process information. The gathered documents are processed according to the
specifications that you set and sent to either the indexing server or to another
program. For example, choose to send your documents to SAS Sentiment

Analysis Workbench where the sentiment that they express can be aggregated. If
you want your end users to be able to query the documents that your crawlers
locate, choose to index these documents.
Use the windows in the SAS Information Retrieval Studio to choose your
components according to the tasks that you want to perform:
1. Start, stop, configure, and monitor the status of the web, file, and feed
crawlers. Specify the parameters for each type of crawl. A crawl is

defined as the entire run of the crawler, instead of a single-page
download.
2. Determine how input documents are processed and where they are sent
using the pipeline server.

3. Specify how input documents are stored in the index by determining
what fields are indexed and how the information in these fields is
handled.
4. Choose the type of search that is available to end users and how
matching and sorting are determined. You can also determine whether
and how labels that facilitate facetted search are made available to end
users.
5. Monitor the status of input queries using the query statistics server.
2.2 Access the SAS Information Retrieval Studio

User Interface
To access the SAS Information Retrieval Studio user interface, complete this
step:
Select Start > Programs
Administration.
> SAS Information Retrieval Studio >
Display 2.1 SAS Information Retrieval Studio User Interface
Use the components of this pane as specified below:

Table 2-1: Components of the Main Window
Component Description
Refresh button
By default, Auto-refresh is selected. Click
to select
Refresh.With the default setting, the status of the components is
updated every few seconds. If you have a slow connection, you can
disable the auto-refresh functionality. In this case, click Refresh to
update the status of any of the components of SAS Information Retrieval
Studio.
Overview
Find information for the application and the import and export operations
that apply to selected components of the application.
Web Crawler
Start, stop, and configure the web crawler.
File Crawler
Start, stop, and configure the file crawler.
Feed Crawler
Start, stop, and configure the feed crawler.
Proxy Server
Start, stop, and configure the proxy server. Also see and search a log file
of output for this server.
Pipeline Server
Start, stop, and configure the pipeline server. Observe the progress of
gathered documents in the Status pane and see the specified log file
output for this server. Use this component to specify the processors that
act on each input document. These processors act on the document using
the specified operation or pass the document to another component.
Table 2-1: Components of the Main Window (Continued)

Component Description
Indexing Server
Start, stop, and configure the indexing server. Use the indexing server if
you plan to perform search operations using SAS Information Retrieval
Studio.
Query Server
Start, stop, and configure the query server. The query server passes queries
to the index from the search window and hands the results back to the
query web server.
Query Web
Server
Start, stop, and configure the query web server. Specify how matching and
sorting operations are performed. Format the labels, document matches,
and the theme of the search window.
Query Statistics
Server
Start and stop the query statistics server. See the most frequent queries
submitted, those search terms that are not matched, and query rates for
specified time periods.
The width of these panels can be adjusted by dragging this icon located
between the two main panes.
10
2.3 Viewing the Overview Pane

2.3.1 Overview of the Overview Pane
For information about this pane, see Section 2.2 Access the SAS Information
Retrieval Studio User Interface on page 8.
2.3.2 The Log Tab

See the log pane that describes the operations performed.
Display 2.2 Log Pane.
Number of lines
(default is 20) see this maximum number of timestamped lines of text that
form the log for the proxy server. Click
or
to reset the limit.
Retrieve button
see the log file contents in the Log pane below.
11
Text to highlight
enter the term that you are seeking to match in the input document.
Find button
click to highlight the matched text in the log pane below.

Log pane
see the specified number of lines in the log file here.
2.4 Viewing the Web Crawler Pane

2.4.1 Overview of the Web Crawler Pane
The web crawler searches the Web and returns Web pages according to the
parameters that you specify. The operations that are available when you click
the Web Crawler tab are explained in the following subsections.
Display 2.3 Web Crawler Pane
12
2.4.2 The Buttons

The buttons that are available in the Web Crawler pane are listed below from
left to right:
Start
begin the crawl.

Stop
end the crawl.

Apply Changes
modify the behavior of the web crawler according to the changes that you
made in this pane.
Revert
return to the last applied settings.
2.4.3 The Status Tab

See whether the web crawler is running.
Display 2.4 Status Pane
13
2.4.4 The Configuration Tab

2.4.4.A The Main Tabs in the Configuration Tab
The web crawler does not run until it is configured. Specify the settings for the
web crawler, the points where the crawler enters the Web, the limits of its
crawl, and the file types that it returns. You can also specify the permissions
necessary to access password-protected sites.
General Settings
specify how the web crawler runs. For more information, see Section
2.4.4.B The General Settings Tab on page 15.
Entry Points
enter the starting URLs. The web crawler starts at these Web addresses
and follows their links to gather documents. For more information, see
Section 2.4.4.C The Entry Points Tab on page 18.
Scope
allow, or exclude, Web addresses from the crawling process. In either case,
specify patterns with regular expressions, or a list of specific URLs. For
more information, see Section 2.4.4.D The Scope Tab on page 20.
Filename Extensions
use the default list of excluded file types, or define your own list to
include, or exclude, from the crawl. For more information, see Section
2.4.4.E The Filename Extensions Tab on page 21.
Credentials
enter the necessary names and passwords for password-protected sites

where access is prevented without this information. When you specify this
14
data, you enable the pages on these sites to be collected. For more
information, see Section 2.4.4.F The Credentials Tab on page 23.
2.4.4.B The General Settings Tab

Specify the settings that determine the overall way that the web crawler runs.
Display 2.5 General Settings Pane
Use the components of this window to specify the general settings.

Table 2-2: General Settings Pane Components
Component
Description
HTTP proxy
Specify the proxy server here, or click Auto-detect as

explained below.
Auto-detect button Access the Select an HTTP Proxy window where you can
choose a proxy server. For more information, see Section 2.14.3
The Select an HTTP Proxy Window on page 120.
15
Table 2-2: General Settings Pane Components (Continued)

Component
Description
Quota (files)
(Default: 25) Click
the web crawler.
or
to change the file limit for
Quota (megabytes)
(Default: 1000) Click
or
to change the megabyte
limit for the crawler. This is the maximum size of all collected
files.
Number of
downloader
threads
Sleep interval
(seconds)
(Default: 1) Click
or
threads that can be created.
to change the total number of
(Default: 1) Click
or
to change the number of
seconds that the web crawler pauses between page downloads.
Timeout (seconds)
(Default: 300) Click
or
to change the number of
seconds the web crawler waits before it stops attempting to
download a specific page.
Maximum number
of retries
(Default: 3) Click
or
to change the highest number
of times the crawler can attempt to download a page before it
tries the next one.
Retry delay
(seconds)
(Default: 300) click

or
to change the highest
number of seconds that the web crawler waits before it
reattempts to download a page.
Respect robots.txt
(Default: Yes) Click
to select No. The robots.txt
standard enables Web site authors to request that crawlers
(robots) avoid downloading some portions of their site. Select
Yes to ignore this request.
16
Table 2-2: General Settings Pane Components (Continued)

Component
Find links in
Javascript and
Flash
Description

to select No. Leave the default
setting to prohibit the web crawler from returning links that the
crawler finds in either of these two types of code.
Link traversal
order
(Default: Breadth first) Click
to select Depth
first. Breadth first means the top layer of linked pages
at one site are gathered before the links to the next layer are
followed. If you select Depth first, the links are followed
on the first page. Then the crawler goes to the second page, and
so on.
17
2.4.4.C The Entry Points Tab

Specify the Web addresses, or URLs, where the web crawler begins its crawl.
You can also specify the quota, or maximum number of files that can be
downloaded.
Display 2.6 Entry Points Pane
URL
see the list of the entry points to the Web for the web crawler. These
crawlers access the URLs in the order in which they are listed in the Entry
Points pane.
Hint:
This ordering can affect the number of documents

returned when you enter a quota for the web crawler or
for each URL. For more information about the quota for
the web crawler, see Section 2.4.4.B The General
Settings Tab on page 15.
Quota
see the maximum number of files that can be collected for each Web
address. When you specify the quota for a URL and the web crawler, the
18
smaller of the two numbers applies. In other words, if you specify 100 for
the web crawler and 35 for the selected URL, only 35 documents can be
downloaded for this URL. On the other hand, if you specify 100
documents for this for URL and 35 for the web crawler, only 35
documents can be downloaded for this URL.
Add
access the Add Entry Point window where you can specify a Web address
to begin the crawl. See Section 2.14.4 The Add Entry Point Window on
page 121.
Remove
delete the selected entry point and quota from the address pane.
Edit
access the Edit Entry Point window to make changes to the selected Web
address. See Section 2.14.5 The Edit Entry Point Window on page 125.
19
2.4.4.D The Scope Tab

Specify the Web addresses, or URLs, where the web crawler begins its crawl.
Display 2.7 Scope Pane
URL Pattern
specify a pattern for matching URLs.

Match Type
see prefix or regular expressions. Both are patterns. Select prefix to

specify a match at the beginning of the URL. Select regular expression
when you want to use Teragram regular expressions to specify the pattern
of the searchable URLs. Regular expressions enable you to apply greater
precision to the collection operation.
Action
specifies Allow or Exclude to either download, or to prevent a download,

for pages whose URLs match the specified pattern.
Add
access the Add Scope Rule window. Scope determines the links that the
crawler follows, if the URL has the specified prefix. The links in this URL
20
are followed only if no other scope rules exclude this URL. See Section
2.14.8 The Add Scope Rule Window on page 130.
Remove
delete the selected URL pattern and its attributes.

Edit
access the Edit Scope Rule window. See Section 2.14.9 The Edit Scope
Rule Window on page 132.
2.4.4.E The Filename Extensions Tab

Specify the file extensions that are excluded or included.
Note:
If you specify one file type as included, only the

specified file types are returned.
Display 2.8 Filename Extensions Pane
Extension
see the list of file types that are listed by their file extension.
Action
see the status of each file type. In other words, is this extension excluded
or allowed. If the file type is specified as Allow, the crawler can return this
21
type of page. If you enable at least one type of file to be returned, only
those files with the Allow operation are returned.
Add
access the Add File Extension window to add an extension that is allowed
or prohibited. See Section 2.14.10 The Add Filename Extension Window
on page 133.
Remove
delete the selected file type.

Edit
access the Edit Filename Extension window to make a change to the file
extension or the operation. See Section 2.14.11 The Edit Filename
Extension Window on page 134.
22
2.4.4.F The Credentials Tab

Specify the sign-in information that enables you to gain access to the specified
password-protected sites.
Display 2.9 Credentials Pane
Site
see a list of Web sites that are password-protected.

Username
see the name of the user for each password-protected site.

Password
see the password assigned to each user. The password is the second
component of the credentials required for HTTP authentication.
Add
access the Add Credential window to add a password-protected site to

crawl. See Section 2.14.12 The Add Credential Window on page 135.
Remove
delete the selected site with its credentials.

Edit
access the Edit Credential window to make a change to the specifications

for the password-protected site. See Section 2.14.13 The Edit Credential
Window on page 136.
23
2.4.4.G The Log Tab

This pane provides information about the web crawler operations. For more
information, see Section 2.3.2 The Log Tab on page 11.
2.5 Viewing the File Crawler Pane

2.5.1 Overview of the File Crawler Pane
The file crawler searches the files on your file system and returns these files.
The operations that are available when you click the File Crawler tab are
explained in the following subsections.
Display 2.10 File Crawler Pane
2.5.2 The Buttons

The buttons that are available in the File Crawler pane are the same buttons
that are available for the web crawler. For more information, see Section 2.4.2
The Buttons on page 13.
24

See whether the file crawler is running.

The file crawler does not run until it is configured. Specify the settings for the
file crawler, the points where the crawler enters a file system, the limits of its
crawl, and the file types that it returns.
Display 2.12 Configuration Tab
25
General Settings
specify how the file crawler runs, specify a date range for returned
documents, and how .xml files are handled. For more information, see
Section 2.5.4.B The General Settings Pane on page 26.
Paths
specify the directories that the file crawler accesses. For more information,
see Section 2.5.4.C The Paths Pane on page 28.
Paths to Exclude
exclude certain directories from the crawl. For more information, see
Section 2.5.4.D The Paths to Exclude Pane on page 29.
Filename Extensions
(optional) if you choose to specify the types of files that can be returned,
only these are permitted. Leave this pane empty if you want the file
crawler to return all file types. For more information, see Section 2.4.4.E
The Filename Extensions Tab on page 21.
2.5.4.B The General Settings Pane

Specify the settings that determine the overall way that the file crawler runs.
Display 2.13 General Settings Pane
Maximum file size (megabytes)
(default is 10) click
26
or
to reset the limit for the size of each file.
Oldest date
click
to access the calendar where you can select the first date that the
crawler can use for the files that it returns to a query.
Crawl continuously
(default setting is No) click

to select Yes. If you leave the default
selection, start the crawler when you update the files in your file system.
Encapsulate XML files
(default setting is No) click

throughout processing.
to select Yes. Keep XML files intact
By default, XML files are passed by the pipeline server with top-level tags
turned into similar-named fields in the document. If you want to exercise
more control over these fields, set this specification to Yes. This setting
enables you to turn nested tags into fields. When you make this selection,
also specify the parse_xml document processor in the pipeline server.
27
2.5.4.C The Paths Pane

See the addresses that the crawler can use to gather documents. You add these
paths using the Add button.
Display 2.14 Paths Pane
Add
access the Add Path window. See Section 2.14.14 The Add Path Window
on page 137.
Remove
delete the selected path.

Edit
access the Edit Path window to change the text that specifies an address.
See Section 2.14.15 The Edit Path Window on page 138.
28
2.5.4.D The Paths to Exclude Pane

See the paths that are not crawled here. This pane also enables you to identify
exceptions to the specified list in the Paths pane. These exceptions are
subdirectories, or files. They are not crawled.
Display 2.15 Paths to Exclude Pane
Add
access the Add Path to Exclude window. See Section 2.14.16 The Add
Path to Exclude Window on page 138.
Remove
delete the selected path.

Edit
access the Edit Path to Exclude window. See Section 2.14.17 The Edit
Path to Exclude Window on page 139.
29
2.5.4.E The Filename Extensions Pane

Specify the file extensions that can be returned. However, if you enter any file
extensions, only those types of extensions are crawled.
Display 2.16 Filename Extensions Pane
Add button
access the Add Extension window. See Section 2.14.10 The Add Filename
Extension Window on page 133.
Remove
delete the selected file type.

Edit
access the Edit Filename Extension window. See Section 2.14.11 The Edit
Filename Extension Window on page 134.
30
2.6 Viewing the Feed Crawler Pane

2.6.1 Overview of the Feed Crawler Pane
The feed crawler collects frequently updated documents on the Web. For
example, the feed crawler collects pages from blogs and from forums. It can
also collect pages such as press releases. Unlike the web crawler, the feed
crawler collects only the Web documents that come from feeds.
The operations that are available when you click the Feed Crawler tab are
Display 2.17 Feed Crawler Pane
2.6.2 The Buttons

The buttons that are available in the Feed Crawler pane are the same buttons
that are available for the web crawler. For more information, see Section 2.4
Viewing the Web Crawler Pane on page 12.
31

See whether the feed crawler is running.
Display 2.18 Status Pane.

The feed crawler does not run until it is configured. Specify the settings for the
feed crawler, the URLs where this crawler enters the Internet, and the limits of
its crawl.
Display 2.19 Configuration Pane
General Settings
specify how the feed crawler runs, the server, and other information
necessary to the crawl. For more information, see Section 2.5.4.B The
General Settings Pane on page 26.
Feeds
crawl one, or more, feeds using this pane. For more information, see
Section 2.5.4.B The General Settings Pane on page 26.
32
2.6.4.B The General Settings Tab

Use the General Settings tab for the feed crawler to configure the feed server
and to specify information specific to the crawl.
Display 2.20 General Settings Tab
HTTP proxy
specify the server that you are accessing here, or click Auto-detect as
explained below.
Auto-detect
access the Select an HTTP Proxy window where you can choose a proxy
server or enter the address for this server. For more information, see
Section 2.14.3 The Select an HTTP Proxy Window on page 120.
Crawl continuously
(default setting is Yes) click

to select No. If you select Yes, fresh
content is always available. For more information about this setting, see
Section 7.2 Configure the Feed Crawler on page 218.
Recrawl interval
or
to select another wait time in seconds.
User agent
(default agent is SAS Feed Crawler) enter the name of a third-party feed
crawler, if you choose.
33
2.6.4.C The Feeds Tab

Use the Feeds tab for the feed crawler to configure the feed crawler, to specify
the feed addresses, whether links are followed, and other information.
Display 2.21 Feeds Tab
Feed URL
specify the Web address for a feed.

Follow Links
choose whether the feed crawler should crawl links in the selected feed.
For more information, see Section 2.14.3 The Select an HTTP Proxy
Window on page 120.
Add
access the Add Feed window where you can choose the address for the
feed that you want to crawl. For more information, see Section 2.14.6 The
Add Feed Window on page 126.
Remove
delete the selected feed URL.

Edit
access the Edit Filename Extension window. See Section 2.14.7 The Edit
Feed Window on page 129.
34
2.6.5 The Log Tab

See Section 2.3.2 The Log Tab on page 11.
2.7 Viewing the Proxy Server Pane

2.7.1 Overview of the Proxy Server Pane
The proxy server is an intermediary server. The proxy server takes the
documents gathered by the crawler and sends them to the pipeline server for
processing. As an intermediary server, the proxy server provides two benefits:
First, the proxy server enables you to pause the flow of documents. The
incoming documents form a queue until the server is restarted. Use the pause
operation to perform maintenance on the system without interrupting the
crawlers.
Second, you can choose to specify more than one pipeline server. In this case,
the pipeline servers run on different machines and the proxy server sends a
copy of each incoming document to each machine. In case of hardware failure,
these servers serve as mirrors.
The operations that are available when you click the Proxy Server tab are
Display 2.22 Proxy Server Pane
35
2.7.2 The Buttons

The Start, Stop, and Apply Changes buttons work for the proxy server like
they work for the crawlers. For more information, see Section 2.4 Viewing the
Web Crawler Pane on page 12.
Pause
stop the proxy server temporarily.

Resume
restart the proxy server operations.

Use the status tab to see whether the proxy server is running.
Documents received
see the number of texts handed to the proxy server.

Documents processed
see the number of documents handled by the proxy server.

Documents queued
see the number of documents waiting to be processed by the proxy server.
36
Last document received
see the timestamp of the latest document that the proxy server accepted.
Last document processed
see the timestamp of the latest text that the proxy server handed to another
server.

Use the Configuration pane to specify the pipeline server where the proxy
server sends a copy of each input document. You can choose to specify several
pipeline servers running on different ports.
You can use multiple pipeline servers for mirroring. To perform this operation,
configure the mirror servers with the same specifications that you set for the
local pipeline server.
You can also set up multiple pipeline servers to perform different sets of
operations on your input documents.
37
Display 2.24 Configuration Pane.
Host
see the name of the pipeline server. By default, when this server is
running, the information for the local machine appears in the
Configuration pane.
Port
see the number of the port where the pipeline server is running.
Status
see whether the pipeline server is running.

Add
click to access the Add Backend window. Here you can specify additional
pipeline servers. For more information, see Section 2.14.20 The Add
Backend Window on page 141.
Remove
delete the selected server.
38
Edit
click to access the Edit Backend window. Here you can change the
pipeline server. For more information, see Section 2.14.21 The Edit
Backend Window on page 142.
2.7.5 The Log Tab

This pane provides information about the proxy server operations. For more
2.8 Viewing the Pipeline Server Pane

2.8.1 Overview of the Pipeline Server Tab
The pipeline server is used to analyze, modify, and export each document
before it is sent to the indexer. The operations that are available when you
click the Pipeline Server tab are explained in the following subsections.
Display 2.25 Pipeline Server Pane
39
2.8.2 The Buttons

The Start and Stop buttons work for the pipeline server in the same ways that
they work for the crawlers. For more information, see Section 2.4.2 The
Buttons on page 13.
Apply Changes
click, if the index server is running. The indexing server is restarted and
the changes take effect for the new index.

See whether the pipeline server is running and observe the document
processing stages.
Pipeline Stage
see the four stages of the pipeline server:

Overall
see the documents that have completed all of the document processing
XML parsing stages.
Document processing
see how the document processors are acting on input documents

Sending to the indexer
see the documents that are in the indexing process.
40
Pending
see the number of documents that are in the queue for each stage, with the
exception of Overall, in the pipeline.
Finished
see the number of documents that completed the processing operations in

each stage.
Last busy time
see the latest processing date and time.
2.8.4 The Document Processors Tab

Document processors act on the documents in the pipeline. Document
processors analyze the data in input documents, perform various operations on
the documents, and export the data. These operations are performed before the
field-value pairs, known as documents, are added to the index or passed to
another application such as SAS Sentiment Analysis Workbench. (Each
document consists of a set of field-value pairs.) In some cases, document
processors pass documents directly to another application for analysis such as
SAS Sentiment Analysis Workbench.
Display 2.27 Document Processors Pane
41
Add
click to access the Add Document Processor window. For more

information, see Section 2.13 The Add Document Processor Windows on
page 73.
Remove
delete the selected processor.

Edit
access the Document Processor window that is specific to the selected

operation. You can make some changes here. To make a more
comprehensive set of changes including changes to the caption, see
Section 13.3.5 Specify Labels for Facetted Search on page 321.
Move Up
reorder the document processing operations by moving the selected

operation up one level.
Move Down
reorder the document processing operations by moving the selected

operation down one level.
Note:
The ordering of the document processor operations is

performed according to the type of operations that are
necessary to achieve the desired results. For more
information, see Section 9.2 Configuring the Pipeline
Server on page 235.
2.8.5 The Document Inspector Tab

The Document Inspector pane enables you to see all of the versions of the
input document for each stage of the pipeline, simultaneously. At each stage of
the pipeline, the original document changes, but you can still see its original
text.
This snapshot operation is available for one document at a time. It is available
only when the documents are in the pipeline server.
42
Display 2.28 Document Inspector Pane
Take Snapshot
click this button and you can see the document in the various pipeline
stages.
Cancel
click this button to stop the snapshot operation.

Processing Stage
see a document. Click on a document field to see the document number in

the Document pane.
Document
see the number of this document. Click on the document number to see the
field names for this document in the Field pane.
43
Field
see the field names in this document. Click on a field to see the contents of
the selected document in the Document Inspector pane.
Document Inspector pane
see the contents of the chosen field of the selected at a specific stage.
2.8.6 The Log Tab

This pane provides information about the pipeline server operations. For more
2.9 Viewing the Indexing Server Pane

2.9.1 Overview of the Indexing Server Pane
The indexing server builds a searchable index of the input documents. If you
want to enable end users to search the collected documents, build an index.
Each document in the index consists of a set of field-value pairs. These fields
are populated when they are matched to similarly named fields in the
documents passed to the indexing server by the pipeline server.
You can build one index at a time from the documents provided by the
continuously running crawlers. Use the different types of index fields for
querying.
44
Display 2.29 Indexing Server Pane
2.9.2 The Buttons

The Start, Stop, and Apply Changes buttons that are available in the
indexing server pane are the same buttons that are available for the web
crawler. For more information, see Section 2.4.2 The Buttons on page 13.
Delete Index
remove the existing index. A new index can be built with the new
configuration after you restart the crawler.
Revert
click when the indexing server is running and the existing index is deleted.
A new index can be built when new documents are input.
45

See whether the indexing server is running.
Display 2.30 Indexing Server Status Pane

The indexing server can be reconfigured according to your specifications.
Display 2.31 Indexing Server Configuration Pane
Field Name
add to, or delete from, the list of field names entered by default. The
default list includes title, date, and body.
Functionality
add to, or delete from, the list of uses for these fields. For more
information, see Section 2.14.22 The Add Field Window for the Indexing
Server on page 143.
46
Add
access the Add Field window where you add fields to the index according
to the purpose that they are intended to serve. For more information, see
Section 2.14.22 The Add Field Window for the Indexing Server on
page 143.
Remove
delete the selected field from the index configuration. Use this button to
change the configuration of the next index that is built. Any changes do
not affect the current index.
Edit
access the Edit Field window where you make changes to the fields that
you added to the index. For more information, see Section 2.14.22 The
Add Field Window for the Indexing Server on page 143.
Language
optimize the index for the selected language. Click

language.
to select another
2.9.5 The Log Tab

This pane provides information about the indexing server operations. For more
47
2.10 Viewing the Query Server Pane

2.10.1 Overview of the Query Server Tab
The query server uses the index built by the indexing server to locate matching
documents in response to queries. Use the query web server that is one of the
components of SAS Information Retrieval Studio. You can also use a custom
query API that you write to pass queries to the query server. You can use this
component with the query API. For more information about the query API, see
the SAS Search and Indexing: User and C and Java API Guide.
Display 2.32 Query Server Pane
2.10.2 The Buttons

The Start and Stop buttons work for the query server like they work for the
crawlers. For more information, see Section 2.4.2 The Buttons on page 13.

See whether the query server is running.
Display 2.33 Query Server Status Pane
48
2.10.4 The Log Tab

This pane provides information about the query server operations. For more
2.11 Viewing the Query Web Server Pane

2.11.1 Overview of the Query Web Server Tab
Use the query web server to format the search window that the end user sees.
Also configure this server to display the search results with, or without labels,
and specify other ways of rendering the search returns.
Display 2.34 Query Web Server Pane
2.11.2 The Buttons

The Start, Stop, and Apply Changes buttons work for the query web server
like they work for the crawlers. For more information, see Section 2.4.2 The
Buttons on page 13.
Revert
restore the default settings.
49

See whether the query web server is running.
Display 2.35 Query Web Server Pane

Specify the settings for the query web server in the Configuration tab. You
configure the query web server when you specify the parser, the processors,
and the order for document processing.
Display 2.36 Configuration Pane
Server Port

web server runs.
or
to select another port where the query
Matching tab
specify the types of searches that the end user can input, the fields the user
can specify, and the weight of each field. For more information, see
Section 2.11.4.B The Matching Tab on page 51.
50
Sorting tab
specify how the matching documents are ranked and their parameters. For
more information, see Section 2.11.4.C The Sorting Tab on page 53.
Labels tab
specify labels when you choose to enable facetted search. Facetted search
uses a web-like system of related labels to enable users to intuitively
locate the results that they seek. For more information, see Section
2.11.4.D The Labels Tab on page 56.
Match Formatting tab
specify how documents that match the query are displayed in the list of
results. For more information, see Section 2.11.4.E The Match
Formatting Tab on page 58.
Theme tab
specify the look and feel of the query interface. For more information, see
Section 2.11.4.F The Theme Tabs on page 60.
2.11.4.B The Matching Tab

Use the Matching pane to specify the types of searches users can use to query
the index. Before you specify the fields in the index that can be searched and
the weight of the matches returned, consider the simple and advanced search
types.
Select the simple, or fsearch query syntax, and the user enters words and
phrases. The user can also mark these words and quoted phrases as required or
excluded by prefixing them with plus (+) or minus (-) signs.
Select the advanced bsearch query syntax, and the user enters field names as
part of a query expression. The words and phrases can be combined with
Boolean, positional, and counting operators.
For more information about searching, see the SAS Information Retrieval
Studio: Users Guide.
51
Display 2.37 Matching Pane
Search type
(default setting is Simple) click
to select Advanced.
Field Name
(default is Body) see the name of the fields to search with input queries.
Weight
(default is 1) is a scaling factor that compares one field to another.
52
Add
access the Add Field window. For more information, see Section 2.14.23
The Add Field Window for the Query Web Server Matching Pane on
page 146.
Remove
delete the selected field.

Edit
access the Edit Field window. For more information, see Section 2.14.24
The Edit Field Window for the Query Web Server Matching Pane on
page 147.
2.11.4.C The Sorting Tab

Use the Sorting tab to specify how documents are ranked when multiple
matches are returned to a query. The fields in the Sorting pane change
according to the Sort type selection that you choose. At this time, only the
combinations that appear based on specific selections are possible. The default
selection, Relevancy, is shown below:
Display 2.38 Sorting Pane
To use the components of this pane, see the following table:
53
Table 2-3: Sorting Pane Components

Component
Selection
Sort type
Description
Specify the relative importance of the matching documents
according to the metrics that you choose. For example, Cosine
Weight (the only metric that is part of relevancy by default) and
Freshness Weight.
Relevancy
Use the weight of specified fields, in combination with the

following metrics to determine the best returns. These metrics are
Cosine Weight, Proximity Weight, Position Weight,
Density Weight, and Freshness Weight.
Number of
Returns the matching document with the highest number of terms
matching terms that match those in the query syntax. You can select a tiebreaker to
determine a match when two or more documents meet this
threshold.
Number of
Returns the matching document with the highest number of
matching fields matching fields. You can also choose a tiebreaker in cases where
there are two, or more, matching documents.
54
Date (newest
first)
Returns the matching document with the most recent date. The
tiebreaker is Order added to the index.
Date (oldest
first)
Returns the matching document with the earliest date. The

tiebreaker is Order added to the index.
Field value
(largest first)
Returns the matching documents with the highest numeric value

stored in a field. For example, price or rating. You determine the
field in Sort field. Choose a tiebreaker for two, or more,
matching documents.
Field value
(smallest first)
Returns the matching documents with the lowest numeric value

stored in a field. For example, price or rating. You determine the
field in Sort field. Choose a tiebreaker for two, or more,
matching documents.
Field value
(alphabetical)
Returns the matching documents with matches in alphabetical

order. You determine the field in Sort field. Choose a tiebreaker
for two, or more, matching documents.
Order
added to
the index
Select to make the first document input to the index the matching
document. This is true when two or more documents both meet the
match requirements.
Table 2-3: Sorting Pane Components (Continued)

Component
Tiebreaker
Selection
Description
Click
to make a different selection, see the relevant Sort
Type above. There can be as many as three tiebreaker fields,
depending on the selection that you make in the Sort Type field.
Specify the following weights according to your values. In other
words, if the density weight is more important than any of the other
weights, specify the highest weight number for this field.
Cosine Weight
(Default: 1) Click
or
to change this metric that
weights frequently occurring terms more highly than those that are
infrequent. This metric also takes noise words into consideration.
(Noise words are the words that appear with enough frequency that
they are ranked down.)
Proximity Weight
(Default: 0) Click
or
to specify how to weight
matching query terms that are located close together in the
document.
Position
Weight
Density
Weight
(Default: 0) Click
or
to change the weight assigned to
matches on words located close to the beginning of the document.
(Default: 0) click
or
to change the metric that
balances the number of matched query terms with the total number
of words in the matching document. The number of match instances
is measured as a percentage of the document.
Freshness Weight
(Default is 0) Click
or
to change the number that
determines the age of the matching document. This metric combines
several factors besides the age of the document.
55
2.11.4.D The Labels Tab

Specify labels that enable your end users to use facetted search to intuitively
locate the information that they seek. For more information about facetted
search, see Section 3.7 Defining Labels for Facetted Search on page 163.
Display 2.39 Labels Pane
Field Name
see the names of the fields that you entered with the Add button.
Caption
see the label that you added with the Add button.
Add
access the Add Field window. For more information, see Section 2.14.25
The Add Field Window: Query Web Server Labels Pane on page 148.
Remove
delete the selected field.
56
Edit
access the Edit Field window. For more information, see Section 2.14.26
The Edit Field Window: Query Web Server Labels on page 150.
Move Up
change the location of the selected field. Click to move the selected field
up one level in the display shown on the search results page.
Move Down
change the location of the selected field. Click to move the selected field
down one level in the hierarchical taxonomy.
Maximum number of related labels
leave the default setting 10. You can also specify a new highest number of
labels that can be displayed in response to a query.
57
2.11.4.E The Match Formatting Tab

Specify how matching documents are displayed in the results list.
Display 2.40 Match Formatting Pane
Use the components of this pane as explained below:

Table 2-4: Match Formatting Pane Components
Component
Description
Title source
(Default: Text field) Click
to select HTML field or
None. Use the title fields in this pane to identify the type of field where
the title of the document is located in the input document. Select None
if you do not want to use the title fields for search. (When you select
None, the Title field disappears.)
Title field
(Default: title) Click
index.
58
to select a different info field in the
Table 2-4: Match Formatting Pane Components (Continued)

Component
Description
Use filename when

document has no
title
to select No.
Abstract source
(Default: Concordance) Click
to select Text field, or
HTML field. Use the abstract source fields to locate the summary of
the input document. Use the Concordance selection if you want to
enable hit highlighting. Hit highlighting bolds the matched query term
in an input document.
Link source
(Default: Text field) Click
to select None, URL, or HTML
field. The link fields specify the type of fields that provide a path to the
input text.
Link prefix
Modify the URL at display time for the purposes of passing an

argument from your own CGI script.
Link suffix
Modify the URL at display time for the purposes of passing an

argument from your own CGI script.
Note: For more information about link suffixes and prefixes, see
Section 13.3.6 Specify the Formatting for the Matches on page 326.
Add keywords to
PDF links
to select No. Modify the URLs of PDF
files to instruct Adobe Reader to highlight the search terms in an input
document. This operation functions like a concordance, but works for
the entire document, not only the abstract in the results list.
MIME type source
(Default: None) Click
to select Text field. This field
specifies where the name for the document format is located.
Date source
(Default: None) Click
to select Date or Text field. This
field is used to locate the source of the document date.
59
2.11.4.F The Theme Tabs

Specify the settings for the query web server user interface in the Theme tab
and its related tabs. For more information about formatting the search window,
see www.w3.org.
Display 2.41 Theme Pane
Title
(default is SAS Information Retrieval Studio) enter a new name to

change the name that appears in the search window that end users see
when they query the index.
Font
(default is sans-serif) enter a new font type into this field to change the
display look of the title. For example, enter Times New Roman.
60
Font size
or
to change the size of the display letters.
Use pop-up menus
(default setting is Yes) click

to select No. Select No to disable the
pop-up menus functionality for older browsers that do not support
Javascript.
Colors tab
specify the colors for the user interface. (See www.w3.org for more
information.)
Display 2.42 Colors Pane
61
Use the operations in this pane as follows:

Table 2-5: Colors Pane Components
Component
Description
Header
background color
(Default: Custom) Click
to select an image. You can also click
to access the color box window to change the header color.

Note: For more information about the Color Box window, see Section
2.14.27 The Color Box Window on page 151.
Header text color
to select one of the images that you
loaded into the work/query-web-server subdirectory of your
installation directory. You can also click

window to select the text color for headers.
to access the color box
Link color
window and select the color of the links.
Visited Link color

window and select the color for your visited links.
62
Table 2-5: Colors Pane Components (Continued)

Component
Description
Hover Link color

installation. directory. You can also click
window and select the color of the links when a user slides the cursor
over them.
Menu border color
(Default: WindowFrame) Click
to select one of the images that
you loaded into the work/query-web-server subdirectory of your
installation directory. This selection specifies how the color is applied to
menus.
Menu unselected
background color
(Default: Window) Click
menus.
Menu unselected
text color
(Default: WindowText) Click
unselected text in menus.
Menu selected
background color
(Default: Highlight) Click
the background of menus.
63
Table 2-5: Colors Pane Components (Continued)

Component
Description
Menu selected text

color
(Default: HighlightText) Click
to select one of the images
that you loaded into the work/query-web-server subdirectory of
your installation directory. This selection specifies how the color is
applied to the text in menus.
Reset to Default
button
Images
Click to restore the default settings.
tab
load the images that you plan to use for the search window into the work/
query-web-server subdirectory of your installation directory.
Left header image
(default is None) click

installation directory.
Right header image
(default is sas.png) click

installation directory.
64
2.11.5 The Log Tab

This pane provides information about the query web server operations. For
more information, see Section 2.3.2 The Log Tab on page 11.
2.12 Viewing the Query Statistics Server Pane

2.12.1 Overview of the Query Statistics Server Pane
The query statistics enables you to see the input query terms and information
about when these terms were entered.
Display 2.43 Query Statistics Server Pane
2.12.2 The Buttons

The Start, and Stop buttons work for the query statistics server like they work
for the crawlers. For more information, see Section 2.4.2 The Buttons on
page 13.

See whether the query statistics server is running.
65
Display 2.44 Query Statistics Server Status Pane
2.12.4 The Query Statistics Tab

2.12.4.A The Buttons
Specify the settings for the query statistics server in the Query Statistics tab.
You can see the various query analytics run by SAS Information Retrieval
Studio when you use this pane.
Display 2.45 Query Statistics Pane
Use the components of this pane as follows:

Table 2-6: Query Statistics Pane Components
66
Component
Description
Today
Click to see the date of the current day in the Year, Month, and Day
fields.
This Month
Click to see the current month in the Month field and the current day
in the Day field. The Year field is unavailable.
This Year
Click to see the current year in the Year field. The Day and Month
fields are inaccessible.
Table 2-6: Query Statistics Pane Components (Continued)

Component
Description
All Time
Click and the Year, Month, Day fields and the Previous and
Next buttons are all inaccessible.
Year field
Click
to select a year. You can select a year back until 1980, or
leave the default selection --.
Month field
Click
to select a month.
Click
to select a day.
Day field
Previous
Click to enter the preceding date. For example, if you selected 2010,
2009 appears in the Year field.
Next
Click to select the following date. For example, if you selected 2010,
8, and 20, the next day 21 appears in the Day field.
67
2.12.4.B The Most Frequent Queries Tab

See the terms that are most often searched in this pane for the selected time
period.
Display 2.46 Most Frequent Queries Pane
Query
see a list of the input search terms ranked from the highest to the lowest
input number.
Number of Occurrences
see the total number of query submissions.
68
2.12.4.C The Most Frequent Queries without Matches Tab

See the terms that are most often searched and not matched in the input
documents in this pane for the selected time period.
Display 2.47 Most Frequent Queries without Matches Pane
Query
see a list of the input search terms that are unmatched in the index. This
list is ordered from the highest to the lowest number of entries.
see the total number of times that each query was submitted.
69
2.12.4.D The Hourly Query Rate Tab

See the number of search terms entered during each 24-hour period, by hour,
for the selected time period.
Display 2.48 Hourly Query Rate Pane
Hour
see each of the 24 hours.

Number of Queries
see the total number of queries submitted each hour.
70
2.12.4.E The Daily Query Rate Tab

See the number of queries submitted for each day of the week in this pane.
Display 2.49 Daily Query Rate Pane
Day
see a list of the seven days of the week.

Number of Queries
see the total number of queries submitted each day.
71
2.12.4.F The Monthly Query Rate Tab

See the number of queries submitted for each month of the year in this pane.
Display 2.50 Monthly Query Rate Pane
Month
see a list of the 12 months of the year.

Number of Queries
see the total number of queries submitted each month.
2.12.5 The Log Tab

This pane provides information about the query statistics server operations.
For more information, see Section 2.3.2 The Log Tab on page 11.
72
2.13 The Add Document Processor Windows

2.13.1 Overview of Document Processor Windows
Use the Add Document Processor window to perform a processing operation
on input documents. For example, use these windows to remove mark-up tags,
categorize, extract concepts, and so on.
You can add more than one processor in order to perform several types of
operations on a single, input document. The document processors act on the
incoming documents in the order specified.
Note:
The order of the document processing operations is

important. For this reason choose to perform
operations such as parse_html and
heuristic_parse_html before operations such as
categorizer.
2.13.2 Access Document Processor Window

To access a Document Processor window, complete these steps:
1. Select Pipeline Server --> Configuration --> Document
Processors.
2. Click Add in the Document Processors pane for the pipeline server. The
Add Document Processor window appears.
73
3. Select one of the following processors:

add_field
add a new field with the value that you specify to each input
document. This field has one name and one value that is the same
for every indexed document. For more information, see Section
2.13.3 The Document Processor: add_field Window on page 77.
content_categorization
match categories, concepts, and facts in input documents. For more

information, see Section 2.13.4 The Document Processor:
content_categorization Wizard on page 78. Use this document
processor to create labels to use in facetted search. You can see
these labels in the labels pane in the Query Web Server
Configuration pane.
default_mime_type_from_url
return the document type that is located in the address fields of input
documents. For more information, see Section 2.13.5 The
Document Processor: default_mime_type_from_url Window on
page 95.
default_title_from_url
return the document title from the Web address of any input
documents. For more information, see Section 2.13.6 The
Document Processor: default_title_from_url Window on page 95.
document_converter
deploy SAS Document Conversion to change incoming files, such

as Adobe PDF and Microsoft Office documents, into text. For more
document_converter Window on page 96.
export_csv
use this selection to transfer document text into a comma-separated

format that can be exported into another program such as SAS Text
Miner or Excel. For more information, see Section 2.13.8 The
Document Processor: export_csv Window on page 97.
export_to_files
choose to save each document to a separate file. The name of the

file is based on a hash of its contents. For more information, see
74
Section 2.13.9 The Document Processor: export_to_files Window

on page 100.
export_to_odbc
Use the Document Processor: export_to_odbc window to send the

documents directly to a database. For more information, see Section
2.13.10 The Document Processor: export_to_odbc Window on
page 102.
export_to_sas_sentiment_analysis_workbench
send the document to SAS Sentiment Analysis Workbench. For

more information, see Section 2.13.11 The Document Processor:
export_to_sentiment_analysis_workbench Window on page 104.
extract_abstract
generate an abstract for the document based on the first 25 to 50

words in the body of the input text. For more information, see
Section 2.13.12 The Document Processor: extract_abstract
Window on page 106.
extract_pdate
normalize the date to a specific format that is understood by SAS

Search and Indexing. For more information, see Section 2.13.13
The Document Processor: extract_pdate Window on page 107.
heuristic_parse_html
separate the body of an HTML document from its tags. This is a

more advanced version of parse_html. heuristic_parse_html
searches for paragraphs of text without many links and extracts
these bodies of text. For more information, see Section 2.13.14 The
Document Processor: heuristic_parse_html Window on page 108.
invalidate_duplicates_by_url
stop more than one document with the same Web address from
being returned. For more information, see Section 2.13.15 The
Document Processor: invalidate_duplicates_by_url Window on
page 110.
match_and_copy
This is similar to the substitute document processor. Use the

match_and_copy document processor to write the output to a
different field than the input field. For more information, see
75
Section 2.13.16 The Document Processor: match_and_copy

Window on page 110. Also see Section 2.13.22 The Document
Processor: substitute Window on page 116.
modify_field_name
change the name of a field in an input document. For more

modify_field_name Window on page 112.
parse_html
separate the text from the HTML mark-up tags. For more
parse_html Window on page 112 and strip_html below.
Use this operation when you an HTML document and you want to
extract the body of this document, and possibly the metadata.
parse_xml
separate the text from the XML mark-up tags. For more
parse_xml Window on page 114.
send
save each input document to each pipeline server. For more

information, see Section 2.13.20 The Document Processor: send
Window on page 115.
strip_html
return only the text without the HTML mark-up tags. For more
strip_html Window on page 116 and parse_html above.
Use strip_html when you have a field that contains some HTML
code that you want to convert into plain text. For example, if input
XML documents contain HTML code.
substitute
exchange all occurrences of one term, tag, or other attribute for

another. For more information, see Section 2.13.22 The Document
Processor: substitute Window on page 116.
4. Click OK and the appropriate Document Processor window appears. For
more information, see Section 2.13 The Add Document Processor
76
Windows on page 73. The selected operation appears in the Document

Processors pane.
2.13.3 The Document Processor: add_field Window

Use the Document Processor: add_field window to add a field with a constant
string to each document passed to the index. For example, use this operation to
specify an identifier for each indexed document that specifies a specific
collection of documents.
To use the Document Processor: add_field window, complete these steps:
1. Select add_field in the Add Document Processor window and click
Next. The
first Document Processor: add_field window appears.
2. Enter a field name into field. For example, enter

corporate_documents.
Note:
Field names can be entered only in lowercase letters.
3. Enter a field name into value. For example, enter MyCompanyName.

4. Click Finish.
77
2.13.4 The Document Processor:

content_categorization Wizard
2.13.4.A Overview of the content_categorization
Document Processor
Use the Document Processor: content_categorization window to specify the
categories, concepts, and facts that can be matched in indexed documents. You
can also specify the labels that are used for facetted search using this wizard.
These processors automatically populate the index and the query web server
components with index fields and labels for facetted search. You can specify
these labels or use the default settings.
2.13.4.B Configure SAS Content Categorization Server

The categories, concepts, and facts are applied by SAS Content Categorization
Server. For this reason, you configure the server to work with the
content_categorization Document Processor.
To configure SAS Content Categorization Server, complete these steps:
1. Select content_categorization in the Add Document Processor
window and click Next. The Document Processor:

content_categorization window appears
2. (Optional) By default, the name of the server where SAS Content
Categorization Server is running is specified in the Hostname field.

For example, see localhost. You can enter a different server name if
SAS Content Categorization Server is running on another server.
78
3. (Optional) By default, the port number for the specified server is entered
in Port. For example, see 6500. Click

port number.
or
to select a different
4. (Optional) By default, 10 is entered into Timeout. Click
or
to select a different number. This is the number of seconds that

the Pipeline Server waits before it stops the download process.
5. Click Next to save these settings.
79
2.13.4.C Specify the Projects

SAS Content Categorization Server applies categories, concepts, and facts to
documents in the SAS Information Retrieval Studio Pipeline Server.
To specify the projects, complete these steps:
1. The Document Processor: content_categorization window appears. This
window lists the projects and their types. The categories, concepts, and
facts that are applied by the pipeline server are limited to those that are
specified in the projects that you specify.
2. Click Add. The Document Processor: content_categorization window
appears.
3. (Optional) By default, Categorization is selected in Type. This is
true if categories are part of the taxonomy in one of the SAS Content
Categorization Studio projects that you uploaded to SAS Content
Categorization Server. Alternately, Concept extraction, or
Contextual extraction
is selected. Click
to make a different
selection.
80
4. (Optional) By default, a project is selected. For example, see

Lifestyle. Click
to select a different project. For example,
select Sports. This selection limits the available categories, concepts,
and facts to those in the project.
5. Click Ok and the project appears in the Document Processor:
content_categorization window.
6. (Optional) Repeat Step 2. through Step 5. above until you have added all
of your projects.
7. Click Next to save your changes.
81
2.13.4.D Specify Input

SAS Content Categorization Server applies the categories, concepts, and facts
to input fields. These matches are labelled or exported as output.
To specify the input and output processing, complete these steps:
1. After you complete Step 7. on page 81, the Document Processor:
content_categorization wizard appears.
2. (Optional) By default, Input Fields is blank. Enter any field names that
you want to search for matches for your categories, concepts, and facts.
If you leave this field blank, all of the fields are searched with the
exception of any fields entered into Input fields to exclude.
3. (Optional) By default, Input Fields to exclude contains metadata
fields. Enter any field names that you want to exclude from the search
for your categories, concepts, and facts.
Hint:
To ensure that excess time stamped fields are not sent

to SAS Content Categorization Server, leave the ctime,
atime, and mtime fields. These field names represent
created, accessed, and modified dates for a file. For
more information, see Section 2.8.5 The Document
Inspector Tab on page 42.
4. (Optional) Click Finish to save your changes.
82
2.13.4.E Specify Categories

Select all of the categories in a SAS Content Categorization Studio projects
that you uploaded to SAS Content Categorization Server. These category rules
are applied by SAS Content Categorization Server to the documents in the
Pipeline Server running in SAS Information Retrieval Studio.
Note:
When you select categories, you select all of the

categories in all of the projects. However, when you
select concepts and facts, you can choose all or some
of the concepts and facts that exist in the selected
projects.
To specify that all of the categories in the uploaded projects can be used by
SAS Content Categorization Server, complete these steps:
1. Click Categories to access the Categories pane.
2. (Optional) By default, categories is entered into the Field name field.
You can enter a new field name.

3. (Optional) By default, Categories is entered into Caption. You can
enter a new caption name to change the label for facetted search. For
83
more information, see Section 9.5 Match Categories, Concepts, and

Facts on page 246 and Chapter 10: Creating Facetted Search Labels
Using content_categorization.
4. (Optional) By default, %c is entered into Format for the category name.
You can enter a new format that might include %% for a literal percent
sign. You can also use x as a modifier to request XML escaping. For
example, enter %xc.
5. (Optional) Enter a regular expression into the Category name pattern
field.
6. (Optional) Enter a string into the Category name replacement field.
This string is a constant value that replaces all of the category names.
7. (Optional) By default, ; (semicolon) appears in Separator. Enter a new
separator such as a comma (,).

8. (Optional) By default, the highest number of categories that can be
matched in any single input document is 15. Click

change this default selection in Max categories.
or
to
9. Click Finish.
2.13.4.F Specify Concepts

Select the concepts in the SAS Content Categorization Studio projects that you
uploaded to SAS Content Categorization Server. These concept definitions are
applied by SAS Content Categorization Server to the documents in the
Pipeline Server running in SAS Information Retrieval Studio.
To specify some of the concepts, complete the following steps. (If you want to
specify all of the concepts for all of the selected projects, see Step 1. and then
go to Step 10.)
1. Click Concepts to access the Concepts pane. Use this pane to add all of
the concepts and contextual extraction concepts. The concepts that are
84
specified by PREDICATE or SEQUENCE definitions are added in the Facts

pane.
appears. Use this pane to specify the settings for each individual
concept. These settings override the settings specified for all of the
concepts in the Concepts tab.
85
to select a concept in the Concept field. For example,

select SPORTS from the drop-down menu.
3. Click
4. (Optional) By default, sports is entered into Field name. You can
enter a new field name.

5. (Optional) By default, Sports is entered into the Caption field. You can
enter a new caption name to change the label for facetted search. For
Facts on page 246.
6. (Optional) By default, % is entered into the Format field for the concept
name. You can also use any of the following symbols:

Table 2-7: Default Format Symbols
Symbol
Description
%c
Match the concept name.
%p
Add to %c to include the path with the concept name.
%m
Match the text.
%i
Match the information associated with the entity, or the match text
if no information is available.
%I
Match the information associated with the entity unconditionally.
%%
Match the literal percent sign.
Use as a modifier, such as in %xc to request XML escaping
If you want to output nested XML tags, specify the format for these
tags such as <body>%xi</body>. For more information, see Section
2.13.9 The Document Processor: export_to_files Window on
page 100.
7. (Optional) By default, ; (semicolon) appears in the Default separator
field. Enter a new separator such as a comma (,).

8. Click Ok. If you want to make changes to your edits, click Copy
Defaults.
86
9. (Optional) Use Step 2. on page 85 through Step 8. on page 86,
reiteratively, until you have added all of the concepts that you want to
deploy in SAS Information Retrieval Studio.
10. (Optional) By default, concepts is entered into Default field name.

11. (Optional) By default, Concepts is entered into Default caption. You
can enter a new caption name to change the label for facetted search. For
Facts on page 246 and Chapter 10.
12. (Optional) By default, %c: %i is entered into the Default format field
for the concept name. You can also use any of the following symbols:
Symbol
Description
%c
%p
Exclude the path.
87
Table 2-8: Default Format Symbols (Continued)

Symbol
Description
%m
Match the text.
%i
%%

14. (Optional) By default, the highest number of concepts that can be

or
change this default selection in the Max concepts field.
to
15. Click Finish.
2.13.4.G Specify Facts

Select the facts in the SAS Content Categorization Studio projects that you
uploaded to SAS Content Categorization Server. The arguments in the
PREDICATE and SEQUENCE rules fact definitions are applied by SAS Content
Categorization Server to input documents.
To specify some of the facts, complete the following steps. (If you want to
specify all of the concepts for all of the selected projects, see Step 1. and then
go to Step 13. on page 93. In other words, when you leave facts in Default
field name all of the facts in the project are selected by default.)
1. Click Facts to access the Facts pane. (Facts are the contextual
extraction concepts that are defined by at least one PREDICATE or
88
rule. Unlike other concept rules, PREDICATE or SEQUENCE

rules have arguments.)
SEQUENCE
89
2. Click Add to specify a fact with its field name and caption for facetted
search.
to select a fact in the Fact field. Facts are contextual

extraction concepts that contain at least one PREDICATE or SEQUENCE
rule. For example, select SIDE_EFFECT from the drop-down menu.
3. Click
4. (Optional) When you select a fact using Step 3. above, the Field name
is automatically entered. For example, see sideeffect. Enter a new

name if you choose.
5. (Optional) When you select a fact, the Caption field is automatically
filled in. For example, see Side Effect. Enter a new name if you
choose.
6. (Optional) By default, the format for the matched fact is entered into the
Format
field. For example, see the following format:
SIDE_EFFECT(drug: %v{drug}, sideeffect: %v{sideeffect})
In this example, drug and sideffect are the returned arguments for
the fact SIDE_EFFECT if these arguments are matched. The match
strings for the arguments are %v{drug} and %v{sideeffect}. If there
are more than one PREDICATE or SEQUENCE rule in the definition with
these same arguments, this line is specified for all rules. If there are
90
any other types of rules in the definition, this fact also appears as a
concept when you select the Concepts tab.
7. (Optional) By default, %n: %v is entered into the Argument format
field for the argument format. You can also use any of the following
symbols:
Table 2-9: Default Argument Format Symbols
Symbol
Description
%f
Match the fact name.
%a
Match a formatted list of arguments.

Note: If you do not specify the argument symbol, the
Argument format field, even when specified, does not apply.
%v{name}
Output the value for a specific argument.
%m
Match the text.
%s
Return the concordance list.

Note: If you do not specify the concordance, the concordance is
not returned. This is true even when you specify the
Concordance type and Surrounding words in the
Facts pane.
%%
Use as a modifier, such as in %xf to request XML escaping.
Argument List:
%n
Match the argument name.


or
to
91
10. (Optional) Click Copy Defaults to insert the entries from the main
Facts
tab into all of the fields with the exception of the Fact field.
11. Click Ok.
92
12. (Optional) Use Step 2. on page 90 through Step 11. on page 92,
reiteratively, until you have added all of the facts that you want to apply
in SAS Information Retrieval Studio.
13. (Optional) By default, facts is entered into the Default field name
field. You can enter a new field name.

14. (Optional) By default, Facts is entered into the Default caption field.
You can enter a new caption name to change the label for facetted
search. For more information about facetted search, see Chapter 10.
15. (Optional) By default, %f(%a) is entered into the Default format field
for the concept name. You can edit this entry using any of the symbols
in Table 2-9 on page 91 with the exception of %v{name}.
93
Note:
Unless you specify %a, no arguments are called. This is

true even if you make entries in the Default argument
format field.
16. (Optional) By default, %n: %v is entered into the Default argument

format field for the concept name. You can edit this entry using the %n,
%v, %%, and the x modifier symbols in Table 2-9 on page 91.
17. (Optional) By default, , (comma) is entered in the Default separator
field. You can enter a new separator such as a semicolon (;).

18. (Optional) By default, Surrounding words is selected in Concordance
type.
Click
to select Full sentence.
Notes: Concordance refers to the surrounding text that is
returned with the match.

When you select Full sentence, the Surrounding words
field disappears.
If you do not specify the concordance using %s, the
concordance is not returned. This is true even when you
specify the Concordance type and Surrounding words in
the Facts pane.
19. (Optional) By default, 10 is selected in the Surrounding words field.
Click
or
to change this default selection.
20. (Optional) By default, the highest number of facts that can be matched
in any single input document is 15. Click

default selection in the Max facts field.
or
to change this
21. Click Finish to save your selections.
94

default_mime_type_from_url Window
Use the Document Processor: default_mime_type_from_url window to return
the document type from the address fields of any input documents. To perform
this operation, SAS Information Retrieval Studio looks for the filename
extension in the URL of an input document.
To use the Document Processor: default_mime_type_from_url window,
complete these steps:
1. Select default_mime_type_from_url in the Add Document Processor

default_mime_type_from_url window appears.
2. Leave the default specification, mimetype, or enter a new field name
into mime-type-field.
3. Leave the default specification, id, or enter the new field name into urlfield.

default_title_from_url Window
Use the Document Processor: default_title_from_url window to return the
document title using the Web address of the input documents.
To use the Document Processor: default_title_from_url window, complete
these steps:
95
1. Select default_title_from_url in the Add Document Processor

default_title_from_url window appears.
2. Leave the default specification, title, or enter a new field that specifies
what field is searched to locate the title of the input document. If the
document has no value for the field, the value of the URL field is used.
3. Leave the default specification, id, or enter the field where the
document title can be located into url-field.

document_converter Window
Use the Document Processor: document_converter window to extract plain
text from other types of file formats, such as Adobe PDF and Microsoft Word
documents. This document processor uses SAS Document Conversion.
To use the Document Processor: document_converter window, complete these
steps:
96
1. Select document_converter in the Add Document Processor window
and click Next. The Document Processor: document_converter

window appears.
2. Replace the default specification, localhost:54321, with a string
naming the server and its port in the server field. (The server name and
port number are separated by a colon [:]).
3. Leave the default specification, mimetype, or enter a new field where
the document type is found into mime-type-field. This field specifies

that non-ASCII text can be formatted into text.
4. Leave the default entry id, or specify a different field where the
identification information for the file can be located into filenamefield.
5. Leave the default specification, raw in the input-field and the
document processor gets the content in the body and title fields.
6. Leave the default specification, body, or enter a new location for the
output information into output-field. The output field is where the

plain text version of the document is stored.
2.13.8 The Document Processor: export_csv Window

Use the Document Processor: export_csv window to save documents to a .csv
file under the column headings that you specify with fields. (CSV represents
comma-separated value.) You can also specify categories, concepts, and facts
97
instead of fields when new files are created. Export these files with escaped, or
nonescaped characters, to be used in SAS Text Miner, Base SAS, Microsoft
Excel, and so on.
To use the Document Processor: export_csv window, complete these steps:
1. Select export_csv in the Add Document Processor window and click
Next.
The Document Processor: export_csv window appears.
2. Rename the default comma separated file, articles.csv, or leave this
entry in the filename field. If you add %s to the entry in this field, the
file is timestamped.
3. Leave the default entry 1, if you want to append to an existing file with
the specified name in the append field. This operation takes place when
the pipeline is restarted. Enter 0 to overwrite an existing file when the
pipeline is restarted.
98
4. Leave the default entry 0 in the new-file-after-n-rows field, or enter

1
if you appended %s to the entry in the filename field.
Notes: This field, like the following four fields, controls when a
file is closed and another file is accessed.

The operations in Step 4. through Step 7. are not
mutually exclusive. For this reason, a new file is started
when any of the enabled conditions is true.
5. Leave the default entry 0 in the new-file-after-n-idle-seconds field,
or enter 1 if you appended %s to the entry in the filename field. A new

file is created after the pipeline server is idle for the specified number of
seconds.
6. Leave the default entry 0 in the new-file-after-n-seconds field, or
enter 1 if you appended %s to the entry in the filename field. A new file
is created after the pipeline server is idle for the specified number of
seconds.
7. Leave the default entry 0 in the new-file-each-hour field. Enter 1 if
you appended %s to the entry in the filename field and a new file is
created every hour.
8. Leave the default entry 0 in the new-file-each-day field. Enter 1 if
you appended %s to the entry in the filename field and a new file is
created every day.
9. Leave the default entries id, title, and body as a comma-separated list
of field names that corresponds to the columns in the CSV file.

Alternatively, edit, or enter a new list of field names into the columns
field.
If you plan to use categorization, specify categories. If you want to
perform concept extraction, list the concepts in this field. This
includes the contextual extraction concepts and facts.
10. Leave the default specification 0 in the invalidate field if you want to
stop the input files at this point in the pipeline. Alternatively, enter 1 to
enable further document processing in the pipeline.
99
11. Leave the default specification 0 in the cleanup-white-space field if
you want to remove any new lines in the document. This operation
makes it easier to parse the document text. Alternatively, enter 1 to keep
these lines in the document as it is parsed.
12. Leave the default comma (,) that is entered into the delimiter field.
You can enter another character that is used to delimit the fields in the
output file.
13. Leave the default 1 setting in the excel-quoting field. Set this number
to 0 for nonescaped output.

14. Leave the default utf-8 encoding specification for input files in the
encoding
field.
2.13.9 The Document Processor: export_to_files

Window
Use the Document Processor: export_to_files window to save each document
to a separate file.
To use the Document Processor: export_to_files window, complete these
steps:
100
1. Select export_to_files in the Add Document Processor window and
click Next. The Document Processor: export_to_files window appears.
2. Leave the default selection, work/export-to-files, or change this
folder, in the directory field. SAS Information Retrieval Studio sends

each document to a separate file in the specified directory, based on a
hash of its contents.
3. Leave the default specification, xml, or enter text into the format field.
4. Enter a comma-separated list of field names to include in the output file.
If you leave fields blank, all of the document fields appear in the output
file.
5. Leave the default selections raw and mimetype in fields-to-exclude.
You can also specify different field names. The text in these fields does
not appear in the output file.
6. (Optional) When you want to output nested XML tags, enter the name
of the field whose value contains the XML syntax into xmlpreescaped-fields.
This field name is listed in the Field name of the Document Processor:
content_categorization window. For example, organization. For more
information, see Section 2.13.4.F Specify Concepts on page 84. This
101
field name is commonly used when XML escaping is requested in the

Format field of the Document Processor: content_categorization
window. See the following example:
7. Leave the default specification, article if you specified XML for the
output format. If you are using text as the output format, enter a
different document tag type into the document-tag field.
8. Leave the default utf-8 encoding specification for input files in the
encoding
field.
2.13.10 The Document Processor: export_to_odbc

Window
Use the Document Processor: export_to_odbc window to send the documents
directly to a database.
To use the Document Processor: export_to_odbc window, complete these
steps:
102
1. Select export_to_odbc in the Add Document Processor window and
click Next. The Document Processor: export_to_odbc window appears.
2. (Optional) Enter the name of the ODBC driver into the connectionstring
Note:
field.
Consult your database documentation for details before
you use this step and Step 6. below.
3. Enter the name of the database table into the table field.
4. Use the table-init field to specify the operation that is performed if the
database table already exists. If drop is specified, the existing table is

removed, allowing a new table to be created. The columns in the new
table replace those in the old. If set to truncate, the existing table is
preserved, but all of the rows in it are deleted.
5. Leave the default entries for the columns in the database table, or
specify new fields in the columns field.
103
6. Leave the default entry id in the merge-column field if you want to
add new rows with a merge operation. If you use the merge operation,
specify the name of a column. (Also see the note above.)
8. Leave the default setting of 1024 in the max-length field. You can also
change the highest number of characters permitted in the value specified

for a single database column.

export_to_sentiment_analysis_workbench
Window
Use the Document Processor: export_to_sentiment_analysis_workbench
window to send the documents directly to SAS Sentiment Analysis Workbench.
To use the Document Processor: export_to_sentiment_analysis_workbench
window, complete these steps:
104
1. Select export_to_sentiment_analysis_workbench in the Add
Document Processor window and click Next. The Document Processor:

export_to_sentiment_analysis_workbench window appears.
2. Leave the default setting, localhost, or enter a new server name into
the hostname field.

3. Leave the default setting, 4000, or enter a new number into the port
field.
4. Enter the name of the SAS Sentiment Analysis Workbench project into
project-name.
5. Enter the name of the output folder into document-set-name.
105
6. Leave the default selection, docid, or delete this entry in docid-field. If
this field is empty, a unique identifier is automatically generated for

each input document.
7. Leave the default selection, link, or delete this entry in link-field. If
this field is empty, a unique link to each document is automatically

generated.
8. Leave the createtime entry in createtime-field to specify the time
that the document was created. You can also specify another field for
this entry. In either case, the format of the contents of this field is
specified in the createtime-format field below.
9. Specify the format of the matching createtime data in the createtimeformat field. For example, %m/%d/%Y %I:%M:%S %p for SAS Sentiment
Analysis Workbench, or %Y%m%d for SAS Search and Indexing.
10. Leave the default specification, title, or enter the field where the name
of the document is located into title-field.

11. Leave the default specification, author, or enter the field where the
name of the person who wrote the document is located into authorfield.
12. Leave the default specification, geolocation, or enter the field that
specifies where the location is found into geolocation-field.

13. Leave the default specification, source, or enter the field that specifies
where the document originates into source-field.

14. Leave the default specification, body, or enter the field where the text of
the document is located into body-field.

2.13.12 The Document Processor: extract_abstract

Window
Use the Document Processor: extract_abstract window to generate the text in
the <abstract> tag. This document processor takes approximately the first 25 to
50 words from an existing field in a document, such as the body field, to
106
generate the abstract. (This location typically contains summary information for
a technical or scientific document.)
The abstract functions like the concordance if the document is sent to the
search engine. However, the abstract is static and therefore independent of any
query, the concordance is query-specific. For this reason, the concordance is
available only when a search operation is performed.
To use the Document Processor: extract_abstract window, complete these
steps:
1. Select extract_abstract in the Add Document Processor window and
click Next. The Document Processor: extract_abstract window appears.
2. Leave the default specification, body, or enter a new source for the
<body>
tag into source-field.
3. Leave the default specification, abstract, or enter the name of the
format tag where the document summary can be located into abstractfield.
2.13.13 The Document Processor: extract_pdate

Window
Use the Document Processor: extract_pdate window to convert the date value
in an input document into the pdate format. The pdate format is understood
by the search operation.
To use the Document Processor: extract_pdate window, complete these steps:
107
1. Select extract_pdate in the Add Document Processor window and
click Next. The Document Processor: extract_pdate window appears.
2. Leave the default specification, date, or enter a new source for the
document date into date-field. This date is converted into the pdate for
the search operation.
3. (Optional) Enter the strptime format of the date field in the input
document into date-format. If this field is left empty, the RFC822

format (Internet text message) is used.
4. Leave the default specification, pdate, or define a new field where the
pdate is stored in new-field.

heuristic_parse_html Window
Use the Document Processor: heuristic_parse_html window to separate the
body of an HTML document from its tags. This operation skips sections of the
document that it determines to be navigation sections.
108
To use the Document Processor: heuristic_parse_html window, complete these

steps:
1. Select heuristic_parse_html in the Add Document Processor window
and click Next. The Document Processor: heuristic_parse_html

window appears.
2. Leave the default specification, raw, or enter a new body field into
input-field.
body fields.
The raw specification returns the text in the title and
3. Leave the default specification, title, or enter a new title field into
title-output-field.
The plain text of the title is output to this field.
4. Leave the default specification, body, or enter a new body field into
body-output-field.
The plain text of the body field is output to this
field.
5. The entry 1 in require-mime-type specifies that matching documents
have a mimetype of HTML. If you enter 0, this field is not required.

6. Leave the mimetype entry in mime-type-field, or specify a different
field.
7. The entry 1 in the base64-input field specifies that the text is encoded
in the mime content transfer encoding. If you enter 0, this encoding is

not used.
109

invalidate_duplicates_by_url Window
Use the Document Processor: invalidate_duplicates_by_url window to run a
checksum operation that eliminates the accidental error of storing duplicate
documents with the same Web address. You can also specify where to store
these checksum URLs that can be tracked even if a restart operation is
performed on the pipeline server.
To use the Document Processor: invalidate_duplicates_by_url, complete these
steps:
1. Select invalidate_duplicates_by_url in the Add Document
Processor window and click Next. The Document Processor:

invalidate_duplicates_by_url window appears.
2. Leave the default specification, id, or enter a new field where the URL
is stored into the url_field.

3. (Optional) Enter the name of the checksum file into the checksumfile
field. When you enter this name, duplicate URLs continue to be

eliminated even after the pipeline is restarted.
2.13.16 The Document Processor: match_and_copy

Window
Use the Document Processor: match_and_copy window in ways that are
similar to the Document Processor: substitute window. However, the
110
match_and_copy window enables you to write the output to a field that is

different from the input field.
To use the Document Processor: match_and_copy window, complete these
steps:
1. Select match_and_copy in the Add Document Processor window and
click Next. The Document Processor: match_and_copy window

appears.
2. Enter the name of the field to be located into input-field.

3. Specify the pattern of the regular expression for this field in the pattern
field.
4. Enter the name of the field where the output is placed into outputfield.
5. (Optional) If the format field contains a value, the value controls how
each match is formatted, otherwise the matches are copied in full.

6. (Optional) The append parameter controls whether these values are
added to the end of an existing value for the output field, or these values
replace an existing value.
7. (Optional) By default the semicolon character (;) is entered into the
separator field. You can enter a different character,
output-field is used as a label or sent to the index.
or a string, if the
111
8. Click Finish.

modify_field_name Window
Use the Document Processor: modify_field_name window to change a field
name.
To use the Document Processor: modify_field_name window, complete these
steps:
1. Select modify_field_name in the Add Document Processor window
and click Next. The Document Processor: modify_field_name window

appears.
2. Enter the name of the field that you want to change into oldname.
3. Enter the name of the new field into newname.
4. Click Finish.
2.13.18 The Document Processor: parse_html

Window
Use the Document Processor: parse_html window to extract the contents of an
input HTML document.
To use the Document Processor: parse_html window, complete these steps:
112
1. Select parse_html in the Add Document Processor window and click

Next. The
Document Processor: parse_html window appears.
2. Leave the default specification raw, or enter a new location where this
processor can locate the unmodified document data, into input-field. If

you leave raw, the parse HTML tool extracts the text and puts it into
title-output-field and the body-output-field.
3. Leave the default specification title, or enter a new title field into
title-output-field.
The plain text of the title is output to this field. If

you leave this field empty, no title text is output.
4. Leave the default specification body, or enter a new title field into
body-output-field. The plain text of the body field is output to this
field. If you leave this field empty, no body text is output.
5. Leave the default specification 0 in output-metadata, to specify that
additional information in other fields is output. (The text in the body and
title fields is always output.) For example, description and
keywords might be output. The fields that are used for output depend on
the meta field types that appear in the HTML documents.
6. The entry 1 in the require-mime-type field specifies that the
mimetype
field is required. If you enter 0, this field is not required.
113
7. Leave the mimetype entry in mime-type-field, or specify a different
type field.
8. The entry 1 in the base64-input field specifies that the text is encoded
in the mime content transfer encoding. If you enter 0, this encoding is

not used.
2.13.19 The Document Processor: parse_xml Window

Use the Document Processor: parse_xml window to extract the contents of an
input XML document. You can instantiate this field multiple times in order to
support multiple document schemas.
To use the Document Processor: parse_xml window, complete these steps:
1. Select parse_xml in the Add Document Processor window and click
Next.
The Document Processor: parse_xml window appears.
2. Leave the default specification, raw, in input-field and the document
processor gets the text from the document.

3. Leave the default specification mimetype in mime-type-field. Only
documents with a mimetype of XML are processed.

4. (Optional) Enter the name of the file that tells the application how to
treat fields in input documents. The template-filename field specifies

the name and location of this file.
114
5. (Optional) Enter the name of the tag in input XML documents that
contains the string that is the identifier for output documents into the
copy-url-from-field.
6. (Optional) Enter the name of the tag in output documents that contains
the identifying string for these documents into the copy-url-to-field.

7. (Optional) Enter the name of the file that tells the application how to
treat fields in input documents. The template-filename specifies the

name and location of this file.
For more information, and to see an example of this file format, see Section
A.2 XML File Field Extraction File Format on page 353.
2.13.20 The Document Processor: send Window

Use the Document Processor: send window to pass each document to another
pipeline server. This operation is used when you want to deploy multiple
pipeline servers.
To use the Document Processor: send window, complete these steps:
1. Select send in the Add Document Processor window and click Next.
The Document Processor: send window appears.
2. Enter a new server name into the host field.

3. Enter a new number into the port field.
115
Note:
Change these settings to prevent an endless loop.
4. Leave the default entry id in id-field, or enter the name of a new field
where the document identifier is located.

5. Leave the default setting 0 in the invalidate field if you want to stop
the input files at this point in the pipeline. Alternatively, enter 1 to send
each instance of a document to another instance of the pipeline.
2.13.21 The Document Processor: strip_html Window

Use the Document Processor: strip_html window when you have a field that
contains some HTML code that you want to convert into plain text. For
example, if input XML documents contain HTML code. This operation leaves
the textual contents and removes the mark-up tags. (If the entire document is in
HTML, use the parse_html operation instead.)
To use the Document Processor: strip_html window, complete these steps:
1. Select strip_html in the Add Document Processor window and click
Next.
The Document Processor: strip_html window appears.
2. Leave the body field, or add new fields that are separated by commas
into Fields. These are the fields where the HTML tags are stripped in
order to return the text that they contain.
2.13.22 The Document Processor: substitute Window

Use the Document Processor: substitute window to perform regular expression
substitutions.
116
To use the Document Processor: substitute window, complete these steps:

1. Select substitute in the Add Document Processor window and click
Next. The
Document Processor: substitute window appears.
2. Enter the name of the first regular expression field to be located into
Field.
3. Specify the pattern of the regular expression into the Pattern field.
4. Enter the replacement for the first regular expression field into
replacement.
For more information about regular expressions, see Appendix A.
117
2.14 Miscellaneous Windows

2.14.1 The Import Settings Window
Use the Import Settings window to specify the XML file to import and the
affected components of SAS Information Retrieval Studio. When you select
this operation, you choose to modify the selected components of SAS
Information Retrieval Studio with the settings that you import.
To access and use the Import Settings window, complete these steps:
1. Click Import Settings in the Overview pane.
The Import Settings window appears.
2. Enter the name of the file that you want to import into the Filename
field. For example, enter ProjASettings.xml.
118
3. (Default setting: all components are selected) Deselect any of the
components that you do not want to modify with the imported file in the
Components section. For example, deselect Feed crawler, Indexing
server, and Query web server.
4. Click OK to save these settings.
2.14.2 The Export Settings Window

Use the Export Settings window to save the settings for the components that
you configured in SAS Information Retrieval Studio as an XML file. You can
then import this file to use it with another project.
To access and use the Export Settings window, complete these steps:
1. Click Export Settings in the Overview pane.
The Export Settings window appears.
2. Enter the name of the file that you want to export into the Filename
field.
119
2.14.3 The Select an HTTP Proxy Window

When you select the HTTP proxy server, you choose a server that is not the
proxy server for SAS Information Retrieval Studio. The HTTP proxy server is
a server that is an intermediary between the crawler and the Web site. The
HTTP proxy server evaluates requests before passing them to the web server.
To access and use the Add HTTP Proxy window, complete these steps:
1. Select Configuration --> General Settings in the Web Crawler pane.
120
2. Click Auto-detect and the Select an HTTP Proxy window appears.
3. Choose an HTTP Proxy. For example, select HTTPProxyname.

4. Click OK and the server appears in the HTTP proxy field.
2.14.4 The Add Entry Point Window

Use the Add Entry Point window to specify the URL that the web crawler uses
to begin its Web crawl. You can also limit the scope of the crawl and the
number of files that are downloaded from this site.
To access and use the Add Entry Point window, complete these steps:
1. Click the Configuration tab in the Web Crawler pane.
121
2. Click the Entry Points tab.
122
3. Click Add. The Add Entry Point window appears.
4. Enter a Web address into the URL field. For example, enter
www.sas.com.
Hint:
The http:// part of the address is automatically

inserted for you after you click OK.
5. (Optional) Leave the default selection Yes in the Add to scope field or
click
to select No. Unless there are scope rules, the crawler
follows all links found on the entry point page, the links found on those
pages, and so on. Scope rules limit the links that the crawler follows.
Use this feature to constrain the crawl to a single site, or section of the
site. In other words, the scope rule follows the way that many Web
pages are laid out. When you leave Yes selected, the URL is
automatically added to the Scope tab in the web crawler Configuration
pane. For more information, see Section 2.4.4.D The Scope Tab on
page 20.
123
or
to reset the number in the Quota field.
For example, specify 90000. When you specify a quota for the links
from the entry point, the overall quota for the crawler, or this number,
applies.
6. (Optional) Click
7. Click OK and this address appears in the entry points list.
124
2.14.5 The Edit Entry Point Window

Use the Edit Entry Point window to change the URL or Quota that you added
using the Add Entry Point window.
To access and use the Edit Entry Point window, complete these steps:
1. Select an entry point and click Edit in the Entry Points tab of the
Configuration pane for the web crawler. The Edit Entry Point window
appears.
2. Enter your changes into the URL field. For example, enter http://
.*\.sas\.com.
3. (Optional) Click
or
to reset the number in the Quota field.
125
2.14.6 The Add Feed Window

The feed crawler collects postings, whether full texts or summaries, from both
RSS and Atom feeds. Use the Add Feed window to add the URL for a feed to
the Feeds pane for the feed crawler. In order to select a URL, you locate the
Web page where the feed is located and copy the entire feed URL.
If you choose to collect summaries, select Yes in the Follow links field of the
Add Field window.
Hint:
If the feed parser collects summaries, select the

parse_html document processor in the pipeline server.
You can also specify a custom document processor to
handle the pages returned from these links. However,
the follow links operation does not perform recursively
like it does for the web crawler.
To obtain a feed, not a page, URL, complete these steps:

1. Locate the Web page with the orange box that symbolizes an RSS feed.
For example, http://support.sas.com/community/rss/.
126
2. Click
located to the left of the feed that you want. For example, Press
Releases.
3. The feed page appears.
4. Copy the feed URL from the URL field in the browser. For example,
copy http://www.sas.com/news/preleases/SASRecentPress.xml.
After you copy the URL for an RSS feed using the steps above, complete these
steps:
127
1. Select Feed Crawler --> General Settings --> Feeds. The Feeds
pane appears.
2. Click Add and the Add Feed window appears.
3. Paste the copied RSS feed URL into the Feed URL field.
to select Yes in the Follow

field.
(If
you
are
collecting
summaries,
Yes.
links
4. Leave the default setting No, or click
5. Click OK and the URL appears in the Feeds pane.
128
2.14.7 The Edit Feed Window

Use the Edit Feed window to change a URL that you added to the Feeds pane.
To edit a feed URL, complete these steps:
1. Click Edit in the Feeds tab of the Configuration pane for the feed
crawler. The Edit Feed window appears.
2. Place your cursor into the Feed URL field and make any necessary
changes.
3. Leave the default setting No, or click
links field.
4. Click OK to save these entries.
129
2.14.8 The Add Scope Rule Window

Use the Add Scope Rule window to set limits for the pages that the web
crawler can follow on the Internet.
To access and use the Add Scope Rule window, complete these steps:
1. Click Add in the Scope tab of the Configuration pane for the web
crawler. The Add Scope Rule window appears.
2. Enter a Web address, or enter a regular expression to define a matching
pattern for URLs, into the URL Pattern field.

A URL pattern is a pattern that matches against URLs. It is not a URL
itself. SAS Information Retrieval Studio supports two types of
patterns. The first is the prefix pattern that matches against the
beginning of the URL. For example, http://www.sas.com matches
this pattern and could return http://www.sas.com/technologies/
analytics/index.html. The second is a regular expression pattern
that matches the whole URL, but it also supports wildcards and other
operators.
3. Leave the default selection Prefix in the Match type field unless you
specified a regular expression in the URL Pattern field. In this case,

click
130
to select Regular expression.
4. Leave the default selection Allow in the Action field unless you want to
exclude URLs that match this pattern from the crawl. In this case, click
to select Exclude.
5. Click OK and this address appears under the URL Pattern heading.
131
2.14.9 The Edit Scope Rule Window

Use the Edit Scope Rule window to change the limits of the web crawlers
Internet search.
To access and use the Edit Scope Rule window, complete these steps:
1. Click Edit in the Scope tab of the Configuration pane for the web
crawler. The Edit Scope Rule window appears.
2. Use Step 2. through Step 5. in the Section 2.14.8 The Add Scope Rule
Window on page 130 to make any necessary changes.
132
2.14.10 The Add Filename Extension Window

Use the Add Filename Extension window to create a list of file types that
should specifically be excluded, or included, in the crawl. If you choose to
include one or more file extension types, all others are excluded. For this
reason, if you want to include all file types do not use the steps below.
Note:
This matching is case-sensitive.
To access and use the Add Filename Extension window, complete these steps:
1. Click Add in the Filename Extensions tab of the Configuration pane
for the web crawler. The Add Filename Extension window appears.
2. Enter the file extension that you want to return or to exclude in the
Extension
field. For example, enter gif.
3. Leave the default selection. If you click
and select Exclude, the
selected file type is not returned.

4. Click OK and the change appears in this URL pattern in the Filename
Extension pane.
133
2.14.11 The Edit Filename Extension Window

Use the Edit Filename Extension window to change the Allow or Exclude
operation for the selected file type.
Note:
This matching is case-sensitive.
To access and use the Edit Filename Extension window, complete these steps:
1. Click Edit in the Filename Extensions tab of the Configuration pane
for the web crawler. The Edit Filename Extension window appears.
2. Use Step 2. through Step 4. in the Section 2.14.10 The Add Filename
Extension Window on page 133 to make any changes.
134
2.14.12 The Add Credential Window

Use the Add Credential window to set the user name and password that is
required to access any password-protected site that you want to access.
To access and use the Add Credential window, complete these steps:
1. Click Add in the Credentials tab of the Configuration pane for the web
crawler. The Add Credential window appears.
2. Enter the address for a Web site that requires credentials into the Site
field.
3. Enter the name of the user into the Username field.
4. Enter the secret term matched to this user into the Password field.
5. Click OK and these entries appear in the Credentials pane.
135
2.14.13 The Edit Credential Window

Use the Edit Credential window to make changes to the information that you
entered in the Add Credential window. Use this window to narrow the
crawling scope. In other words, crawl everything but the specified file or
directory.
To access and use the Edit Credential window, complete these steps:
1. Click Edit in the Credentials tab of the Configuration pane for the web
crawler. The Edit Credential window appears.
2. Use Step 2. through Step 5. in the Section 2.14.12 The Add Credential
Window on page 135 to make any changes.
136
2.14.14 The Add Path Window

Use the Add Path window to specify the location of a file or directory for the
file crawler. You can use either relative or absolute paths. These path types are
relative to the component that uses them. However, absolute paths are
recommended for accurate search returns.
To access and use the Add Path window, complete these steps:
1. Click Add in the Paths tab of the Configuration pane for the file
crawler. The Add Path window appears.
2. Enter a file or directory name into the Path field. For Windows
fileshares, use universal naming conventions (UNC) instead of local

paths.
3. Click OK and this entry appears in the Paths pane.
137
2.14.15 The Edit Path Window

Use the Edit Path window to make a change to the selected path.
To access and use the Edit Path window, complete these steps:
1. Select a path and click Edit in the Paths tab of the Configuration pane
for the file crawler. The Edit Path window appears.
2. Use Step 2. through Step 3. in the Section 2.14.14 The Add Path
2.14.16 The Add Path to Exclude Window

Use the Add Path to Exclude window to deny the file crawler access to a file
or directory. This window enables you to limit the scope of the crawl.
To access and use the Add Path to Exclude window, complete these steps:
1. Click Add in the Paths to Exclude tab of the Configuration pane for
the file crawler. The Add Path to Exclude window appears.
2. Enter a path to the files that the file crawler should not access in the
Path
field.
3. Click OK and this entry appears in the Paths to Exclude pane.
138
2.14.17 The Edit Path to Exclude Window

Use the Edit Path to Exclude window to make a change in the directory path.
To access and use the Edit Path to Exclude window, complete these steps:
1. Click Edit in the Paths to Exclude tab of the Configuration pane for
the file crawler. The Edit Path to Exclude window appears.
2. Use Step 2. through Step 3. in Section 2.14.16 The Add Path to
Exclude Window on page 138 to make any changes.
2.14.18 The Add Extension Window

Use the Add Extension window to limit the access of the file crawler to the list
of files specified in this pane. If you enable at least one type of file to be
returned, only these files are returned
To access and use the Add Extension window, complete these steps:
1. Click Add in the Filename Extensions tab of the Configuration pane
for the file crawler. The Add Extension window appears.
2. Enter a string into the Extension field. For example, enter html.
3. Click OK and this entry appears in the Filename Extensions pane.
139
2.14.19 The Edit Extension Window

Use the Edit Extension window to limit file crawler access to the list of files
specified in this pane.
To access and use the Edit Extension window, complete these steps:
1. Click Edit in the Filename Extensions tab of the Configuration pane
for the file crawler. The Edit Extension window appears.
2. Use Step 2. through Step 3. in Section 2.14.18 The Add Extension
140
2.14.20 The Add Backend Window

Use the Add Backend window to add a pipeline server to the configuration
pane for the proxy server.
To access and use the Add Backend window, complete these steps:
1. Click Add in the Configuration pane for the proxy server. The Add
Backend window appears.
2. Enter a new string into the Host field. For example, enter newhost.
3. Click
or
specify 9008.
to reset the number in the Port field. For example,
4. Click OK and the new server information appears in the configuration
pane.
141
2.14.21 The Edit Backend Window

Use the Edit Backend window to change information about the pipeline server
that appears in the proxy server configuration pane.
To access and use the Edit Backend window, complete these steps:
1. Click Edit in the Configuration pane for the proxy server. The Edit
Backend window appears.
2. Use Step 2. through Step 4. in Section 2.14.20 The Add Backend
142
2.14.22 The Add Field Window for the Indexing

Server
Use the Add Field window to define the fields of the document stored in the
index. You also use this window to change the specifications that you entered
when you click Edit in the Configuration pane of the Indexing Server.
To access and use the Add Field window, complete these steps:
1. Click Add in the Configuration pane of the Indexing Server and the Add
Field window appears.
2. Enter the field name into the Name field. You can specify any field
name that appears in your documents.
3. Click
to choose one of the following selections:
Searching
(Default) Search for words that match the input query terms in this
field. This choice is equivalent to the standard functionality.
143
Label
(default) select this type of usage for facetted search. For more
information about facetted search labels, see Section 9.5 Match
Categories, Concepts, and Facts on page 246.
Display and Sorting
display the matching URLs according to the sorting type that you
select. Sort the results alphabetically, or numerically, instead of by
relevancy. This selection corresponds to marking the field as info.
Identification
choose this field to identify the field that contains the individual
identification number for each document. Each field in the index
requires a unique identifier. If a new document is added that has the
same identification number as an old document, the new document
replaces the old document.
Custom
select one, or more, of the following field types:
Standard
make this field a regular field. This selection enables searching

within this field.
144
Boolean
enable searches with Boolean operators. Boolean fields require

an exact match on a Boolean field in a document. If the entire
contents of a Boolean field are equal to the term, in a byte-forbyte manner, there is a match. In other words, case, punctuation,
and whitespace characters for the matched term are identical.
Info
make this field an information field. The information field is

used to pass static data. Information fields are not modified and
they cannot be matched by a query.
URL
contains either a Web address or a unique string.

Date
contains a field that represents the date of the document.

Number
contains an integer value that is associated with the document

such as a price or rating. Use this selection for range-based
query constraint at query time, if you choose to use the query
API. For more information about the query API, see the SAS
Search and Indexing: User and C and Java API Guide.
4. Click OK to add this field to the list of fields specified for this index.
145
2.14.23 The Add Field Window for the Query Web

Server Matching Pane
Use the Add Field window to add fields to the Matching pane of the query web
server.
Note:
The only fields that are available are those added to the
index with search functionality.
1. Click Add in the Matching pane of the Query Web Server and the Add
2. Leave the default selection such as id to search the identification fields
of the input documents. Click

to choose another field. The fields
in this drop-down list are added in the pipeline server.
3. Leave the default Weight value 1, or click
or
to reset the
weight assigned to this field. The weight value sets a number that is
relative to the other matching fields and is used to prioritize matches.
4. Click OK to save this field in the Matching pane.
146
2.14.24 The Edit Field Window for the Query Web

Server Matching Pane
Use the Edit Field window to change field entries in the query web server
Matching pane.
To access and use the Edit Field window, complete these steps:
1. Click Edit in the Configuration pane of the Query Web Server and the
Edit field window appears.
2. Use Step 2. through Step 3. in Section 2.14.23 The Add Field Window
for the Query Web Server Matching Pane on page 146 as necessary.
147
2.14.25 The Add Field Window: Query Web Server

Labels Pane
Use the Add Field window to add labels to the query web server for use with
facetted search.
1. Click Add in the Labels pane of the Query Web Server and the Add a
2. Any field in the index that has label functionality is available in the
drop-down list when you click

. For example, categories is added
as a document processor to the pipeline server.
3. Enter a label name into the Caption field. This term appears for any
matches on the categories. If you added concepts in the Pipeline Server

tab, you could specify a different label for each concept.
4. Leave the default selection No in the Hierarchical field, or
click
to select Yes or Flattened. No displays the list view. Yes
displays the tree view, and Flattened displays the tree in a list format.
148
Note:
You can specify Yes or Flattened in the Hierarchical

field for categories only. Concepts and facts do not
have parent-child relationships.
5. Leave the default selection No in the Display counts field, or
click
to select Yes to see the number of matching values for each
label field.
6. Leave the Show in matches value 0. You can click
or
reset the number of labels found in each individual matching
document that are displayed in the results list.
to
7. Click OK to save these entries in the Labels pane.
149
2.14.26 The Edit Field Window: Query Web Server

Labels
Use the Edit Field window to make changes to the labels in the query web
server for use with facetted search.
To access and use the Edit Field window, complete these steps:
1. Select a label in the Labels pane.
2. Click Edit and the Edit field window appears.
3. Use Step 2. through Step 7. in Section 2.14.25 The Add Field Window:
Query Web Server Labels Pane on page 148 as necessary.
150
2.14.27 The Color Box Window

Use the color box window to make changes to the colors that you use for the
query web server user interface.
To access and use the color box window, complete these steps:
1. Select Query Web Server --> Configuration --> Theme -->
Colors.
2. Click
3. Select
in the Color pane and the color box window appears.
to make a color change.
4. Click a color box beneath Basic to select a color.

5. See the colors that you previously selected in the Recently used boxes.
151
6. Click Custom Colors to access the expanded version of the color box
window.
7. See the color range for the selected color in the large pane.
8. Slide the
9. Click
and
or
to reset the default number 112 assigned to the
or
Green field.
Red
field.
10. Click
11. Click
Blue
152
buttons up or down to select a new color.
or
field.
12. See the newly selected color in the New pane.

13. See the existing color in the Current pane.
14. See the hexadecimal color code that corresponds with the selected color
for the Web page.

15. Click OK to save this color in the selected field.
2.14.28 Status Windows

2.14.28.A Overview of Status Windows
Some, but not all, of the status windows that appear in SAS Information
Retrieval Studio are displayed in this section. Use the status windows in this
application to understand the processes, to catch errors, and to make changes
to your application.
2.14.28.B The Confirmation Window

Use the Delete Index window to remove an index from the server.
To access and use the Delete Index window, complete these steps:
1. Click Delete Index in the Indexing Server pane and the Delete Index
window appears.
2. Click OK and the index that you have compiled is deleted.
2.14.28.C The Error Window

The Error window appears when an operation cannot be completed. This
window contains a string providing relevant information. See the example
below:
153
Display 2.51 Error Window
3. Click OK to close this window and make your changes.
154
Choosing Your Components

-
Before You Choose Your Components
Choosing a Crawler
Purposes of the Proxy Server
Choosing Document Processors in the Pipeline Server
How the Indexing Server Works
Querying the Index
Defining Labels for Facetted Search
After You Choose Your Components
Exporting and Importing Component Specifications
3.1 Before You Choose Your Components

SAS Information Retrieval Studio enables you, the administrator, to create a
custom document acquisition and processing application. You choose the
components that fit the requirements of your organization and you configure
each of these components. For example, if you choose to use the web crawler,
you install SAS Web Crawler before installing SAS Information Retrieval
Studio. If you want to perform search and indexing operations, you install
SAS Search and Indexing before you install SAS Information Retrieval
Studio.
You also choose the order for the document processors and decide whether to
index the processed texts or to send to another application. For information
about sample configurations, see Chapter 4: Sample Configurations.
3.2 Choosing a Crawler

Choose a crawler to locate and return the documents that contain the
information that you seek. Crawlers crawl the Internet and your corporate
files. Each returned document could be a Web page, a blog post, or a file that
is posted to the Internet or on your local machine.
Each document is a unit of textual data. For example, a document can be an
HTML page, a Microsoft Word or a PDF file, one row in a CSV file, or an
article or summary in a feed.
configurable set of fields. Each file has a name and a value. This name-value
pair might be returned as binary data (Word documents). Each document has
an associated ID tag that identifies the input document as it is collected,
processed, and output by SAS Information Retrieval Studio.
There are three types of crawlers in SAS Information Retrieval Studio:
Web crawler
crawls the Web, according to the parameters that you set. These
parameters define the types of documents and information that you seek,
and they also limit the scope of the crawl. The scope, or breadth and depth
of the crawl, prevent the crawler from attempting to return every
document that appears on the Internet.
When you limit the scope of the Web crawl, you optimize the crawl and
minimize the time that it takes to return this data. You can also specify the
credentials that are necessary to access password-protected sites.
File crawler
crawls fileshares on your organizationss network, or your local machine,

for the types of files that you specify. You input the parameters of the
crawl to limit the retrieval operation to the document types and paths that
you select, or exclude.
156
Feed crawler
use the feed crawler when you want to obtain blog posts, user forum
pages, and other trending data such as press releases.
3.3 Purposes of the Proxy Server

All deployments of SAS Information Retrieval Studio require the proxy
server. The proxy server controls the flow of documents. As an intermediary
server, the proxy server copies the collected data to multiple places. This copy
process prevents the loss of data in the case of hardware failure. The proxy
server sends the same set of documents to each pipeline server.
The proxy server performs two basic types of operations. These functions
make it an integral link between the crawlers and the servers that together
form your customized SAS Information Retrieval Studio application. For this
reason, the proxy server is not optional.
First, the proxy server enables you to pause the flow of incoming
documents from a crawler. When you stop the flow, the incoming
documents form a queue until you instruct the proxy server to resume
operations. You can use the pause operation to perform maintenance
without interrupting crawling.
Second, the proxy server enables you to specify a list of pipeline servers
running on different machines. A copy of each incoming document can be
sent to each of these servers.
Use the proxy server to send the same set of documents to multiple pipeline
servers:
To create mirrors, specify the same configuration for each server. The
servers that run on different machines act as mirrors in case of hardware
failure.
To create mirrors for multiple types of document processors, specify
identical document processing capabilities for each type of pipeline server.
In case of hardware failure, the input documents are saved in another
pipeline server.
157
3.4 Choosing Document Processors in the

Pipeline Server
3.4.1 Overview of the Pipeline Server
Use the pipeline server to specify the document processors that act on input
documents. For example, strip the markup tags from HTML documents, or
convert Microsoft Word documents into text. These document processors take
the input documents and prepare them to be used by another component of
SAS Information Retrieval Studio or another application.
After the input documents are processed by the pipeline server, choose to
export the data or build a searchable index of the documents using the
indexing server. Use the pipeline server to pass the input documents to the
selected program or to the indexing server.
3.4.2 Choosing a Document Processor

All deployments of SAS Information Retrieval Studio that process incoming
documents require the pipeline server. Input documents can be processed and
sent to the index or they can be passed to another application. For example,
HTML documents collected by the web crawler are passed to the proxy server
and then to the pipeline server. In the pipeline server, you can select the
document processors that act on this document before it is indexed or sent to
another application.
For example, if a document is in HTML format, it requires processing before
the text is used by another application. Use the parse_html, or
heuristic_parse_html processors to perform HTML processing before the
pipeline server performs any additional document processing operations on
the input document.
Use the following document processing operations before you index a
document:
categorizer
match category rule terms that appear in one, or more, fields of an input
document. These rules are found in the categories project running on SAS
Content Categorization Server.
158
concept_extractor
extract any matching concepts from an input document. These terms are
located in the specified concepts project running on SAS Content
Categorization Server.
contextual_extractor
return the matching contextual extraction concepts and facts in the

specified project running on SAS Content Categorization Server.
determine the type of the original, input document based on the filename
extension found in its Web address.
name documents that lack a title based on their Web addresses.

document_converter
change the format of incoming files, such as Adobe PDF and Microsoft
Office documents into text using the SAS Document Conversion
application.
extract_abstract
obtain the summary for each document.

extract_pdate
normalize the document dates into a format understood by SAS Search

and Indexing.
separate the body of an HTML document from its tags using an operation
that provides an algorithm to obtain the optimal result. This operation
searches for paragraphs of text without many tags and extracts these
bodies of text.
prevent the collection of multiple copies of the same document from being
returned.
modify_field_name
change the name of a field in an input document.
159
parse_html
separate the text from the HTML mark-up tags in an input HTML
document.
parse_xml
separate the text from the XML mark-up tags in an input XML document.
send
save each input document to each pipeline server.

strip_html
remove the mark-up tags and return only the text from an HTMLformatted field in a document.
substitute
use regular expressions to modify fields that match a specified pattern.

Regular expressions are used to locate patterns. For more information, see
Appendix A.
3.4.3 The Export Operations Performed by the

Pipeline Server
Use the pipeline server to perform the following operations. The export
operation is a function of the proxy server. You can configure each of the
following operations for the pipeline server:
export_csv
transfer document text into a comma-separated format. The columns in this

file match the fields that are processed and placed into the output file. The
identification (ID) field in this file tracks the URL that identifies the
document.
The output can be used by many applications. For example, use this file type
to place a document into a spreadsheet.
export_to_files
save each document to a separate file whose name is based on a hash of its
contents.
export_to_odbc
send documents to a database or to another ODBC provider.
160
pass the information directly to the SAS Sentiment Analysis Workbench

application for further analysis.
Note:
By default, when SAS Search and Indexing is installed,

documents are sent to the index if they are not
exported to other applications.
3.5 How the Indexing Server Works

After the gathered documents are processed, they are passed to the indexing
server that builds a searchable index of the documents. Each document in the
index consists of a set of fields. These fields are populated with the data in the
matched fields of the documents passed to the indexing server by the pipeline
server. Use the different field types in the index to enable different types of
queries. If you choose to enable facetted search, you can also specify fields as
labels and enable intuitive search.
3.6 Querying the Index

3.6.1 Overview of the Querying
The query server and the query web server work together to obtain queries and
pass them to the index. You can format the query web server to display the
matched documents in the search window. To see a statistical analysis of these
queries, use the selections that are available for the query statistical server.
3.6.2 Using the Query Server

The query server controls the flow of queries to the index and the matches that
are returned to input queries. You can see a log file for the query server if you
want to run a check on the server.
161
You can design a Web page that enables users to input queries and to obtain
search results. However, you can also use the query server with an application
that does not require an interface to search the index. In this case, you can
write a custom program to provide a connection between the query server and
your application.
3.6.3 Using the Query Web Server

The query web server provides the capabilities that are necessary to customize
a Web-browser interface for the query server. Use this window to specify the
following types of parameters:
Searches
specify simple or advanced searches. Simple searches match an input

term. Advanced searches match field names and take various types of
operators. Advanced searches limit and more accurately return results.
You can also specify facetted search using labels. These labels enable
users to follow related search terms to intuitively locate the results that
they seek.
Sort the results
decide whether to return search results based on relevancy, date, or the

number of matching terms or fields.
Format the user interface and results
select the way that results are displayed in the custom user interface that
you design. You can also specify themes and colors.
3.6.4 Using the Query Statistics Server

The query statistics server enables you to monitor the queries entered from the
query server. Choose to perform this monitoring by specifying a date range.
You can choose to see any, or all, of the following statistics:
Most frequent queries
see a list of the most frequent query terms and the number of occurrences
for each term.
162
Most frequent queries without matches
see a list of the most frequent query terms that did not locate results in the
index. You can also see the number of occurrences for each term.
Query rate by hour
see the number of queries for each hour in a day.

Query rate by day
see the number of queries for each day of the week.

Query rate by month
see the number of queries for each month in a year.

Query rate for all time
see the number of queries since you installed the SAS Information
Retrieval Studio application.
3.7 Defining Labels for Facetted Search

Facetted search enables end users to query the index using clusters of related
labels to intuitively locate the information that they seek. The labels that
appear in the interface are those that occur most frequently in the matching
documents.
End users can navigate between labels without using the Back button in the
Web interface or breadcrumbs. Instead, users select meaningful terms and
navigate by using them to refine their query.
Labels are used to identify matching categories and concepts in indexed
documents. In other words, labels use the names of the SAS Content
Categorization Studio categories and concepts to cluster matching documents.
You specify these labels when you create a SAS Content Categorization Studio
project and upload it to SAS Content Categorization Server.
Users can click on labels to see related documents, or select a document to
view related labels. These webs of links, provided by the document
classifications, are an alternative to the linear paths provided by breadcrumbs,
or to the Back button in the browser.
163
3.8 After You Choose Your Components

After you select the SAS Information Retrieval Studio components that fit the
document retrieval, processing, and search requirements of your organization,
you can construct your application. When you design the application, it is
important to consider the order of the processes involved. For example, you
cannot use the query web server interface to search documents that are not
indexed.
These specifications, as well as all of the information necessary to configure
each component of SAS Information Retrieval Studio is explained in the
following chapters. It is necessary only to review the chapters that discuss the
components that you choose to use.
3.9 Exporting and Importing Component

Specifications
After you develop your SAS Information Retrieval Studio application, you can
export the specifications for your components. When you choose to use this
process, you create an XML file that can be imported into another project. Use
this process to create a new project using the old settings for some, or all, of
the SAS Information Retrieval Studio components. For more information, see
Section 2.14.1 The Import Settings Window on page 118 and Section 2.14.2
The Export Settings Window on page 119.
164
Sample Configurations
-
Why You Want to Understand Sample Configurations
Before You Use a Sample Configuration to Create Your Own

Application
Sample Configurations That Use the Web Crawler
A Sample Configuration That Uses the File Crawler
A Sample Configuration That Uses the Feed Crawler
4.1 Why You Want to Understand Sample

Configurations
When you understand how some of the sample configurations for SAS
Information Retrieval Studio work, you have a better idea of how to build a
customized application. For this reason, these examples include the types of
processes, specifications, and purposes for these configurations.
Specific examples of the settings for the necessary document processors in the
pipeline server are also included. Select the document processors that act on
input documents to prepare them to be handled by another server or
application using the pipeline server.
Configure the components for the application that meets the requirements of
your organization using these samples to understand how the various
components work together.
4.2 Before You Use a Sample Configuration to

Create Your Own Application
Sample configurations are examples that are designed to be changed to meet
your organizations requirements. It is important to understand some of the
operations that are necessary to make when you develop an application and
then choose to make changes.
All of the following information is contained in the appropriate chapters that
follow this chapter. For your convenience, a summary of the operations that
are necessary when you make changes to your configuration is outlined below:
-
Make sure that your document processors are listed in the order of
logical operations:
a. Normalize input text. For example, place parse_html,
heuristic_parse_html, or document_converter, first in the list of

document processors. Each of these processors strips the text from
the input document.
b. Process the text. For example, categorize, or extract concepts,
extract an abstract, and so on. These processors act only on

normalized text.
c. Export the documents to a SAS application or to third-party
software. Send only the processed and normalized text that can be
used by an index (by default, if you install SAS Search and
Indexing, the documents are indexed) or by other applications.
These applications include SAS Text Miner and SAS Sentiment
Analysis Workbench.
166
Click Apply Changes before you leave the tab for any component
where you make changes.
Delete the index if you want all of the gathered documents to be indexed
according to the changed settings. If you do not delete the index, the
documents that were indexed according to the old settings remain in the
index. The documents that are added after you save your changes to the
new index are indexed according to the new settings.
If necessary, stop and restart the web, file, or feed crawler that is
running. If you delete the index, stop and restart the web, file, or feed
crawler that you chose to build the index.
Hint:
When you restart the crawling process, the documents

that were previously gathered are collected again.
If you decide to check the results of your index by entering query terms
using the search window, consider the path and scope of your crawl. In
other words, if you limit your crawl to SAS documents, do not expect to
enter medical terms and locate matches in these documents.
4.3 Sample Configurations That Use the Web

Crawler
4.3.1 A Web Crawler, Indexing, and Searching
Configuration
For this configuration, using the web crawler with several other SAS
Information Retrieval Studio components, make sure that the following
components are installed:
-
SAS Search and Indexing
SAS Web Crawler
Optional application: A category, concepts, or contextual extraction project

developed in SAS Content Categorization Studio and loaded on SAS Content
Note:
It is necessary to choose an HTML document processor

when you use the web crawler.
167
To set up a simple project that crawls the Web, builds an index, and configures
the query server, complete these steps:
1. Select Web Crawler --> Configuration --> General Settings.
2. Click Auto-detect and the Select an HTTP Proxy window appears.
Use this window to select the proxy server that is located between the
web crawler and the Internet.
3. Select a server. For example, choose my.default.proxy.server.
168
4. Click OK and the selected server appears in the HTTP proxy field of the
General Settings pane.

5. (Optional) Make changes to the web crawler. For example, Use these
steps:
a. Increase the number of files that the web crawler can collect from
25
to 3000 in the Quota field.
b. Increase the total size of the files that can be collected to 3000
megabytes in the Quota field.

c. If you increase the Number of downloader threads, the web
crawler can access more files quickly. However, too many threads
can overwhelm the site that the web crawler is crawling.
6. Click OK and the server appears in the General Settings pane.
7. Select Configuration --> Entry Points to specify the Web site where
the web crawler begins its Internet crawl.
169
8. Click Add and the Add Entry Point window appears.
9. Enter the Web address that the web crawler uses to enter the Internet
into the URL field.

10. (Optional) Limit this crawl to the specified site, select Add to scope. If
you specify at least one permitted site, every other site is excluded.
For more information, see Section 5.2.4 Specify the Scope of the Crawl
on page 198.
11. (Optional) Change the limit on the number of files downloaded from
this site in the Quota field using

or
. The lesser of the
numbers entered into this field and the Quota field in the General
Setting field applies.
12. Click OK and this address appears in the Entry Points pane.
13. (Optional) To limit the file types that are returned, click the Filename
Extensions
tab. For more information, see Section 5.2.5 Exclude

Certain Types of Files on page 202.
14. (Optional) To specify user names and passwords for password-protected
sites, click the Credentials tab. For more information, see Section
5.2.6 Specify Access Information for Password-Protected Sites on
page 203.
170
15. Click Apply Changes to save the new web crawler configuration.
Note:
Do not start the web crawler until you have configured

all of the components for your application. If you start
the web crawler before you configure the indexing
server, delete and rebuild the index.
16. Select Pipeline Server --> Document Processors. The Document
Processors pane appears.
171
17. Click Add and the Add Document Processor window appears.
18. Select heuristic_parse_html.

Hint:
You could also select parse_html but the

heuristic_parse_html processor uses a heuristic to
exclude navigation text.
19. Click Next and the Document Processor: heuristic_parse_html
window appears.
20. Leave the default settings or make changes. For more information about
these fields, see Section 2.13.14 The Document Processor:

heuristic_parse_html Window on page 108.
172
21. Click Finish and the document processor that you select appears in the
Document Processors
tab.
22. Use Step 17. through Step 21. above, reiteratively, until you have added
all of the document processors that you require. For example, if you
want to add labels to enable facetted search use the
content_categorization Document Processor. For more information, see
Section 2.13.4 The Document Processor: content_categorization
Wizard on page 78.
In this example, both categories and concepts are added to enable
facetted search.
23. Select a document processor and click Edit to make a change to a
document processor. If you want to change the functionality of a

category or concept field, see Step 26. below.
Note:
Be sure to make the heuristic_parse document

processor the first in the list in the Document
Processors pane.
24. Click Apply Changes.
173
25. Click the Configuration tab in the Indexing Server.
26. (Optional) Leave the default settings or click Edit to make changes to
the functionality of an index field.

Hint:
If you added fields such as categories or concepts,

these field names automatically appear in the
Configuration pane. The categories field, and each
concept field, has the Label functionality.
27. (Optional) By default, the index is optimized for the English

Language.
To change this setting, click

and select another
language from the drop-down menu that appears.
28. Select Query Web Server --> Configuration --> Matching. Use
this pane to set the priorities for field matches. Weights are a relative
174
setting. The priority value that you specify for each field is determined
only in relationship to other matched fields in a document.
29. (Optional) To add a field and specify its weight, click Add, and the Add
30. Click
to select a field that appears in the Configuration pane. For

example, select title in the Name field.
175
31. (Optional) To change the default setting 1 in the Weight field,
click
or
32. Click OK and this field and weight appear in the Matching pane.
33. Select the Labels pane to see all of the selected categories, concepts,
and facts.
34. (Optional) Select a field and click Edit to make changes to this field.
For example, use the Edit Field window that appears to change the
caption, or label, name. For more information, see Section 2.14.25 The
Add Field Window: Query Web Server Labels Pane on page 148.
to change the default setting 10 in the Maximum
field. This is the highest number of related
labels that can be displayed in response to a query. The end user sees
these labels after entering a query into the SAS Information Retrieval
Studio search window.
35. Click
or
number of related labels
176
37. Select the Web Crawler pane and click Start.

38. Select Query Web Server --> Status.
39. Click the blue hyperlink and the search window appears.
40. Enter a query into the search field in the SAS Information Retrieval
Studio search window that appears. For example, enter analytics.
41. Click Search and see the labels that match the returned documents on
the left side of the search window. On the right side see the matching
documents with links to the full text for each document.
42. To see the statistics for queries, click the Query Statistics Server tab.
For more information, see Section 14.3 View the Query Statistics for a
Selected Time Period on page 341.
177
4.3.2 The Web Crawler with Exporting and Indexing

Processes
You can send the same set of documents, collected by the web crawler, to an
index and SAS Text Miner. To perform these operations, configure the
pipeline server with the document processors appropriate to the index and to
the export operation.
4.4 A Sample Configuration That Uses the

File Crawler
For this configuration, using several SAS Information Retrieval Studio
components and processes, make sure that the following components are
installed:
-
SAS Document Conversion

Note:
178
It is necessary to choose document_converter

processor for this configuration.
To set up a simple project that crawls the files on your machine and exports
these files, complete the following steps:
1. Select File Crawler --> Configuration --> Paths.
2. Click Add to add one, or more, paths to the Paths pane. The Add Path
window appears.
3. Select a directory. For example, enter

\\MyComputer\Documents\FolderA.
179
4. Click OK and the path appears in the Paths pane.

5. (Optional) Use Step 2. through Step 4., reiteratively, until you have
added all of your paths.

6. Click any setting in the other tabs in the Configuration pane that you
want to use to configure the file crawler.

7. Click Apply Changes to save the new file crawler configuration.
Note:
Do not start the file crawler until all of the configuration

processes are complete. If you start the file crawler
before configuring components such as the indexing
server, delete the index and rebuild it.
8. Select Pipeline Server --> Document Processors. The Document
180
10. Select document_converter. This document processor extracts plain
text from input document formats such as Microsoft Office and Adobe
PDF files. This document processor is relevant for the file crawler, but it
can also be used with the web crawler after the parse_html document
processor is used.
11. Click Next and the Document Processor: document_converter window
appears.
12. (Optional) Change any of the settings in this window. For more

document_converter Window on page 96.
181
13. Click Finish and the selected document processor appears in the
Document Processors pane.

14. Click Add and select export_to_files in the Add Document Processor
window that appears.
15. Click Next and the Document Processor: export_to_files window
appears.
16. (Optional) Add a field name such as body to fields.
182
Note:
If you add one field, only the specified field is included.

In this example, the body field was selected in the
Document Processor: document_converter window.
17. (Optional) Make any other changes to the fields in the Document
Processor: export_to_files window. For more information, see

Section 2.13.9 The Document Processor: export_to_files Window on
page 100.
18. Click Finish and the document processor that you added appears in the
Document Processors Pane.

19. Use Step 14. through Step 18. above, reiteratively, until you have added
all of the document processors required. For example, if you want to add
labels to enable facetted search, see Section 2.13.4 The Document
Processor: content_categorization Wizard on page 78.
Note:
If you add any additional document processors, be sure

to move them up above the export_to_files document
processor in the Document Processors pane.
20. Click Edit to make any changes to your fields.

183
4.5 A Sample Configuration That Uses the

Feed Crawler
For this configuration, using several SAS Information Retrieval Studio
components and processes, make sure that the following components are
installed:
-

Note:
It is necessary to choose an HTML document processor

for this configuration.
When you set up the feed crawler, you can choose to return either summaries
or full length texts. If the feed collects summaries, you can enable the feed
crawler to follow the links contained in the summaries to the full texts of each
article. If the feed crawler collects summaries, it also follows any links to the
full story. For this reason, enable this capability using the steps below.
To set up the feed crawler, complete the following steps:
184
1. Select Feed Crawler --> Configuration.
2. Click Add and the Add Feed window appears.
185
3. Access your Web browser and locate the Web page with the orange
box
that symbolizes an RSS feed. For example, http://

support.sas.com/community/rss/.
4. Click
located to the left of the feed that you want. For example,
Media Coverage.
186
5. The feed page appears.
6. Copy the feed URL from the URL field in the browser. Paste this URL
into the Feed URL field in the Add Feed window. For example, copy
http://www.sas.com/news/mediacoverage/
SASRecentMediaCoverage.xml into the Feed URL
field.
7. Summaries of news articles comprise the RSS feed shown in the
example above. For this reason, click

links field in the Add Feed window.
8. Click OK in the Add Feed window.
187
9. Select Pipeline Server --> Document Processors and the
Document Processors pane appears.
11. Select parse_html. In this example, summaries are collected and the
feed crawler is instructed to follow links to the HTML pages that are
linked to each summary. (See Step 7. on page 187 where Yes is selected
in the Follow Links drop-down menu.)
188
12. Click Next and the Add document Processor: parse_html window
appears.
13. (Optional) Make any changes that you choose.

14. Click Finish. The parse_html document processor appears in the
Document Processors pane.

Note:
You can also add custom document processors to

perform operations on the input feed text. For
example, when the Follow links selection is enabled in
the Add Feed window, documents that contain both a
post and a list of comments or replies are returned. If
you want to separate each post into a separate
document, write a site-specific document processor.
Use this document processor instead of parse_html.
189
190
Configuring the Web Crawler

-
Overview of the Web Crawler
Configuring the Web Crawler
Run the Web Crawler
Troubleshoot with the Log File
5.1 Overview of the Web Crawler

The SAS Web Crawler is controlled by SAS Information Retrieval Studio. The
web crawler searches the Internet and returns the documents that it locates,
according to the parameters that you set. You specify the types of files to
return, the Web addresses where the collection process begins, and the scope
of the crawl. You can also specify the user names and passwords that are
necessary to crawl password-protected sites.
The web crawler passes the documents that it collects to the proxy server that
sends them to the pipeline server. According to the processing that the pipeline
server performs, the documents can be sent to an application, database, or to
the indexing server where they can be queried.
After the web crawler collects the maximum number of pages allowed, it stops
running. You can restart the web crawler at any time.
Notes: If you plan to crawl blogs, user forums, or other time-
sensitive data such as press releases, use the feed

crawler instead.
5.2 Configuring the Web Crawler

5.2.1 Overview of Configuring the Web Crawler
You configure the web crawler in stages, or according to the parameters set up
for each tab in the Configuration pane. Each tab, or set of configurations,
defines a specific aspect of the crawler. This section is set up as a how-to
guide, but it also contains the background information that is necessary to set
the specific parameters for each tab.
Display 5.1 Web Crawler Configuration Pane
Use each of the following sections to configure your web crawler with one
exception. The Credentials information is necessary only when you choose to
crawl password-protected sites.
After you make all of your changes, click Apply Changes in the Web Crawler
pane. If the file crawler is running when you click this button, the Restart Web
Crawler window appears.
Display 5.2 Restart Web Crawler Window
Click Yes.
If the web crawler is not running, click Start.
192
5.2.2 Specify the General Settings

You configure the web crawler to specify how the crawl and download
operations work. As you work through each of the steps below, the appropriate
background information is included.
To specify the parameters for the web crawler, complete these steps:
1. Click the General Settings tab in the Web Crawler pane.
2. Click Auto-detect to access the Select an HTTP Proxy window.
193
a. Select a proxy server. For example, choose MyHTTPProxyServer.
The HTTP proxy server is a server that is an intermediary between

the crawler and the Web site. The HTTP proxy server is not the
proxy server for SAS Information Retrieval Studio. The HTTP
proxy server evaluates requests before passing them to the web
server.
b. Click OK and this server appears in the HTTP proxy field.
or
to change the default setting of 25 in the Quota
field. This is the maximum size for all of the files collected by
the web crawler.
3. Click
(files)
4. Click
or
to change the default megabyte limit of 1000 for
the maximum number of megabytes in the Quota (megabytes) field.
This limit applies to all of the collected documents.
5. Click
or
to change the total number of threads that can be
created in the Number of downloader threads field. For example,
change this setting to 16. (The default setting is 1.) The more threads
you specify, the faster the download process becomes. However, a
higher number of downloaded files can also overwhelm a site and shut
it down.
6. Click
or
to change the number of seconds that the web
crawler rests between page downloads in the Sleep interval field.
(The default setting is 1.) This setting enables the web crawler to be
polite. In other words, a single thread does not overwhelm a site with
download requests. This is not true, if you use this setting but have
many threads. For example, 100 threads operating on 5-second sleep
intervals could potentially launch 100 requests simultaneously to a
site.
7. Click
or
to change the number of seconds before the web
crawler stops trying to download a page in the Timeout field. (The
default setting is 300.)
194
8. Click
or
to change the number of times that the web
crawler tries to download a page before it stops in the Maximum
number of retries field. (The default setting is 3.)
9. Click
or
to change the highest number of seconds that the
web crawler waits before it tries to download a page again in the
Retry delay field. (The default setting is 300.)
to select No, the default setting is Yes, in the Respect

robots.txt field. Select No to ignore a Web site authors request not to
crawl specific portions of a site.
10. Click
11. Click
to select No to prohibit the web crawler from following

links found in either of these types of code in the Find links in
Javascript and Flash field. The default setting is Yes,
12. Click
first,
to select Depth first, the default setting is Breadth

in the Link traversal order field.
In the breadth-first mode, the crawler searches all of the links in the
point-of-entry page. The crawler then searches all of the links in the
first layer of child pages. The crawler repeats this process for the
second layer of child pages, and so on, until it has crawled all of the
links related to the point-of-entry page. This is a first in, first out
operation.
In depth-first mode, the crawler follows one set of links through all of
its children. The crawler then backtracks to the next child page and
crawls the links of its children, and so on. This process is repeated
reiteratively until all of the links in a page are crawled. This operation
drills deep and then backtracks, reiteratively.
195
5.2.3 Specify Entry Points for the Web Crawler

After you specify the general settings for the web crawler, add the entry
points. Entry points are the Web addresses that are used by the crawler to
begin its crawl. Unless you specify otherwise in the Scope pane, the entire
entry point site and all of its links are crawled. For example, if you want to
crawl the SAS Web site, you could enter www.sas.com.
To specify the entry points for the web crawler, complete these steps:
1. Click the Entry Points tab in the Web Crawler pane.
196
2. Click Add and the Add Entry Point window appears.
3. Enter the Web address for the first site into the URL field.
4. (Optional) To add this address to the Scope pane as an allowed site for
the crawl, leave the default selection Yes in the Add to scope rules
field. If you do not want to add this address to the Scope pane, click
and select No.
Note:
If you add any URL patterns to the Scope pane, all of

the other URLs are excluded from the crawl.
5. By default, the Quota (files) field is set to 100000000. Click
or
to change this number.

6. Click OK and the Web address that you entered appears in the Entry
Points pane.
7. Use this process reiteratively until you have added all of your URLs to
the Entry Points pane.
197
5.2.4 Specify the Scope of the Crawl

After you specify one or more entry points, you can add a list of permitted and
excluded sites. However, the default setting is the empty pane. This is because
all of the Web addresses on the Internet are allowed. If you specify at least one
permitted site, every other site is excluded. This is true whether you specify
that a site is part of the scope of the crawl within this window, or in the Add
Entry Point window. If you want to exclude a segment of the site that you
permitted, you can also perform this action.
For example, you could use the Scope pane to limit the crawl to the SAS
publications pages and exclude the new books pages from the crawl.
Continuing with Section 5.2.3 Specify Entry Points for the Web Crawler on
page 196, you limit the crawl to one section of the SAS Web site. Within the
publications section, you exclude any new books.
198
To specify the scope of the web crawl, complete these steps:

1. Click the Scope tab in the Web Crawler pane.
2. Click Add and the Add Scope Rule window appears.
3. Enter a pattern for a URL, or a regular expression, into the URL

Pattern
field. Both enable the web crawler to match patterns. For

example, type https://support.sas.com/pubscat/complete.jsp.
199
Note:
For information about how to write regular expressions,

see Section A.1 Regular Expressions on page 353.
4. Leave the default setting Prefix in the Match type field, or
click
to select Regular Expression. This setting tells the web
crawler how to use the characters entered in the URL Pattern field.
A prefix match is one that matches against the beginning of the URL.
5. Leave the default setting Allow in the Action field. (Click
to
select Exclude if you do not want the crawler to download pages from
this URL.)
This URL appears in the Add Scope pane.
6. Click Add and a new Add Scope Rule window appears.

7. Continuing with this example, type https://support.sas.com/
pubscat/newbooks.jsp
into the URL Pattern field.
8. Leave the default selection Prefix.
9. Click
200
to select Exclude, in the Action field.
10. Click OK and see the complete list of included and excluded URLs. In
this example, the web crawler searches only the publications pages of
the SAS Web site. It does not search the pages that list recent books.
201
5.2.5 Exclude Certain Types of Files

After you specify the scope of the crawl, you might want to limit the types of
files that are returned by the web crawler. For example, you could exclude
files that contain programs or images.
To specify the file types that are excluded from a crawl, complete these steps:
1. Click the Filename Extensions tab in the Web Crawler pane.
2. Click the Add button to access the Add Filename Extension window.
3. Enter the extension of the file type that you want to exclude into the
Extension
202
field.
Note:
The file type extensions are case-sensitive.
4. Click
to select Exclude to prevent the crawler from gathering

this type of file. (The default setting is Allow.) If you enable one type
of file to be returned, only those with the Allow specification are
returned.
5.2.6 Specify Access Information for PasswordProtected Sites

Some sites are password-protected. To crawl these sites, you provide the web
crawler with the information that it requires to download these pages.
To specify the sites and the user and password information that the web
crawler uses to download pages, complete these steps:
1. Click the Credentials tab in the Web Crawler pane.
203
2. Click the Add button to access the Add Credential window.
3. Enter the URL followed by a colon (:) and the port number for the host
into the Site field. For example, enter www.medscape.com:80.

4. Enter the name of a user, who has access to this site, into the Username
field. For example, enter UserMD.

5. Enter the password for this user into the Password field. For example,
enter mdpassword. When you enter this password, the characters that
comprise the password are represented by the asterisk symbol (*) in the
Credentials pane.
6. Click OK and this site with its credentials is added to the Credentials
pane.
7. Click Apply Changes in the Web Crawler pane.
204
5.3 Run the Web Crawler

After you configure the web crawler, you can run it. You should configure all
of the components that you plan to use before you run the web crawler. Click
Apply Changes after you modify any of the default settings for these
components.
To start, restart, and stop the web crawler, complete any of these actions:
-
Click Start in the Web Crawler pane.
The appropriate message appears in the Status pane after any of these
operations.
-
Click Stop to stop the crawl.
(Optional) If you make any changes to the configuration while the web
crawler is running, click Apply Changes. The Restart Web Crawler
window appears.
Click Yes.
If the web crawler is not running, click Start.
-
Click Revert to return to the last applied settings.
205
5.4 Troubleshoot with the Log File

This log pane enables you to see a history of the operations performed by the
web crawler. Use the contents of the Log pane when you require customer
support.
To access and use the log pane, complete these steps:
1. Click the Log tab in the Web Crawler pane.
2. (Optional) Click
or
if you want to change the default
setting of 20 in the Number of lines field. This field specifies the
maximum number of timestamped lines that are displayed for the
searchable log file in this pane.
3. Click Retrieve to display the specified number of lines in the log file.
4. (Optional) Enter a search term into the Text to highlight field. For
example, enter sas.

5. Click Find to locate all instances of the entered term in this pane.
206
Configuring the File Crawler

-
Overview of the File Crawler
Configure the File Crawler
Run the File Crawler
6.1 Overview of the File Crawler

The file crawler crawls your organizations file system and returns documents,
according to the parameters that you set. These specifications include the
paths to crawl, file types to return, and whether the crawl is continuous. They
also include the oldest date and maximum file size that can be returned.
The file crawler passes the documents to the proxy server that passes them to
the pipeline server. According to the processing that the pipeline server
performs, the files can be sent to an application, database, or to the indexing
server where they can be queried.
6.2 Configure the File Crawler

6.2.1 Overview of Configuring the File Crawler
You configure the file crawler using the four tabs in the Configuration pane.
Each stage, or set of configurations, defines a specific aspect of the crawler.
This section is set up as a how-to guide, but it also contains the background
information that is necessary to set the specific parameters for each stage.
Display 6.1 File Crawler Configuration Pane
Use each of the following sections to configure your file crawler. After each
change, click Apply Changes in the File Crawler pane. If the file crawler is
running when you click this button, the Restart File Crawler window appears.
Click Yes.

You configure the file crawler to specify how the crawl and download
operations work for the files that the file crawler collects. As you work
through each of the steps below, the appropriate background information is
included.
To specify the parameters for the file crawler, complete these steps:
208
1. Click the General Settings tab in the File Crawler pane.
2. Click
or
to change the default setting 10 that is specified in
the Maximum file size field. Increasing or decreasing this number
affects the size of the documents collected. For example, you might
want to gather white papers but not books.
3. Click
to access the calendar where you can select the first date for
the crawl. Documents that have creation dates before the date
specified in the Oldest date field are not collected by the file crawler.
to select Yes, the default setting is No in the Crawl

field. Choose to continuously crawl your file system
only when it is constantly updated.
4. Click
continuously
to select Yes, the default setting is No in the Encapsulate

XML files field. In this case, only top-level XML tags are turned into
fields. If you select Yes, you can exert more control over this process.
For example, you can turn nested fields into tags. In this case, also select
the parse_xml document processor.
5. Click
6.2.3 Specify the Paths to Crawl

After you specify the general settings for the file crawler, you can enter a list
of paths to crawl. When you specify a list of paths to crawl, all other paths are
not permitted. These paths should be absolute instead of relative. For
209
Windows fileshares, use universal naming conventions (UNC) names instead

of local paths.
To specify the paths for the file crawl, complete these steps:
1. Click the Paths tab.
2. Click Add and the Add Path window appears.
3. Enter an absolute path into the Path field. If you specify a Windows
fileshare, enter a name that is written according to UNC conventions.

4. Click OK and the path appears in the Paths pane.
5. Continue this process, reiteratively, until you have added all of the paths
that you want crawled.
210
6.2.4 Specify the Paths to Exclude

After you specify the general settings, you can enter a list of paths that should
not be crawled. This pane enables you to specify limits within the crawl that
you set in the Paths pane. For example, choose to exclude the Trash folder on
your local computer. Or choose one, or more subdirectories to exclude from
the crawl. These paths should be absolute instead of relative. For Windows
fileshares, use universal naming conventions (UNC) names instead of local
paths.
To specify the paths that the file crawler does not crawl, complete these steps:
1. Click the Paths to Exclude tab.
2. Click Add and the Add Path to Exclude window appears.
3. Enter an absolute path into the Path field. If you specify a Windows
fileshare, enter a name that is written according to UNC conventions.

4. Click OK and the path appears in the Paths to Exclude pane.
211
5. Continue this process, reiteratively, until you have added all of the paths
that you want to exclude.
6.2.5 Specify the Types of Files to Return

You can choose to limit the types of files returned to the crawl. If you do not
specify any files to return, all of the files that the file crawler locates are sent
to the proxy server.
To specify the paths for the file crawl, complete these steps:
1. Click the Filename Extensions tab.
2. Click Add and the Add Filename Extension window appears.
212
3. Enter a file extension into the Extension field. For example, enter txt
or png. If you specify any file extension, only those file types are
returned. No other files are collected.
4. Click
to select Exclude, the default setting is Allow in the

Action field.
5. Click OK and the path appears in the Filename Extensions pane.

6. Repeat Step 2. through Step 5., reiteratively, until you have added all of
the file extension types that you want returned.
6.3 Run the File Crawler

After you configure the file crawler, you can run it. You should also configure
all of the components that you plan to use before you run the file crawler.
Click Apply Changes after you modify the default settings for any of these
components.
To start, restart, and stop the file crawler, complete any of these steps:
-
Click Start in the File Crawler pane.
operations.
213
(Optional) If you make any changes to the configuration while the file
crawler is running, click Apply Changes. The Restart File Crawler
window appears.
Click Yes.
If the file crawler is not running, click Start.
-
(Optional) Click Revert to return to the last applied settings.
To stop the crawl, click Stop.

This log pane enables you to see a history of the operations performed by the
file crawler. Use the contents of the Log pane when you require customer
support.
To access and use this Log pane, complete these steps:
214
1. Click the Log tab in the File Crawler pane.
2. (Optional) Click
or
example, enter filename.

215
216
Configuring the Feed Crawler

-
Overview of the Feed Crawler
Configure the Feed Crawler
Run the Feed Crawler
7.1 Overview of the Feed Crawler

The feed crawler crawls the Internet for frequently updated content and returns
these documents, according to the parameters that you set. Like the web
crawler, the feed crawler is used for Web content, only. Unlike the web
crawler, the feed crawler seeks newly updated information in the form of a
Web feed. You specify the parameters for the Web address where the feed
crawler begins its crawl and determine whether it follows links and crawls
continuously.
The feed crawler passes the documents to the proxy server that passes them to
the pipeline server. According to the processing that the pipeline server
performs, the files can be sent to an application, database, or to the indexing
server where they can be queried. For example, the feed crawler is used to
gather documents that express sentiment from blogs and customer reviews for
SAS Sentiment Analysis Workbench.
7.2 Configure the Feed Crawler

7.2.1 Overview of Configuring the Feed Crawler
You configure the feed crawler in stages, or according to the parameters set up
for each tab in the Configuration pane. Each tab, or set of configurations,
defines a specific aspect of the crawler. This section is set up as a how-to
guide, but it also contains the background information that is necessary to set
the specific parameters for each stage.
Display 7.1 Feed Crawler Configuration Pane
Use each of the following sections to configure your feed crawler.

After you make all of your changes, click Apply Changes in the Feed Crawler
pane. If the feed crawler is running when you click this button, the Restart
Feed Crawler window appears.
Click Yes.
If the feed crawler is not running, click Start.

You configure the feed crawler to specify the location of the feed. As you
work through each of the steps below, the appropriate background information
for each setting is included.
218
To specify the parameters for the feed crawler, complete these steps:
1. Select Configuration --> General Settings in the Feed Crawler
pane.
2. Click Auto-detect to access the Select an HTTP Proxy window.
a. Select a proxy server. For example, choose MyHTTPProxyServer.
The HTTP proxy server is a server that is an intermediary between

the crawler and the Web site. The HTTP proxy server is not the
proxy server for SAS Information Retrieval Studio. The HTTP
proxy server evaluates requests before passing them to the web
server.
b. Click OK and this server appears in the HTTP proxy field.
to select No in the Crawl

field. The crawler seeks updated items posted to the
feed over time, unless this operation is prohibited.
3. (default setting is Yes) Click

continuously
219
4. Click
or
to change the default setting of 600 for the number
of seconds for the Recrawl interval field.
5. (Optional) Enter another name of the crawler into the User agent field
if you choose to change the name of this crawler.
7.2.3 Specify the Feeds

The feed crawler collects postings, whether full texts or summaries, from both
RSS and Atom feeds. For more information, see Section 2.14.5 The Edit
Entry Point Window on page 125.
You specify the feed urls and whether links are followed in the Feeds tab.
To perform these operations, complete these steps:
1. Select Configuration --> Feeds in the Feed Crawler pane.
220
2. Click Add to access the Add Feed window.
a. Paste an address for a feed into the Feed URL field. For example,
choose http://www.sas.com/success/SASRecentSuccess.xml.
to select No, the default setting is Yes, in the Follow

field. This setting specifies whether links from the Web
address set in the Feed URL field are crawled. If you select Yes,
these links might lead to other feeds.
b. Click
Links
There are two common types of feeds. These are the full content and
summary-only feeds. In the full content feed, all of the information
that you seek is present in the feed itself. In the summary-only field,
only a brief description of the content is passed. In this case, the link
is followed, like a traditional Web page link, to locate the rest of the
content.
If you want to crawl the summary-only fields, select Yes in the
Follow links field. Also select the parse_html document processor
in the pipeline server. However, the follow links operation does not
perform recursively like the Web crawler.
c. Click OK and this information appears in the Feeds pane.
3. Enter the Web address that you want to crawl into the Feed URL field.
221
7.3 Run the Feed Crawler

After you configure the feed crawler, you can run it. You should configure all
of the components that you plan to use before you run the feed crawler. Click
Apply Changes after you modify any of the default settings for these
components.
To start, restart, and stop the feed crawler, complete any of these steps:
-
Click Start in the Feed Crawler pane.
operations.
-
(Optional) If you make any changes to the configuration while the feed
crawler is running, click Apply Changes. The Restart Feed Crawler
window appears.
Click Yes.
If the feed crawler is not running, click Start.
222

After you configure the feed crawler, you can run it. Use the contents of the
Log pane when you require customer support.
To access and use this Log pane, complete these steps:
1. Click the Log tab in the Feed Crawler pane.
2. (Optional) Click
or
example, enter close.

223
224
Configuring the Proxy Server

-
Overview of the Proxy Server
View the Status of the Proxy Server and Input Files
Configure the Proxy Server
Run the Proxy Server
8.1 Overview of the Proxy Server

The proxy server sends the documents that it receives from one, or more,
crawlers to the pipeline server for processing. The proxy server can also pass
the same set of documents to a pipeline server that was set up with a second
installation of SAS Information Retrieval Studio.
When the proxy server passes the documents that it receives, the proxy server
passes the same set to each server. This functionality makes it possible for you
to perform different operations on the same set of documents on the respective
servers. As an intermediary server, you only configure those specifications
that are necessary for the proxy server to pass documents to another server.
Use the proxy server for the following purposes:
-
Pause the flow of documents if you want to perform maintenance on

one, or more, of the components in your application.
Send the documents to multiple pipeline servers for the following

reasons:
-
Create mirrors. These are pipeline servers that perform identical

processing operations on the input documents. Multiple servers are
used for backup purposes in case of hardware failure.
Use the same set of documents for multiple purposes. In this case,
send the input documents to pipeline servers that are configured
differently. For example, send the documents to one pipeline server
for indexing and searching. Send this same set of documents to a
second pipeline server that analyzes the sentiment located in them.
You can find information about the number of documents at different stages in
this server and see a log file.
For all of these reasons, the proxy server is an integral part of any customized
configuration of SAS Information Retrieval Studio.
8.2 View the Status of the Proxy Server and

Input Files
Use the Status pane to see whether the proxy server is running and where the
input files are in the various processing stages. This pane provides view-only
displays that show the current statistics for the proxy server. You can also use
the Status pane to troubleshoot any backups in the input process.
By default, the proxy server is running. If you add any servers in the
configuration pane, click Apply Changes. You can then see the statistics for
these operations in the Status pane.
To see whether the proxy server is running and to see the statistics for this
server, complete the following steps:
226
1. Click Status in the proxy server pane.
2. See the number of documents that were input to the proxy server in the
Documents received
field. For example, 25 documents were
received.
In this example, the Quota (files) setting was set in the General
Settings pane of the Configuration pane at 25 for the web crawler. This
is the only crawler in this configuration. This crawler has returned the
maximum number of allowed documents.
3. See the number of documents that the proxy server sent to the pipeline
server in the Documents processed field. For example, see 25.

4. See the number of documents that are waiting to be received by the
proxy server in the Documents queued field. For example, see 0.

5. See the date and time that the last document entered this server in the
Last documents received field. For example, 2010-09-17 is the year
and 12:46 is the month.

6. See the date and time that the last document entered this server in the
Last documents processed
field. See the example in Step 5. above.
7. Click Refresh, if the maximum number of documents specified has not
been returned.
If you see a discrepancy, you can use the Log pane to see the connections and
errors that might be the cause. For more information, see Section 8.5
Troubleshoot with the Log File on page 230.
227
8.3 Configure the Proxy Server

When you configure the proxy server, you can either add a new pipeline server
or you can change the settings for the default pipeline server. This server
appears by default in the Configuration pane under the Host heading. Use the
Configuration pane to add pipeline servers to the proxy server or to change the
Host, Port, or Status settings.
You can also add multiple pipeline servers. Choose to add these servers for
backup purposes or to specify different types of processes for input
documents.
Note:
The same input documents are passed to each pipeline

server.
To add a new proxy server, complete the following steps:

1. Click Configuration in the Proxy Server pane.
228
2. Click Add and the Back-end Server window appears. Use this window
to add another pipeline server to the customized application that you are
building.
3. Enter the name of the machine into the Host field. For example, enter
Mirror1.
4. If the default setting is incorrect, click
or
to change the
default setting of 9004 in the Port field. For example, change the port
to 9100.
5. Click OK and the new server is added to the Configuration pane. (The
new server is automatically started and its status is running.)

6. (Optional) Repeat Step 2. through Step 5., reiteratively, to add more
servers to your pipeline.

7. Click Apply Changes in the Proxy Server pane.
8.4 Run the Proxy Server

If you make any configuration changes to the proxy server, or to another
component of SAS Information Retrieval Studio, you can restart the proxy
server. (By default, the proxy server is always running.) Click Apply Changes
after you modify the default settings for any of these components.
To start, stop, pause, resume, or apply changes to the proxy server, complete
any of these steps:
229
If you have stopped the proxy server for any reason, click Start in the
Proxy Server pane.
operations.
-
(Optional) Click Stop and the proxy server ceases its running process.
(Optional) Click Pause to interrupt the running process.
If you have stopped or paused the proxy server, click Resume.
If you make any changes to the configuration while the proxy server is
running, click Apply Changes.

The log pane enables you to see the history of the operations performed by the
proxy server. Use the contents of the Log pane when you require customer
support.
To see this Log pane, complete these steps:
230
1. Click Log in the Proxy Server pane.
2. Use the default selection Connections. Click
to select Errors.
or
to change the default setting of 20 in the Number
field. This field specifies the number of lines that are
displayed for the searchable log file in this pane.
3. Click
of lines
4. Click Retrieve to see the specified number of lines in the log file pane
below.
5. Enter the text that you want to locate in the Text to highlight field. For
example, enter 10.

6. Click Find to see these terms, highlighted in bold font, in the log file
pane. For example, see each instance of 10 highlighted in the dates and
find it in the queue.
231
232
Configuring the Pipeline Server

-
Overview of the Pipeline Server
Configuring the Pipeline Server
See Input Documents with the Document Inspector
Add a New Field to Input Documents
Match Categories, Concepts, and Facts
Export Categories and Concept Matches
Advanced Installation
Run the Pipeline Server
9.1 Overview of the Pipeline Server

9.1.1 Processing Documents and Related SAS
Applications
9.1.1.A How Document Processing and Export Operations
Work Together
The pipeline server enables you to select document processors that act on
input documents to prepare these texts to be handled by another server or
application. These processes are known as normalization, analysis, and export
operations.
For example, normalization includes the process of stripping Web documents
of their HTML markup tags and using SAS Document Conversion on
documents collected by the file crawler. You can then use the SAS Content
Categorization Studio to analyze input documents. Finally, export documents
to SAS programs such as SAS Sentiment Analysis Workbench and SAS Text
Miner.
Note:
Before you can analyze or export your documents,

make sure that any required software is installed and
running.
For more information about installing these software applications, see SAS
Information Retrieval Studio: Installation Guide or the installation guide for
each SAS application that you want to use.
9.1.1.B Process Documents

The pipeline server performs many operations that are integral to document
handling and processing. These normalization, analysis, and export operations
include category matching, concept extraction, contextual extraction,
document conversion, and exporting documents to other applications.
The analysis operations of the SAS Content Categorization Studio document
processors are also used for the labels associated with facetted search. These
labels, or captions, are specific to the categorization, concepts extraction, or
contextual extraction matching technologies in SAS Content Categorization
Studio. When you choose to create labels, a series of windows enables you to
track an input document field. You can track this field from the crawler
through the pipeline and indexing servers and into the query web server. You
can see the results in input documents when a query is entered in the search
page.
Make sure that the following programs are installed and running before you
try to process documents using them:
SAS Content Categorization Server
identifies categories, concepts, and facts from SAS Content

Categorization Studio and SAS Contextual Extraction Studio. Make
sure that the taxonomies that you want to apply to the documents input
to SAS Information Retrieval Studio are uploaded to SAS Content
Categorization Server before you configure the pipeline server.
234
extract plain text from documents such as Microsoft Word and PDF
files.
9.1.1.C Export Processed Documents

Export the documents that were collected by a crawler to the following
programs:
SAS Content Categorization Studio
uses input files for training and testing purposes.

SAS Sentiment Analysis Workbench
analyze the sentiment in input documents.

SAS Text Miner
identifies topics and themes in input documents.
9.2 Configuring the Pipeline Server

9.2.1 Overview of the Document Processors
Input documents are defined as one chunk of text returned as the result of a
crawl. This text can be a news article, a file received from the file crawler, or a
PDF document. Each of these documents is processed using the operations
that you specify, before the document is passed to another server. By default, if
SAS Search and Indexing is installed, all input documents go to the indexing
server. This is true if the documents are also sent to other applications.
Choose your document processors according to the operations that you want to
perform:
First, consider the crawlers that you defined and the document types that they
are configured to return in order to normalize the input text. For example, the
web crawler can return PDF and Microsoft Word documents in addition to
HTML documents. For this reason, choose a processor to strip the HTML tags
from the text such as parse_html or heuristic_parse_html. You can also
select the document_converter operation to extract text from documents such
as Microsoft Word and PDF.
235
Second, If you choose to use the feed crawler, you might select
invalidate_duplicates_by_url. This operation ensures that only one copy
of a document is passed to another process. This document processor is
important for applications such as SAS Sentiment Analysis Workbench where
the freshness of the document matters.
Third, choose the content_categorization document processor if you want
to enable facetted search using the categorizer, concept, or contextual
extraction processors. You can also use these processors to categorize and
extract concepts and facts from your input documents before passing them to
another operation.
Fourth, use the export_csv and the export_to_files processors to export the
normalized (and analyzed) documents to put these documents into a format
that can be used by another application. To send documents directly to SAS
Sentiment Analysis Workbench, specify
export_to_sas_sentiment_analysis_workbench.
Note:
You can also add deployment-specific document

processors by placing them into the bin/postrpocessors
subdirectory of your installation.
By default, when SAS Search and Indexing is installed, all input documents go
to the indexing server. This is true if the documents are also sent to other
applications.
After you consider these available operations, use the Add Document
Processor window to add and configure your document processors. You can
choose to use one document processor, or you can build a pipeline that orders
several processors. For example, use the heuristic_parse_html operation to
extract paragraphs of text without their HTML tags. The next processor in the
pipeline might be the export_to_files processor that enables you to export
the file in XML or in text format. In either case, you can specify whether the
document stops here in the pipeline or goes to the indexing server.
The operations that you specify in the Document Processors pane occur in the
same order that they are listed in this pane. You can specify the document
processors in any order and use the Move up and Move down buttons to
reorder these operations. If document processing operations are incorrectly
ordered, unexpected results might occur.
236
9.2.2 Checking Program Installations

Document processors are specified and configured within the pipeline server.
If you choose to use one of the following processing operations, make sure
that the necessary application is running:
-
SAS Document Conversion:
SAS Content Categorization Server: Identify categories, concepts,

and facts from SAS Content Categorization Studio and SAS Contextual
Extraction Studio. Make sure that the taxonomies that you want to apply
to the documents input to SAS Information Retrieval Studio are
uploaded to SAS Content Categorization Server. Run SAS Content
Categorization Server before you configure the pipeline server.
If you want to process documents such

as Microsoft Word and PDF files, install and run this application before
you specify this document processor.
Before you use SAS Content Categorization Server, create projects

using SAS Content Categorization Studio and SAS Contextual
Extraction Studio.
You can also export documents to another SAS program after you specify the
document processor and start the program.
-
SAS Content Categorization Studio: Use the files that you export
from SAS Information Retrieval Studio for training and testing
purposes.
SAS Sentiment Analysis Workbench:
Analyze the sentiment
expressions in input documents.

-
SAS Text Miner:
Identify entities.
For more information, see the SAS Information Retrieval Studio: Installation
Guide.
237
9.2.3 Configure the Document Processors

To add the parse_html processor, or to use this section as an example of how
to add a different processor, complete these steps:
1. Select Pipeline Server --> Configuration --> Document
Processors.
2. Click Add in the Document Processors pane. The Add Document
Processor window appears.
3. Select parse_html.
238
4. Click Next and the Document Processor: parse_html window appears.
5. Leave the default specification, raw, or enter a new field name in the
input-field. The processor uses this field to obtain the unmodified,
document data. raw specifies that the original, unmodified content was
placed into the HTML document using this identifying field name.
6. Leave the default specification, title in the title-output-field. You
can also enter a different field name where the processor stores the plain
text of the document title.
7. Leave the default specification, body in the body-output-field. You
can enter a different field where the processor stores the body text
located in the input document. This field is used by other applications
such as SAS Content Categorization Studio, when they are part of the
processing pipeline.
8. Change the default entry to 1 in the output-metadata field and this
processor populates other fields, such as keywords and description,

with values taken from the HTML document.
9. The entry 1 in the require-mime-type field specifies that a document
is checked to ensure that it is an HTML document. If you enter 0, this

check is not required.
239
10. Leave the mimetype entry in the mime-type-field, or specify a
different field.
11. The entry 1 in the base64-input field specifies that the text is
preserved in the mime content transfer encoding. If you enter 0, this

encoding is not used.
12. Click Finish to save these settings.
13. (Optional) Continue adding the document processors to the pipeline.
14. (Optional) To make changes to the specifications for a processor, click
Edit.
15. (Optional) To change the ordering of the processors in the pipeline, click
Move up, or Move down until the order is correct.

Note:
For more information, see Section 2.8.4 The Document

Processors Tab on page 41.
9.3 See Input Documents with the Document

Inspector
Use the Document Inspector pane to see all of the versions of the input
document. You can see each version, simultaneously, at each stage in the
pipeline server. The original document changes at each stage of the pipeline,
but you can still see its original text.
This snapshot operation is available for one document at a time, but only when
the documents are moving through the pipeline server. In this pane, you can
see each document as it moves, whether it is intact or split into multiple
documents.
240
Display 9-1 Viewing a Document in the Document Inspector Pane
To use the document Inspector pane to see a document, use the following
steps:
1. Click Take Snapshot.
2. Click on a document processing operation that appears in the
Processing Stage pane. For example, click on
heuristic_parse_html. A document number appears
Document pane.
in the
3. Click the number in the Document pane and the fields in this document
appear in the Field pane. For example, click on 1.

4. Click on one of the fields that the document consists of in the Field
pane. For example, click on body.
241
5. See the contents of the selected field in the Document Inspector pane.
For example, see http://money.cnn.com/2010/01/21/technology/

sas_best_companies.fortune/.
9.4 Add a New Field to Input Documents

You can add a new field, with a constant value to each of the input documents.
Use this feature to assign the same field to each indexed document. For
example, you might want to add a field to all of the documents. This field
might be used to specify that the documents are indexed from a particular
source, during a specific time period, or for other defining purposes.
When you choose to add a field, you also specify the alphanumeric string that
is assigned to each document.
To add a field to each input document, complete these steps:
1. Select Pipeline Server --> Document Processors. The document
242
2. Click Add. The Add Document Processor window appears.
3. Select add_field.
4. Click Next. The Document Processor: add_field window appears.
5. Enter the name of the field that you want to add to each input document
into field. For example, type Date.

6. Enter the value that populates the added field into the value field. For
example, type 062011.
243
7. Click Finish to see this addition in the Document Processors pane.
8. Click Stop to halt the Pipeline Server.

10. Perform Step 8. through Step 9. above for the crawler that you are using.
11. Perform Step 8. through Step 9. above for the Proxy Server.
12. Perform Step 8. through Step 9. above for the Indexing Server.
244
13. Select Pipeline Server --> Document Inspector.
14. Click Take Snapshot.
245
15. Click Processing Stage to see a list of the document processors.
Select a processor. For example, click add_field. The document

number appears in the Document pane.
16. Click the document number under Document to see the fields for this
document in the Field pane. For example, click 1 in the Document

pane to see concepts, Data, promotion, and id in the Field pane.
17. Click a field in the Field pane to see the related information in the
empty pane. For example, click Data to see that the value 062011 that
you assigned to the add_field processor was assigned to document 1.
9.5 Match Categories, Concepts, and Facts

You can match categories, concepts, and facts in input documents using the
content_categorization Document Processor. You use this processor to specify
the categories, and classifier and grammar concepts, that you created and
defined in SAS Content Categorization Studio. The concepts that you define in
the SAS Content Categorization Studio add-on program SAS Contextual
246
Extraction Studio are used as concepts or facts. Any concept that is developed
in SAS Contextual Extraction Studio and specified with a PREDICATE or
SEQUENCE rule is a fact.
The content_categorization Document Processor is the client for SAS Content
Categorization Server. The categories, concepts, and facts are applied by SAS
Content Categorization Server to the documents processed by SAS
Information Retrieval Studio.
The following example uses concepts. If you want to use categories or facts,
make the appropriate substitutions. Also see Chapter 10: Creating Facetted
Search Labels Using content_categorization. This chapter uses the Document
Processor: content_categorization wizard to create labels for facetted search.
To map concepts to labels, complete these steps:
1. Select Pipeline Server.
2. Click
to access the Document Processors pane.
247
4. Select content_categorization. The Document Processor:
content_categorization window appears.

For example, see localhost. You can enter a different server name if
SAS Content Categorization Server is running on another server.
in the Port field. For example, see 6500. Click

different port number.
or
to select a
7. (Optional) By default, 10 is entered into the Timeout field.
Click
or
to select a different number. This is the number of
seconds that the Pipeline Server waits before this server stops
attempting to download an input field.
248
8. Click Next. The Document Processor content_categorization window
appears. Use this window to add any of the projects that are uploaded to
SAS Content Categorization Server to SAS Information Retrieval
Studio.
9. Click Add and the Document Processor: content_categorization
window appears.
10. (Optional) Click
and select Concept extraction unless this

processor is already selected.
11. (Optional) Click
and select a project that you added, unless the

project that you want to use is already selected. For example, select
Entities.
content_categorization window. For example, see Entities under
249
Project
and Concept extraction under Type. Your selection limits

the available concepts to those in the project.
13. (Optional) Continue to add projects using Step 9. through Step 12. The
concepts in each of the projects that you select are available to match
your input documents.
14. Click Next. The Document Processor: content_categorization window
appears. By default, the Input tab is displayed.
15. (Optional) Enter the fields that are in any of your input documents
where you want to locate matches for your concepts. Enter these fields,
as a comma-separated list into the Input fields field. If you leave this
field blank, all fields, with the exception of those listed in the Input
fields to exclude field are searched.
16. (Optional) By default, fields that contain information about the
document are listed in the Input fields to exclude field. You can add
additional fields, or delete fields from this list:
id,url,feed_url,raw,mimetype,date,pdate,source,
promotion,ctime,atime,mtime
If you edit this list, be sure to insert a comma (,) between each field.
17. (Optional) If you make any changes, click Finish to save these edits.
250
18. Click Concepts and the Concepts pane appears.
19. Click Add to specify the concepts that are matched in input documents.
The Document Processor: content_categorization window appears.
20. Click
Concept
to select the concept that you want to match in the

field. For example, select LOCATION.
251
Hint:
Only the concepts that are part of the selected project

are available in the drop-down menu that appears.
21. (Optional) By default, the name of the concept is entered into the Field
name
field. For example, see location. Enter a new name, if you

choose.
22. (Optional) By default, the name of the label for the facetted search is
entered into the Caption field. For example, see Location. Enter a new
caption, if you choose. For more information about facetted search, see
Chapter 10: Creating Facetted Search Labels Using
content_categorization.
23. (Optional) By default, %c: %i is entered into the Format field. These
symbols indicate that information about the concept (%c) followed by

information about the entity (%i) is output. Choose different symbols
from those symbols that are available, if you choose.
Symbol
Description
%c
%p
Add to %c to include the path with the concept name.
%m
Match the text.
%i
%I
Match the information associated with the entity unconditionally.
%%
24. (Optional) By default, ; (the semicolon) is used to separate the output
fields. Enter a different separator character if you choose.
252
25. (Optional) Click Copy Defaults to revert to the concepts entries in the
Concepts
tab.
26. Click Ok to save your changes. The Document Processor:
content_categorization window appears.
27. See the newly entered concept with its field name, and caption. For
example, see Location under Concept, location under Field name,

and Basketball Player under Caption.
253
28. (Optional) If you want to continue to add concepts, click Add. Use Step
19. through Step 26. on page 253, reiteratively, until you have added all
of the concepts that you want to use for facetted search.
Note:
By default, you can add a maximum of 10 concepts to

the project. To change this number, see Section
13.2.1 Displays with or without Labels on page 310.
29. (Optional) By default, concepts is entered into the Default field

name
field. You can choose to enter a different name into this field.
30. (Optional) By default, Concepts is entered into the Default caption
field. You can choose to enter a different name into this field.
31. (Optional) By default, %c: %i is entered into the Default format field.
You can choose to enter different symbols into this field. You can edit
this entry using any of the symbols in Table 9-1 on page 252 with the
exception of %I.
32. (Optional) By default, ; (semicolon) is entered into the Default
separator
field. Enter a different separator character if you choose.
33. (Optional) By default, 15 is entered into the Max concepts field. This
is the highest number of concepts that can be located in an input

document. Click
or
to enter a different number.
34. Click Finish to save these settings.

35. If an index was in the process of building while you added captions for
your concepts, the Delete Index window might appear:
36. Click Yes to delete the index.
254
37. See the name that you entered into Field name appears in the
Configuration pane of the indexing server when this operation is

complete.
38. Click Start in the main Pipeline Server window to restart the Pipeline
Server.
39. When you click the Add button in the Matching pane of the query web
server, you can select this field in the Add Field window. This caption
name appears as a field in the Matching pane of the Query Web Server.
255
This caption also appears in the user interface when a matching term is
located in an input document.
9.6 Export Categories and Concept Matches

You can export matches on your category and concept fields using file, CSV,
and ODBC operations:
To export matched categories and concepts fields, complete these steps:
256
1. Use the steps in Section 9.5 Match Categories, Concepts, and Facts on
page 246.
2. (Optional) If you plan to export your matched fields without indexing
them, deselect the Label field in the index check box.

3. Deselect the Label field in the query web server check box.
4. Select one of the following operations:
File export
fields are exported as files

CSV export
fields are exported in commas separated format

ODBC export
fields are exported into a database

5. Click Finish.
257
9.7 Advanced Installation

When you choose to use the advanced installation, you can configure two or
more pipeline servers. When you choose this type of SAS Information
Retrieval Studio configuration, you can perform some of the document
processing operations on one server. This pipeline server can send a copy of
the processed documents to another pipeline server where more document
processors can act on them.
9.8 Run the Pipeline Server

By default, the pipeline server is running. Configure all of the components that
you plan to use. Click Apply Changes after you modify any of the default
settings for these components. Perform these operations before you view the
statistics for the pipeline server.
To start, restart, and stop the pipeline server, complete any of these steps:
-
Click Start in the Pipeline Server pane.
operations.
See the progress of the input documents in the Status pane:
a. The Overall - Pending table cell is always empty.
258
b. See how many documents have finished all of the processing
operations in the Overall - Finished table cell. For example, 32.

c. See how many XML documents are in the process of having their
XML tags removed in the XML parsing - Pending table cell. For
example, 1.
d. See how many XML documents have completed the process of
XML tag removal in the XML parsing - Finished table cell. For
example, 31.
e. See how many documents are in the pipeline process in the
Document processing - Pending
f.
table cell. For example, 22.
See how many documents have completed all of the pipeline

operations in the Document processing - Finished table cell. For
example, 8.
g. See how many documents are going to the indexing server in the
Sending to the indexer - Pending
table cell. For example, 7.
h. See how many documents have completed the indexing process in
the Sending to the indexer - Finished table cell. For example, 0.

-
(Optional) If you make any changes to the configuration while the

pipeline server is running, click Apply Changes.
(Optional) Click Refresh to see any changes in this pane.
259

The log pane enables you to see the operations performed by the pipeline
server. Use the contents of the Log pane when you require customer support.
1. Click the Log tab in the Pipeline Server pane.
2. Use the default selection Connections. Click
to select Errors.
or
3. Click
of lines
example, enter target machine.

6. Click Find to locate all instances of this term in this pane.
260
10
Creating Facetted Search Labels

Using content_categorization
-
Before You Begin Using This Example
Creating a Sample Project
Seeing the Results in the Query Interface
10.1 Before You Begin Using This Example

10.1.1 How the content_categorization Document
Processor Creates Facetted Search Labels
Facetted search applies identifying labels to matched documents. These labels
enable you to intuitively navigate to the documents that match your input
query terms. Unlike traditional search, facetted search enables you to search
instinctively and faster because the matching texts are pre-organized. (You can
also apply in-line tagging using these labels. This tagging can be used by a
third-party program at this time.)
Labels are values within fields. These fields can have display names that are
specified in the Caption field in the Document Processor:
content_categorization windows. Captions do not have formatting restrictions,
unlike internal field names that can contain only lowercase English letters.
Figure 10.1 Example of Facetted Labels
10.1.2 Using Related Programs to Define Labels

When you define labels for facetted search, you use the following programs:
-
(Optional) SAS Contextual Extraction Studio
Use the following architectural diagram to gain an overview of these

applications:
Figure 10.1 Architecture for Facetted Label Creation
262
Define your labels using the categories and concepts that you specify in SAS
Content Categorization Studio with or without SAS Contextual Extraction
Studio. Labels apply the matching requirements set by the rules that define
categories and concepts. Labels also enable facetted search operations in the
query interface of SAS Information Retrieval Studio.
Use SAS Content Categorization Studio alone to develop categories that
identify documents based on their subject matter. Also define concepts that
locate relevant terms based on rules that are specified by lists of matching
terms or parts of speech and other symbols.
Display 10.1 SAS Content Categorization Studio
When you use the add-on SAS Contextual Extraction Studio application with
SAS Content Categorization Studio, you can define LITI concepts. These
concepts increase matching precision (matches all of the relevant texts) and
recall (matches only the relevant texts). LITI concepts differ from the
classifier and grammar concepts in SAS Content Categorization Studio
because you can mix rule types within a single definition.
Contextual Extraction, or LITI, concepts can also include facts. Facts are rules
that are defined by arguments. Arguments are defined by concepts that are
related if they are matched by the fact rule. For this reason, facts return related
pieces of information in input documents. For example, define facts when you
want to identify relationships between drugs, symptoms, and gender.
263
Display 10.2 Two Facts in One LITI Concept
Note:
Rules appear on only one line. The rules that appear on

more than one line in this example are spaced only for
illustrative purposes.
When you use facts as labels, you can specify the string that is returned for the
label. Each string contains terms that are custom filled according to the
matched text.
10.1.3 Mapping to Labels

You can enter the names of the labels for the categories, concepts, and facts
that you want to serve as navigation tags for facetted search. These labels link
to one or more matched documents in the query interface. For this reason, the
names of the labels and the rules that define each taxonomy node should be
part of a schema that reflects appropriate ways of searching. For example, if
you want drugs to be part of the taxonomy for your SAS Content
Categorization Studio, you might also define the SIDE_EFFECTS, GENDER,
and DISEASE concepts.
Use the Document Processor: content_categorization wizard to select
categorization, concept, and fact extraction processors. These processors
locate matching terms in the input text, or within the document fields that you
specify, and return matches.
264
When you choose categories, SAS Information Retrieval Studio applies all of
the categories in the selected project to input texts. Although the default
selections for concepts and facts are the same, you can select specific facts and
concepts to apply.
All LITI concept definitions that include PREDICATE and SEQUENCE rules are
treated as facts. If a LITI concept rule contains one, or more, facts and other
concept rules, the facts and the concepts are applied separately. The default
settings in the Document Processor: content_categorization wizard return the
matched fact and concept rules for each LITI definition under the same label
name. For this reason, consider renaming either the fact or the concept label
and field name for each LITI definition that contains a concept and a fact.
Choose the display selection that works best for your end users:
Display 10.3 Example of Default Setting
265
Display 10.4 Facts and Concepts Labeled Differently
and SEQUENCE rules match two, or more, concepts to provide

otherwise overlooked relationships in input documents. The related matches
are facts. For this reason, facts consist of at least two arguments. In the
example above, the arguments for SIDE_EFFECT are drug and sideeffect.
These arguments match Topamax and restlessness, respectively, in the input
document.
PREDICATE
10.1.4 Before You Build Your SAS Content

Categorization Studio Project
Before you develop, or choose to use an existing, SAS Content Categorization
Studio project, consider the types of labels that you want to display in the
query interface. The category, concept, and fact names that you define in SAS
266
Content Categorization Studio are the default settings for SAS Information
Retrieval Studio. (You can also write a custom string that displays these
names.) For this reason, use care when specifying names and writing
PREDICATE and SEQUENCE rules that specify terms that are visible to the end
user.
Also use care when writing rules that return many matches. For example, you
might develop a SAS Content Categorization Studio project that includes an
EMAIL concept. This concept might contain rules defined by regular
expressions that are designed to return all e-mail accounts within internal
company documents. The inclusion of this EMAIL concept might not be
appropriate for a facetted search on the Web.
Before you upload a SAS Content Categorization Studio project to SAS
Content Categorization Server, check the Project Settings - Misc tab of SAS
Content Categorization Studio. If there are entries in the XML Default Fields
field, remove these fields and leave the XML Default Fields field blank.
(These fields apply to categories and to LITI concepts and facts. For this
reason, grammar and classifier concepts in SAS Content Categorization Studio
are matched regardless of the field entries. Other matches that should occur,
might not.)
Display 10.5 Project Settings - Misc Tab
267
Use care when changing rules and uploading projects to avoid propagating the
same rule or its variations. For example, you might upload a SAS Content
Categorization Studio project to SAS Content Categorization Server. If you
change a concept definition and upload the same project with a new name to
the server, both rules are available for matching. This is true if you add both
projects to your SAS Information Retrieval Studio project using the Document
Processor: content_categorization wizard.
In other words, matches might be made on concept definitions where one or
more definitions is specified using an outdated rule. This behavior can occur
because SAS Information Retrieval Studio consolidates all of the rules for
categories, concepts, and facts with the same names.
Naming also affects LITI facts and concepts. For example, you might have a
LITI concept definition that includes both fact and concept definitions. See the
example below:
Figure 10.2 Facts and Concept Rules in One Concept Definition
Note:
For the purposes of this example only, each fact rule

appears on two lines.
In this example, if matches occurred for both the facts and concepts, all of
these matches would return a match on the SIDE_EFFECT concept. However,
you can use the content categorization document processor to specify different
names for the concept and fact matches.
268
10.1.5 Before You Use the Example in This Chapter

Before you follow the example in this chapter, install the following programs:
-
(Optional) SAS Contextual Extraction Studio

Note:
This chapter provides one example of how labels are

mapped to concepts and facts and viewed in the query
interface for SAS Information Retrieval Studio. For
more general information, see Section 9.5 Match
To understand how to use these applications with SAS Information Retrieval

Studio, complete the following steps:
1. Develop a sample SAS Content Categorization Studio project with, or
without, SAS Contextual Extraction Studio concepts and facts. For
269
more information, see SAS Content Categorization Studio: Users

Guide and SAS Contextual Extraction Studio: Users Guide.
270
2. Use the Build menu to build, compile, and upload the relevant
categories and concepts projects to SAS Content Categorization Server.

For more information, see SAS Content Categorization Studio:
Administrators Guide.
3. Specify the name of the project in the Upload window that appears. The
entry in the Server Project Name field can be unique for the SAS
Information Retrieval Studio project.
4. (If you uploaded your project awhile ago) Select Start --> Programs -> SAS Content Categorization Server
to make sure that the server
is running.
271
5. Configure a sample SAS Information Retrieval Studio project using the
content categorization document processor that references your sample

SAS Content Categorization Studio project. See the following sections
of this chapter for step-by-step directions.
272
6. Check the matching results using the Document Inspector tab.
(Always click Take Snapshot before you start a crawler.)
7. Select Start --> Programs --> SAS Information Retrieval Studio

--> Query Interface.
8. Enter a query term such as side effects.
273
9. Click Search to see the results. The facetted search labels appear on the
left side of the query interface.
10. (Optional) Check your search results against the original project to
ensure that the expected results occur.
10.2 Creating a Sample Project

10.2.1 Access the Projects on SAS Content
Categorization Server
Use the Document Processor: content_categorization window to specify the
location where SAS Content Categorization Server is running. The category,
concept, and fact definitions are uploaded in projects to SAS Content
The content_categorization Document Processor is the client for SAS Content
Categorization Server. The categories, concepts, and fact definitions are
274
applied by SAS Content Categorization Server to the documents processed by

SAS Information Retrieval Studio.
Note:
The following steps apply when text documents are

input to SAS Information Retrieval Studio. If HTML or
XML documents are input, use the appropriate parser.
For example, add the parse_html document processor.
To specify the location of SAS Content Categorization Server, complete these

steps:
1. Select Pipeline Server --> Document Processors.
Hint:
The Pipeline Server can either be running or stopped.
275
2. Click Add. The Add Document Processor window appears.
3. Select content_categorization.
4. Click Next. The Document Processor: content_categorization window
appears.

For example, see localhost. Enter a different server name if SAS
Content Categorization Server is running on another server.
into the Port field. For example, see 6500. Click

a different port number.
or
to select
7. (Optional) By default, 10 is entered into the Timeout field.
Click
or
to select a different number of seconds that the
pipeline server waits before it stops trying to complete a matching
operation.
276
8. Click Next. You can now add your projects to SAS Content
10.2.2 Add Projects

Use this section to specify the projects that you uploaded to SAS Content
Categorization Server that are used by SAS Information Retrieval Studio. You
use these projects to select the categories, concepts, and facts that SAS
Information Retrieval Studio applies to input documents. For this reason, you
select the project that you want to use for each type of label source.
To add projects, complete the following steps:
1. Click Add in the Document Processor: content_categorization window
that appears after you click Next in aboveStep 8..
The Document Processor: content_categorization window appears.
2. (Optional) By default, Categorization is selected in the Type field.
This is true if you uploaded a categories project to SAS Content

Categorization Server. Otherwise, Concept extraction or
277
Contextual extraction is selected. Click
to change the default
selection.
3. (Optional) By default, a project is selected in the Project field such as
Sample. Click
to select a different project that is running on SAS

Content Categorization Server with the appropriate matching
technology. For example, if Sample is selected, only categories are
available for matching. This is true because Categorization is
selected in the Type field and Sample was uploaded as a categories
project.
content_categorization window.
5. (Optional) Repeat Step 1. on page 277 through Step 4. above until you
have added all of the projects and their matching types. For example,
add MedicalProj to include concepts. Add MedicalProj2 to match
LITI concepts and facts. If you have multiple project for a specific
matching technology, you can upload all of these projects.
6. Click Next.
278
10.2.3 Determine the Input, Matching, and Output

10.2.3.A How Input Documents Are Handled
The Document Processor: content_creation document processor enables you
to specify the input fields, the matching field names, and how the fields are
labeled or exported. These specifications determine how the content of input
documents is handled.
10.2.3.B Specify Input Fields

Input documents such as HTML and XML documents contain fields, some of
which are for informational purposes only. You can choose to limit the fields
that are searched for categories, concepts, and facts. You can also exclude
some fields. When you specify input fields, all of the unlisted fields are not
searched.
Before you use the steps below, consider the types of documents that you plan
to input. These documents types determine the fields to include or exclude.
To select the input fields, complete these steps after you click Next in Step 6.
on page 278. The Document Processor content_categorization window
appears. By default, the Input tab is selected.
1. (Optional) By default, the Input Fields field is blank. Use a comma (,)
separated list to specify any field names that you want to search for
matches for your categories, concepts, and facts. If you leave this field
blank, all of the fields are searched with the exception of any fields
entered into the Input fields to exclude. If you specify any fields,
only the listed fields are searched.
279
2. (Optional) By default, the Input Fields to exclude field contains
these fields:
id,url,feed_url,raw,mimetype,date,pdate,source,
promotion,ctime,atime,mtime
Using a comma-separated format, you can edit this list.

3. (Optional) Click Finish to save your changes.
10.2.3.C Specify Categories

Categories define the information that is located in input documents by
specifying the subject matter of the documents. When you select categories,
unlike concepts and facts, all of the categories in the project are applied to
input documents.
To add categories to the project, complete these steps:
1. Click Categories to access the Categories pane.
2. (Optional) By default, categories is entered into the Field name field.

3. (Optional) By default, Categories is entered into the Caption field.
You can enter a new caption name for facetted search.
280
4. (Optional) By default, %c is entered into the Format field for each,
individual category name. You can enter a new format that might
include %% for a literal percent sign. You can also use x as a modifier to
request XML escaping. For example, enter %xc.
5. (Optional) Enter a regular expression into the Category name pattern
field. Regular expressions specify the pattern for the category name.
6. (Optional) Enter a string into the Category name replacement field.
This string is a constant value that replaces each of the individual

category names with the name that you specify here.
7. (Optional) By default, ; (semicolon) appears in the Separator field.
Enter a new separator such as a comma (,) for the matched categories.
8. (Optional) By default, the highest number of categories that can be

or
change this default selection in the Max categories field.
Hint:
to
This field specifies the number of categories that have

the highest numbers of matches. For example,
matches might occur for 25 categories. However, the
results for this example are displayed only for the 15
categories with most matches in input documents.
9. Click Finish.
10.2.3.D Specify Concepts

Concepts identify metadata, or data on information. You specify concepts to
locate specific types of information in input documents using SAS Content
Categorization Studio. You can add the classifier and grammar concepts in
SAS Content Categorization Studio or any of the concept types in SAS
Contextual Extraction Studio.
However, concepts that include SEQUENCE and PREDICATE rules in their
definitions are added as Facts. After you upload these concepts to SAS
Content Categorization Server, the projects that contain these rules can be
applied by SAS Information Retrieval Studio.
281
Matches for any of the concepts that you specify explicitly, appear in the table
at the top of the Document Processor: content_categorization window. These
matches appear in the specified format and are placed into the specified output
field. Matches for any other concepts that are not in the table are assigned the
default format. The text of these matches appears in the Default field name.
You can also choose to exclude concepts from matching. For example, exclude
all of the matches that are not specified when you leave the Default field
name empty in the Concepts tab. If you want to specify one or more concepts
to exclude, leave the Field name blank when you specify the excluded
concepts.
If you want to prevent a specific concept from the output, leave the empty.
To add concepts to the project, complete these steps:
1. Click Concepts to access the Concepts pane. You can use this pane to
add all of the concepts and contextual extraction concepts. If any of the
LITI concepts include PREDICATE or SEQUENCE rules, these rules are
matched as facts. Access these facts using the Facts pane.
appears. Use this pane to specify the settings for each individual
282
concept. These settings override the specifications for all of the concepts
in the Concepts pane.
in the Concept field to select a concept from the available

projects. For example, select SIDE_EFFECT from the drop-down menu.
3. Click
4. (Optional) When you select a concept using Step 3. above, the name of
the selected concept appears in the Field name field after you make a
selection in the Concept field.
In this example, the concept SIDE_EFFECT also contains PREDICATE
and SEQUENCE rules. For this reason, SIDE_EFFECT appears in the Facts
drop-down list also. In order to avoid ambiguity in the search results,
you can choose to rename either the concept or the fact. In this
example, negativeeffects is entered.
5. (Optional) The name of the selected concept appears in the Caption
field after you make a selection in the Concept field. For example,
Negative Effects. You can enter a new caption name such as
Negative Effects. For more information, see Section 9.5 Match
Categories, Concepts, and Facts on page 246. You can also use the
sample project in Chapter 4: Sample Configurations
Note:
If you change the Field name field, also change the

name that appears in the Caption field.
283
6. (Optional) By default, % is entered into the Format field for the concept
name. You can also use any of the following symbols:

Table 10-1: Concept Output Format Symbols
Symbol
Description
%c
Output the concept name.
%p
Output the concept name with its path. You can specify %c to
include the path with the concept name.
%m
Output the text.
%i
Output the information associated with the entity, or the match

text if no information is available.
%I
Output the information associated with the entity unconditionally.
%%
Output the literal percent sign.
7. (Optional) By default, ; (semicolon) appears in the Separator field.
You can choose to enter a new separator such as a comma (,).

8. (Optional) Use Step 2. on page 282 to Step 7. above, reiteratively, until
you have added all of your concepts.

9. Click Ok. If you want to reload the default settings, click Copy
Defaults.
284
10. See the concepts in the Concepts tab. Make sure that you loaded all of
the concepts that you want to use in your project.
11. (Optional) By default, concepts is entered into the Default field

name
12. (Optional) By default, Concepts is entered into the Default caption
field. You can enter a new caption name for facetted search.
13. (Optional) By default, %c: %i is entered into the Default format field
in Table 10-1 on page 284.
field. You can enter a new separator such as a comma (,).


or
to
285
Hint:
This field specifies the number of concepts that have

the highest numbers of matches. For example,
matches might occur for 25 concepts. However, the
results for this example are displayed only for the 15
concepts with most matches in input documents.
16. Click Finish.
10.2.3.E Specify Facts

Facts match two, or more, concepts to provide otherwise overlooked
relationships in input documents. Facts consist of at least two arguments and
are defined by PREDICATE and SEQUENCE rules. If a contextual extraction
concept contains either a PREDICATE or a SEQUENCE rule, these rules are treated
as a fact by SAS Information Retrieval Studio. Facts are automatically
separated from contextual extraction concepts when these concepts are
uploaded to SAS Content Categorization Server. The arguments and the
matched values appear in the query interface.
Matches for any of the facts that you specify explicitly appear in the table at
the top of the Document Processor: content_categorization window. These
matches have the specified format. Matches for any other facts that are not in
the table are assigned the default format. The text of these matches appears in
the Default field name.
You can also choose to exclude facts from matching. For example, exclude all
of the matches that are not specified when you leave the Default field name
empty in the Facts tab. If you want to specify one or more facts to exclude,
leave the Field name blank when you specify the excluded facts.
To add facts to the project, complete these steps:
286
1. Click Facts to access the Facts pane.
287
2. Click Add and the Document Processor: content_categorization
window appears.
to select a fact in the Fact field. Facts are contextual

extraction concepts that contain at least one PREDICATE or SEQUENCE
rule. For example, select SIDE_EFFECT from the drop-down menu.
3. Click
4. (Optional) When you select a fact using Step 3. above, the Field name
field is automatically entered. For example, see sideeffect. Enter a

new name if you choose.
5. (Optional) When you select a fact, the Caption field is automatically
entered. For example, see Side Effect. Enter a new name if you
choose.
6. (Optional) By default, the format for the matched fact is entered into the
Format
field. This is the argument string that fills in the document

matches as a label. For example, see the following format:
SIDE_EFFECT(drug: %v{drug}, sideeffect: %v{sideeffect})
288
7. In this example, the SIDE_EFFECT concept has two arguments drug and
gender.
You can use the following symbols to edit this field.

Table 10-2: Fact Output Format Symbols
Symbol
Description
%f
Output the fact name.
%a
Output a formatted list of arguments.

Note: If you do not specify the argument symbol, the
Argument format field, even when specified, does not apply.
%v{name}
Output the value for a specific argument.
%m
Output the text.
%s
Return the concordance list.

Note: If you do not specify the concordance, the concordance is
not returned. This is true even when you specify the
Concordance type and Surrounding words in the
Facts pane.
%%
Output the literal percent sign.
The following symbols appear in the format string of the

argument for a matched fact.
%n
Output the argument name for the arguments that comprise the
definition.
%v
Output the value for the specified argument.
8. (Optional) By default, %n: %v is entered into the Argument format
field. You can also use any of the symbols in Table 10-2 above.
9. (Optional) By default, a , (comma) appears in the Argument
separator
field. Enter a new separator such as a period (.).
10. (Optional) By default, a ; (semicolon) appears in the Separator field.
Enter a new separator such as a hyphen (-).

11. (Optional) Use Step 2. on page 288 to Step 10. above, reiteratively, until
you add all of your facts.
289
12. Click Ok. If you want to use the same settings specified in the Facts
tab, click Copy Defaults. The Facts tab appears.
13. (Optional) By default, facts is entered into the Default field name

Note:
If do not change the default entry facts, in the Default

field name field in the Facts tab, all of the concepts are
matched.
14. (Optional) By default, Facts is entered into the Default caption field.
You can enter a new caption name for facetted search.

15. (Optional) By default, %f(%a) is entered into the Default format field
in Table 10-2 on page 289 with the exception of %v{name}.
290
Note:
Unless you specify %a, no arguments are called. This is

true even if you make entries in the Default argument
format field.
16. (Optional) By default, %n: %v is entered into the Default argument

format field for the concept name. You can edit this entry using the %n,
%v, %%, and the x modifier symbols in Table 10-2 on page 289.
17. (Optional) By default, , (comma) is entered in the Default separator
field. You can enter a new separator such as a semicolon (;).

18. (Optional) By default, Surrounding words is selected in Concordance
type.
Click
to select Full sentence.
Concordance refers to the surrounding text that is returned with the

match. When you select Full sentence, the Surrounding words field
disappears. If you do not specify the concordance using %s, the
concordance is not returned. This is true even when you specify the
Concordance type and Surrounding words in the Facts pane.
19. (Optional) By default, 10 is selected in the Surrounding words field.
Click
or
to change this default selection.

change this default selection in the Max facts field.
Hint:
or
to
This field specifies the number of facts that have the

highest numbers of matches. For example, matches
might occur for 25 facts. However, the results for this
example are displayed only for the 15 facts with most
matches in input documents.
21. Click Finish to save your selections.
291
10.2.4 Specify Output

After you specify the input and matching requirements, choose how the
matched fields are treated.
1. Click Output to access the Output window.
2. (Optional) If you do not want to enable facetted search, deselect Label

field in the index.
(This field applies only to the index.) When you

deselect this check box, you can use the label fields for another purpose.
3. (Optional) If you do not want to enable facetted search, deselect Label

field in the query web server. This field applies only to the query
web server pane. If you are using a custom query interface, you might
select this operation. In this case, the Label field in the query web
server operation is irrelevant.
4. (Optional) If you want to export these fields as files, select File export.
5. (Optional) If you want to export these fields in comma-separated
format, select CSV export. Choose this selection to export your files
into programs such as SAS Text Miner or Microsoft Excel.
6. (Optional) If you want to export these fields to a file system, select
ODBC export.
292
7. Click Finish and see the categories field listed in the Document
Processors pane.
10.2.5 Apply content_categorization to Input

Documents
After you specify the content_categorization document processor, you can
apply these operations to input documents.
To apply content_categorization to input documents, complete these steps:
293
1. Click Stop, Apply Changes, and Start to apply changes and to restart
the Document Processor.
2. Begin with the selected crawler and work down through the list of
components clicking Stop, Apply Changes, and Start.

You can also perform any of these operations:
-
Click Edit in the Document Processors pane. You can follow any of the
steps in Section 10.2.1 Access the Projects on SAS Content
Categorization Server on page 274 through Section 10.2.4 Specify
Output on page 292.
Click Move up or Move down to reorder your document processors.
Select Pipeline Server --> Document Inspector. Click Take

Snapshot to see the results.
Hint:
294
The Document Inspector captures the next document

that is sent to the pipeline server.
Click Start in the Document Inspector pane (below the

Document Processors pane) before you start a crawl.
10.3 Seeing the Results in the Query Interface

You can see, and test, the results of the content categorization document
processor when you use the query interface. For comprehensive directions, see
SAS Information Retrieval Studio: Users Guide.
To test the results of the content categorization document processor that you
defined, complete these steps:
1. Select Start --> Program --> SAS Information Retrieval Studio -> Query Interface.
2. Enter a search term into the blank field to the left of the Search button.
295
3. Click Search.
4. See the results.
296
11
Configuring the Indexing Server

-
Overview of the Indexing Server
Configure an Index
Changes That Affect the Indexing Server
Run the Indexing Server
11.1 Overview of the Indexing Server

The indexing server works like the index that is located in the back of a
textbook. The index is a list of unique words and the locations where these
words occur. Unless you specify another application, all of the documents that
are collected by the crawlers are automatically sent to the indexing server.
The unique fields in the index are populated by the data from the input
documents. These fields are specified when you use the Document Processor
windows in the Pipeline Server pane. The various types of fields in the index
are used for different query functionalities.
You can select a language when you build an index. Your language selection
does not prevent documents written in other languages from being indexed,
but it does optimize the index for the selected language. The index matches
only words in the document.
It is important to remember that you do not change the existing index, but that
you can configure the next index that is built. For this reason, the Apply
Changes button deletes the current index and configures the new index.
The index can be searched during the build process, or after the index is built.
11.2 Configure an Index

You configure an index in order to specify the fields that are used for search
operations. You also determine how the information that is located in these
fields is stored in the index.
To configure an index, complete these steps:
1. Select Indexing Server --> Configuration.
See the list of field names that are the default selections for the index.
For example, see id, title, date, and so on.
298
Click Remove to delete an entry in the current index configuration.
Click Edit to make changes to the purposes specified for a field.
Click Add to enter a new field name with its functionality. You can
enter any field name that is found in any of the input documents. It
is not necessary for every document to contain each of the specified
fields.
2. When you click Add or Edit the Add Field window appears.
For detailed explanations for each of the functionalities that are

available in this window, see Table 11-1 below:
Table 11-1: Field Functionalities
Field Type
Purpose
Searching
(Default) Search for words that match the input query terms. This
selection is equivalent to the standard function.
Label
Select for facetted search, only. This selection is equivalent to marking the
field as both standard and Boolean.
For more information about facetted search labels, see Section 9.5 Match
299
Table 11-1: Field Functionalities (Continued)

Field Type
Purpose
Display and
Sorting
Sort the results alphabetically, or numerically, instead of by relevancy.

The field is returned with the URL of each document in the results list.
This choice is equivalent to marking the field as info.
Identification
Identify a document. This field corresponds to marking the field as URL.

It is not necessary for this field to contain a standard-compliant URL.
However, it is necessary for this field to contain a unique string.
Custom
Choose one, or more, of the following selections:

-
Standard: Make this field a regular field.
Info: Make this field an information field.
Boolean: Use Boolean, counting, and positional operators.
URL: Specify either a Web address or a unique string.
3. Click
to select a language for index optimization. However, the

index is built with the returned documents regardless of language.
4. Use the Add Field window to add, and make changes to, all of the fields
in the index.
5. Click Apply Changes to delete the current index and to set the
configuration for the new index.
300
11.3 Changes That Affect the Indexing Server

If you choose to build an index, many other operations affect the index. For
example, see the following list of operations:
-
Starting and stopping a crawler affects the flow of documents to the

server. For example, if you stop the crawler and then restart it, the same
documents are collected.
Some of the document processing operations in the pipeline server

specify the names of the fields passed to the indexing server. If you
make a change to one of these document processors while the indexing
server is running, click Apply Changes.
If you change the field names, types, and functionalities that you specify
in the Configuration pane of the indexing server, the index is affected.
Whenever you make a change to any of these operations, the current index is not
affected. These changes can affect only the new index. For this reason, you have
two choices:
-
Click the Delete Index button to remove the existing index. A new
index can be built with the specified changes after you restart the
crawler.
Click the Apply Changes button when the indexing server is running.
The existing index is deleted and the indexing server is restarted so that
a new index can be built.
For example, if you make changes to fields in the pipeline server, they can
affect the indexing server.
301
11.4 Run the Indexing Server

By default, the indexing server is running. Configure all of the components
that you plan to use. Click Apply Changes after you modify any of the default
settings for these components. You might also want to delete the existing
index after you make changes, and before end users enter queries.
To start, restart, and stop the indexing server, complete any of these steps:
-
Click Start in the Indexing Server pane.
operations.
302
To stop the server, click Stop.
(Optional) If you make any changes that affect the index, click Delete
Index. This operation removes the old index. For example, if you add a
title field to the list of indexed fields a new index might be necessary.
(Optional) Click Apply Changes if the index server is running. This

operation deletes the old index and restarts the indexing server. A new
index can be built with this set of changes.

The indexing server log pane enables you to locate information about the
operations of the indexing server. Use the contents of the Log pane when you
require customer support.
1. Click the Log tab in the Indexing Server pane.
or
2. Click
of lines
example, enter ID.

5. Click Find to locate all instances of this term in this pane.
303
304
12
Configuring the Query Server

-
Overview of the Query Server
Run the Query Server
12.1 Overview of the Query Server

The query server serves the queries that it receives from the query web server
to the index. The query server then returns the matched documents from the
index to the query web server. The query web server displays these matches to
the end user according to the parameters that you specify. The query server is
merely the conduit that passes queries and results and logs these interactions
with both servers.
12.2 Run the Query Server

The query server uses the index built by the indexing server to locate matching
documents in response to queries. Use the query web server or specify a
custom application that you write using the Query API, to pass queries to the
query server.
By default, the query server is running. Configure all of the components that
you plan to use in SAS Information Retrieval Studio. Click Apply Changes
after you modify any of the default settings for these components. The query
server does not require updates.
To start and stop the query server, complete any of these steps:
Click Start in the Query Server pane.
The appropriate message appears in the Status pane after both the start
and stop operations.
-

The log pane enables you to see information about the query processing
operations of the query server. Use the contents of the Log pane when you
require customer support.
To use this log pane, complete these steps:
1. Select Query Server --> Log.
306
or
to select a new Number of lines, the default
setting is 20. For example, choose 25 to see more lines.
2. Click
3. Click Retrieve to display this number of lines in the blank pane below.
4. (Optional) Enter the terms that you want to locate in the Text to
highlight
field. For example, enter INDEX.
5. Click Find to display this text in the log file.
307
308
13
Configuring the Query Web

Server
-
Overview of the Query Web Server
Choosing How Search Returns Are Displayed
Configure the Query Web Server
Run the Query Web Server
13.1 Overview of the Query Web Server

Use the query web server to specify the fields that are searched when an end
user enters a query, and how this information is matched and prioritized. The
links to the matches are displayed according to the selections that you choose.
For example, you can specify the URLs, the text that is displayed, and what
fields in the input document are searched to return this text.
You can also specify labels to make facetted search possible. Facetted search
enables users to locate the information that they seek moving in an intuitive,
instead of linear progression. These labels appear to the left of the search
returns that are displayed in list format on the right in a hierarchical, or flat,
layout.
You can also choose the display settings and design the query window for your
end users. When you use the query web server to specify the look and feel of
the search window, choose the banner, colors, and other components for this
window. You can also access the search window through the link provided in
the Status pane of the query web server window.
Display 13-1 SAS Information Retrieval Studio Search Window
13.2 Choosing How Search Returns Are

Displayed
13.2.1 Displays with or without Labels
You can customize the way that end users see and navigate the matches that
are located for their input query terms. These display selections enable the end
user to navigate the returned documents and to optimize search within the
returns to locate the results that they seek.
To specify how labels are displayed for categories, concepts, and facts,
310
1. Select Query Web Server --> Configuration --> Labels.
2. Click Add and the Add Field window appears.
in the Hierarchical field to choose a hierarchical, nonhierarchical, or a flattened display of the labels. In this example, Yes is
selected to enable a hierarchical display of the categories.
3. Click
311
4. Make any other changes and click OK to see this selection in the Labels
pane.
See the following examples that include search windows that do, and do not,
display labels.
13.2.2 No Labels Example

If you choose to use no labels, search results are displayed in list format. To
see the matched document, select the blue hyperlink that appears to the right
of the number in the ordered list. (Returns are ordered according to the
specifications that you select in the Sorting pane of the Query Web Server.)
Display 13.1 No Labels
13.2.3 Hierarchical Labels Example

If you specify a hierarchical ordering of labels, the matching sections of
taxonomy for categories is displayed. This is a matched portion of the same
taxonomy that appears in the Taxonomy pane of the SAS Content
Categorization Studio user interface. You can also choose to see a count of the
matching documents.
312
Note:
Concepts and facts do not have a parent-child

relationship. For this reason, this specification does not
work.
Display 13.2 Hierarchical Taxonomy of Matched Labels
You can click the left mouse button on a hyperlink label to make one of the
following selections:
Require
the path to the selected label appears below the search box. The displayed
documents match both the query term and the selected label. If you specify
more than one label, the documents match the query term and the selected
labels. In this case each path is appended with a plus sign (+).
Exclude
one label, preceded by the minus sign (-) appears in the SAS Information
Retrieval Studio search window. The displayed documents match the
query term, but not the selected labels.
View
one label appears in the SAS Information Retrieval Studio search window
that displays all of the matching documents for this label, only. Bolded
matches for existing query terms no longer appear below the document
links on the right side of the search window.
313
Remove
this operation is available for the label, or path, appearing at the top of the
SAS Information Retrieval Studio search window. This selection is the
only available after you use any of the above operations.
13.2.4 No Hierarchical Display Example

If you select No for the hierarchical display selections of the related taxonomy,
the matched categories are displayed with slashes (/). These slashes indicate
the paths, or parent-child relationships that exist in the SAS Content
Categorization Studio taxonomy that they match.
The hierarchical view does not work for concepts and facts. For this reason,
they do not exist in a parent-child relationship.
Display 13.3 No Hierarchy
314
13.2.5 Flattened Hierarchical Example

If you select a flattened display of the related taxonomy, the matched
categories are displayed. However, the full path to that category does not
appear.
Note:
You can also use the Require, Exclude, View, and

Remove operations. For more information, see Section
13.2.3 Hierarchical Labels Example on page 312.
Display 13.4 Flattened Hierarchy
Hover the mouse over a category or concept to see the hierarchy, or parentchild relationships existing in SAS Content Categorization Studio.
315
13.3 Configure the Query Web Server

13.3.1 Overview of Configuring the Query Web
Server
You configure the query web server using the configuration sets in each pane
of the Configuration tab.
Use this pane to specify the type of search and how results are sorted. If you
enable facetted search, you also specify the captions that appear as label
names. Choose the fields where matches can be located and design the
appearance of the SAS Information Retrieval Studio search window.
This section is set up as a how-to guide, but it also contains the background
information that is necessary for each tab.
Display 13.5
316
Query Web Server Configuration Pane
13.3.2 Specify the Server Port

The Server Port field displays the number of the port where the query web
server is running. The default is 9100.
Display 13.6 Query Web Server Port
To change the query web server port, click

number that is not already in use.
or
to select a new port
13.3.3 Specify How Matching Is Performed

13.3.3.A Match Types
Use the Matching pane to specify how the terms that are matched to the query
are located. In other words, each document is indexed as a field-value pair.
When you leave the default selection Simple selected, you can specify the
fields that are searched in the index.
There are two types of searches that you can specify for your end users:
Simple (fsearch)
specify the query fields and enable end users to prefix required words and
quoted phrases by prefixing them with plus (+) and minus (-) signs. When
you make this selection, you specify the indexing fields that are searched.
Advanced (bsearch)
enable end users to specify the query fields. Query terms can be combined
when you specify the following operators:
-
Boolean operators such as AND, OR, and NOT add precision to your search.
Positional words such as SENT and PAR specify that matches are located
only if the specified words appear in the same sentence or paragraph,
respectively.
317
Counting words such as MINOC_n total the number of matched words. A

match only occurs when there is at least (MINOC_n) this number, of
matching words in the input document.
When you specify Advanced search, you do not select any index fields.
Instead, you choose the sorting and weights for matches.
13.3.3.B Select a Match Type

To determine the match type, complete these steps:
1. Select Query Web Server --> Configuration --> Matching.
2. Leave the default selection, Simple search, or click
to select
Advanced.
318
4. Leave the default selection such as title. Click
to select a
different field. Each of these fields is listed in the Configuration pane of
the indexing server. In other words, the fields that appear in this dropdown menu are also in the index.
or
to select a new Weight. For example, choose 5 to
weight matches that are located in the body field more heavily than
those in the title field. The weight setting is relative across all fields.
5. Click
6. Click OK to see this field in the Configuration pane.

7. Click Remove in the Matching pane to delete a field.
8. Click Edit to make a change in one of the fields.
13.3.4 Specify How Matches Are Sorted

Matches are sorted by date, field values, the number of matching terms or
fields, or by relevancy. You can affect relevancy when you specify a weight
for a field that ranks matches on one field higher than those on another field.
For more information, see Section 13.3.3 Specify How Matching Is Performed
on page 317.
The following set of steps provides information about sorting by relevancy. To
use the other selections that are available in this pane, see Section 2.11.4.C
The Sorting Tab on page 53.
To specify how matches are sorted, complete these steps:
319
1. Select Query Web Server --> Configuration --> Sorting.
2. Leave the default selection Relevancy, or click
to choose a
different selection in the Sort type drop-down menu. The selection that
you make in this field determines the fields that are displayed below this
field.
3. Specify the following weights according to your values. In other words,
if the density weight is more important than any of the other weights,
specify the highest weight number for this field:
or
to select a new Cosine
This metric assigns the highest weights to the most
frequently occurring terms. It takes noise words into consideration.
(Noise words are the words that appear with enough frequency
that they are ranked down.)
a. (Default selection is 1) Click

Weight.
to add Proximity Weight to the relevancy

metric. Specify higher numbers when multiple query terms appear
close to each other in a document.
b. (Optional) Click
to add Position Weight to the relevancy

metric. Specify higher weights for query terms that are matched at
the beginning of a document.
c. (Optional) Click
320
to add Density Weight to the relevancy

metric. Density measures the proximity of matches as a percentage
of the document size.
d. (Optional) Click
to add Freshness Weight to the relevancy

metric. Freshness enables you to combine date sorting with the
other measures.
e. (Optional) Click
4. Click Apply Changes before you select another pane.
13.3.5 Specify Labels for Facetted Search

Labels cluster matching documents. Labels apply the matched categories and
concepts that occur most frequently in the matched documents. You can see a
general label that uses the caption that you specify in the search window.
Beneath this label, you can see a taxonomy, or list, of matched categories and
concepts.
You choose whether to use categories, concepts, or both. You make this choice
in the Document Processor window of the pipeline server.
You specify labels when you want to enable facetted search in the search
window for your end users. Facetted search enables your users to intuitively
search and locate documents that match the input word. For example, if a user
enters the word cars, all of the documents that match cars are returned.
Related labels such as car parts, car repair, and antique cars might also
appear. These labels, or categories and concepts, enable the end user to see
related terms that are also matched in the returned documents.
Use the Labels pane to specify the caption for the categories and concepts that
users see. For categories, the caption replaces the Top node that you see in the
SAS Content Categorization Studio Taxonomy pane. You specify a caption for
each concept, or SAS Information Retrieval Studio uses the name of the
concept, by default. These captions are applied to index fields in order to
rename them with user-friendly text. For more information about the fields to
labels process, see Section 3.7 Defining Labels for Facetted Search on
page 163.
321
Note:
Only index fields where the specified functionality is

Label can be accessed in this window and can have a
caption.
To specify labels, complete these steps:

1. Select the Query Web Server --> Configuration --> Labels.
322
3. Leave the default field selection in Field name. For example, if you
selected categories as the field with Label functionality in the

Indexing Server pane, you can leave this default selection. The only
selections that are available in this field are those with the Label
functionality.
4. Enter a new name for the label into the Caption field. For example,
enter Categories that uses an uppercase letter.

5. Leave the default selection No in the Hierarchical field or
click
to select either Yes or Flattened. To see the types of results
that are displayed for these selections, see Section 13.2 Choosing How
Search Returns Are Displayed on page 310.
323
6. By default, Yes is selected in the Display counts field. Click
to
select No. If you select No, the numbers of matching documents do not
appear to the right of the labels in the SAS Information Retrieval Studio
search window.
324
or
to select a number of matching fields in the Show
field. For example, choose to display the three categories
with the highest number of matches in the SAS Information Retrieval
Studio search window. (If there are more than the specified number of
fields, the term and other information appears. This term is appended to
the list to indicate that the display is incomplete.)
7. Click
in matches
8. Click Move Up to relocate a match on this field when it is displayed in
the search results window.

9. Click Move Down to relocate a match on this field it is displayed in the
search results window.

10. Click
labels
or
to display.
to change the Maximum number of related

12. Click the link in the Status tab to see the results of your changes in the
search window.
325
13.3.6 Specify the Formatting for the Matches

You can choose how matched information appears to the end user. For
example, use the Match Formatting pane to specify how the links and fields
for matching documents are displayed in the SAS Information Retrieval
Studio search results window. This pane also enables you to specify the
sources and the allowed prefix and suffix for the displayed links.
To specify the formatting for the matching documents returned to queries,
1. Select Query Web Server --> Configuration --> Match
Formatting.
to select HTML field in the Title

field. This field identifies the location of the title in the input
document. Select None if you choose not to display the document title.
In this case, the Title field disappears and No Title is displayed for
each match.
2. (Default is Text field) Click

source
326
in the Title field to select from the

fields in the index. This is the field where the document title can
be located.
3. (Default is title) Click

info
to select No in the Use filename when

field. Use this selection to generate a title for
the document when the title field in an input document is empty.
4. (Default is Yes) Click
document has no title
5. (Default is Concordance) Click
to select Text field, or HTML

field in the Abstract source field. Use these fields to locate the type
of field where the summary of the input document is located. If you

select None or Concordance, Abstract field disappears.
Select Concordance to enable hit highlighting. Matched query terms in
an input document appear in bold.
to select another field in Abstract

This is the field where the abstract can be located.
6. (Default is Title) Click

field.
to select None or Text field in Link

field. This field specifies the link to the matched document.
7. (Default is URL) Click

Source
8. Enter a string to prepend to the URL, before it is displayed, into the

Link prefix
field. Specify how to modify the prefix of the URL at

display time for the purposes of passing an argument from your own
CGI script.
9. Enter a string to prepend to the URL, before it is displayed, into the

Link suffix
field. Specify how to modify the suffix of the URL at

display time for the purposes of passing an argument from your own
CGI script.
For example, the link field might contain the unique identifier 12345.
However, the browser does not understand this string. In this case, set
327
the link prefix to http://host/script?id= and the link suffix to

&format=html. The browser now sees the link as http://host/
script?id=12345&format=html. The only other requirements are that
the CGI script exists and that the browser can render this type of ID.
to select No in the Add keywords to PDF

field. When you leave the default selection Yes, you instruct
Adobe Reader to highlight the search terms in an input .pdf
document. This operation functions like a concordance, but works for
the entire document, not only the abstract in the results list.

links
11. (Default is None) Click
to select Text field. This field is used

to locate the type of input document in the MIME type source field.
to select Date or Text field in the Date

field. This field is used to locate the creation date of the
matched document.
12. (Default is None) Click

source
328
13.3.7 Specify the Theme of the Search Window

13.3.7.A Theme Overview
There are three parts to the Theme pane. The first four fields provide the
specifications for the matched documents. The Colors pane enables you to
select the colors for the search window. The Images pane enables you to
specify the images that replace the existing title bar.
Display 13.7 SAS Information Retrieval Studio Search Window
Determine the look and feel of the SAS Information Retrieval Studio search
window.
To specify the theme of the search window, complete these steps:
1. Select Query Web Server --> Configuration --> Theme.
2. Leave the default selection SAS Information Retrieval Studio, or
enter a new name into the Title field.
329
3. Leave the default selection sans-serif, or enter a new display font into
the Font field.

4. Leave the default selection 10, or click
or
to select a new
size for the display letters in the Font size field. For example, choose
12 to display the search returns in a larger font size.
to select No in the Use popup menus

field. Select No when you want to disable pop-up menus for older
browsers that cannot use Javascript.
search window.
330
13.3.7.B Specify the Colors of the Search Window

Determine the colors that are displayed in the SAS Information Retrieval
Studio search window. For example, change the background color and the
visited link colors to appear in red.
Display 13.8 New Colors Specified for the Search Window
For more information about formatting the search window, see http://
www.w3.org/TR/CSS/ui.html#system-colors.
331
To specify the colors in the search interface, complete these steps:

1. Select the Query Web Server --> Configuration --> Theme -->
Colors.
2. Leave the default selection Custom in the Header background color
field. Click
also click
such as red.
Note:
to select the location that you want to color. You can

to access the color box window and select a color
For more information about the Color Box window, see

Section 2.14.27 The Color Box Window on page 151.
3. Leave the default selection Custom in the Header text color field.
Click
332
to select ActiveBorder, ActiveCaption, AppWorkspace,
Background, and so on. You can also click

box window to select another color.
to access the color
4. Leave the default selection Custom in the Link color field.
Click

box window and select a color such as red.
to access the color
5. Leave the default selection Custom in the Visited link color field.
Click

window and select another color.
6. Leave the default selection Custom in the Hover link color field.
Click

window and select a color such as red.
7. Leave the default selection Window in the Menu border color field.
Click
Background, and so on.
8. Leave the default selection Window in the Menu unselected
background color
field. Click
to select ActiveBorder,
ActiveCaption, AppWorkspace, Background,
and so on. Click

access the color box window and select a different color.
to
333
9. Leave the default selection Custom in the Menu unselected text
to select Custom, ActiveBorder,

color field. Click
ActiveCaption, AppWorkspace, Background, and so on.
10. Leave the default selection GrayText in the Menu selected

background color field. Click
11. Leave the default selection HighlightText in the Menu selected

text color field. Click
12. (Optional) Click Reset to Default to revert to the standard SAS
Information Retrieval Studio settings.

13. Click Apply Changes before you select another pane.
14. Click the link in the Status tab to see the results of your changes.
334
13.3.7.C Load New Images into the Search Window

You can upload images, or borders, into the search window that your end users
see. Before you use the steps below, make sure that you load your images into
the work/query-web-server subdirectory of your installation directory.
Note:
PNG, JPEG, and GIF images are all supported.
To upload images or borders, complete these steps:

1. Select the Query Web Server --> Configuration --> Theme -->
Images.
2. Leave the default selection None in the Left header image field.
Click
to select one of the images that you loaded into the work/
3. Leave the default selection sas.png in the Right header image field.
Click
to select one of the images that you loaded into the work/

search window.
335
13.4 Run the Query Web Server

By default, the query web server is running. Configure all of the components
that you plan to use. Click Apply Changes after you modify any of the default
settings for these components.
To start, restart, and stop the query web server, complete any of these steps:
-
Click Start in the Query Web Server pane.
operations.
336
Click the link to the machine where the query web server is running to
see the SAS Information Retrieval Studio search window.
(Optional) If you make any changes to the configuration while the

indexing server is running, click Apply Changes.

The log pane enables you to see information about the queries entered by an
end user in the search pane. Use the contents of the Log pane when you require
customer support.
To use this Log pane, complete these steps:
1. Select Query Web Server --> Log.
or
2. Click
highlight
field. For example, enter query.
337
338
14
Configuring the Query Statistics

Server
-
Overview of the Query Statistics Server
Run the Query Statistics Server
View the Query Statistics for a Selected Time Period
After You View the Query Data
14.1 Overview of the Query Statistics Server

The query statistics server monitors the queries that end users enter into the
SAS Information Retrieval Studio search window. The query statistics server
tracks and displays information such as the most frequent query terms, query
terms that did not return matching documents, and other time-related
information. This server also provides the query analytics that enable you to
troubleshoot or to see numbers that interpret the flow of traffic through the
SAS Information Retrieval Studio search window.
You can select the time-periods and types of data that you want to see when
you use the panes in this window. See the most frequent queries or monitor the
flow of traffic hour-by-hour. Some choices provide access to all of the tabs in
this pane, other selections limit the panes that you can see to those that apply
to your selection.
Display 14-1 Query Statistics Pane
14.2 Run the Query Statistics Server

By default the query statistics server is running.
To start and stop the query statistics server, complete any of these steps:
-
Click Start in the Query Statistics Server pane.
The appropriate message appears in the Status pane after either of

these operations.
-
340
14.3 View the Query Statistics for a Selected

Time Period
14.3.1 Overview of Time Period Views
By default, the Query Statistics pane displays the four periods of time that you
can use to see related analytics. Use the Today, This Month, This Year, and
All Time buttons in this tab to select a time period. You can then select one of
the following panes, Most Frequent Queries, Most Frequent Queries Without
Matches, Hourly Query Rate, Daily Query Rate, or Monthly Query Rate.
The availability of these panes depends on the time period button that you
click. For example, when you click All Time, you see all of the available tabs.
If you click Today the first three tabs are available.
14.3.2 See the Statistics for Today

When you click Today, the Most Frequent Queries, Most Frequent
Queries Without Matches, and the Hourly Query Rate tabs remain
accessible.
To see the query statistics for searches performed today or yesterday, complete
these steps:
1. Select Query Statistics Server --> Query Statistics.
2. Click Today to see the screen shown above. Todays date is displayed
by Year, Month, and Day. For example, 2010, 9, 15.
341
3. (Optional) Click Previous to see the date assigned to yesterday and to
see these results. For example, 2010, 9, 14.

4. Click the Most Frequent Queries tab to see the query terms and
number of times that these words were entered into the search window.
5. See the query with the highest number of entries at the top of the list
under Query. For example, see the word sas.

6. See the total count for the number of entries under Number of
Occurrences.
342
For example, sas was entered 73 times.
7. Click the Most Frequent Queries Without Matches tab to see the
search terms that are not located.
8. See any search terms that were not matched by the searched corpus
under Query. For example, see the term produc.

Hint:
In this example, this term is also listed in the Most

Frequent Queries pane.
9. Under Number of Occurrences, see the number of times this search
term was input by end users. For example, produc was entered one time.
343
10. Click the Hourly Query Rate tab to see the query traffic over the
current 24-hour period.
11. See each Hour and the Number of Queries. For example, see 8 am,
17
and 9 am, 12.
14.3.3 See the Statistics for This Month

When you click This Month, the Most Frequent Queries, Most Frequent
Queries Without Matches, Hourly Query Rate, and the Daily Query Rate
tabs remain accessible.
To see the query statistics for searches performed this month, complete these
steps:
344
2. Click This Month.

3. See the screen shown above. The date for this month is displayed by
Year
and Month. For example, 2010 and 9.
4. Use Step 3. on page 342 through Step 11. on page 344.

5. Click Daily Query Rate.
6. See each Day of the week and the Number of Queries input for that
day. For example, see Wednesday, 74.
345
14.3.4 See the Statistics for This Year

When you click This Year, all of the tabs remain accessible.
To see the query statistics for searches performed this year, complete these
steps:
2. Click This Year.

3. See the date that is displayed in year format in Year. For example, 2010.
5. Use Step 5. through Step 6. on page 345 for the Daily Query Rate tab.
346
6. Click Monthly Query Rate to see the total number of queries that were
input during each of the weekdays during the selected month.
7. See each Month and the Number of Queries for that month. For
example, see September, 97.
347
14.3.5 See the Statistics for All Time

When you click All Time, all of the tabs remain accessible.
To see the query statistics for searches performed from the time when your end
users began to query until now, complete these steps:
1. Select Query Statistics Server --> Query Statistics
2. Click All Time.

3. See that no date is displayed and the Previous and Next buttons are not
accessible for this date selection.

5. Use Step 5. through Step 6. on page 345 for the Daily Query Rate tab.
The number for each day of the week matches the total number of input
queries received on each of the respective weekdays over the course of
the year.
Hint:
The statistics are preserved when you restart the

application, but not when SAS Information Retrieval
Studio is reinstalled.
6. Use Step 6. through Step 7. on page 347 for the Monthly Query Rate
tab.
348
14.4 After You View the Query Data

The query statistics help you to understand whether changes should be made
to the current configuration of SAS Information Retrieval Studio. For
example, you can find the following information:
-
Discover your peak query hours, days, and months. If performance

should be increased, consider adding additional hardware or network
bandwidth.
Discover any changes that should be made to the index. For example,
see whether queries without matches might be matched if an additional
field is added to the index.
See whether the most frequent query terms adequately match the
searched corpus. If not add a new link.

The query web server Log pane enables you to see the history of the queries
entered by an end user in the search pane. Use the contents of the Log pane
when you require customer support.
To use this Log pane, complete these steps:
349
1. Select Query Statistics Server --> Log.
or
2. Click
highlight
field. For example, enter time.
350
Appendixes
-
Appendix A: Regular Expressions and XML Field Extraction File on

page 353
Appendix B: Recommended Reading on page 355
Appendix C: Glossary on page 359
351
352
Appendix: A
Regular Expressions and XML
Field Extraction File
-
Regular Expressions
XML File Field Extraction File Format
A.1 Regular Expressions

The document processors, file crawler, and feed crawler use the Python
compatible equivalent of PCRE. The query web server uses the Java
compatible equivalent of PCRE. The web crawler uses the SAS wrapper for
PCRE.
For more information, see the following pages:
-
PCRE: http://www.pcre.org/
Python compatible equivalent: http://docs.python.org/library/

re.html
Java compatible equivalent: http://download.oracle.com/javase/

7/docs/api/java/util/regex/Pattern.html
A.2 XML File Field Extraction File Format

Use this section when you want to extract the contents of a specific XML
document. For more information, see Section The Document Processor:
parse_xml Window on page 114.
Example A.1: Original Document
<article>
<content>foo bar</content>
<tsrc>unwanted garbage</tsrc>
353
<thumbnail>
<tsrc>http://img.com/</tsrc>
</thumbnail>
</article>
Suppose you want to extract the value of the content field, and the value of
the tsrc field in the thumbnail field. In order to extract only the tsrc field
that is located inside the thumbnail field, specify the following syntax
<article>
<content />
<tsrc index="no" />
<thumbnail index="no">
<tsrc />
</thumbnail>
</article>
In this example the attribute index has the value no". This value specifies
that the parser does not add the value of this field to its list of documents.
The default value of the index attribute is "yes". This specification means that
every field in the input XML that does not have the index attribute remains in
the document.
354
Appendix: B
Recommended Reading
The following books are recommended as companion guides:
-
SAS Information Retrieval Studio: Installation Guide: Install SAS

Information Retrieval Studio and prerequisite software.
SAS Information Retrieval Studio: Users Guide: Use the search

window that an administrator customized to query the index built in
SAS Information Retrieval Studio.
SAS Sentiment Analysis Studio: Users Guide: Create a SAS

Sentiment Analysis Studio project, test, and upload it to SAS
Sentiment Analysis Server.
SAS Sentiment Analysis Server: Administrators Guide: Automate the

process of applying the rules that you define in SAS Sentiment Analysis
Studio to your input documents.
SAS Sentiment Analysis Workbench: Installation Guide: Install SAS

Sentiment Analysis Workbench and prerequisite software.
SAS Sentiment Analysis Workbench: Administrators Guide: Set up

SAS Sentiment Analysis Studio projects, add users, and specify the files
to be used. These files include SAS Sentiment Analysis Studio and SAS
Content Categorization Studio files.
SAS Sentiment Analysis Workbench: Users Guide: Review and edit the
automated analyses and create reports illustrated with graphs that
illustrate these analyses.
SAS Content Categorization: Users Guide: Create a SAS Content

Categorization Studio project, test, and upload to SAS Content
SAS Content Categorization Studio: Quick Start Guide: Advanced

users can learn how to expeditiously set up a SAS Content
Categorization Studio project.
SAS Content Categorization: Installation Guide: Install SAS Content

355
SAS Content Categorization Server: Administrators Guide:

Understand how SAS Content Categorization Server applies the .mco
and .concepts files to input documents. Program this application using
the Java language.
SAS Contextual Extraction Studio: Administrators Guide: Use this

add-on application to SAS Content Categorization Studio to write
complex concept definitions that can include multiple rule types within
a single definition.
SAS Contextual Extraction Studio: Installation Guide: Install SAS

Contextual Extraction Studio.
SAS Document Conversion: Developers Guide: Use this C API for

SAS Document Conversion to convert documents in formats such as
Adobe PDF and Microsoft Office into text.
Use the language book that applies to the language that you use to create
your project. Each of the SAS world language books contain a
comprehensive list of part-of-speech tags.
SAS offers instructor-led training and self-paced e-learning courses to

help you get started with the SAS add-in, learn how the SAS add-in
works with the other products in the SAS Enterprise Intelligence
Platform, and learn how to run stored processes in the SAS add-in.
For more information about the courses available, see
support.sas.com/training.
For a complete list of SAS publications, see the current SAS Publishing
Catalog. To order the most current publications or to receive a free copy of the
catalog, contact a SAS representative at
SAS Publishing Sales
SAS Campus Drive
Cary, NC 27513
Telephone: (800) 727-3228*
Fax: (919) 677-8166
E-mail: sasbook@sas.com
Web address:support.sas.com/pubs
* For other SAS Institute business, call (919) 677-8000.
Customers outside the United States should contact their local SAS office.
356
Appendix: C
Glossary
Boolean operators
specifies words such as AND, OR, and NOT, to construct logical definitions
that locate the matches that you seek.
caption
specifies an alternative version of a label field name that is displayed to

the end user during facetted search. Also see label.
categorization
concisely defines the subject matter of a document, in other words, the

main idea or subject of the document.
checksum operation
eliminates a duplicate document according to the type of operation that

you specify. For example, choose to remove old documents, or eliminate
documents with the same URL.
concept
specifies any of the followinga string, token, or an argumentto locate

in an input document. A concept locates the metadata of the input
document.
contextual extraction
specifies concepts and facts.

corpus
specifies one set of documents. For multiple sets, see corpora.

corpora
specifies multiple sets of training documents. See corpus for one set
crawl
an entire run of a crawler, instead of a single page download.
359
definition
defines a concept. There can be many rules for each concept definition.
This term is also used interchangeably with rule. See rule.
document
is a unit of textual data. For example, a document can be an HTML page, a

Microsoft Word file, a PDF file, or one row in a CSV file or a database. A
document can also be an article or summary in a feed.
configurable set of fields. Each file has a name and a value. Unnecessary
fields can be either left empty or omitted from the document.
dominance
when an object is dominant, it is mentioned more frequently than the other

comparable objects that you defined in the Products tab of SAS
Sentiment Analysis Studio.
Facetted search
applies identifying labels to matched documents. These labels enable you

to intuitively navigate to the documents that match your input query terms.
fact
links two, or more, concepts to provide otherwise overlooked relationships

in input documents.
filter
criteria that restrict the data that is displayed in a graph.

hash
change a string of characters into a value that can be indexed. The hash
process expedites the search process.
label
specify the value of the field that is passed to the query web server for
each match that appears within the search window. Also see caption.
metadata
identifies information about information.

MIME
is an acronym for Multipurpose Internet Mail Extensions. Non ASCII

messages are formatted using MIME to be sent over the Internet.
360
MIME type
is the format of the input document.

Noise words
appear with enough frequency that they are ranked down in the metrics for
weight.
polite
means that a single thread does not overwhelm a site with download
requests, but respects the robots.txt standard. This standard enables Web
site developers to specify portions of their sites that should not be crawled.
precision
measures the accuracy of the model. It reflects the percentage of

documents that were correctly classified.
prominence
see where the information about the product is located in the document.
The information can appear primarily in the top 20%, or in the bottom
80%, of the selected document.
raw
specifies the original, unmodified content that was placed into an HTML
document using this identifying field name.
recall
measures the number of documents that are a match for the definition out
of those texts that were successfully returned.
rule
defines a category. Unless you use SAS Contextual Extraction Studio,

only one rule defines each category. This term is also used interchangeably
with definition. See definition.
sentiment
expresses feeling, or like or dislike.

string
refers to a group of words or characters that you specify for a rule.
361
362
Index
A
Abstract source
defined ...............................................................................................................59
Action heading
defined ...............................................................................................................20
Add Backend window usage ..................................................................................141
Add button
defined ............................................... 19, 20, 22, 23, 28, 29, 30, 38, 42, 47, 53, 56
Add Credential window
usage ................................................................................................................135
Add Entry Point window
usage ................................................................................................................121
Add Extension window
usage ................................................................................................................139
Add Field window
usage ................................................................................................ 143, 146, 148
Add Filename Extension window
usage ................................................................................................................133
Add HTTP Proxy window ......................................................................................120
usage ................................................................................................ 118, 119, 120
Add keywords to PDF links
defined ...............................................................................................................59
Add Path to Exclude window
usage ................................................................................................................138
Add Path window
usage ................................................................................................................137
Add Scope Rule window
usage ................................................................................................................130
add_field
document processor ...........................................................................................74
All Time button
defined ...............................................................................................................67
363
Apply Changes button

defined .............................................................................................................. 13
pipeline server ................................................................................................... 40
Auto-detect button
defined .........................................................................................................15, 33
feed crawler ..................................................................................................33, 34
B
bsearch
defined ............................................................................................................ 317
C
Caption heading
defined .............................................................................................................. 56
categorizer
defined ............................................................................................................ 158
color box window
usage ............................................................................................................... 151
colors
search window ................................................................................................ 331
Colors tab
defined .............................................................................................................. 61
concept_extractor
defined ............................................................................................................ 159
Configuration pane
defined .............................................................................................................. 37
content_categorization
document processor .......................................................................................... 74
contextual_extractor
defined ............................................................................................................ 159
Cosine Weight
defined .............................................................................................................. 55
Crawl continuously
defined .........................................................................................................27, 33
Credentials tab
web crawler ....................................................................................................... 14
364
D
Date
Sort tab ..............................................................................................................54
Date source
defined ...............................................................................................................59
Day
defined ...............................................................................................................71
Day field
defined ...............................................................................................................67
defined .............................................................................................................159
defined .............................................................................................................159
Delete Index button
defined ....................................................................................................... 45, 303
Delete Index window
usage ................................................................................................................153
Density Weight
defined ...............................................................................................................55
document
defined ...............................................................................................................44
Document processing heading
defined ...............................................................................................................40
Document Processor
add_field window ..............................................................................................77
content_categorization window ................................................................. 78, 274
default_mime_type_from_url window ..............................................................95
default_title_from_url window ..........................................................................95
document_converter window ............................................................................96
export_csv window ............................................................................................97
export_to_files window ...................................................................................100
export_to_odbc window ..................................................................................102
export_to_sentiment_analysis_workbench window ........................................104
extract_abstract window ..................................................................................106
extract_pdate window ......................................................................................107
heuristic_parse_html window ..........................................................................108
invalidate_duplicates_by_url window .............................................................110
match_and_copy window ................................................................................111
modify_field_name window ............................................................................112
365
parse_html window ......................................................................................... 112

parse_xml window .......................................................................................... 114
strip_html window ...................................................................................115, 116
substitute window ........................................................................................... 117
document processors
choose ............................................................................................................. 235
document_converter
defined ............................................................................................................ 159
Documents processed
defined .............................................................................................................. 36
Documents queued
defined .............................................................................................................. 36
Documents received
defined .............................................................................................................. 36
E
Edit Backend window
usage ............................................................................................................... 142
Edit button
defined ......................................... 19, 21, 22, 23, 28, 29, 30, 34, 39, 42, 47, 53, 57
Edit Credential window
usage ............................................................................................................... 136
Edit Entry Point window
usage ............................................................................................................... 125
Edit Extension window
usage ............................................................................................................... 140
Edit Field window
usage ........................................................................................................147, 150
Edit Filename Extension window
usage ............................................................................................................... 134
Edit Path to Exclude window
usage ............................................................................................................... 139
Edit Path window
usage ............................................................................................................... 138
Encapsulate XML files
define ................................................................................................................ 27
entry points
defined ............................................................................................................ 196
366
Entry Points tab

usage ................................................................................................................196
web crawler .......................................................................................................14
Export Settings window ..........................................................................................119
export_csv
defined .............................................................................................................160
export_to_files
defined .............................................................................................................160
export_to_odbc
defined .............................................................................................................160
export_to_sas_sentiment_analysis_workbench
defined .............................................................................................................161
Extension
defined ...............................................................................................................21
extract_abstract
defined .............................................................................................................159
extract_pdate
defined .............................................................................................................159
F
facetted search
defined .............................................................................................................163
feed crawler
configure ..........................................................................................................218
defined ......................................................................................................... 9, 157
Feeds pane .........................................................................................................34
General Settings pane ........................................................................................33
operations ..........................................................................................................31
run ....................................................................................................................222
usage ................................................................................................................217
Feed URL
feed crawler .......................................................................................................34
367
Feeds tab
feed crawler ....................................................................................................... 32
Field Name heading
defined .........................................................................................................46, 56
Field Name tab
defined .............................................................................................................. 52
Field value ................................................................................................................ 54
Sort tab .............................................................................................................. 54
file crawler
configure ......................................................................................................... 207
defined .........................................................................................................9, 156
general settings ............................................................................................... 208
run ................................................................................................................... 213
Filename Extensions tab
file crawler ........................................................................................................ 26
web crawler ....................................................................................................... 14
Find button
defined .............................................................................................................. 12
Finished heading
defined .............................................................................................................. 41
flattened hierarchy
search returns .................................................................................................. 315
Follow Links
feed crawler ....................................................................................................... 34
Font
defined .............................................................................................................. 60
Font size
defined .............................................................................................................. 61
formatting
query web server ............................................................................................. 162
Freshness Weight
defined .............................................................................................................. 55
fsearch
defined ............................................................................................................ 317
Functionality heading
defined .............................................................................................................. 46
368
G
General Settings tab
feed crawler .......................................................................................................32
file crawler .........................................................................................................26
web crawler ............................................................................................... 14, 193
H
Header background color
defined ...............................................................................................................62
defined .............................................................................................................159
Host
defined ...............................................................................................................38
Hour
defined ...............................................................................................................70
HTTP proxy
defined ...............................................................................................................15
feed crawler .......................................................................................................33
I
Import Settings window ..........................................................................................118
index
configure ..........................................................................................................298
defined .............................................................................................................161
input documents ..................................................................................................8
indexing server
defined ...............................................................................................................10
run ....................................................................................................................302
usage ................................................................................................................297
input documents
index ....................................................................................................................8
defined .............................................................................................................159
369
L
labels
defined ............................................................................................................ 163
hierarchy ......................................................................................................... 312
navigation tools ............................................................................................... 264
usage ............................................................................................................... 321
Labels tab
defined .............................................................................................................. 51
Last busy time heading
defined .............................................................................................................. 41
Last document processed
defined .............................................................................................................. 37
Last document received
defined .............................................................................................................. 37
Left header image
defined .............................................................................................................. 64
Link prefix
defined .............................................................................................................. 59
Link source
defined .............................................................................................................. 59
Link suffix
defined .............................................................................................................. 59
Link traversal order
defined .............................................................................................................. 17
log file
entire application ............................................................................................... 11
feed crawler ..................................................................................................... 223
file crawler ...................................................................................................... 214
indexing server ................................................................................................ 303
pipeline server ................................................................................................. 260
proxy server ...............................................................................................12, 230
query server ..................................................................................................... 306
query web server ......................................................................................337, 349
troubleshoot .................................................................................................... 223
M
Match Formatting tab
defined .............................................................................................................. 51
usage ............................................................................................................... 326
370
match type
select ................................................................................................................318
Match Type heading
defined ...............................................................................................................20
match_and_copy
matches
sort ...................................................................................................................319
Matching tab
defined ...............................................................................................................50
usage ................................................................................................................317
Maximum file size (megabytes)
defined ...............................................................................................................26
Maximum number of related labels
defined ...............................................................................................................57
Maximum number of retries
defined ...............................................................................................................16
Menu selected background color
defined ...............................................................................................................63
Menu selected text color
defined ...............................................................................................................64
Menu unselected background color
defined ...............................................................................................................63
Menu unselected text color
defined ...............................................................................................................63
MIME type source
defined ...............................................................................................................59
modify_field_name
defined .............................................................................................................159
Month
defined ...............................................................................................................72
Month field
defined ...............................................................................................................67
most frequent queries
defined .............................................................................................................162
query statistics server ......................................................................................162
most frequent queries without matches
query statistics server ......................................................................................163
Move Down button
defined ......................................................................................................... 42, 57
371
Move Up button
defined .........................................................................................................42, 57
N
Next button
defined .............................................................................................................. 67
no hierarchy
search returns .................................................................................................. 314
no labels
search returns .................................................................................................. 312
Number of downloader threads
defined .............................................................................................................. 16
Number of lines
defined .............................................................................................................. 11
Number of matching fields
Sort tab .............................................................................................................. 54
Number of matching terms
Sort tab .............................................................................................................. 54
defined .............................................................................................................. 69
Number of Occurrences heading
defined .............................................................................................................. 68
Number of Queries
defined ...................................................................................................70, 71, 72
O
Oldest date
defined .............................................................................................................. 27
operation history
log file ............................................................................................................. 206
Order added to the index
Sort tab .............................................................................................................. 55
Overall heading
defined .............................................................................................................. 40
372
P
parse_html
defined .............................................................................................................160
parse_xml
defined .............................................................................................................160
Password
defined ...............................................................................................................23
password-protected sites
crawl ................................................................................................................203
Paths tab
file crawler .........................................................................................................26
paths to crawl
specify .............................................................................................................209
paths to exclude
file crawler .......................................................................................................211
Paths to Exclude tab
file crawler .........................................................................................................26
Pending heading
defined ...............................................................................................................41
pipeline server
defined .................................................................................................................9
operations ........................................................................................................234
run ....................................................................................................................258
Pipeline Server tab
operations ..........................................................................................................39
Pipeline Stage
stages .................................................................................................................40
Port
defined ...............................................................................................................38
Position Weight
defined ...............................................................................................................55
Previous button
defined ...............................................................................................................67
processes
order .................................................................................................................164
Proximity Weight
defined ...............................................................................................................55
373
proxy server
configure ......................................................................................................... 228
defined ...................................................................................................9, 35, 157
operations ........................................................................................................ 157
run ................................................................................................................... 229
status ............................................................................................................... 226
usage ............................................................................................................... 225
Q
queries ........................................................................................................................ 8
Query
defined .............................................................................................................. 69
Query heading
defined .............................................................................................................. 68
query rate by day
defined ............................................................................................................ 163
query rate by hour
defined ............................................................................................................ 163
query rate by month
defined ............................................................................................................ 163
query rate for all time
defined ............................................................................................................ 163
query rates
query statistics server ...................................................................................... 163
query server
defined .................................................................................................10, 48, 305
usage ............................................................................................................... 305
query statistics
all queries ........................................................................................................ 348
see ............................................................................................................341, 344
this year ........................................................................................................... 346
usage ............................................................................................................... 349
Query Statistics pane
usage ............................................................................................................... 341
query statistics server
defined .............................................................................................................. 10
run ................................................................................................................... 340
usage ............................................................................................................... 339
374
query web server

configure ..........................................................................................................316
defined ...............................................................................................................10
run ....................................................................................................................336
usage ................................................................................................................309
Quota (files)
defined ...............................................................................................................16
Quota (megabytes)
defined ...............................................................................................................16
Quota heading
defined ...............................................................................................................18
R
Recrawl interval
defined ...............................................................................................................33
Refresh button
defined ........................................................................................................... 9, 13
Relevancy
Sort tab ..............................................................................................................54
Remove button
defined ......................................... 19, 21, 22, 23, 28, 29, 30, 34, 38, 42, 47, 53, 56
Reset to Default button
defined ...............................................................................................................64
Respect robots.txt
defined ...............................................................................................................16
Retrieve button
defined ...............................................................................................................11
Retry delay (seconds)
defined ...............................................................................................................16
Revert button
defined ...............................................................................................................13
indexing server ..................................................................................................45
Right header image
defined ...............................................................................................................64
robots.txt
defined ...............................................................................................................16
375
S
sample project
set up ........................................................................................................168, 179
SAS Content Categorization Server ................................................................234, 237
install ................................................................................................234, 235, 237
taxonomies ...............................................................................................234, 237
SAS Content Categorization Studio ................................................................235, 237
concepts and categories .................................................................................. 234
SAS Contextual Extraction Studio
concepts and facts ....................................................................................234, 237
install ............................................................................................................... 234
SAS Sentiment Analysis Workbench
install ........................................................................................................235, 237
SAS Text Miner ..............................................................................................235, 237
install ........................................................................................................235, 237
scope
defined ............................................................................................................ 198
Scope tab
defined .............................................................................................................. 14
search
query web server ............................................................................................. 162
type ...................................................................................................................... 8
search box
customize ........................................................................................................ 310
Search type
defined .............................................................................................................. 52
send
defined ............................................................................................................ 160
Sending to the indexer heading
defined .............................................................................................................. 40
Server Port field
defined .............................................................................................................. 50
specify ............................................................................................................. 317
Site
defined .............................................................................................................. 23
Sleep interval (seconds)
defined .............................................................................................................. 16
sort
matches ........................................................................................................... 319
376
sort the results

query web server .............................................................................................162
Sort type
defined ...............................................................................................................53
Sort Type field
defined ...............................................................................................................55
Sorting tab
defined ......................................................................................................... 51, 53
Start button
defined ...............................................................................................................13
Status
defined ...............................................................................................................38
status tab
defined ...............................................................................................................36
Stop button
defined ...............................................................................................................13
strip_html
defined .............................................................................................................160
substitute
defined .............................................................................................................160
T
Take Snapshot
usage ..................................................................................................................43
Text to highlight
defined ...............................................................................................................12
Theme pane
usage ................................................................................................................329
Theme tab
defined ...............................................................................................................51
This Month button
defined ...............................................................................................................66
This Year button
defined ...............................................................................................................66
Tiebreaker
defined ...............................................................................................................55
Timeout (seconds)
defined ...............................................................................................................16
377
Title
defined .............................................................................................................. 60
Title field
defined .............................................................................................................. 58
Title source
defined .............................................................................................................. 58
Today button
defined .............................................................................................................. 66
types of files
limit ................................................................................................................. 202
U
URL heading
defined .............................................................................................................. 18
URL Pattern heading
defined .............................................................................................................. 20
Use pop-up menus
defined .............................................................................................................. 61
User agent
defined .............................................................................................................. 33
Username
defined .............................................................................................................. 23
W
web crawler
configure ......................................................................................................... 192
defined .........................................................................................................9, 156
run ................................................................................................................... 205
specify operations ........................................................................................... 193
Web Crawler pane
operations .......................................................................................................... 12
Weight tab
defined .............................................................................................................. 52
X
XML document
extract contents ............................................................................................... 353
378
Y
Year field
defined ...............................................................................................................67
379
380

Irsag Sas

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Irsag Sas

Enviado por

Direitos autorais:

Formatos disponíveis

Contents

About This Book .......................................................................................xiii

2.5 Viewing the File Crawler Pane ...................................................................24

SAS Information Retrieval Studio: Administrators Guide

2.10 Viewing the Query Server Pane ............................................................... 48

SAS Information Retrieval Studio: Administrators Guide

2.13.4.F Specify Concepts .................................................................84

SAS Information Retrieval Studio: Administrators Guide

2.14.17 The Edit Path to Exclude Window ............................................... 139

SAS Information Retrieval Studio: Administrators Guide

4.3 Sample Configurations That Use the Web Crawler ...................................167

SAS Information Retrieval Studio: Administrators Guide

8.1 Overview of the Proxy Server .................................................................... 225

SAS Information Retrieval Studio: Administrators Guide

10.2.3 Determine the Input, Matching, and Output ...................................279

SAS Information Retrieval Studio: Administrators Guide

13.3.7 Specify the Theme of the Search Window .................................... 329

Appendixes ............................................................................ 351

SAS Information Retrieval Studio: Administrators Guide

SAS Information Retrieval Studio: Administrators Guide

About This Book

Persons who install the software.

Persons who determine what components are used in the custom

Persons who choose the components, and their configurations, for

You could be assigned one of these functions, or all of them.

SAS Information Retrieval Studio installed on your machine.

A supported browser installed on your desktop client.

Access to data sources.

(Optional) Rules such as category rules and concept definitions created

The root directory where SAS Information Retrieval Studio is

Code examples are shown in a fixed-width font.

The hypertext links are shown in a light blue, fixed-width font,

This manual contains instructional text that is subject to change.

SAS Information Retrieval Studio: Administrators Guide

Whats New in SAS Information

SAS licensing replaces the Teragram license.

The content_categorization Document Processor wizard replaces the

The add_field Document Processor enables you to add a field with a

The parse_xml document processor can now be instantiated multiple

The export_csv document processor now supports a non-escaped output

The match_and_copy document processor is similar to the substitute

SAS Information Retrieval Studio: Administrators Guide

About SAS Information

What Is SAS Information Retrieval Studio?

Benefits of Using SAS Information Retrieval Studio

How Does SAS Information Retrieval Studio Work?

How to Get Help for SAS Information Retrieval Studio

1.1 What Is SAS Information Retrieval Studio?

Analysis Workbench. If indexed, the documents can be searched by your

SAS Information Retrieval Studio: Administrators Guide

1.2 Benefits of Using SAS Information

1.3 How Does SAS Information Retrieval

SAS Information Retrieval Studio: Administrators Guide

1.4 How Does SAS Information Retrieval Studio

1.5 How to Get Help for SAS Information

SAS Information Retrieval Studio: Administrators Guide

1.6 What is a Document?

a Microsoft Word file

one row in a CSV file or a database

one article or summary in a feed

In SAS Information Retrieval Studio, each document is represented as a

SAS Information Retrieval Studio

SAS Information Retrieval Studio: Administrators Guide