Escolar Documentos
Profissional Documentos
Cultura Documentos
iii
iv
vi
vii
viii
ix
xi
xii
Persons who design the search window that is used by end users to
query the index.
Prerequisites
Here are the prerequisites for using SAS Information Retrieval Studio:
-
If you have any questions about whether you are ready to use SAS Information
Retrieval Studio, contact your system administrator.
xiii
Conventions
This manual uses the following typographical conventions:
Convention
Description
TGM_ROOT
.xml
Start button
The labels for user interface controls are shown in a bold, sansserif font.
www.sas.com
xiv
The export_to_files document processor now enables you to mark preescaped fields for XML documents. Use this processor to create nested
XML tags.
Entry point quota control is now available for the web crawler. This
feature enables seed-only crawling.
The default fields ctime, mtime, and atime are included in the Input
fields to exclude field for the content categorization document
processor. These fields preclude these timestamps from processing by
SAS Content Categorization Server.
The passwords in the web crawler Credentials pane are now obscured.
xv
xvi
How Does SAS Information Retrieval Studio Fit into the SAS Product
Line?
What is a Document?
Architecture
configure the components, and enable your end users to perform facetted
search using labels.
an HTML page
a PDF file
1.7 Architecture
Use the architecture diagram below to gain an overview of the application
processes that you can choose to use in your customized configuration.
Figure 1-1
Miscellaneous Windows
what fields are indexed and how the information in these fields is
handled.
4. Choose the type of search that is available to end users and how
matching and sorting are determined. You can also determine whether
and how labels that facilitate facetted search are made available to end
users.
5. Monitor the status of input queries using the query statistics server.
Find information for the application and the import and export operations
that apply to selected components of the application.
Web Crawler
File Crawler
Feed Crawler
Proxy Server
Start, stop, and configure the proxy server. Also see and search a log file
of output for this server.
Pipeline Server
Start, stop, and configure the pipeline server. Observe the progress of
gathered documents in the Status pane and see the specified log file
output for this server. Use this component to specify the processors that
act on each input document. These processors act on the document using
the specified operation or pass the document to another component.
Start, stop, and configure the indexing server. Use the indexing server if
you plan to perform search operations using SAS Information Retrieval
Studio.
Query Server
Start, stop, and configure the query server. The query server passes queries
to the index from the search window and hands the results back to the
query web server.
Query Web
Server
Start, stop, and configure the query web server. Specify how matching and
sorting operations are performed. Format the labels, document matches,
and the theme of the search window.
Query Statistics
Server
Start and stop the query statistics server. See the most frequent queries
submitted, those search terms that are not matched, and query rates for
specified time periods.
The width of these panels can be adjusted by dragging this icon located
between the two main panes.
10
Number of lines
(default is 20) see this maximum number of timestamped lines of text that
form the log for the proxy server. Click
or
Retrieve button
11
Text to highlight
enter the term that you are seeking to match in the input document.
Find button
12
modify the behavior of the web crawler according to the changes that you
made in this pane.
Revert
13
General Settings
specify how the web crawler runs. For more information, see Section
2.4.4.B The General Settings Tab on page 15.
Entry Points
enter the starting URLs. The web crawler starts at these Web addresses
and follows their links to gather documents. For more information, see
Section 2.4.4.C The Entry Points Tab on page 18.
Scope
allow, or exclude, Web addresses from the crawling process. In either case,
specify patterns with regular expressions, or a list of specific URLs. For
more information, see Section 2.4.4.D The Scope Tab on page 20.
Filename Extensions
use the default list of excluded file types, or define your own list to
include, or exclude, from the crawl. For more information, see Section
2.4.4.E The Filename Extensions Tab on page 21.
Credentials
14
data, you enable the pages on these sites to be collected. For more
information, see Section 2.4.4.F The Credentials Tab on page 23.
Description
HTTP proxy
Auto-detect button Access the Select an HTTP Proxy window where you can
choose a proxy server. For more information, see Section 2.14.3
The Select an HTTP Proxy Window on page 120.
15
Description
Quota (files)
(Default: 25) Click
the web crawler.
or
Quota (megabytes)
(Default: 1000) Click
or
to change the megabyte
limit for the crawler. This is the maximum size of all collected
files.
Number of
downloader
threads
Sleep interval
(seconds)
(Default: 1) Click
or
threads that can be created.
(Default: 1) Click
or
to change the number of
seconds that the web crawler pauses between page downloads.
Timeout (seconds)
(Default: 300) Click
or
to change the number of
seconds the web crawler waits before it stops attempting to
download a specific page.
Maximum number
of retries
(Default: 3) Click
or
to change the highest number
of times the crawler can attempt to download a page before it
tries the next one.
Retry delay
(seconds)
Respect robots.txt
(Default: Yes) Click
to select No. The robots.txt
standard enables Web site authors to request that crawlers
(robots) avoid downloading some portions of their site. Select
Yes to ignore this request.
16
Description
Link traversal
order
(Default: Breadth first) Click
to select Depth
first. Breadth first means the top layer of linked pages
at one site are gathered before the links to the next layer are
followed. If you select Depth first, the links are followed
on the first page. Then the crawler goes to the second page, and
so on.
17
URL
see the list of the entry points to the Web for the web crawler. These
crawlers access the URLs in the order in which they are listed in the Entry
Points pane.
Hint:
Quota
see the maximum number of files that can be collected for each Web
address. When you specify the quota for a URL and the web crawler, the
18
smaller of the two numbers applies. In other words, if you specify 100 for
the web crawler and 35 for the selected URL, only 35 documents can be
downloaded for this URL. On the other hand, if you specify 100
documents for this for URL and 35 for the web crawler, only 35
documents can be downloaded for this URL.
Add
access the Add Entry Point window where you can specify a Web address
to begin the crawl. See Section 2.14.4 The Add Entry Point Window on
page 121.
Remove
delete the selected entry point and quota from the address pane.
Edit
access the Edit Entry Point window to make changes to the selected Web
address. See Section 2.14.5 The Edit Entry Point Window on page 125.
19
URL Pattern
access the Add Scope Rule window. Scope determines the links that the
crawler follows, if the URL has the specified prefix. The links in this URL
20
are followed only if no other scope rules exclude this URL. See Section
2.14.8 The Add Scope Rule Window on page 130.
Remove
access the Edit Scope Rule window. See Section 2.14.9 The Edit Scope
Rule Window on page 132.
Extension
see the list of file types that are listed by their file extension.
Action
see the status of each file type. In other words, is this extension excluded
or allowed. If the file type is specified as Allow, the crawler can return this
21
type of page. If you enable at least one type of file to be returned, only
those files with the Allow operation are returned.
Add
access the Add File Extension window to add an extension that is allowed
or prohibited. See Section 2.14.10 The Add Filename Extension Window
on page 133.
Remove
access the Edit Filename Extension window to make a change to the file
extension or the operation. See Section 2.14.11 The Edit Filename
Extension Window on page 134.
22
Site
see the password assigned to each user. The password is the second
component of the credentials required for HTTP authentication.
Add
23
24
25
General Settings
specify how the file crawler runs, specify a date range for returned
documents, and how .xml files are handled. For more information, see
Section 2.5.4.B The General Settings Pane on page 26.
Paths
specify the directories that the file crawler accesses. For more information,
see Section 2.5.4.C The Paths Pane on page 28.
Paths to Exclude
exclude certain directories from the crawl. For more information, see
Section 2.5.4.D The Paths to Exclude Pane on page 29.
Filename Extensions
(optional) if you choose to specify the types of files that can be returned,
only these are permitted. Leave this pane empty if you want the file
crawler to return all file types. For more information, see Section 2.4.4.E
The Filename Extensions Tab on page 21.
26
or
Oldest date
click
to access the calendar where you can select the first date that the
crawler can use for the files that it returns to a query.
Crawl continuously
By default, XML files are passed by the pipeline server with top-level tags
turned into similar-named fields in the document. If you want to exercise
more control over these fields, set this specification to Yes. This setting
enables you to turn nested tags into fields. When you make this selection,
also specify the parse_xml document processor in the pipeline server.
27
Add
access the Add Path window. See Section 2.14.14 The Add Path Window
on page 137.
Remove
access the Edit Path window to change the text that specifies an address.
See Section 2.14.15 The Edit Path Window on page 138.
28
Add
access the Add Path to Exclude window. See Section 2.14.16 The Add
Path to Exclude Window on page 138.
Remove
access the Edit Path to Exclude window. See Section 2.14.17 The Edit
Path to Exclude Window on page 139.
29
Add button
access the Add Extension window. See Section 2.14.10 The Add Filename
Extension Window on page 133.
Remove
access the Edit Filename Extension window. See Section 2.14.11 The Edit
Filename Extension Window on page 134.
30
31
General Settings
specify how the feed crawler runs, the server, and other information
necessary to the crawl. For more information, see Section 2.5.4.B The
General Settings Pane on page 26.
Feeds
crawl one, or more, feeds using this pane. For more information, see
Section 2.5.4.B The General Settings Pane on page 26.
32
HTTP proxy
specify the server that you are accessing here, or click Auto-detect as
explained below.
Auto-detect
access the Select an HTTP Proxy window where you can choose a proxy
server or enter the address for this server. For more information, see
Section 2.14.3 The Select an HTTP Proxy Window on page 120.
Crawl continuously
or
User agent
(default agent is SAS Feed Crawler) enter the name of a third-party feed
crawler, if you choose.
33
Feed URL
choose whether the feed crawler should crawl links in the selected feed.
For more information, see Section 2.14.3 The Select an HTTP Proxy
Window on page 120.
Add
access the Add Feed window where you can choose the address for the
feed that you want to crawl. For more information, see Section 2.14.6 The
Add Feed Window on page 126.
Remove
access the Edit Filename Extension window. See Section 2.14.7 The Edit
Feed Window on page 129.
34
35
Documents received
36
see the timestamp of the latest document that the proxy server accepted.
Last document processed
see the timestamp of the latest text that the proxy server handed to another
server.
37
Host
see the name of the pipeline server. By default, when this server is
running, the information for the local machine appears in the
Configuration pane.
Port
see the number of the port where the pipeline server is running.
Status
click to access the Add Backend window. Here you can specify additional
pipeline servers. For more information, see Section 2.14.20 The Add
Backend Window on page 141.
Remove
38
Edit
click to access the Edit Backend window. Here you can change the
pipeline server. For more information, see Section 2.14.21 The Edit
Backend Window on page 142.
39
click, if the index server is running. The indexing server is restarted and
the changes take effect for the new index.
Pipeline Stage
see the documents that have completed all of the document processing
XML parsing stages.
Document processing
40
Pending
see the number of documents that are in the queue for each stage, with the
exception of Overall, in the pipeline.
Finished
41
Add
42
Take Snapshot
click this button and you can see the document in the various pipeline
stages.
Cancel
see the number of this document. Click on the document number to see the
field names for this document in the Field pane.
43
Field
see the field names in this document. Click on a field to see the contents of
the selected document in the Document Inspector pane.
Document Inspector pane
see the contents of the chosen field of the selected at a specific stage.
44
remove the existing index. A new index can be built with the new
configuration after you restart the crawler.
Revert
click when the indexing server is running and the existing index is deleted.
A new index can be built when new documents are input.
45
Field Name
add to, or delete from, the list of field names entered by default. The
default list includes title, date, and body.
Functionality
add to, or delete from, the list of uses for these fields. For more
information, see Section 2.14.22 The Add Field Window for the Indexing
Server on page 143.
46
Add
access the Add Field window where you add fields to the index according
to the purpose that they are intended to serve. For more information, see
Section 2.14.22 The Add Field Window for the Indexing Server on
page 143.
Remove
delete the selected field from the index configuration. Use this button to
change the configuration of the next index that is built. Any changes do
not affect the current index.
Edit
access the Edit Field window where you make changes to the fields that
you added to the index. For more information, see Section 2.14.22 The
Add Field Window for the Indexing Server on page 143.
Language
to select another
47
48
49
Server Port
or
Matching tab
specify the types of searches that the end user can input, the fields the user
can specify, and the weight of each field. For more information, see
Section 2.11.4.B The Matching Tab on page 51.
50
Sorting tab
specify how the matching documents are ranked and their parameters. For
more information, see Section 2.11.4.C The Sorting Tab on page 53.
Labels tab
specify labels when you choose to enable facetted search. Facetted search
uses a web-like system of related labels to enable users to intuitively
locate the results that they seek. For more information, see Section
2.11.4.D The Labels Tab on page 56.
Match Formatting tab
specify how documents that match the query are displayed in the list of
results. For more information, see Section 2.11.4.E The Match
Formatting Tab on page 58.
Theme tab
specify the look and feel of the query interface. For more information, see
Section 2.11.4.F The Theme Tabs on page 60.
51
Search type
to select Advanced.
Field Name
(default is Body) see the name of the fields to search with input queries.
Weight
52
Add
access the Add Field window. For more information, see Section 2.14.23
The Add Field Window for the Query Web Server Matching Pane on
page 146.
Remove
access the Edit Field window. For more information, see Section 2.14.24
The Edit Field Window for the Query Web Server Matching Pane on
page 147.
53
Selection
Sort type
Description
Specify the relative importance of the matching documents
according to the metrics that you choose. For example, Cosine
Weight (the only metric that is part of relevancy by default) and
Freshness Weight.
Relevancy
Number of
Returns the matching document with the highest number of terms
matching terms that match those in the query syntax. You can select a tiebreaker to
determine a match when two or more documents meet this
threshold.
Number of
Returns the matching document with the highest number of
matching fields matching fields. You can also choose a tiebreaker in cases where
there are two, or more, matching documents.
54
Date (newest
first)
Returns the matching document with the most recent date. The
tiebreaker is Order added to the index.
Date (oldest
first)
Field value
(largest first)
Field value
(smallest first)
Field value
(alphabetical)
Order
added to
the index
Select to make the first document input to the index the matching
document. This is true when two or more documents both meet the
match requirements.
Selection
Description
Click
to make a different selection, see the relevant Sort
Type above. There can be as many as three tiebreaker fields,
depending on the selection that you make in the Sort Type field.
Specify the following weights according to your values. In other
words, if the density weight is more important than any of the other
weights, specify the highest weight number for this field.
Cosine Weight
(Default: 1) Click
or
to change this metric that
weights frequently occurring terms more highly than those that are
infrequent. This metric also takes noise words into consideration.
(Noise words are the words that appear with enough frequency that
they are ranked down.)
Proximity Weight
(Default: 0) Click
or
to specify how to weight
matching query terms that are located close together in the
document.
Position
Weight
Density
Weight
(Default: 0) Click
or
to change the weight assigned to
matches on words located close to the beginning of the document.
(Default: 0) click
or
to change the metric that
balances the number of matched query terms with the total number
of words in the matching document. The number of match instances
is measured as a percentage of the document.
Freshness Weight
(Default is 0) Click
or
to change the number that
determines the age of the matching document. This metric combines
several factors besides the age of the document.
55
Field Name
see the names of the fields that you entered with the Add button.
Caption
see the label that you added with the Add button.
Add
access the Add Field window. For more information, see Section 2.14.25
The Add Field Window: Query Web Server Labels Pane on page 148.
Remove
56
Edit
access the Edit Field window. For more information, see Section 2.14.26
The Edit Field Window: Query Web Server Labels on page 150.
Move Up
change the location of the selected field. Click to move the selected field
up one level in the display shown on the search results page.
Move Down
change the location of the selected field. Click to move the selected field
down one level in the hierarchical taxonomy.
Maximum number of related labels
leave the default setting 10. You can also specify a new highest number of
labels that can be displayed in response to a query.
57
Description
Title source
(Default: Text field) Click
to select HTML field or
None. Use the title fields in this pane to identify the type of field where
the title of the document is located in the input document. Select None
if you do not want to use the title fields for search. (When you select
None, the Title field disappears.)
Title field
(Default: title) Click
index.
58
Description
to select No.
Abstract source
(Default: Concordance) Click
to select Text field, or
HTML field. Use the abstract source fields to locate the summary of
the input document. Use the Concordance selection if you want to
enable hit highlighting. Hit highlighting bolds the matched query term
in an input document.
Link source
(Default: Text field) Click
to select None, URL, or HTML
field. The link fields specify the type of fields that provide a path to the
input text.
Link prefix
Link suffix
Add keywords to
PDF links
(Default: Yes) Click
to select No. Modify the URLs of PDF
files to instruct Adobe Reader to highlight the search terms in an input
document. This operation functions like a concordance, but works for
the entire document, not only the abstract in the results list.
MIME type source
(Default: None) Click
to select Text field. This field
specifies where the name for the document format is located.
Date source
(Default: None) Click
to select Date or Text field. This
field is used to locate the source of the document date.
59
Title
(default is sans-serif) enter a new font type into this field to change the
display look of the title. For example, enter Times New Roman.
60
Font size
or
specify the colors for the user interface. (See www.w3.org for more
information.)
Display 2.42 Colors Pane
61
Description
Header
background color
(Default: Custom) Click
Link color
(Default: Custom) Click
to select one of the images that you
loaded into the work/query-web-server subdirectory of your
installation directory. You can also click
window and select the color of the links.
62
Description
63
Description
tab
load the images that you plan to use for the search window into the work/
query-web-server subdirectory of your installation directory.
64
65
66
Component
Description
Today
Click to see the date of the current day in the Year, Month, and Day
fields.
This Month
Click to see the current month in the Month field and the current day
in the Day field. The Year field is unavailable.
This Year
Click to see the current year in the Year field. The Day and Month
fields are inaccessible.
Description
All Time
Click and the Year, Month, Day fields and the Previous and
Next buttons are all inaccessible.
Year field
Click
to select a year. You can select a year back until 1980, or
leave the default selection --.
Month field
Click
to select a month.
Click
to select a day.
Day field
Previous
Click to enter the preceding date. For example, if you selected 2010,
2009 appears in the Year field.
Next
Click to select the following date. For example, if you selected 2010,
8, and 20, the next day 21 appears in the Day field.
67
Query
see a list of the input search terms ranked from the highest to the lowest
input number.
Number of Occurrences
68
Query
see a list of the input search terms that are unmatched in the index. This
list is ordered from the highest to the lowest number of entries.
Number of Occurrences
see the total number of times that each query was submitted.
69
Hour
70
Day
71
Month
72
2. Click Add in the Document Processors pane for the pipeline server. The
73
add a new field with the value that you specify to each input
document. This field has one name and one value that is the same
for every indexed document. For more information, see Section
2.13.3 The Document Processor: add_field Window on page 77.
content_categorization
return the document type that is located in the address fields of input
documents. For more information, see Section 2.13.5 The
Document Processor: default_mime_type_from_url Window on
page 95.
default_title_from_url
return the document title from the Web address of any input
documents. For more information, see Section 2.13.6 The
Document Processor: default_title_from_url Window on page 95.
document_converter
74
stop more than one document with the same Web address from
being returned. For more information, see Section 2.13.15 The
Document Processor: invalidate_duplicates_by_url Window on
page 110.
match_and_copy
75
separate the text from the HTML mark-up tags. For more
information, see Section 2.13.18 The Document Processor:
parse_html Window on page 112 and strip_html below.
Use this operation when you an HTML document and you want to
extract the body of this document, and possibly the metadata.
parse_xml
separate the text from the XML mark-up tags. For more
information, see Section 2.13.19 The Document Processor:
parse_xml Window on page 114.
send
return only the text without the HTML mark-up tags. For more
information, see Section 2.13.21 The Document Processor:
strip_html Window on page 116 and parse_html above.
Use strip_html when you have a field that contains some HTML
code that you want to convert into plain text. For example, if input
XML documents contain HTML code.
substitute
76
77
78
3. (Optional) By default, the port number for the specified server is entered
or
to select a different
or
79
window lists the projects and their types. The categories, concepts, and
facts that are applied by the pipeline server are limited to those that are
specified in the projects that you specify.
appears.
true if categories are part of the taxonomy in one of the SAS Content
Categorization Studio projects that you uploaded to SAS Content
Categorization Server. Alternately, Concept extraction, or
Contextual extraction
is selected. Click
to make a different
selection.
80
content_categorization window.
6. (Optional) Repeat Step 2. through Step 5. above until you have added all
of your projects.
7. Click Next to save your changes.
81
2. (Optional) By default, Input Fields is blank. Enter any field names that
you want to search for matches for your categories, concepts, and facts.
If you leave this field blank, all of the fields are searched with the
exception of any fields entered into Input fields to exclude.
3. (Optional) By default, Input Fields to exclude contains metadata
fields. Enter any field names that you want to exclude from the search
for your categories, concepts, and facts.
Hint:
82
To specify that all of the categories in the uploaded projects can be used by
SAS Content Categorization Server, complete these steps:
1. Click Categories to access the Categories pane.
enter a new caption name to change the label for facetted search. For
83
You can enter a new format that might include %% for a literal percent
sign. You can also use x as a modifier to request XML escaping. For
example, enter %xc.
5. (Optional) Enter a regular expression into the Category name pattern
field.
6. (Optional) Enter a string into the Category name replacement field.
This string is a constant value that replaces all of the category names.
7. (Optional) By default, ; (semicolon) appears in Separator. Enter a new
or
to
9. Click Finish.
the concepts and contextual extraction concepts. The concepts that are
84
appears. Use this pane to specify the settings for each individual
concept. These settings override the settings specified for all of the
concepts in the Concepts tab.
85
3. Click
enter a new caption name to change the label for facetted search. For
more information, see Section 9.5 Match Categories, Concepts, and
Facts on page 246.
6. (Optional) By default, % is entered into the Format field for the concept
Description
%c
%p
%m
%i
Match the information associated with the entity, or the match text
if no information is available.
%I
%%
If you want to output nested XML tags, specify the format for these
tags such as <body>%xi</body>. For more information, see Section
2.13.9 The Document Processor: export_to_files Window on
page 100.
7. (Optional) By default, ; (semicolon) appears in the Default separator
86
reiteratively, until you have added all of the concepts that you want to
deploy in SAS Information Retrieval Studio.
can enter a new caption name to change the label for facetted search. For
more information, see Section 9.5 Match Categories, Concepts, and
Facts on page 246 and Chapter 10.
12. (Optional) By default, %c: %i is entered into the Default format field
for the concept name. You can also use any of the following symbols:
Table 2-8: Default Format Symbols
Symbol
Description
%c
%p
87
Description
%m
%i
Match the information associated with the entity, or the match text
if no information is available.
%%
to
88
89
2. Click Add to specify a fact with its field name and caption for facetted
search.
3. Click
4. (Optional) When you select a fact using Step 3. above, the Field name
filled in. For example, see Side Effect. Enter a new name if you
choose.
6. (Optional) By default, the format for the matched fact is entered into the
Format
In this example, drug and sideffect are the returned arguments for
the fact SIDE_EFFECT if these arguments are matched. The match
strings for the arguments are %v{drug} and %v{sideeffect}. If there
are more than one PREDICATE or SEQUENCE rule in the definition with
these same arguments, this line is specified for all rules. If there are
90
any other types of rules in the definition, this fact also appears as a
concept when you select the Concepts tab.
7. (Optional) By default, %n: %v is entered into the Argument format
field for the argument format. You can also use any of the following
symbols:
Table 2-9: Default Argument Format Symbols
Symbol
Description
%f
%a
%v{name}
%m
%s
%%
Argument List:
%n
to
91
10. (Optional) Click Copy Defaults to insert the entries from the main
Facts
tab into all of the fields with the exception of the Fact field.
92
12. (Optional) Use Step 2. on page 90 through Step 11. on page 92,
reiteratively, until you have added all of the facts that you want to apply
in SAS Information Retrieval Studio.
13. (Optional) By default, facts is entered into the Default field name
You can enter a new caption name to change the label for facetted
search. For more information about facetted search, see Chapter 10.
15. (Optional) By default, %f(%a) is entered into the Default format field
for the concept name. You can edit this entry using any of the symbols
in Table 2-9 on page 91 with the exception of %v{name}.
93
Note:
Click
Click
or
20. (Optional) By default, the highest number of facts that can be matched
or
to change this
94
into mime-type-field.
3. Leave the default specification, id, or enter the new field name into urlfield.
95
2. Leave the default specification, title, or enter a new field that specifies
what field is searched to locate the title of the input document. If the
document has no value for the field, the value of the URL field is used.
3. Leave the default specification, id, or enter the field where the
96
naming the server and its port in the server field. (The server name and
port number are separated by a colon [:]).
3. Leave the default specification, mimetype, or enter a new field where
document processor gets the content in the body and title fields.
6. Leave the default specification, body, or enter a new location for the
97
instead of fields when new files are created. Export these files with escaped, or
nonescaped characters, to be used in SAS Text Miner, Base SAS, Microsoft
Excel, and so on.
To use the Document Processor: export_csv window, complete these steps:
1. Select export_csv in the Add Document Processor window and click
Next.
entry in the filename field. If you add %s to the entry in this field, the
file is timestamped.
3. Leave the default entry 1, if you want to append to an existing file with
the specified name in the append field. This operation takes place when
the pipeline is restarted. Enter 0 to overwrite an existing file when the
pipeline is restarted.
98
Notes: This field, like the following four fields, controls when a
enter 1 if you appended %s to the entry in the filename field. A new file
is created after the pipeline server is idle for the specified number of
seconds.
7. Leave the default entry 0 in the new-file-each-hour field. Enter 1 if
you appended %s to the entry in the filename field and a new file is
created every hour.
8. Leave the default entry 0 in the new-file-each-day field. Enter 1 if
you appended %s to the entry in the filename field and a new file is
created every day.
9. Leave the default entries id, title, and body as a comma-separated list
stop the input files at this point in the pipeline. Alternatively, enter 1 to
enable further document processing in the pipeline.
99
you want to remove any new lines in the document. This operation
makes it easier to parse the document text. Alternatively, enter 1 to keep
these lines in the document as it is parsed.
12. Leave the default comma (,) that is entered into the delimiter field.
You can enter another character that is used to delimit the fields in the
output file.
13. Leave the default 1 setting in the excel-quoting field. Set this number
field.
100
If you leave fields blank, all of the document fields appear in the output
file.
5. Leave the default selections raw and mimetype in fields-to-exclude.
You can also specify different field names. The text in these fields does
not appear in the output file.
6. (Optional) When you want to output nested XML tags, enter the name
of the field whose value contains the XML syntax into xmlpreescaped-fields.
This field name is listed in the Field name of the Document Processor:
content_categorization window. For example, organization. For more
information, see Section 2.13.4.F Specify Concepts on page 84. This
101
7. Leave the default specification, article if you specified XML for the
output format. If you are using text as the output format, enter a
different document tag type into the document-tag field.
8. Leave the default utf-8 encoding specification for input files in the
encoding
field.
stop the input files at this point in the pipeline. Alternatively, enter 1 to
enable further document processing in the pipeline.
102
2. (Optional) Enter the name of the ODBC driver into the connectionstring
Note:
field.
Consult your database documentation for details before
you use this step and Step 6. below.
3. Enter the name of the database table into the table field.
4. Use the table-init field to specify the operation that is performed if the
103
add new rows with a merge operation. If you use the merge operation,
specify the name of a column. (Also see the note above.)
7. Leave the default specification 0 in the invalidate field if you want to
stop the input files at this point in the pipeline. Alternatively, enter 1 to
enable further document processing in the pipeline.
8. Leave the default setting of 1024 in the max-length field. You can also
104
2. Leave the default setting, localhost, or enter a new server name into
field.
4. Enter the name of the SAS Sentiment Analysis Workbench project into
project-name.
105
that the document was created. You can also specify another field for
this entry. In either case, the format of the contents of this field is
specified in the createtime-format field below.
9. Specify the format of the matching createtime data in the createtimeformat field. For example, %m/%d/%Y %I:%M:%S %p for SAS Sentiment
Analysis Workbench, or %Y%m%d for SAS Search and Indexing.
10. Leave the default specification, title, or enter the field where the name
name of the person who wrote the document is located into authorfield.
12. Leave the default specification, geolocation, or enter the field that
stop the input files at this point in the pipeline. Alternatively, enter 1 to
enable further document processing in the pipeline.
106
generate the abstract. (This location typically contains summary information for
a technical or scientific document.)
The abstract functions like the concordance if the document is sent to the
search engine. However, the abstract is static and therefore independent of any
query, the concordance is query-specific. For this reason, the concordance is
available only when a search operation is performed.
To use the Document Processor: extract_abstract window, complete these
steps:
1. Select extract_abstract in the Add Document Processor window and
2. Leave the default specification, body, or enter a new source for the
<body>
format tag where the document summary can be located into abstractfield.
107
2. Leave the default specification, date, or enter a new source for the
document date into date-field. This date is converted into the pdate for
the search operation.
3. (Optional) Enter the strptime format of the date field in the input
108
2. Leave the default specification, raw, or enter a new body field into
input-field.
body fields.
3. Leave the default specification, title, or enter a new title field into
title-output-field.
4. Leave the default specification, body, or enter a new body field into
body-output-field.
field.
5. The entry 1 in require-mime-type specifies that matching documents
field.
7. The entry 1 in the base64-input field specifies that the text is encoded
109
2. Leave the default specification, id, or enter a new field where the URL
110
field.
4. Enter the name of the field where the output is placed into outputfield.
5. (Optional) If the format field contains a value, the value controls how
added to the end of an existing value for the output field, or these values
replace an existing value.
7. (Optional) By default the semicolon character (;) is entered into the
separator field. You can enter a different character,
output-field is used as a label or sent to the index.
or a string, if the
111
8. Click Finish.
2. Enter the name of the field that you want to change into oldname.
3. Enter the name of the new field into newname.
4. Click Finish.
112
2. Leave the default specification raw, or enter a new location where this
4. Leave the default specification body, or enter a new title field into
body-output-field. The plain text of the body field is output to this
field. If you leave this field empty, no body text is output.
additional information in other fields is output. (The text in the body and
title fields is always output.) For example, description and
keywords might be output. The fields that are used for output depend on
the meta field types that appear in the HTML documents.
6. The entry 1 in the require-mime-type field specifies that the
mimetype
113
type field.
8. The entry 1 in the base64-input field specifies that the text is encoded
114
5. (Optional) Enter the name of the tag in input XML documents that
contains the string that is the identifier for output documents into the
copy-url-from-field.
6. (Optional) Enter the name of the tag in output documents that contains
115
Note:
4. Leave the default entry id in id-field, or enter the name of a new field
the input files at this point in the pipeline. Alternatively, enter 1 to send
each instance of a document to another instance of the pipeline.
2. Leave the body field, or add new fields that are separated by commas
into Fields. These are the fields where the HTML tags are stripped in
order to return the text that they contain.
116
2. Enter the name of the first regular expression field to be located into
Field.
3. Specify the pattern of the regular expression into the Pattern field.
4. Enter the replacement for the first regular expression field into
replacement.
117
2. Enter the name of the file that you want to import into the Filename
118
components that you do not want to modify with the imported file in the
Components section. For example, deselect Feed crawler, Indexing
server, and Query web server.
4. Click OK to save these settings.
2. Enter the name of the file that you want to export into the Filename
field.
3. Click OK to save these settings.
119
120
121
122
4. Enter a Web address into the URL field. For example, enter
www.sas.com.
Hint:
5. (Optional) Leave the default selection Yes in the Add to scope field or
click
to select No. Unless there are scope rules, the crawler
follows all links found on the entry point page, the links found on those
pages, and so on. Scope rules limit the links that the crawler follows.
Use this feature to constrain the crawl to a single site, or section of the
site. In other words, the scope rule follows the way that many Web
pages are laid out. When you leave Yes selected, the URL is
automatically added to the Scope tab in the web crawler Configuration
pane. For more information, see Section 2.4.4.D The Scope Tab on
page 20.
123
or
to reset the number in the Quota field.
For example, specify 90000. When you specify a quota for the links
from the entry point, the overall quota for the crawler, or this number,
applies.
6. (Optional) Click
124
Configuration pane for the web crawler. The Edit Entry Point window
appears.
2. Enter your changes into the URL field. For example, enter http://
.*\.sas\.com.
3. (Optional) Click
or
125
126
2. Click
located to the left of the feed that you want. For example, Press
Releases.
4. Copy the feed URL from the URL field in the browser. For example,
copy http://www.sas.com/news/preleases/SASRecentPress.xml.
After you copy the URL for an RSS feed using the steps above, complete these
steps:
127
1. Select Feed Crawler --> General Settings --> Feeds. The Feeds
pane appears.
3. Paste the copied RSS feed URL into the Feed URL field.
128
2. Place your cursor into the Feed URL field and make any necessary
changes.
links field.
129
130
4. Leave the default selection Allow in the Action field unless you want to
exclude URLs that match this pattern from the crawl. In this case, click
to select Exclude.
5. Click OK and this address appears under the URL Pattern heading.
131
2. Use Step 2. through Step 5. in the Section 2.14.8 The Add Scope Rule
132
To access and use the Add Filename Extension window, complete these steps:
1. Click Add in the Filename Extensions tab of the Configuration pane
for the web crawler. The Add Filename Extension window appears.
2. Enter the file extension that you want to return or to exclude in the
Extension
Extension pane.
133
To access and use the Edit Filename Extension window, complete these steps:
1. Click Edit in the Filename Extensions tab of the Configuration pane
for the web crawler. The Edit Filename Extension window appears.
2. Use Step 2. through Step 4. in the Section 2.14.10 The Add Filename
134
2. Enter the address for a Web site that requires credentials into the Site
field.
3. Enter the name of the user into the Username field.
4. Enter the secret term matched to this user into the Password field.
5. Click OK and these entries appear in the Credentials pane.
135
2. Use Step 2. through Step 5. in the Section 2.14.12 The Add Credential
136
2. Enter a file or directory name into the Path field. For Windows
137
2. Use Step 2. through Step 3. in the Section 2.14.14 The Add Path
2. Enter a path to the files that the file crawler should not access in the
Path
field.
138
2. Enter a string into the Extension field. For example, enter html.
3. Click OK and this entry appears in the Filename Extensions pane.
139
140
2. Enter a new string into the Host field. For example, enter newhost.
3. Click
or
specify 9008.
pane.
141
142
2. Enter the field name into the Name field. You can specify any field
3. Click
Searching
(Default) Search for words that match the input query terms in this
field. This choice is equivalent to the standard functionality.
143
Label
(default) select this type of usage for facetted search. For more
information about facetted search labels, see Section 9.5 Match
Categories, Concepts, and Facts on page 246.
Display and Sorting
display the matching URLs according to the sorting type that you
select. Sort the results alphabetically, or numerically, instead of by
relevancy. This selection corresponds to marking the field as info.
Identification
choose this field to identify the field that contains the individual
identification number for each document. Each field in the index
requires a unique identifier. If a new document is added that has the
same identification number as an old document, the new document
replaces the old document.
Custom
Standard
144
Boolean
145
The only fields that are available are those added to the
index with search functionality.
To access and use the Add Field window, complete these steps:
1. Click Add in the Matching pane of the Query Web Server and the Add
or
to reset the
weight assigned to this field. The weight value sets a number that is
relative to the other matching fields and is used to prioritize matches.
146
2. Use Step 2. through Step 3. in Section 2.14.23 The Add Field Window
for the Query Web Server Matching Pane on page 146 as necessary.
147
2. Any field in the index that has label functionality is available in the
click
to select Yes or Flattened. No displays the list view. Yes
displays the tree view, and Flattened displays the tree in a list format.
148
Note:
click
to select Yes to see the number of matching values for each
label field.
6. Leave the Show in matches value 0. You can click
or
reset the number of labels found in each individual matching
document that are displayed in the results list.
to
149
3. Use Step 2. through Step 7. in Section 2.14.25 The Add Field Window:
150
2. Click
3. Select
151
6. Click Custom Colors to access the expanded version of the color box
window.
7. See the color range for the selected color in the large pane.
8. Slide the
9. Click
and
or
or
Green field.
Red
field.
10. Click
11. Click
Blue
152
or
field.
window appears.
153
154
Choosing a Crawler
crawls the Web, according to the parameters that you set. These
parameters define the types of documents and information that you seek,
and they also limit the scope of the crawl. The scope, or breadth and depth
of the crawl, prevent the crawler from attempting to return every
document that appears on the Internet.
When you limit the scope of the Web crawl, you optimize the crawl and
minimize the time that it takes to return this data. You can also specify the
credentials that are necessary to access password-protected sites.
File crawler
156
Feed crawler
use the feed crawler when you want to obtain blog posts, user forum
pages, and other trending data such as press releases.
157
match category rule terms that appear in one, or more, fields of an input
document. These rules are found in the categories project running on SAS
Content Categorization Server.
158
concept_extractor
extract any matching concepts from an input document. These terms are
located in the specified concepts project running on SAS Content
Categorization Server.
contextual_extractor
determine the type of the original, input document based on the filename
extension found in its Web address.
default_title_from_url
change the format of incoming files, such as Adobe PDF and Microsoft
Office documents into text using the SAS Document Conversion
application.
extract_abstract
separate the body of an HTML document from its tags using an operation
that provides an algorithm to obtain the optimal result. This operation
searches for paragraphs of text without many tags and extracts these
bodies of text.
invalidate_duplicates_by_url
prevent the collection of multiple copies of the same document from being
returned.
modify_field_name
159
parse_html
separate the text from the HTML mark-up tags in an input HTML
document.
parse_xml
separate the text from the XML mark-up tags in an input XML document.
send
remove the mark-up tags and return only the text from an HTMLformatted field in a document.
substitute
save each document to a separate file whose name is based on a hash of its
contents.
export_to_odbc
160
export_to_sentiment_analysis_workbench
161
You can design a Web page that enables users to input queries and to obtain
search results. However, you can also use the query server with an application
that does not require an interface to search the index. In this case, you can
write a custom program to provide a connection between the query server and
your application.
select the way that results are displayed in the custom user interface that
you design. You can also specify themes and colors.
see a list of the most frequent query terms and the number of occurrences
for each term.
162
see a list of the most frequent query terms that did not locate results in the
index. You can also see the number of occurrences for each term.
Query rate by hour
see the number of queries since you installed the SAS Information
Retrieval Studio application.
163
164
Sample Configurations
-
Make sure that your document processors are listed in the order of
logical operations:
a. Normalize input text. For example, place parse_html,
software. Send only the processed and normalized text that can be
used by an index (by default, if you install SAS Search and
Indexing, the documents are indexed) or by other applications.
These applications include SAS Text Miner and SAS Sentiment
Analysis Workbench.
166
Click Apply Changes before you leave the tab for any component
where you make changes.
Delete the index if you want all of the gathered documents to be indexed
according to the changed settings. If you do not delete the index, the
documents that were indexed according to the old settings remain in the
index. The documents that are added after you save your changes to the
new index are indexed according to the new settings.
If necessary, stop and restart the web, file, or feed crawler that is
running. If you delete the index, stop and restart the web, file, or feed
crawler that you chose to build the index.
Hint:
If you decide to check the results of your index by entering query terms
using the search window, consider the path and scope of your crawl. In
other words, if you limit your crawl to SAS documents, do not expect to
enter medical terms and locate matches in these documents.
167
To set up a simple project that crawls the Web, builds an index, and configures
the query server, complete these steps:
1. Select Web Crawler --> Configuration --> General Settings.
Use this window to select the proxy server that is located between the
web crawler and the Internet.
168
4. Click OK and the selected server appears in the HTTP proxy field of the
steps:
a. Increase the number of files that the web crawler can collect from
25
b. Increase the total size of the files that can be collected to 3000
crawler can access more files quickly. However, too many threads
can overwhelm the site that the web crawler is crawling.
6. Click OK and the server appears in the General Settings pane.
7. Select Configuration --> Entry Points to specify the Web site where
169
9. Enter the Web address that the web crawler uses to enter the Internet
you specify at least one permitted site, every other site is excluded.
For more information, see Section 5.2.4 Specify the Scope of the Crawl
on page 198.
11. (Optional) Change the limit on the number of files downloaded from
sites, click the Credentials tab. For more information, see Section
5.2.6 Specify Access Information for Password-Protected Sites on
page 203.
170
15. Click Apply Changes to save the new web crawler configuration.
Note:
171
17. Click Add and the Add Document Processor window appears.
window appears.
20. Leave the default settings or make changes. For more information about
172
21. Click Finish and the document processor that you select appears in the
Document Processors
tab.
22. Use Step 17. through Step 21. above, reiteratively, until you have added
all of the document processors that you require. For example, if you
want to add labels to enable facetted search use the
content_categorization Document Processor. For more information, see
Section 2.13.4 The Document Processor: content_categorization
Wizard on page 78.
In this example, both categories and concepts are added to enable
facetted search.
23. Select a document processor and click Edit to make a change to a
173
26. (Optional) Leave the default settings or click Edit to make changes to
this pane to set the priorities for field matches. Weights are a relative
174
setting. The priority value that you specify for each field is determined
only in relationship to other matched fields in a document.
29. (Optional) To add a field and specify its weight, click Add, and the Add
30. Click
175
click
or
32. Click OK and this field and weight appear in the Matching pane.
33. Select the Labels pane to see all of the selected categories, concepts,
and facts.
34. (Optional) Select a field and click Edit to make changes to this field.
For example, use the Edit Field window that appears to change the
caption, or label, name. For more information, see Section 2.14.25 The
Add Field Window: Query Web Server Labels Pane on page 148.
to change the default setting 10 in the Maximum
field. This is the highest number of related
labels that can be displayed in response to a query. The end user sees
these labels after entering a query into the SAS Information Retrieval
Studio search window.
35. Click
or
176
39. Click the blue hyperlink and the search window appears.
40. Enter a query into the search field in the SAS Information Retrieval
41. Click Search and see the labels that match the returned documents on
the left side of the search window. On the right side see the matching
documents with links to the full text for each document.
42. To see the statistics for queries, click the Query Statistics Server tab.
For more information, see Section 14.3 View the Query Statistics for a
Selected Time Period on page 341.
177
178
To set up a simple project that crawls the files on your machine and exports
these files, complete the following steps:
1. Select File Crawler --> Configuration --> Paths.
2. Click Add to add one, or more, paths to the Paths pane. The Add Path
window appears.
179
180
text from input document formats such as Microsoft Office and Adobe
PDF files. This document processor is relevant for the file crawler, but it
can also be used with the web crawler after the parse_html document
processor is used.
appears.
12. (Optional) Change any of the settings in this window. For more
181
13. Click Finish and the selected document processor appears in the
appears.
182
Note:
17. (Optional) Make any other changes to the fields in the Document
all of the document processors required. For example, if you want to add
labels to enable facetted search, see Section 2.13.4 The Document
Processor: content_categorization Wizard on page 78.
Note:
183
When you set up the feed crawler, you can choose to return either summaries
or full length texts. If the feed collects summaries, you can enable the feed
crawler to follow the links contained in the summaries to the full texts of each
article. If the feed crawler collects summaries, it also follows any links to the
full story. For this reason, enable this capability using the steps below.
To set up the feed crawler, complete the following steps:
184
185
3. Access your Web browser and locate the Web page with the orange
box
4. Click
located to the left of the feed that you want. For example,
Media Coverage.
186
6. Copy the feed URL from the URL field in the browser. Paste this URL
into the Feed URL field in the Add Feed window. For example, copy
http://www.sas.com/news/mediacoverage/
SASRecentMediaCoverage.xml into the Feed URL
field.
187
10. Click Add and the Add Document Processor window appears.
11. Select parse_html. In this example, summaries are collected and the
feed crawler is instructed to follow links to the HTML pages that are
linked to each summary. (See Step 7. on page 187 where Yes is selected
in the Follow Links drop-down menu.)
188
12. Click Next and the Add document Processor: parse_html window
appears.
189
190
Use each of the following sections to configure your web crawler with one
exception. The Credentials information is necessary only when you choose to
crawl password-protected sites.
After you make all of your changes, click Apply Changes in the Web Crawler
pane. If the file crawler is running when you click this button, the Restart Web
Crawler window appears.
Display 5.2 Restart Web Crawler Window
Click Yes.
If the web crawler is not running, click Start.
192
193
or
to change the default setting of 25 in the Quota
field. This is the maximum size for all of the files collected by
the web crawler.
3. Click
(files)
4. Click
or
to change the default megabyte limit of 1000 for
the maximum number of megabytes in the Quota (megabytes) field.
This limit applies to all of the collected documents.
5. Click
or
to change the total number of threads that can be
created in the Number of downloader threads field. For example,
change this setting to 16. (The default setting is 1.) The more threads
you specify, the faster the download process becomes. However, a
higher number of downloaded files can also overwhelm a site and shut
it down.
6. Click
or
to change the number of seconds that the web
crawler rests between page downloads in the Sleep interval field.
(The default setting is 1.) This setting enables the web crawler to be
polite. In other words, a single thread does not overwhelm a site with
download requests. This is not true, if you use this setting but have
many threads. For example, 100 threads operating on 5-second sleep
intervals could potentially launch 100 requests simultaneously to a
site.
7. Click
or
to change the number of seconds before the web
crawler stops trying to download a page in the Timeout field. (The
default setting is 300.)
194
8. Click
or
to change the number of times that the web
crawler tries to download a page before it stops in the Maximum
number of retries field. (The default setting is 3.)
9. Click
or
to change the highest number of seconds that the
web crawler waits before it tries to download a page again in the
Retry delay field. (The default setting is 300.)
10. Click
11. Click
12. Click
first,
In the breadth-first mode, the crawler searches all of the links in the
point-of-entry page. The crawler then searches all of the links in the
first layer of child pages. The crawler repeats this process for the
second layer of child pages, and so on, until it has crawled all of the
links related to the point-of-entry page. This is a first in, first out
operation.
In depth-first mode, the crawler follows one set of links through all of
its children. The crawler then backtracks to the next child page and
crawls the links of its children, and so on. This process is repeated
reiteratively until all of the links in a page are crawled. This operation
drills deep and then backtracks, reiteratively.
195
196
3. Enter the Web address for the first site into the URL field.
4. (Optional) To add this address to the Scope pane as an allowed site for
the crawl, leave the default selection Yes in the Add to scope rules
field. If you do not want to add this address to the Scope pane, click
and select No.
Note:
or
Points pane.
7. Use this process reiteratively until you have added all of your URLs to
197
198
199
Note:
click
to select Regular Expression. This setting tells the web
crawler how to use the characters entered in the URL Pattern field.
A prefix match is one that matches against the beginning of the URL.
to
select Exclude if you do not want the crawler to download pages from
this URL.)
This URL appears in the Add Scope pane.
9. Click
200
10. Click OK and see the complete list of included and excluded URLs. In
this example, the web crawler searches only the publications pages of
the SAS Web site. It does not search the pages that list recent books.
201
2. Click the Add button to access the Add Filename Extension window.
3. Enter the extension of the file type that you want to exclude into the
Extension
202
field.
Note:
4. Click
203
3. Enter the URL followed by a colon (:) and the port number for the host
enter mdpassword. When you enter this password, the characters that
comprise the password are represented by the asterisk symbol (*) in the
Credentials pane.
6. Click OK and this site with its credentials is added to the Credentials
pane.
7. Click Apply Changes in the Web Crawler pane.
204
The appropriate message appears in the Status pane after any of these
operations.
-
(Optional) If you make any changes to the configuration while the web
crawler is running, click Apply Changes. The Restart Web Crawler
window appears.
Click Yes.
If the web crawler is not running, click Start.
-
205
2. (Optional) Click
or
if you want to change the default
setting of 20 in the Number of lines field. This field specifies the
maximum number of timestamped lines that are displayed for the
searchable log file in this pane.
3. Click Retrieve to display the specified number of lines in the log file.
4. (Optional) Enter a search term into the Text to highlight field. For
206
Use each of the following sections to configure your file crawler. After each
change, click Apply Changes in the File Crawler pane. If the file crawler is
running when you click this button, the Restart File Crawler window appears.
Click Yes.
208
2. Click
or
to change the default setting 10 that is specified in
the Maximum file size field. Increasing or decreasing this number
affects the size of the documents collected. For example, you might
want to gather white papers but not books.
3. Click
to access the calendar where you can select the first date for
the crawl. Documents that have creation dates before the date
specified in the Oldest date field are not collected by the file crawler.
4. Click
continuously
5. Click
209
3. Enter an absolute path into the Path field. If you specify a Windows
210
3. Enter an absolute path into the Path field. If you specify a Windows
211
5. Continue this process, reiteratively, until you have added all of the paths
212
3. Enter a file extension into the Extension field. For example, enter txt
or png. If you specify any file extension, only those file types are
returned. No other files are collected.
4. Click
The appropriate message appears in the Status pane after any of these
operations.
213
(Optional) If you make any changes to the configuration while the file
crawler is running, click Apply Changes. The Restart File Crawler
window appears.
Click Yes.
If the file crawler is not running, click Start.
-
214
2. (Optional) Click
or
if you want to change the default
setting of 20 in the Number of lines field. This field specifies the
maximum number of timestamped lines that are displayed for the
searchable log file in this pane.
3. Click Retrieve to display the specified number of lines in the log file.
4. (Optional) Enter a search term into the Text to highlight field. For
215
216
Click Yes.
If the feed crawler is not running, click Start.
218
To specify the parameters for the feed crawler, complete these steps:
1. Select Configuration --> General Settings in the Feed Crawler
pane.
219
4. Click
or
to change the default setting of 600 for the number
of seconds for the Recrawl interval field.
5. (Optional) Enter another name of the crawler into the User agent field
220
a. Paste an address for a feed into the Feed URL field. For example,
choose http://www.sas.com/success/SASRecentSuccess.xml.
b. Click
Links
There are two common types of feeds. These are the full content and
summary-only feeds. In the full content feed, all of the information
that you seek is present in the feed itself. In the summary-only field,
only a brief description of the content is passed. In this case, the link
is followed, like a traditional Web page link, to locate the rest of the
content.
If you want to crawl the summary-only fields, select Yes in the
Follow links field. Also select the parse_html document processor
in the pipeline server. However, the follow links operation does not
perform recursively like the Web crawler.
c. Click OK and this information appears in the Feeds pane.
3. Enter the Web address that you want to crawl into the Feed URL field.
221
The appropriate message appears in the Status pane after any of these
operations.
-
(Optional) If you make any changes to the configuration while the feed
crawler is running, click Apply Changes. The Restart Feed Crawler
window appears.
Click Yes.
If the feed crawler is not running, click Start.
222
2. (Optional) Click
or
if you want to change the default
setting of 20 in the Number of lines field. This field specifies the
maximum number of timestamped lines that are displayed for the
searchable log file in this pane.
3. Click Retrieve to display the specified number of lines in the log file.
4. (Optional) Enter a search term into the Text to highlight field. For
223
224
Use the same set of documents for multiple purposes. In this case,
send the input documents to pipeline servers that are configured
differently. For example, send the documents to one pipeline server
for indexing and searching. Send this same set of documents to a
second pipeline server that analyzes the sentiment located in them.
You can find information about the number of documents at different stages in
this server and see a log file.
For all of these reasons, the proxy server is an integral part of any customized
configuration of SAS Information Retrieval Studio.
226
2. See the number of documents that were input to the proxy server in the
Documents received
received.
In this example, the Quota (files) setting was set in the General
Settings pane of the Configuration pane at 25 for the web crawler. This
is the only crawler in this configuration. This crawler has returned the
maximum number of allowed documents.
3. See the number of documents that the proxy server sent to the pipeline
been returned.
If you see a discrepancy, you can use the Log pane to see the connections and
errors that might be the cause. For more information, see Section 8.5
Troubleshoot with the Log File on page 230.
227
228
2. Click Add and the Back-end Server window appears. Use this window
to add another pipeline server to the customized application that you are
building.
3. Enter the name of the machine into the Host field. For example, enter
Mirror1.
or
to change the
default setting of 9004 in the Port field. For example, change the port
to 9100.
5. Click OK and the new server is added to the Configuration pane. (The
229
If you have stopped the proxy server for any reason, click Start in the
Proxy Server pane.
The appropriate message appears in the Status pane after any of these
operations.
-
(Optional) Click Stop and the proxy server ceases its running process.
If you make any changes to the configuration while the proxy server is
running, click Apply Changes.
230
to select Errors.
or
to change the default setting of 20 in the Number
field. This field specifies the number of lines that are
displayed for the searchable log file in this pane.
3. Click
of lines
4. Click Retrieve to see the specified number of lines in the log file pane
below.
5. Enter the text that you want to locate in the Text to highlight field. For
pane. For example, see each instance of 10 highlighted in the dates and
find it in the queue.
231
232
Advanced Installation
to SAS programs such as SAS Sentiment Analysis Workbench and SAS Text
Miner.
Note:
For more information about installing these software applications, see SAS
Information Retrieval Studio: Installation Guide or the installation guide for
each SAS application that you want to use.
234
extract plain text from documents such as Microsoft Word and PDF
files.
235
Second, If you choose to use the feed crawler, you might select
invalidate_duplicates_by_url. This operation ensures that only one copy
of a document is passed to another process. This document processor is
important for applications such as SAS Sentiment Analysis Workbench where
the freshness of the document matters.
Third, choose the content_categorization document processor if you want
to enable facetted search using the categorizer, concept, or contextual
extraction processors. You can also use these processors to categorize and
extract concepts and facts from your input documents before passing them to
another operation.
Fourth, use the export_csv and the export_to_files processors to export the
normalized (and analyzed) documents to put these documents into a format
that can be used by another application. To send documents directly to SAS
Sentiment Analysis Workbench, specify
export_to_sas_sentiment_analysis_workbench.
Note:
By default, when SAS Search and Indexing is installed, all input documents go
to the indexing server. This is true if the documents are also sent to other
applications.
After you consider these available operations, use the Add Document
Processor window to add and configure your document processors. You can
choose to use one document processor, or you can build a pipeline that orders
several processors. For example, use the heuristic_parse_html operation to
extract paragraphs of text without their HTML tags. The next processor in the
pipeline might be the export_to_files processor that enables you to export
the file in XML or in text format. In either case, you can specify whether the
document stops here in the pipeline or goes to the indexing server.
The operations that you specify in the Document Processors pane occur in the
same order that they are listed in this pane. You can specify the document
processors in any order and use the Move up and Move down buttons to
reorder these operations. If document processing operations are incorrectly
ordered, unexpected results might occur.
236
SAS Content Categorization Studio: Use the files that you export
from SAS Information Retrieval Studio for training and testing
purposes.
Identify entities.
For more information, see the SAS Information Retrieval Studio: Installation
Guide.
237
3. Select parse_html.
238
5. Leave the default specification, raw, or enter a new field name in the
input-field. The processor uses this field to obtain the unmodified,
document data. raw specifies that the original, unmodified content was
placed into the HTML document using this identifying field name.
can also enter a different field name where the processor stores the plain
text of the document title.
7. Leave the default specification, body in the body-output-field. You
can enter a different field where the processor stores the body text
located in the input document. This field is used by other applications
such as SAS Content Categorization Studio, when they are part of the
processing pipeline.
8. Change the default entry to 1 in the output-metadata field and this
239
different field.
11. The entry 1 in the base64-input field specifies that the text is
15. (Optional) To change the ordering of the processors in the pipeline, click
240
To use the document Inspector pane to see a document, use the following
steps:
1. Click Take Snapshot.
2. Click on a document processing operation that appears in the
Processing Stage pane. For example, click on
heuristic_parse_html. A document number appears
Document pane.
in the
3. Click the number in the Document pane and the fields in this document
241
5. See the contents of the selected field in the Document Inspector pane.
242
3. Select add_field.
4. Click Next. The Document Processor: add_field window appears.
5. Enter the name of the field that you want to add to each input document
243
244
245
16. Click the document number under Document to see the fields for this
empty pane. For example, click Data to see that the value 062011 that
you assigned to the add_field processor was assigned to document 1.
246
Extraction Studio are used as concepts or facts. Any concept that is developed
in SAS Contextual Extraction Studio and specified with a PREDICATE or
SEQUENCE rule is a fact.
The content_categorization Document Processor is the client for SAS Content
Categorization Server. The categories, concepts, and facts are applied by SAS
Content Categorization Server to the documents processed by SAS
Information Retrieval Studio.
The following example uses concepts. If you want to use categories or facts,
make the appropriate substitutions. Also see Chapter 10: Creating Facetted
Search Labels Using content_categorization. This chapter uses the Document
Processor: content_categorization wizard to create labels for facetted search.
To map concepts to labels, complete these steps:
1. Select Pipeline Server.
2. Click
247
or
to select a
Click
or
to select a different number. This is the number of
seconds that the Pipeline Server waits before this server stops
attempting to download an input field.
248
appears. Use this window to add any of the projects that are uploaded to
SAS Content Categorization Server to SAS Information Retrieval
Studio.
window appears.
249
Project
13. (Optional) Continue to add projects using Step 9. through Step 12. The
concepts in each of the projects that you select are available to match
your input documents.
14. Click Next. The Document Processor: content_categorization window
15. (Optional) Enter the fields that are in any of your input documents
where you want to locate matches for your concepts. Enter these fields,
as a comma-separated list into the Input fields field. If you leave this
field blank, all fields, with the exception of those listed in the Input
fields to exclude field are searched.
16. (Optional) By default, fields that contain information about the
document are listed in the Input fields to exclude field. You can add
additional fields, or delete fields from this list:
id,url,feed_url,raw,mimetype,date,pdate,source,
promotion,ctime,atime,mtime
If you edit this list, be sure to insert a comma (,) between each field.
17. (Optional) If you make any changes, click Finish to save these edits.
250
19. Click Add to specify the concepts that are matched in input documents.
20. Click
Concept
251
Hint:
21. (Optional) By default, the name of the concept is entered into the Field
name
22. (Optional) By default, the name of the label for the facetted search is
entered into the Caption field. For example, see Location. Enter a new
caption, if you choose. For more information about facetted search, see
Chapter 10: Creating Facetted Search Labels Using
content_categorization.
23. (Optional) By default, %c: %i is entered into the Format field. These
Description
%c
%p
%m
%i
Match the information associated with the entity, or the match text
if no information is available.
%I
%%
252
25. (Optional) Click Copy Defaults to revert to the concepts entries in the
Concepts
tab.
27. See the newly entered concept with its field name, and caption. For
253
28. (Optional) If you want to continue to add concepts, click Add. Use Step
19. through Step 26. on page 253, reiteratively, until you have added all
of the concepts that you want to use for facetted search.
Note:
field. You can choose to enter a different name into this field.
field. You can choose to enter a different name into this field.
31. (Optional) By default, %c: %i is entered into the Default format field.
You can choose to enter different symbols into this field. You can edit
this entry using any of the symbols in Table 9-1 on page 252 with the
exception of %I.
32. (Optional) By default, ; (semicolon) is entered into the Default
separator
33. (Optional) By default, 15 is entered into the Max concepts field. This
or
254
37. See the name that you entered into Field name appears in the
38. Click Start in the main Pipeline Server window to restart the Pipeline
Server.
39. When you click the Add button in the Matching pane of the query web
server, you can select this field in the Add Field window. This caption
name appears as a field in the Matching pane of the Query Web Server.
255
This caption also appears in the user interface when a matching term is
located in an input document.
256
1. Use the steps in Section 9.5 Match Categories, Concepts, and Facts on
page 246.
257
The appropriate message appears in the Status pane after any of these
operations.
See the progress of the input documents in the Status pane:
a. The Overall - Pending table cell is always empty.
258
XML tags removed in the XML parsing - Pending table cell. For
example, 1.
d. See how many XML documents have completed the process of
XML tag removal in the XML parsing - Finished table cell. For
example, 31.
e. See how many documents are in the pipeline process in the
Document processing - Pending
f.
g. See how many documents are going to the indexing server in the
Sending to the indexer - Pending
259
to select Errors.
or
to change the default setting of 20 in the Number
field. This field specifies the number of lines that are
displayed for the searchable log file in this pane.
3. Click
of lines
4. Click Retrieve to display the specified number of lines in the log file.
5. (Optional) Enter a search term into the Text to highlight field. For
260
10
262
Define your labels using the categories and concepts that you specify in SAS
Content Categorization Studio with or without SAS Contextual Extraction
Studio. Labels apply the matching requirements set by the rules that define
categories and concepts. Labels also enable facetted search operations in the
query interface of SAS Information Retrieval Studio.
Use SAS Content Categorization Studio alone to develop categories that
identify documents based on their subject matter. Also define concepts that
locate relevant terms based on rules that are specified by lists of matching
terms or parts of speech and other symbols.
Display 10.1 SAS Content Categorization Studio
When you use the add-on SAS Contextual Extraction Studio application with
SAS Content Categorization Studio, you can define LITI concepts. These
concepts increase matching precision (matches all of the relevant texts) and
recall (matches only the relevant texts). LITI concepts differ from the
classifier and grammar concepts in SAS Content Categorization Studio
because you can mix rule types within a single definition.
Contextual Extraction, or LITI, concepts can also include facts. Facts are rules
that are defined by arguments. Arguments are defined by concepts that are
related if they are matched by the fact rule. For this reason, facts return related
pieces of information in input documents. For example, define facts when you
want to identify relationships between drugs, symptoms, and gender.
263
Note:
When you use facts as labels, you can specify the string that is returned for the
label. Each string contains terms that are custom filled according to the
matched text.
264
When you choose categories, SAS Information Retrieval Studio applies all of
the categories in the selected project to input texts. Although the default
selections for concepts and facts are the same, you can select specific facts and
concepts to apply.
All LITI concept definitions that include PREDICATE and SEQUENCE rules are
treated as facts. If a LITI concept rule contains one, or more, facts and other
concept rules, the facts and the concepts are applied separately. The default
settings in the Document Processor: content_categorization wizard return the
matched fact and concept rules for each LITI definition under the same label
name. For this reason, consider renaming either the fact or the concept label
and field name for each LITI definition that contains a concept and a fact.
Choose the display selection that works best for your end users:
Display 10.3 Example of Default Setting
265
266
Content Categorization Studio are the default settings for SAS Information
Retrieval Studio. (You can also write a custom string that displays these
names.) For this reason, use care when specifying names and writing
PREDICATE and SEQUENCE rules that specify terms that are visible to the end
user.
Also use care when writing rules that return many matches. For example, you
might develop a SAS Content Categorization Studio project that includes an
EMAIL concept. This concept might contain rules defined by regular
expressions that are designed to return all e-mail accounts within internal
company documents. The inclusion of this EMAIL concept might not be
appropriate for a facetted search on the Web.
Before you upload a SAS Content Categorization Studio project to SAS
Content Categorization Server, check the Project Settings - Misc tab of SAS
Content Categorization Studio. If there are entries in the XML Default Fields
field, remove these fields and leave the XML Default Fields field blank.
(These fields apply to categories and to LITI concepts and facts. For this
reason, grammar and classifier concepts in SAS Content Categorization Studio
are matched regardless of the field entries. Other matches that should occur,
might not.)
Display 10.5 Project Settings - Misc Tab
267
Use care when changing rules and uploading projects to avoid propagating the
same rule or its variations. For example, you might upload a SAS Content
Categorization Studio project to SAS Content Categorization Server. If you
change a concept definition and upload the same project with a new name to
the server, both rules are available for matching. This is true if you add both
projects to your SAS Information Retrieval Studio project using the Document
Processor: content_categorization wizard.
In other words, matches might be made on concept definitions where one or
more definitions is specified using an outdated rule. This behavior can occur
because SAS Information Retrieval Studio consolidates all of the rules for
categories, concepts, and facts with the same names.
Naming also affects LITI facts and concepts. For example, you might have a
LITI concept definition that includes both fact and concept definitions. See the
example below:
Figure 10.2 Facts and Concept Rules in One Concept Definition
Note:
In this example, if matches occurred for both the facts and concepts, all of
these matches would return a match on the SIDE_EFFECT concept. However,
you can use the content categorization document processor to specify different
names for the concept and fact matches.
268
269
270
2. Use the Build menu to build, compile, and upload the relevant
3. Specify the name of the project in the Upload window that appears. The
entry in the Server Project Name field can be unique for the SAS
Information Retrieval Studio project.
4. (If you uploaded your project awhile ago) Select Start --> Programs -> SAS Content Categorization Server
is running.
271
272
273
9. Click Search to see the results. The facetted search labels appear on the
10. (Optional) Check your search results against the original project to
274
Hint:
275
3. Select content_categorization.
4. Click Next. The Document Processor: content_categorization window
appears.
or
to select
Click
or
to select a different number of seconds that the
pipeline server waits before it stops trying to complete a matching
operation.
276
8. Click Next. You can now add your projects to SAS Content
Categorization Server.
277
selection.
3. (Optional) By default, a project is selected in the Project field such as
Sample. Click
content_categorization window.
5. (Optional) Repeat Step 1. on page 277 through Step 4. above until you
have added all of the projects and their matching types. For example,
add MedicalProj to include concepts. Add MedicalProj2 to match
LITI concepts and facts. If you have multiple project for a specific
matching technology, you can upload all of these projects.
6. Click Next.
278
1. (Optional) By default, the Input Fields field is blank. Use a comma (,)
separated list to specify any field names that you want to search for
matches for your categories, concepts, and facts. If you leave this field
blank, all of the fields are searched with the exception of any fields
entered into the Input fields to exclude. If you specify any fields,
only the listed fields are searched.
279
these fields:
id,url,feed_url,raw,mimetype,date,pdate,source,
promotion,ctime,atime,mtime
280
individual category name. You can enter a new format that might
include %% for a literal percent sign. You can also use x as a modifier to
request XML escaping. For example, enter %xc.
5. (Optional) Enter a regular expression into the Category name pattern
field. Regular expressions specify the pattern for the category name.
6. (Optional) Enter a string into the Category name replacement field.
Enter a new separator such as a comma (,) for the matched categories.
8. (Optional) By default, the highest number of categories that can be
to
9. Click Finish.
281
Matches for any of the concepts that you specify explicitly, appear in the table
at the top of the Document Processor: content_categorization window. These
matches appear in the specified format and are placed into the specified output
field. Matches for any other concepts that are not in the table are assigned the
default format. The text of these matches appears in the Default field name.
You can also choose to exclude concepts from matching. For example, exclude
all of the matches that are not specified when you leave the Default field
name empty in the Concepts tab. If you want to specify one or more concepts
to exclude, leave the Field name blank when you specify the excluded
concepts.
If you want to prevent a specific concept from the output, leave the empty.
To add concepts to the project, complete these steps:
1. Click Concepts to access the Concepts pane. You can use this pane to
add all of the concepts and contextual extraction concepts. If any of the
LITI concepts include PREDICATE or SEQUENCE rules, these rules are
matched as facts. Access these facts using the Facts pane.
appears. Use this pane to specify the settings for each individual
282
concept. These settings override the specifications for all of the concepts
in the Concepts pane.
3. Click
4. (Optional) When you select a concept using Step 3. above, the name of
the selected concept appears in the Field name field after you make a
selection in the Concept field.
In this example, the concept SIDE_EFFECT also contains PREDICATE
and SEQUENCE rules. For this reason, SIDE_EFFECT appears in the Facts
drop-down list also. In order to avoid ambiguity in the search results,
you can choose to rename either the concept or the fact. In this
example, negativeeffects is entered.
5. (Optional) The name of the selected concept appears in the Caption
field after you make a selection in the Concept field. For example,
Negative Effects. You can enter a new caption name such as
Negative Effects. For more information, see Section 9.5 Match
Categories, Concepts, and Facts on page 246. You can also use the
sample project in Chapter 4: Sample Configurations
Note:
283
6. (Optional) By default, % is entered into the Format field for the concept
Description
%c
%p
Output the concept name with its path. You can specify %c to
include the path with the concept name.
%m
%i
%I
%%
284
10. See the concepts in the Concepts tab. Make sure that you loaded all of
field. You can enter a new caption name for facetted search.
13. (Optional) By default, %c: %i is entered into the Default format field
for the concept name. You can edit this entry using any of the symbols
in Table 10-1 on page 284.
14. (Optional) By default, ; (semicolon) appears in the Default separator
to
285
Hint:
286
287
window appears.
3. Click
4. (Optional) When you select a fact using Step 3. above, the Field name
entered. For example, see Side Effect. Enter a new name if you
choose.
6. (Optional) By default, the format for the matched fact is entered into the
Format
288
7. In this example, the SIDE_EFFECT concept has two arguments drug and
gender.
Symbol
Description
%f
%a
%v{name}
%m
%s
%%
Output the argument name for the arguments that comprise the
definition.
%v
field. You can also use any of the symbols in Table 10-2 above.
9. (Optional) By default, a , (comma) appears in the Argument
separator
289
12. Click Ok. If you want to use the same settings specified in the Facts
13. (Optional) By default, facts is entered into the Default field name
14. (Optional) By default, Facts is entered into the Default caption field.
for the concept name. You can edit this entry using any of the symbols
in Table 10-2 on page 289 with the exception of %v{name}.
290
Note:
Click
Click
or
or
to
291
4. (Optional) If you want to export these fields as files, select File export.
5. (Optional) If you want to export these fields in comma-separated
format, select CSV export. Choose this selection to export your files
into programs such as SAS Text Miner or Microsoft Excel.
6. (Optional) If you want to export these fields to a file system, select
ODBC export.
292
7. Click Finish and see the categories field listed in the Document
Processors pane.
293
1. Click Stop, Apply Changes, and Start to apply changes and to restart
2. Begin with the selected crawler and work down through the list of
Click Edit in the Document Processors pane. You can follow any of the
steps in Section 10.2.1 Access the Projects on SAS Content
Categorization Server on page 274 through Section 10.2.4 Specify
Output on page 292.
294
2. Enter a search term into the blank field to the left of the Search button.
295
3. Click Search.
4. See the results.
296
11
Configure an Index
See the list of field names that are the default selections for the index.
For example, see id, title, date, and so on.
298
Click Add to enter a new field name with its functionality. You can
enter any field name that is found in any of the input documents. It
is not necessary for every document to contain each of the specified
fields.
2. When you click Add or Edit the Add Field window appears.
Purpose
Searching
(Default) Search for words that match the input query terms. This
selection is equivalent to the standard function.
Label
Select for facetted search, only. This selection is equivalent to marking the
field as both standard and Boolean.
For more information about facetted search labels, see Section 9.5 Match
Categories, Concepts, and Facts on page 246.
299
Purpose
Display and
Sorting
Identification
Custom
3. Click
4. Use the Add Field window to add, and make changes to, all of the fields
in the index.
5. Click Apply Changes to delete the current index and to set the
300
If you change the field names, types, and functionalities that you specify
in the Configuration pane of the indexing server, the index is affected.
Whenever you make a change to any of these operations, the current index is not
affected. These changes can affect only the new index. For this reason, you have
two choices:
-
Click the Delete Index button to remove the existing index. A new
index can be built with the specified changes after you restart the
crawler.
Click the Apply Changes button when the indexing server is running.
The existing index is deleted and the indexing server is restarted so that
a new index can be built.
For example, if you make changes to fields in the pipeline server, they can
affect the indexing server.
301
The appropriate message appears in the Status pane after any of these
operations.
302
(Optional) If you make any changes that affect the index, click Delete
Index. This operation removes the old index. For example, if you add a
title field to the list of indexed fields a new index might be necessary.
or
to change the default setting of 20 in the Number
field. This field specifies the number of lines that are
displayed for the searchable log file in this pane.
2. Click
of lines
3. Click Retrieve to display the specified number of lines in the log file.
4. (Optional) Enter a search term into the Text to highlight field. For
303
304
12
The appropriate message appears in the Status pane after both the start
and stop operations.
-
306
or
to select a new Number of lines, the default
setting is 20. For example, choose 25 to see more lines.
2. Click
3. Click Retrieve to display this number of lines in the blank pane below.
4. (Optional) Enter the terms that you want to locate in the Text to
highlight
307
308
13
310
in the Hierarchical field to choose a hierarchical, nonhierarchical, or a flattened display of the labels. In this example, Yes is
selected to enable a hierarchical display of the categories.
3. Click
311
4. Make any other changes and click OK to see this selection in the Labels
pane.
See the following examples that include search windows that do, and do not,
display labels.
312
Note:
You can click the left mouse button on a hyperlink label to make one of the
following selections:
Require
the path to the selected label appears below the search box. The displayed
documents match both the query term and the selected label. If you specify
more than one label, the documents match the query term and the selected
labels. In this case each path is appended with a plus sign (+).
Exclude
one label, preceded by the minus sign (-) appears in the SAS Information
Retrieval Studio search window. The displayed documents match the
query term, but not the selected labels.
View
one label appears in the SAS Information Retrieval Studio search window
that displays all of the matching documents for this label, only. Bolded
matches for existing query terms no longer appear below the document
links on the right side of the search window.
313
Remove
this operation is available for the label, or path, appearing at the top of the
SAS Information Retrieval Studio search window. This selection is the
only available after you use any of the above operations.
314
Hover the mouse over a category or concept to see the hierarchy, or parentchild relationships existing in SAS Content Categorization Studio.
315
316
or
specify the query fields and enable end users to prefix required words and
quoted phrases by prefixing them with plus (+) and minus (-) signs. When
you make this selection, you specify the indexing fields that are searched.
Advanced (bsearch)
enable end users to specify the query fields. Query terms can be combined
when you specify the following operators:
-
Boolean operators such as AND, OR, and NOT add precision to your search.
Positional words such as SENT and PAR specify that matches are located
only if the specified words appear in the same sentence or paragraph,
respectively.
317
to select
Advanced.
318
to select a
different field. Each of these fields is listed in the Configuration pane of
the indexing server. In other words, the fields that appear in this dropdown menu are also in the index.
or
to select a new Weight. For example, choose 5 to
weight matches that are located in the body field more heavily than
those in the title field. The weight setting is relative across all fields.
5. Click
319
to choose a
different selection in the Sort type drop-down menu. The selection that
you make in this field determines the fields that are displayed below this
field.
if the density weight is more important than any of the other weights,
specify the highest weight number for this field:
or
to select a new Cosine
This metric assigns the highest weights to the most
frequently occurring terms. It takes noise words into consideration.
(Noise words are the words that appear with enough frequency
that they are ranked down.)
b. (Optional) Click
c. (Optional) Click
320
d. (Optional) Click
e. (Optional) Click
321
Note:
322
3. Leave the default field selection in Field name. For example, if you
click
to select either Yes or Flattened. To see the types of results
that are displayed for these selections, see Section 13.2 Choosing How
Search Returns Are Displayed on page 310.
323
to
select No. If you select No, the numbers of matching documents do not
appear to the right of the labels in the SAS Information Retrieval Studio
search window.
324
or
to select a number of matching fields in the Show
field. For example, choose to display the three categories
with the highest number of matches in the SAS Information Retrieval
Studio search window. (If there are more than the specified number of
fields, the term and other information appears. This term is appended to
the list to indicate that the display is incomplete.)
7. Click
in matches
or
to display.
search window.
325
326
For example, the link field might contain the unique identifier 12345.
However, the browser does not understand this string. In this case, set
327
328
Determine the look and feel of the SAS Information Retrieval Studio search
window.
To specify the theme of the search window, complete these steps:
1. Select Query Web Server --> Configuration --> Theme.
329
3. Leave the default selection sans-serif, or enter a new display font into
or
to select a new
size for the display letters in the Font size field. For example, choose
12 to display the search returns in a larger font size.
6. Click the link in the Status tab to see the results of your changes in the
search window.
330
For more information about formatting the search window, see http://
www.w3.org/TR/CSS/ui.html#system-colors.
331
field. Click
also click
such as red.
Note:
3. Leave the default selection Custom in the Header text color field.
Click
332
Click
5. Leave the default selection Custom in the Visited link color field.
Click
6. Leave the default selection Custom in the Hover link color field.
Click
7. Leave the default selection Window in the Menu border color field.
Click
to select ActiveBorder, ActiveCaption, AppWorkspace,
Background, and so on.
8. Leave the default selection Window in the Menu unselected
background color
field. Click
to select ActiveBorder,
to
333
334
2. Leave the default selection None in the Left header image field.
Click
to select one of the images that you loaded into the work/
query-web-server subdirectory of your installation directory.
3. Leave the default selection sas.png in the Right header image field.
Click
to select one of the images that you loaded into the work/
query-web-server subdirectory of your installation directory.
search window.
335
The appropriate message appears in the Status pane after any of these
operations.
336
Click the link to the machine where the query web server is running to
see the SAS Information Retrieval Studio search window.
or
to select a new Number of lines, the default
setting is 20. For example, choose 25 to see more lines.
2. Click
3. Click Retrieve to display this number of lines in the blank pane below.
4. (Optional) Enter the terms that you want to locate in the Text to
highlight
337
338
14
340
2. Click Today to see the screen shown above. Todays date is displayed
341
number of times that these words were entered into the search window.
5. See the query with the highest number of entries at the top of the list
342
7. Click the Most Frequent Queries Without Matches tab to see the
8. See any search terms that were not matched by the searched corpus
term was input by end users. For example, produc was entered one time.
343
10. Click the Hourly Query Rate tab to see the query traffic over the
11. See each Hour and the Number of Queries. For example, see 8 am,
17
344
6. See each Day of the week and the Number of Queries input for that
345
346
6. Click Monthly Query Rate to see the total number of queries that were
7. See each Month and the Number of Queries for that month. For
347
The number for each day of the week matches the total number of input
queries received on each of the respective weekdays over the course of
the year.
Hint:
6. Use Step 6. through Step 7. on page 347 for the Monthly Query Rate
tab.
348
Discover any changes that should be made to the index. For example,
see whether queries without matches might be matched if an additional
field is added to the index.
See whether the most frequent query terms adequately match the
searched corpus. If not add a new link.
349
or
to select a new Number of lines, the default
setting is 20. For example, choose 25 to see more lines.
2. Click
3. Click Retrieve to display this number of lines in the blank pane below.
4. (Optional) Enter the terms that you want to locate in the Text to
highlight
350
Appendixes
-
351
352
Appendix: A
Regular Expressions and XML
Field Extraction File
-
Regular Expressions
PCRE: http://www.pcre.org/
353
<thumbnail>
<tsrc>http://img.com/</tsrc>
</thumbnail>
</article>
Suppose you want to extract the value of the content field, and the value of
the tsrc field in the thumbnail field. In order to extract only the tsrc field
that is located inside the thumbnail field, specify the following syntax
<article>
<content />
<tsrc index="no" />
<thumbnail index="no">
<tsrc />
</thumbnail>
</article>
In this example the attribute index has the value no". This value specifies
that the parser does not add the value of this field to its list of documents.
The default value of the index attribute is "yes". This specification means that
every field in the input XML that does not have the index attribute remains in
the document.
354
Appendix: B
Recommended Reading
The following books are recommended as companion guides:
-
SAS Sentiment Analysis Workbench: Users Guide: Review and edit the
automated analyses and create reports illustrated with graphs that
illustrate these analyses.
355
Use the language book that applies to the language that you use to create
your project. Each of the SAS world language books contain a
comprehensive list of part-of-speech tags.
For a complete list of SAS publications, see the current SAS Publishing
Catalog. To order the most current publications or to receive a free copy of the
catalog, contact a SAS representative at
SAS Publishing Sales
SAS Campus Drive
Cary, NC 27513
Telephone: (800) 727-3228*
Fax: (919) 677-8166
E-mail: sasbook@sas.com
Web address:support.sas.com/pubs
* For other SAS Institute business, call (919) 677-8000.
Customers outside the United States should contact their local SAS office.
356
Appendix: C
Glossary
Boolean operators
specifies words such as AND, OR, and NOT, to construct logical definitions
that locate the matches that you seek.
caption
specifies multiple sets of training documents. See corpus for one set
crawl
359
definition
defines a concept. There can be many rules for each concept definition.
This term is also used interchangeably with rule. See rule.
document
change a string of characters into a value that can be indexed. The hash
process expedites the search process.
label
specify the value of the field that is passed to the query web server for
each match that appears within the search window. Also see caption.
metadata
360
MIME type
appear with enough frequency that they are ranked down in the metrics for
weight.
polite
means that a single thread does not overwhelm a site with download
requests, but respects the robots.txt standard. This standard enables Web
site developers to specify portions of their sites that should not be crawled.
precision
see where the information about the product is located in the document.
The information can appear primarily in the top 20%, or in the bottom
80%, of the selected document.
raw
specifies the original, unmodified content that was placed into an HTML
document using this identifying field name.
recall
measures the number of documents that are a match for the definition out
of those texts that were successfully returned.
rule
361
362
Index
A
Abstract source
defined ...............................................................................................................59
Action heading
defined ...............................................................................................................20
Add Backend window usage ..................................................................................141
Add button
defined ............................................... 19, 20, 22, 23, 28, 29, 30, 38, 42, 47, 53, 56
Add Credential window
usage ................................................................................................................135
Add Entry Point window
usage ................................................................................................................121
Add Extension window
usage ................................................................................................................139
Add Field window
usage ................................................................................................ 143, 146, 148
Add Filename Extension window
usage ................................................................................................................133
Add HTTP Proxy window ......................................................................................120
usage ................................................................................................ 118, 119, 120
Add keywords to PDF links
defined ...............................................................................................................59
Add Path to Exclude window
usage ................................................................................................................138
Add Path window
usage ................................................................................................................137
Add Scope Rule window
usage ................................................................................................................130
add_field
document processor ...........................................................................................74
All Time button
defined ...............................................................................................................67
363
B
bsearch
defined ............................................................................................................ 317
C
Caption heading
defined .............................................................................................................. 56
categorizer
defined ............................................................................................................ 158
color box window
usage ............................................................................................................... 151
colors
search window ................................................................................................ 331
Colors tab
defined .............................................................................................................. 61
concept_extractor
defined ............................................................................................................ 159
Configuration pane
defined .............................................................................................................. 37
content_categorization
document processor .......................................................................................... 74
contextual_extractor
defined ............................................................................................................ 159
Cosine Weight
defined .............................................................................................................. 55
Crawl continuously
defined .........................................................................................................27, 33
Credentials tab
web crawler ....................................................................................................... 14
364
D
Date
Sort tab ..............................................................................................................54
Date source
defined ...............................................................................................................59
Day
defined ...............................................................................................................71
Day field
defined ...............................................................................................................67
default_mime_type_from_url
defined .............................................................................................................159
document processor ...........................................................................................74
default_title_from_url
defined .............................................................................................................159
document processor ...........................................................................................74
Delete Index button
defined ....................................................................................................... 45, 303
Delete Index window
usage ................................................................................................................153
Density Weight
defined ...............................................................................................................55
document
defined ...............................................................................................................44
Document processing heading
defined ...............................................................................................................40
Document Processor
add_field window ..............................................................................................77
content_categorization window ................................................................. 78, 274
default_mime_type_from_url window ..............................................................95
default_title_from_url window ..........................................................................95
document_converter window ............................................................................96
export_csv window ............................................................................................97
export_to_files window ...................................................................................100
export_to_odbc window ..................................................................................102
export_to_sentiment_analysis_workbench window ........................................104
extract_abstract window ..................................................................................106
extract_pdate window ......................................................................................107
heuristic_parse_html window ..........................................................................108
invalidate_duplicates_by_url window .............................................................110
match_and_copy window ................................................................................111
modify_field_name window ............................................................................112
365
E
Edit Backend window
usage ............................................................................................................... 142
Edit button
defined ......................................... 19, 21, 22, 23, 28, 29, 30, 34, 39, 42, 47, 53, 57
Edit Credential window
usage ............................................................................................................... 136
Edit Entry Point window
usage ............................................................................................................... 125
Edit Extension window
usage ............................................................................................................... 140
Edit Field window
usage ........................................................................................................147, 150
Edit Filename Extension window
usage ............................................................................................................... 134
Edit Path to Exclude window
usage ............................................................................................................... 139
Edit Path window
usage ............................................................................................................... 138
Encapsulate XML files
define ................................................................................................................ 27
entry points
defined ............................................................................................................ 196
366
F
facetted search
defined .............................................................................................................163
feed crawler
configure ..........................................................................................................218
defined ......................................................................................................... 9, 157
Feeds pane .........................................................................................................34
General Settings pane ........................................................................................33
operations ..........................................................................................................31
run ....................................................................................................................222
usage ................................................................................................................217
Feed URL
feed crawler .......................................................................................................34
367
Feeds tab
feed crawler ....................................................................................................... 32
Field Name heading
defined .........................................................................................................46, 56
Field Name tab
defined .............................................................................................................. 52
Field value ................................................................................................................ 54
Sort tab .............................................................................................................. 54
file crawler
configure ......................................................................................................... 207
defined .........................................................................................................9, 156
general settings ............................................................................................... 208
run ................................................................................................................... 213
Filename Extensions tab
file crawler ........................................................................................................ 26
web crawler ....................................................................................................... 14
Find button
defined .............................................................................................................. 12
Finished heading
defined .............................................................................................................. 41
flattened hierarchy
search returns .................................................................................................. 315
Follow Links
feed crawler ....................................................................................................... 34
Font
defined .............................................................................................................. 60
Font size
defined .............................................................................................................. 61
formatting
query web server ............................................................................................. 162
Freshness Weight
defined .............................................................................................................. 55
fsearch
defined ............................................................................................................ 317
Functionality heading
defined .............................................................................................................. 46
368
G
General Settings tab
feed crawler .......................................................................................................32
file crawler .........................................................................................................26
web crawler ............................................................................................... 14, 193
H
Header background color
defined ...............................................................................................................62
heuristic_parse_html
defined .............................................................................................................159
document processor ...........................................................................................75
Host
defined ...............................................................................................................38
Hour
defined ...............................................................................................................70
HTTP proxy
defined ...............................................................................................................15
feed crawler .......................................................................................................33
I
Import Settings window ..........................................................................................118
index
configure ..........................................................................................................298
defined .............................................................................................................161
input documents ..................................................................................................8
indexing server
defined ...............................................................................................................10
run ....................................................................................................................302
usage ................................................................................................................297
input documents
index ....................................................................................................................8
invalidate_duplicates_by_url
defined .............................................................................................................159
document processor ...........................................................................................75
369
L
labels
defined ............................................................................................................ 163
hierarchy ......................................................................................................... 312
navigation tools ............................................................................................... 264
usage ............................................................................................................... 321
Labels tab
defined .............................................................................................................. 51
Last busy time heading
defined .............................................................................................................. 41
Last document processed
defined .............................................................................................................. 37
Last document received
defined .............................................................................................................. 37
Left header image
defined .............................................................................................................. 64
Link prefix
defined .............................................................................................................. 59
Link source
defined .............................................................................................................. 59
Link suffix
defined .............................................................................................................. 59
Link traversal order
defined .............................................................................................................. 17
log file
entire application ............................................................................................... 11
feed crawler ..................................................................................................... 223
file crawler ...................................................................................................... 214
indexing server ................................................................................................ 303
pipeline server ................................................................................................. 260
proxy server ...............................................................................................12, 230
query server ..................................................................................................... 306
query web server ......................................................................................337, 349
troubleshoot .................................................................................................... 223
M
Match Formatting tab
defined .............................................................................................................. 51
usage ............................................................................................................... 326
370
match type
select ................................................................................................................318
Match Type heading
defined ...............................................................................................................20
match_and_copy
document processor ...........................................................................................75
matches
sort ...................................................................................................................319
Matching tab
defined ...............................................................................................................50
usage ................................................................................................................317
Maximum file size (megabytes)
defined ...............................................................................................................26
Maximum number of related labels
defined ...............................................................................................................57
Maximum number of retries
defined ...............................................................................................................16
Menu selected background color
defined ...............................................................................................................63
Menu selected text color
defined ...............................................................................................................64
Menu unselected background color
defined ...............................................................................................................63
Menu unselected text color
defined ...............................................................................................................63
MIME type source
defined ...............................................................................................................59
modify_field_name
defined .............................................................................................................159
document processor ...........................................................................................76
Month
defined ...............................................................................................................72
Month field
defined ...............................................................................................................67
most frequent queries
defined .............................................................................................................162
query statistics server ......................................................................................162
most frequent queries without matches
query statistics server ......................................................................................163
Move Down button
defined ......................................................................................................... 42, 57
371
Move Up button
defined .........................................................................................................42, 57
N
Next button
defined .............................................................................................................. 67
no hierarchy
search returns .................................................................................................. 314
no labels
search returns .................................................................................................. 312
Number of downloader threads
defined .............................................................................................................. 16
Number of lines
defined .............................................................................................................. 11
Number of matching fields
Sort tab .............................................................................................................. 54
Number of matching terms
Sort tab .............................................................................................................. 54
Number of Occurrences
defined .............................................................................................................. 69
Number of Occurrences heading
defined .............................................................................................................. 68
Number of Queries
defined ...................................................................................................70, 71, 72
O
Oldest date
defined .............................................................................................................. 27
operation history
log file ............................................................................................................. 206
Order added to the index
Sort tab .............................................................................................................. 55
Overall heading
defined .............................................................................................................. 40
372
P
parse_html
defined .............................................................................................................160
document processor ...........................................................................................76
parse_xml
defined .............................................................................................................160
document processor ...........................................................................................76
Password
defined ...............................................................................................................23
password-protected sites
crawl ................................................................................................................203
Paths tab
file crawler .........................................................................................................26
paths to crawl
specify .............................................................................................................209
paths to exclude
file crawler .......................................................................................................211
Paths to Exclude tab
file crawler .........................................................................................................26
Pending heading
defined ...............................................................................................................41
pipeline server
defined .................................................................................................................9
operations ........................................................................................................234
run ....................................................................................................................258
Pipeline Server tab
operations ..........................................................................................................39
Pipeline Stage
stages .................................................................................................................40
Port
defined ...............................................................................................................38
Position Weight
defined ...............................................................................................................55
Previous button
defined ...............................................................................................................67
processes
order .................................................................................................................164
Proximity Weight
defined ...............................................................................................................55
373
proxy server
configure ......................................................................................................... 228
defined ...................................................................................................9, 35, 157
operations ........................................................................................................ 157
run ................................................................................................................... 229
status ............................................................................................................... 226
usage ............................................................................................................... 225
Q
queries ........................................................................................................................ 8
Query
defined .............................................................................................................. 69
Query heading
defined .............................................................................................................. 68
query rate by day
defined ............................................................................................................ 163
query rate by hour
defined ............................................................................................................ 163
query rate by month
defined ............................................................................................................ 163
query rate for all time
defined ............................................................................................................ 163
query rates
query statistics server ...................................................................................... 163
query server
defined .................................................................................................10, 48, 305
usage ............................................................................................................... 305
query statistics
all queries ........................................................................................................ 348
see ............................................................................................................341, 344
this year ........................................................................................................... 346
usage ............................................................................................................... 349
Query Statistics pane
usage ............................................................................................................... 341
query statistics server
defined .............................................................................................................. 10
run ................................................................................................................... 340
usage ............................................................................................................... 339
374
R
Recrawl interval
defined ...............................................................................................................33
Refresh button
defined ........................................................................................................... 9, 13
Relevancy
Sort tab ..............................................................................................................54
Remove button
defined ......................................... 19, 21, 22, 23, 28, 29, 30, 34, 38, 42, 47, 53, 56
Reset to Default button
defined ...............................................................................................................64
Respect robots.txt
defined ...............................................................................................................16
Retrieve button
defined ...............................................................................................................11
Retry delay (seconds)
defined ...............................................................................................................16
Revert button
defined ...............................................................................................................13
indexing server ..................................................................................................45
Right header image
defined ...............................................................................................................64
robots.txt
defined ...............................................................................................................16
375
S
sample project
set up ........................................................................................................168, 179
SAS Content Categorization Server ................................................................234, 237
install ................................................................................................234, 235, 237
taxonomies ...............................................................................................234, 237
SAS Content Categorization Studio ................................................................235, 237
concepts and categories .................................................................................. 234
SAS Contextual Extraction Studio
concepts and facts ....................................................................................234, 237
SAS Document Conversion
install ............................................................................................................... 234
SAS Sentiment Analysis Workbench
install ........................................................................................................235, 237
SAS Text Miner ..............................................................................................235, 237
install ........................................................................................................235, 237
scope
defined ............................................................................................................ 198
Scope tab
defined .............................................................................................................. 14
search
query web server ............................................................................................. 162
type ...................................................................................................................... 8
search box
customize ........................................................................................................ 310
Search type
defined .............................................................................................................. 52
send
defined ............................................................................................................ 160
document processor .......................................................................................... 76
Sending to the indexer heading
defined .............................................................................................................. 40
Server Port field
defined .............................................................................................................. 50
specify ............................................................................................................. 317
Site
defined .............................................................................................................. 23
Sleep interval (seconds)
defined .............................................................................................................. 16
sort
matches ........................................................................................................... 319
376
T
Take Snapshot
usage ..................................................................................................................43
Text to highlight
defined ...............................................................................................................12
Theme pane
usage ................................................................................................................329
Theme tab
defined ...............................................................................................................51
This Month button
defined ...............................................................................................................66
This Year button
defined ...............................................................................................................66
Tiebreaker
defined ...............................................................................................................55
Timeout (seconds)
defined ...............................................................................................................16
377
Title
defined .............................................................................................................. 60
Title field
defined .............................................................................................................. 58
Title source
defined .............................................................................................................. 58
Today button
defined .............................................................................................................. 66
types of files
limit ................................................................................................................. 202
U
URL heading
defined .............................................................................................................. 18
URL Pattern heading
defined .............................................................................................................. 20
Use pop-up menus
defined .............................................................................................................. 61
User agent
defined .............................................................................................................. 33
Username
defined .............................................................................................................. 23
W
web crawler
configure ......................................................................................................... 192
defined .........................................................................................................9, 156
run ................................................................................................................... 205
specify operations ........................................................................................... 193
Web Crawler pane
operations .......................................................................................................... 12
Weight tab
defined .............................................................................................................. 52
X
XML document
extract contents ............................................................................................... 353
378
Y
Year field
defined ...............................................................................................................67
379
380