From The Library of Congress: RFI Web Harvesting

Title: Web Harvesting; Request for Information
PSC: D317
NAICS: 519130
Response Date: 12/30/14; 1:00pm
Background Summary (Extended background provided beginning on p. 3)
The Library of Congress, Office of Strategic Initiatives (OSI) is seeking information from
potential contractors about how to best to design a requirement related to saving and reviewing
information from the Internet. The Library is seeking information, e.g., current/existing
commercial solutions, design solutions, etc., on how to best meet this web harvesting
requirement.
This RFI is to determine if potential offerors can meet the Librarys technical and production
requirements for harvesting web content and to receive feedback on pricing models and
reasonable quality assurance. The Library is actively seeking suggested solutions and alternatives
that will meet our requirements.
The purpose of this RFI is to gain information related to the following issues:
1) Is the Draft Scope of Work (beginning on p.5) sufficiently clear? If not, please explain
what additional information should be provided, or what alternate description would be
more appropriate.
2) Are the requirements identified in the Draft Scope of Work reasonable? If not, please
identify those requirements that are considered unreasonable as well as propose a version
that is considered to be reasonable.
3) Are there alternatives to sections in the Scope of Work that would be beneficial to the
Library in some way?
4) Feedback on, or suggested alternatives to, the Librarys potential
requirements as articulated in the Description of Potential Requirements
below.
5) Suggested definition of web harvesting contract deliverables to take into
account and accommodate fluctuating numbers of seed URLs, varying
magnitude of the sites harvested, technical difficulties, and varying
frequencies for individual seeds or groups of seeds.
6) Pricing models for web harvesting services, consistent with the defined
deliverables. The Library is NOT interested in time and materials pricing
models for this work.
7) Proposed methods (manual or automated) for vendor quality assurance of
content capture. The Library is particularly interested in how vendors ensure
that web content is captured successfully. A successful crawl would be
complete, as scoped by the Library, and would include all digital objects
p. 1 of 22
required to render the identified seeds and scopes as they were at the time of
capture, taking into account documented limitations of Contractors tools as
agreed to by the Library.
8) Capabilities and limitations of vendors crawler and access tools in their
ability to capture and replay content.
9) The qualifications/certifications a responsible contractor should have to
perform this service
10) The terms and conditions under which this maintenance is typically procured
11) Opinion on whether or not this is considered a commercial service.
12) Would you consider bidding on a solicitation for these services?
Responder Requirements
Responders are requested to provide:
1)
2)
3)
4)
Company name, company contact, contact address, phone number, and e-mail address
Information addressing one or more of the Issues listed above
Descriptions of technologies or procedures that will satisfy the Librarys requirements
The Librarys preference is to issue a firm-fixed price contract, with a price set for a firm
amount before the contract service begins. Please provide descriptions of benchmarks
based on how payments should be arranged to be issued for such a contract, or your
recommendation for how such a contract should be billed.
5) Rough estimates of what the described service may cost and what the major cost drivers
of that estimate are.
Questions
Questions regarding this announcement shall be submitted in writing by e-mail to:
Name: Sherman Mayle
Email: smay@loc.gov
Verbal questions will NOT be accepted. Questions may be answered by posting answers to the
FedBizOpps website; accordingly, questions shall NOT contain proprietary or classified
information. The Government does not guarantee that questions will be answered.
RFI Qualification
THIS IS a REQUEST FOR INFORMATION (RFI) ONLY to identify sources that can provide
the requirement as described in this announcement. The information provided in the RFI is
p. 2 of 22
subject to change and is not binding on the Government. The purpose of this RFI is for planning
and market research purposes only. The information obtained from responses to this RFI may be
used in the development of an acquisition strategy and in the development of a future
solicitation.
Proprietary information or other sensitive information should be clearly marked. The Library of
Congress has not made a commitment to procure any of the items discussed, and release of this
RFI should not be construed as such a commitment or as authorization to incur cost for which
reimbursement would be required or sought. All submissions become Government property and
will not be returned.
Location
Library of Congress
101 Independence Avenue, S.E.
Washington, D.C., 20540.
RFI CONTENTS
p.4
p.5
p.6
p.14
p.22
Extended Background Summary

Technical Definitions
Draft Scope of Work
Attachment 1, Sample Seed List
Attachment 2, Sample Host Data Report
p. 3 of 22
Extended Background Summary

The Library of Congress is the nations oldest federal cultural institution and the
largest library in the world. Among its vast holdings are digital materials such as
websites. Since 2000, the Library has collected and preserved web content related to
a variety of thematic web and event-based topics, such as the United States National
Elections, Public Policy Topics, Congressional and Legislative Branch, and Web
Comics. The harvesting of selected websites for the Librarys collections supports
the Librarys strategic goal to acquire, preserve, and provide access to a universal
collection of knowledge and the record of Americas creativity.
Please see Definitions of Terminology (Section J, Attachment 1 of the Description
of Potential Requirements) for descriptions of terms used in this RFI. Currently the
Library is collecting over 5,000 seed URLs (see definitions below) on frequency
schedules of one-time, weekly, monthly, semi-yearly, or yearly, which are part of
event or thematic collections. In the past, the Library has specified that seeds be
groups into monthly or weekly crawls (currently two monthly crawls lasting 7 days
each, as well as weekly crawls for our general collections and separate weekly crawls
for our Election archives). Harvested content is stored in the WARC (Web ARChive)
file format (http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml). The
Library collects anywhere from 4 TB a month to 20 TB a month (total compressed
WARC size); a typical month is on average 9.5 TB. Variables in the amount
collected each month and year depend on a number of factors, including:
a. The collecting needs of the curatorial divisions of the Library. The specific
seed URLs identified for capture for each crawl can range anywhere from 50
seed URLs to 3000 seed URLs.
b. Websites often have functionality or structure that challenges crawler
technologies and sometimes affects size of data captured.
c. Websites of interest to the Library can vary in size greatly, and very large
sites (more than 100,000 digital objects, for example), can be a challenge to
harvest completely.
d. The number of crawls and seed URLs increases in election years, during
which the Library harvests increased amounts of content on a more frequent
basis. The Election crawls usually start small and then gradually increase in
size as Election day gets nearer and candidate websites are identified for
collection. In years with Presidential races, the Library begins crawling
presidential campaign sites about a year and a half out from Election day;
during midterm elections, the crawls begin 6-7 months ahead of the election.
This activity is predictive, however varies by year depending on the type of
election and range of candidates running for office.
e. Funding availability.
p. 4 of 22
Definitions
Breadth: The scope and limitations bounding a particular seed URL in a

specific crawl. The breadth might consider a list of paths (such as the seed
URLs domain, subdomain, subdirectories, or files) to include or exclude in
the specific crawl.
Crawl: A process of harvesting a discrete batch of URLS.
Depth: The distance from a given seed, measured in link-to-link hops. Depth
does not correspond to a web-site structure of directory and subdirectory, so it
is an arbitrary way of limiting the crawls scope.
Deduplication / Deduping / Duplication Reduction: The process of
eliminating duplicate content from a given crawls output. Some web
harvesting tools have a feature that allows for deduplication based on a
defined set of rules or collection of earlier crawl content files.
Digital Object: Any object or document that actively contributes to the
complete rendering of a website in a standard web browser.
Frequency: How often a seed URL or seed list is to be harvestedfor
example: weekly, monthly, yearly.
Crawler, or Harvester: The software (or robot) used to harvest sites on
the web.
Crawling or Harvesting: Downloading web content via a crawler or
harvester.
Holey Bag: a directory container that follows the metadata and file-layout
structure outlined in the BagIt specification, which includes the optional
fetch.txt file along with an empty payload (or data) directory.
Scopes: Additional URLs associated with a particular seed URL that provide
further instructions on what content to harvest for the given seed website (i.e.,
the links that shall be followed).
Seed list: A list of seed URLs, each of which may or may not include an
associated list of scope URLs, provided to the crawler.
Seed URL, or seed: The crawlers entry point for a given website.
SURT: Sort-friendly URL Reordering Transform, a transformation applied to
URLs which makes their left-to-right representation better match the natural
hierarchy of domain names. For example, the SURT formatted version of
http://www.example.com/path/to/file.html would be
http://(com,example,www,)/path/to/file.html
WARC (Web ARChive) File Format: ISO standard format which specifies
a method for combining multiple digital resources into an aggregate archival
file together with related information.
See: http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm
?csnumber=44717 and
http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml for more
information.
p. 5 of 22
Draft Scope of Work

C.1. BACKGROUND
Many of the activities of the digital lifecycle for harvested web content occur at the
Library of Congress, including seed URL nomination, permissions gathering,
scoping and preparation of a seed list, quality review, and public access to
researchers. The Librarys web harvesting curator tools and infrastructure have been
developed for the inputs and outputs of open source tools (Heritrix for harvesting,
and Wayback Machine for access). The potential requirements described here are to
support the Librarys large-scale, ongoing harvesting efforts, plus storage for the life
of any potential contract, indexing for access, restricted access to the content for
processing by Library staff, and transfer to the Library for long-term storage.
Although the following provides a general description of the Library's potential
requirements, the Library is actively seeking suggested alternatives to the
requirements discussed below, where appropriate.
C.2. CONTENT CAPTURE
The Library will provide the Contractor with SURT-formatted seed lists (See
Attachment 1) for each crawl to be harvested in advance of scheduled crawls (which
include URLs in SURT format, scopes, plus any robots.txt instructions and other
rules for that crawl). The Library reserves the right to revise the seed list up to 24
hours in advance of the scheduled crawl.
C.2.1. The Contractors harvester shall be able to process Library-provided seed lists.
C.2.2. The contractor shall harvest batches of URLs identified by the Library. The
Contractor shall be able to accommodate varying numbers of seed URLs per crawl,
as seeds are added or removed from the Librarys collections.
C.2.3. The Contractor shall crawl according to crawl schedules (day and time the
crawls take place and/or the estimated start date and length of time to extract data)
negotiated between Contractor and the Library.
C.2.4. The Contractor shall establish procedures to ensure that the latest copy of the
seed list is used for the appropriate crawl.
C.2.5. The Contractor shall allow for robots.txt exclusions either to be ignored or to
be respected for seeds within a crawl.
C.2.6. The Contractor shall employ politeness factors when capturing content.
C.2.7. In order to minimize storage space requirements, upon request from the
Library, the Contractor shall perform any deduping (duplicate reduction) of content
between crawls and within crawls, storing only data that has been modified when
compared with previously captured and stored data. Typically the Library elects to
p. 6 of 22
perform a baseline crawl once per year of all content, and then dedupe content for the
remainder of the year.
C.2.8 The Contractor shall be capable of crawling very large sites (over 100K digital
objects) (such as state.gov, whitehouse.gov, AFT.org, RAND.org, or news sites such
as townhall.com, etc.).
C.2.9. The Contractor shall ensure that content captured is output and stored in the
WARC (Web ARChive) file format
(http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml) to maintain
consistency across Librarys web archive collections.
C.2.10. The Contractor shall monitor the crawls and within 24 hours take action to
resolve any issues related to crawler behavior or interruption of service. The
Contractor shall also resolve issues identified by Library staff or site owners related
to crawler behavior within 24 hours of the Librarys reporting the issue to Contractor.
Additionally, the Contractor shall acknowledge reports of issues identified by the
Library within 48 hours of the Librarys submitting them. Within technical
constraints, any problem should be rectified within 24 hours. If the problem cannot
be rectified, the Library shall be notified as soon as the Contractor has made that
determination.
C.2.11. The Contractor shall be capable of supporting automated, alert-driven
harvesting of parts of sites, where a feed (such as RSS) is used to identify new or
updated parts of a site (such as for large news sites).
C.3 CONTRACTOR QUALITY ASSURANCE
Completeness of the harvest of any identified websites (as defined in instructions) is
of very high importance to the Library, including any indicators that can be provided
about the completeness.
C.3.1. The Contractor shall ensure that crawls are complete, as scoped by the
Library, and would include all digital objects required to render the identified seeds
and scopes as they were at the time of capture, taking into account documented
limitations of Contractors tools as agreed to by the Library. Digital objects may
include, but are not limited to, html, images, flash, PDFs, and current versions of
audio and video files, content appearing on multiple domains including third-party
social media sites (as instructed by the Library via scoping instructions), and any
embedded content.
C.3.2. The Contractor shall perform monitoring and quality assurance processes and
provide reasonable quality assurances about the content harvested. The Contractor
shall propose a method for analyzing the capture success rate for specified types of
content.
C.3.3. The Contractor shall investigate quality assurance issues discovered by
Library staff and needing Contractor assistance.
p. 7 of 22
C.3.4. The Contractor shall take additional measures to monitor and employ solutions
to successfully capture social media content important to the Library, such as
YouTube, Facebook, and Twitter, or other new social media identified by the Library
for collection.
C.3.5. The Contractor shall run additional patch crawls if necessary to harvest
materials missed during a crawl.
C.4. CAPTURE TOOLS
C.4.1. The Contractor shall ensure that capture tools are kept up to date over the life
of the contract, with any relevant technical, functional, and security requirements and
updates.
C.4.2. Contractor shall provide documentation of capture tool limitations and issues.
In order for the Library to assess the results of the crawls, the Contractor shall keep
this documentation up-to-date during the contract period.
C.4.3. The Contractor shall document how it will successfully capture the following:
html, images, flash, PDFs, current versions of audio and video files, and social media
content.
C.5. ACCESS TOOLS

C.5.1. The Contractor shall make the results of all crawls, including the archived data
collected and reports generated, accessible to the Library within five (5) business
days of completion, via a version of the open source Wayback Machine software no
more than one version behind the latest official major release.
C.5.2. The Contractor shall ensure that the Wayback interface and indexes be
provided to enable search, retrieval, and display by URL, and in order to accurately
access the results of the crawls, and that the Wayback be configured to allow for
archival mode and proxy mode. The Contractor shall allow the implementation of a
Wayback custom banner to display when archived content is viewed in archival
mode.
C.5.3. The Contractor shall allow for up to 25 simultaneous sessions (restricted by IP
address to Library premises only, and available via password to be used by approved
Library staff at offsite locations).
C.5.4. The Contractor shall ensure that all content captured must be capable of being
fully functional when displayed via the access tools, or document where it is not
possible given the Contractors technical approach. For multimedia content that does
not replay in the Wayback (such as YouTube), yet is captured during the harvests,
the Contractor shall employ alternative replay methods.
p. 8 of 22
C.5.5. The Contractor shall ensure that access tools are kept up to date over the life of
the contract, with any relevant technical, functional, and security requirements and
updates.
C.5.6. The Contractor shall provide documentation access tool limitations and issues.
In order for the Library to assess the results of the crawls, the Contractor shall keep
this documentation up-to-date during the contract period.
C.5.7. The Contractor shall document how it will successfully display the following
in the access tools: html, images, flash, PDFs, current versions of audio and video
files, and social media content.
C.5.8. The Contractor shall provide an action plan to monitor, report to the Library,
and remedy breaks in access service. When it is within Contractors ability to
schedule maintenance outages, outages should be scheduled outside of Library
business hours, 6:30am 8:00pm EST if possible.
C.6. LIBRARY QUALITY REVIEW
Library staff will perform quality review and inspection processes to ensure that the
crawl results are as expected given Contractor-identified issues with crawler and
access tools during the time of capture. The Library will record issues, modify
scoping instructions as needed, and report issues as needed to Contractor for further
investigation or to rectify problems with the crawl.
Inspection Process and Criteria
Inspection process will occur within 30 days of notice by contractor that crawl has
completed, is indexed, and reports are available.
The Library will review provided reports, including any reports of quality assurance
performed by the Contractor, as well as inspecting a sampling of harvested content to
determine if the results of the crawls are as expected given Contractor-identified
issues with crawler and access tools during the time of capture. After review of the
sampling, for any crawls not accepted by the Library, the same resources or different
resources shall be harvested (again) at Contractors expense.
Content accepted by the Library shall be verified with the Librarys transfer tools for
completeness and file integrity upon completion of transfer.
C.7 REPORTS
C.7.1. The Contractor shall establish a mechanism (such as a ticketing system),
available via a secure login, to provide for exchange of seed lists between the Library
and the Contractor, reporting of issues, quality assurance inquiries and responses,
tracking transfer of content, and other general communications.
C.7.2 Within five (5) business days of the completion of each crawl (and any patch
crawling that is required), the Contractor shall provide reports to the Library of:
p. 9 of 22
C.7.2.1. Contractors quality assurance process results

C.7.2.2. Any major issues encountered with the crawl, including those related to:
C.7.2.2.1. Seeds that cannot be harvested at the time of crawl (for example,
seeds blocking the crawler or unavailable at the time of the crawl, DNS
failures)
C.7.2.2.2. Seeds that return HTTP codes (3XX, 4XX, 5XX)
C.7.2.2.3. Seeds that have errors in surting or scoping and were fixed by
Contractor prior to crawl
C.7.3. Crawl reports shall also be made available to the Library for review and
analysis and downloadable by the Library, within 5 days of completion of the crawl.
For each crawl, the Contractor shall provide the following reports:
C.7.3.1. Status Reports: The status reports shall be in text and xml format (unless
noted differently below), so that they may be ingested by the Librarys in-house tools
that are used to transfer content, perform inspection and QA process and to track
statistics about data harvested. Reports shall include:
C.7.3.1.1. total seeds harvested and not harvested,
C.7.3.1.2. total hosts harvested,
C.7.3.1.3. total documents per host,
C.7.31.4. total documents harvested,
C.7.3.1.5. number of documents and size in bytes by mimetype,
C.7.3.1.6. HTTP response code and number of documents,
C.7.3.1.7. total unique documents harvested,
C.7.3.1.8. documents left in frontier at end of crawl,
C.7.3.1.9. processed documents per second,
C.7.3.1.10. bandwidth in kilobytes per second,
C.7.3.1.11. total raw data size in bytes,
C.7.3.1.12. total compressed WARC size,
C.7.3.1.13. novel bytes,
C.7.3.1.14. duplicate by hash bytes,
C.7.3.1.15. non-modified bytes,
C.7.3.1.16. and any other data available from the crawl.
C.7.3.1.17. the duration (start and end time) of the crawl in ISO 8601 format:
yyyy-mm-ddThh:mm.
C.7.3.2. Unsuccessful Crawl Reports: Crawl information for hypertext transfer
protocol (HTTP) return codes 3XX, 4XX and 5XX.
C.7.3.3. Associate URLs with seeds. A harvested seed list with associated
URLs discovered from that seed.
C.7.3.4. Archival Seed list. A copy of the crawls seed list, provided by the
Library to the Contractor, shall be maintained with other crawl reports for the
crawl.
p. 10 of 22
C.7.3.5. Count of WARCs. A count of WARCs written for the crawl.

C.7.3.6. Host Data Report: Provide a report with one line entry per unique host
encountered during the crawl (for an example, Attachment 2). Each line entry
should contain the following field data:
C.7.3.6.1. Total number of URIs harvested for the host
C.7.3.6.2. Total number of bytes harvested for the host
C.7.3.6.3. The host name string
C.7.3.6.4. Number of URIs excluded for the host, due to robots.txt
restriction. This number does not consider/count links from URIs
specifically excluded in robots.txt.
C.7.3.6.5. Number of un-crawled URIs for this host still in queue
C.7.3.6.6. Number of new URIs crawled for this host since the last crawl
C.7.3.6.7. Number of URIs that exhibited a hash code that matched a
previous crawl, for the host. This number indicates the number of URIs
that exhibited duplicate contentthat is, content previously captured in
the same state
C.7.3.6.8. Number of bytes associated with URIs that exhibited a hash
code that matched a previous crawl, for the host
C.7.3.6.9. The number of URIs for the host that returned a 304 http status
code
C.7.3.6.10. The number of bytes associates with URIs that returned 304
http status codes for the host
C.7.3.7. PDF Extract: An extracted list of PDFs harvested during the crawl.
C.7.3.8. Seed Report: Report of http code, crawl status (crawled/not crawled), seed
(from seed list provided) and for all 301/302 http code, the redirect.
C.7.3.9. Crawler data: All associated crawl instructions and crawl log (provided as
a tar.gz file).
C.7.3.10. Captured by not displayable multimedia content reports: For
multimedia content that does not replay in the Wayback (such as YouTube), yet is
captured during the harvests, reports shall be provided to allow the Library to ensure
that capture is complete.
C.7.3.11. Surts report: A comprehensive list of approved scopes in surt format for
each crawl.
C.7.3.12. Crawler Instructions and Rules: All crawler specifications, patches,
scripts, and configurations (e.g.., order file, special rules for the crawler) used for
generating each crawl shall be provided to the Library available for transfer and reuse
by the Library.
p. 11 of 22
C.7.3.13. The Contractor shall provide visualizations of crawl report data in the form
of charts, graphs, and any other mechanisms that enables the Library to make
determinations about the results of each crawl, including reports about numbers of
seeds, documents crawled, hosts crawled, size of harvested data, response codes,
mime types (# of URLs), mime types (by size), seeds change report, and hosts
changed, or any other data provided in the crawl reports.
C.7.3.14. The Contractor shall also provide a report on technical performance issues
and root cause analysis for documented performance issues.
C.8 HOSTING
C.8.1. The Contractor shall maintain a reliable and secure data storage and
maintenance infrastructure capable of storing all web content harvested within the
scope of this contract incorporating the following minimum capabilities:
C.8.2. The Contractor shall store and maintain two replicas of web content harvested
under the scope of this contract: one for access by Library staff, one for long-term
storage. Storage of the two replicas shall be for the life of the contract.
C.8.3. The Contractor shall ensure the infrastructure for crawling, storing, and
providing access to content provides documented physical security, electronic
security, and operational practices typical of a commercial hosting provider. The
Contractor shall abide by Library specifications as to who is allowed to have access
credentials to the Library content for Quality Review and for transfer to the Library.
C.8.4.The Contractor shall ensure that content harvested under the terms of this
contract is not co-mingled with the Contractors own data or with that of other
customers.
C.8.5. The Contractor shall ensure that content harvested under the terms of this
contract is not included in or copied to Contractors own services/collections and is
not made publicly available or to researchers without the Librarys explicit
permission.
C.8.6. The Contractor shall ensure integrity monitoring for data hosted under the
terms of this contract:
C.3.8.6.1. Checksums shall be performed on content every 30 days with error
reporting provided to the Library
C.3.8.6.2. Operational status monitoring of machines, including reports on status
and any errors and remedies performed, shall be provided and reported to the
Library. Errors and remedies should be reported within a one-week period of
detection and repair.
C.9 TRANSFER OF CONTENT TO THE LIBRARY
p. 12 of 22
The Library transfers and stores content in the BagIt File Packaging Format
specification, a hierarchical file packaging format. The Librarys transfer tools are
also based on the specification. See
http://www.digitalpreservation.gov/documents/bagitspec.pdf for information and
instructions.
C.9.1. Once crawls are accepted by the Library, the Contractor shall prepare the
content for transfer via the following steps:
C.9.1.1. Calculate sub-1-terabyte divisions of the complete collection of WARC
files associated with the crawl. For a simple example, a crawl that yields 3500
1GB WARC files, totaling 3.5TB, should be divided into four subdivisions,
which might exhibit the following structure:
999 WARC files -> .999TB
C.9.1.2. Create Holey Bag for each sub-1-terabyte division according to the
specifications of the BagIt file Packaging Format
(http://www.digitalpreservation.gov/documents/bagitspec.pdf)
C.9.1.3. Make resulting bags available for network transfer
C.9.2. The Contractor shall propose a method for transfer of the content to the
Library that does not use the commercial Internet. The Librarys preferred method is
network transfer using Internet2. The Library will also consider transfer using
physical storage devices such as external hard drives, or another method that does not
use the commercial Internet. If the Contractor can provide access to the content via
Internet2 or another proposed network method that does not use the commercial
Internet, then the Library will propose one or more specific protocols to be used (e.g.
rsync, Http), method, and schedule for pulling content from the designated access
point. The Contractor may propose an alternative protocol, method, or schedule, if
useful for improving performance or integrity of transfer.
C.9.3. The Contractor shall ensure that content transferred is equal in size (measured
in bytes) to that which has been generated for the Library in harvest services for that
contract year, as well as any existing backlog, if any, the Library is able to transfer,
within the following requirements:
C.9.3.1. The Contractor shall maintain machines and networks operational for
transfer/pull by the Library of Congress, as described above.
C.9.3.2. The Contractor shall address and rectify transfer failures falling
within its machine(s) and/or network sphere. A transfer failure is a factor that
causes transfer of currently requested content to fail.
C.9.3.2.1. The Contractor shall acknowledge a transfer failure query
from the Library within 24 hours.
C.9.3.2.2. If the transfer failure is determined to fall within the
Contractors machine(s) and/or network sphere, the Contractor shall
rectify the failure within 5 business days.
p. 13 of 22
C.9.3.2.3. If rectifying the failure takes more than 5 business days, the
total time required to rectify it shall be counted as downtime (see next
section).
C.9.3.2. The Contractor shall ensure system availability for transfer of
harvested content to the Library during normal business hours Eastern time.
In no case shall contractors annual downtime exceed 20 business days or
portions thereof.
C.9.3.2.1. Downtime is a period of time during which the Library is
unable to transfer any content from any of the Contractors machines.
C.9.3.2.2. The Contractor shall give the Library 24 hours notice of
downtime (exclusive of transfer failures that become downtime).
C.9.4. The Contractor shall make adjustments in transfer tools as necessary during
the course of the contract to accommodate the Librarys security or technology
refresh requirements.
C.10. CLOSE OUT
At the end of the final option year of the contract, and upon request of the close out
tasks include ensuring delivery of (per Transfer specifications listed above) any
remaining content and reports to the Library.
Any copies of the content stored and hosted during the life of the Contract by the
Contractor shall be deleted at the termination of this contract, and not retained/comingled with Contractors content.
Attachment 1: Surt-Formatted Seed List (excerpt)
Sample Seed List

#----------------------# 835 record(s):
# > 835 monthly
#----------------------# Legislative Branch
# United States Congressional Web Archive
# Federal Courts Archive
p. 14 of 22
#----------------------# monthly
#----------------------#----------------------#
GLOBAL SURTS
#----------------------#do not crawl query strings on /rss but allow the root rss feed for newsroom.lds.org
# do not crawl
-http://(com,solanocounty,www,)/bosagenda/MG22254/AS22274/
-http://(com,delicious,)/save
-http://(com,secretaryofinnovation,
-http://(com,wikileaks,
-http://(gov,alabama,joblink,
-http://(gov,mt,leg,)/css/committees/session/minutes/07minwrittenaudio.asp?chamber=house
-http://(gov,mt,leg,)/css/committees/session/minutes/07minwrittenaudio.asp?chamber=senate
-http://(gov,mt,leg,)/css/committees/session/minutes/05minwrittenaudio.asp?houseid=1&sessionid=88
-http://(gov,mt,leg,)/css/committees/session/minutes/05minwrittenaudio.asp?houseid=2&sessionid=88
-http://(gov,mt,leg,)/css/sessions/61st/archives.asp
-http://(gov,mt,leg,)/css/sessions/62nd/archives.asp
-http://(info,wikileaks,
-http://(mil,navy,news,
-http://(org,au,members,)/site/userlogin
-http://(org,northcountrygazette,
-http://(org,wikileaks,
-http://(com.tinyurl.errorhelp,
-http://(com,msn,r,
p. 15 of 22
# respect robots
# respect robots on http://joblink.delaware.gov/
# respect robots on http://search.scdhec.gov
# respect robots on http://search.sheriff.org
# respect robots on http://search.socialsecurity.gov
# respect robots on http://search.sos.state.tx.us
# respect robots on http://search.tpwd.texas.gov
# respect robots on http://search.trade.gov
# URL shorteners
+http://(be,yout,
+http://(BO,OFA,
+http://(cc,tiny,
+http://(co,t,
+http://(com,epurl,
+http://(com,tinyurl,
+http://(com,yfrog,
+http://(gd,ta,
+http://(gl,goo,
+http://(gov,usa,go,
#----------------------#----------------------# COLLECTION SURTS
#----------------------# lcwa0039
#----------------------+http://(com,granicus,
p. 16 of 22
+http://(gov,houselive,
+http://207.7.154.110:443/
+http://(207.7.154.110)
+http://(me,bcove,
+http://(com,brightcove,link,
+http://(net,edgeboss,
#----------------------#----------------------#
Monthly Seeds
#----------------------#----------------------# Lesser Frequency

#----------------------# PRIORITIZED SEEDS
#----------------------# Record ID 94168 - oce.house.gov
http://oce.house.gov/
+http://(gov,house,oce,)/
+http://(com,youtube,www,)/CongressionalEthics
+http://(com,twitter,www,)/congressethics
+http://(com,twitter,)/congressethics
+http://(com,facebook,www,)/OfficeofCongressionalEthics
# Record ID 94169 - chaplain.house.gov
http://chaplain.house.gov/
+http://(gov,house,chaplain,)/
p. 17 of 22
# Record ID 94170 - clerk.house.gov

http://clerk.house.gov/
+http://(gov,house,clerk,
+http://(gov,house,clerkkids,)/
+http://(com,youtube,www,)/clerkofthehouse
+http://(com,granicus,podcache-101,)/clerkhouse/
+http://(gov,house,lobbyingdisclosure,)/
# Record ID 94171 - cao.house.gov
http://cao.house.gov/
+http://(gov,house,cao,
# Record ID 94172 - republicanwhip.house.gov
http://republicanwhip.house.gov/
+http://(gov,house,republicanwhip,)/
+http://(com,twitter,)/GOPWhip
# Record ID 94174 - intelligence.house.gov
http://intelligence.house.gov/
+http://(gov,house,intelligence,)/
# Record ID 94175 - www.house.gov
http://www.house.gov/
+http://(gov,house,
#+http://(com,youtube,www,)/user/househub
#Decided to keep the youtube link above
+http://(com,twitter,)/HouseFloor
+http://(net,edgeboss,
+http://(tv,blip,)
+http://(com,streamos,boss,)
p. 18 of 22
+http://(com,google,video,)
+http://(com,facebook,www,)/government
+http://(com,facebook,www,)/congress
+http://(com,youtube,www,)/househub
+http://(gov,cosponsor,
# Record ID 94178 - republican.senate.gov
http://republican.senate.gov/
+http://(gov,senate,republican,
+http://(com,twitter,)/Senate_GOPs
+http://(com,youtube,www,)/user/RepublicanSenators
+http://(com,facebook,www,)/RepublicanSenators
# Record ID 94179 - www.stennis.gov
http://www.stennis.gov/
+http://(gov,stennis,www,)/
# Record ID 94180 - www.aoc.gov
http://www.aoc.gov/
+http://(gov,aoc,www,)
+http://(com,flickr,www,)/photos/uscapitol/
+http://(com,youtube,www,)/user/AOCgov
+http://(com,facebook,www,)/ArchitectoftheCapitol
+http://(com,instagram,)/uscapitol
+http://(com,twitter,)/uscapitol
# Record ID 94182 - www.uscc.gov
http://www.uscc.gov/
+http://(gov,uscc,www,
# Record ID 94183 - medpac.gov
p. 19 of 22
http://medpac.gov/
+http://(gov,medpac,
# Record ID 94184 - www.jct.gov
http://www.jct.gov/
+http://(gov,jct,
# Record ID 94185 - www.majorityleader.gov
http://www.majorityleader.gov/
+http://(gov,majorityleader,
# Record ID 94186 - www.loc.gov
http://www.loc.gov/
+http://(gov,loc,www,)
+http://(gov,loc,search,)/
+http://(gov,loc,memory,)/
+http://(gov,loc,mic,)/
+http://(gov,loc,lcweb,)/
+http://(gov,loc,marvel,)/
+http://(gov,loc,international,)/
+http://(gov,digitalpreservation,www,)/
+http://(gov,americaslibrary,www,)/
+http://(gov,read,www,)/
#+http://(gov,loc,thomas,)/
#+http://(com,twitter,)/librarycongress
+http://(com,facebook,www,)/libraryofcongress
+http://(com,facebook,www,)/americanfolklifecenter
+http://(com,facebook,www,)/booksandbeyond
+http://(com,facebook,www,)/lawlibraryofcongress
p. 20 of 22
+http://(com,facebook,www,)/digitalpreservation
+http://(gov,loc,blogs,)/
+http://(gov,loc,www,)/chroniclingamerica/
+http://(gov,loc,thomas,)/cgi-bin/dailydigest
+http://(gov,access,gpo,frwebgate,)/cgi-bin/getpage.cgi?dbname=
+http://(gov,loc,thomas,)/home/lawsmade.toc.html
+http://(gov,loc,thomas,)/home/lawsmade.bysec/
+http://(gov,loc,thomas,)/home/lawsmade.toc.html
+http://(gov,loc,www,)/crsinfo/
+http://(com,twitter,)/librarycongress
+http://(com,youtube,www,)/libraryofcongress
#update 4-6-07 for loc.gov exclusion
#special rule for www.loc.gov
+http://(gov,loc,memory,)/ammem/aaohtml
#there was a space before 'memory' that was deleted BF
+http://(gov,loc,www,)/exhibits/african/
+http://(com,flickr,www,)/photos/library_of_congress/
+http://(com,twitter,)/LawlibCongress
+http://(com,twitter,)/LOCMaps
+http://(com,twitter,)/ndiipp
+http://(com,twitter,)/THOMASdotgov
+http://(com,twitter,)/CopyrightOffice
+http://(com,twitter,)/wdlorg
+http://(gov,loc,lcweb2,)
+http://(gov,loc,media,
+http://(com,pinterest,www,)/LibraryCongress/
p. 21 of 22
+http://(gov,congress,
+http://(com,youtube,www,)/user/LibraryOfCongress/
# Record ID 94187 - www.gpo.gov
http://www.gpo.gov/
+http://(gov,gpo,www,)/
+http://(gov,gpo,bensguide,)/
+http://(gov,gpoaccess,www,)/
+http://(com,twitter,)/usgpo
+http://(com,youtube,www,)/user/gpoprinter
Attachment 2: Host Data Report example

Sample Host Data Report
Attachment 3: Host Data Report example

[#urls] [#bytes] [host] [#robots] [#remaining] [#novel-urls] [#novelbytes] [#dup-by-hash-urls] [#dup-by-hash-bytes] [#not-modified-urls]
[#not-modified-bytes]
163912 5690495364 example.com 0 2065196 149272 5647558614 14640
42936750 0 0
147607 14152427270 2.exampletwo.com 0 11068 147273 14152222823 334
204447 0 0
126073 6885025254 www.example3.com 0 73675 71378 2498172867 54695
4386852387 0 0
p. 22 of 22

From The Library of Congress: RFI Web Harvesting

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

From The Library of Congress: RFI Web Harvesting

Enviado por

Direitos autorais:

Formatos disponíveis

Title: Web Harvesting; Request for Information

Extended Background Summary

Extended Background Summary

Breadth: The scope and limitations bounding a particular seed URL in a

Draft Scope of Work

C.5. ACCESS TOOLS

C.7.2.1. Contractors quality assurance process results

C.7.3.5. Count of WARCs. A count of WARCs written for the crawl.

Attachment 1: Surt-Formatted Seed List (excerpt)

Sample Seed List

#----------------------#----------------------# Lesser Frequency

# Record ID 94170 - clerk.house.gov

Attachment 2: Host Data Report example

Attachment 3: Host Data Report example

Você também pode gostar