Escolar Documentos
Profissional Documentos
Cultura Documentos
PSC: D317
NAICS: 519130
Response Date: 12/30/14; 1:00pm
Background Summary (Extended background provided beginning on p. 3)
The Library of Congress, Office of Strategic Initiatives (OSI) is seeking information from
potential contractors about how to best to design a requirement related to saving and reviewing
information from the Internet. The Library is seeking information, e.g., current/existing
commercial solutions, design solutions, etc., on how to best meet this web harvesting
requirement.
This RFI is to determine if potential offerors can meet the Librarys technical and production
requirements for harvesting web content and to receive feedback on pricing models and
reasonable quality assurance. The Library is actively seeking suggested solutions and alternatives
that will meet our requirements.
The purpose of this RFI is to gain information related to the following issues:
1) Is the Draft Scope of Work (beginning on p.5) sufficiently clear? If not, please explain
what additional information should be provided, or what alternate description would be
more appropriate.
2) Are the requirements identified in the Draft Scope of Work reasonable? If not, please
identify those requirements that are considered unreasonable as well as propose a version
that is considered to be reasonable.
3) Are there alternatives to sections in the Scope of Work that would be beneficial to the
Library in some way?
4) Feedback on, or suggested alternatives to, the Librarys potential
requirements as articulated in the Description of Potential Requirements
below.
5) Suggested definition of web harvesting contract deliverables to take into
account and accommodate fluctuating numbers of seed URLs, varying
magnitude of the sites harvested, technical difficulties, and varying
frequencies for individual seeds or groups of seeds.
6) Pricing models for web harvesting services, consistent with the defined
deliverables. The Library is NOT interested in time and materials pricing
models for this work.
7) Proposed methods (manual or automated) for vendor quality assurance of
content capture. The Library is particularly interested in how vendors ensure
that web content is captured successfully. A successful crawl would be
complete, as scoped by the Library, and would include all digital objects
p. 1 of 22
required to render the identified seeds and scopes as they were at the time of
capture, taking into account documented limitations of Contractors tools as
agreed to by the Library.
8) Capabilities and limitations of vendors crawler and access tools in their
ability to capture and replay content.
9) The qualifications/certifications a responsible contractor should have to
perform this service
10) The terms and conditions under which this maintenance is typically procured
11) Opinion on whether or not this is considered a commercial service.
12) Would you consider bidding on a solicitation for these services?
Responder Requirements
Responders are requested to provide:
1)
2)
3)
4)
Company name, company contact, contact address, phone number, and e-mail address
Information addressing one or more of the Issues listed above
Descriptions of technologies or procedures that will satisfy the Librarys requirements
The Librarys preference is to issue a firm-fixed price contract, with a price set for a firm
amount before the contract service begins. Please provide descriptions of benchmarks
based on how payments should be arranged to be issued for such a contract, or your
recommendation for how such a contract should be billed.
5) Rough estimates of what the described service may cost and what the major cost drivers
of that estimate are.
Questions
Questions regarding this announcement shall be submitted in writing by e-mail to:
Name: Sherman Mayle
Email: smay@loc.gov
Verbal questions will NOT be accepted. Questions may be answered by posting answers to the
FedBizOpps website; accordingly, questions shall NOT contain proprietary or classified
information. The Government does not guarantee that questions will be answered.
RFI Qualification
THIS IS a REQUEST FOR INFORMATION (RFI) ONLY to identify sources that can provide
the requirement as described in this announcement. The information provided in the RFI is
p. 2 of 22
subject to change and is not binding on the Government. The purpose of this RFI is for planning
and market research purposes only. The information obtained from responses to this RFI may be
used in the development of an acquisition strategy and in the development of a future
solicitation.
Proprietary information or other sensitive information should be clearly marked. The Library of
Congress has not made a commitment to procure any of the items discussed, and release of this
RFI should not be construed as such a commitment or as authorization to incur cost for which
reimbursement would be required or sought. All submissions become Government property and
will not be returned.
Location
Library of Congress
101 Independence Avenue, S.E.
Washington, D.C., 20540.
RFI CONTENTS
p.4
p.5
p.6
p.14
p.22
p. 3 of 22
p. 4 of 22
Definitions
p. 5 of 22
perform a baseline crawl once per year of all content, and then dedupe content for the
remainder of the year.
C.2.8 The Contractor shall be capable of crawling very large sites (over 100K digital
objects) (such as state.gov, whitehouse.gov, AFT.org, RAND.org, or news sites such
as townhall.com, etc.).
C.2.9. The Contractor shall ensure that content captured is output and stored in the
WARC (Web ARChive) file format
(http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml) to maintain
consistency across Librarys web archive collections.
C.2.10. The Contractor shall monitor the crawls and within 24 hours take action to
resolve any issues related to crawler behavior or interruption of service. The
Contractor shall also resolve issues identified by Library staff or site owners related
to crawler behavior within 24 hours of the Librarys reporting the issue to Contractor.
Additionally, the Contractor shall acknowledge reports of issues identified by the
Library within 48 hours of the Librarys submitting them. Within technical
constraints, any problem should be rectified within 24 hours. If the problem cannot
be rectified, the Library shall be notified as soon as the Contractor has made that
determination.
C.2.11. The Contractor shall be capable of supporting automated, alert-driven
harvesting of parts of sites, where a feed (such as RSS) is used to identify new or
updated parts of a site (such as for large news sites).
C.3 CONTRACTOR QUALITY ASSURANCE
Completeness of the harvest of any identified websites (as defined in instructions) is
of very high importance to the Library, including any indicators that can be provided
about the completeness.
C.3.1. The Contractor shall ensure that crawls are complete, as scoped by the
Library, and would include all digital objects required to render the identified seeds
and scopes as they were at the time of capture, taking into account documented
limitations of Contractors tools as agreed to by the Library. Digital objects may
include, but are not limited to, html, images, flash, PDFs, and current versions of
audio and video files, content appearing on multiple domains including third-party
social media sites (as instructed by the Library via scoping instructions), and any
embedded content.
C.3.2. The Contractor shall perform monitoring and quality assurance processes and
provide reasonable quality assurances about the content harvested. The Contractor
shall propose a method for analyzing the capture success rate for specified types of
content.
C.3.3. The Contractor shall investigate quality assurance issues discovered by
Library staff and needing Contractor assistance.
p. 7 of 22
C.3.4. The Contractor shall take additional measures to monitor and employ solutions
to successfully capture social media content important to the Library, such as
YouTube, Facebook, and Twitter, or other new social media identified by the Library
for collection.
C.3.5. The Contractor shall run additional patch crawls if necessary to harvest
materials missed during a crawl.
C.4. CAPTURE TOOLS
C.4.1. The Contractor shall ensure that capture tools are kept up to date over the life
of the contract, with any relevant technical, functional, and security requirements and
updates.
C.4.2. Contractor shall provide documentation of capture tool limitations and issues.
In order for the Library to assess the results of the crawls, the Contractor shall keep
this documentation up-to-date during the contract period.
C.4.3. The Contractor shall document how it will successfully capture the following:
html, images, flash, PDFs, current versions of audio and video files, and social media
content.
p. 8 of 22
C.5.5. The Contractor shall ensure that access tools are kept up to date over the life of
the contract, with any relevant technical, functional, and security requirements and
updates.
C.5.6. The Contractor shall provide documentation access tool limitations and issues.
In order for the Library to assess the results of the crawls, the Contractor shall keep
this documentation up-to-date during the contract period.
C.5.7. The Contractor shall document how it will successfully display the following
in the access tools: html, images, flash, PDFs, current versions of audio and video
files, and social media content.
C.5.8. The Contractor shall provide an action plan to monitor, report to the Library,
and remedy breaks in access service. When it is within Contractors ability to
schedule maintenance outages, outages should be scheduled outside of Library
business hours, 6:30am 8:00pm EST if possible.
C.6. LIBRARY QUALITY REVIEW
Library staff will perform quality review and inspection processes to ensure that the
crawl results are as expected given Contractor-identified issues with crawler and
access tools during the time of capture. The Library will record issues, modify
scoping instructions as needed, and report issues as needed to Contractor for further
investigation or to rectify problems with the crawl.
Inspection Process and Criteria
Inspection process will occur within 30 days of notice by contractor that crawl has
completed, is indexed, and reports are available.
The Library will review provided reports, including any reports of quality assurance
performed by the Contractor, as well as inspecting a sampling of harvested content to
determine if the results of the crawls are as expected given Contractor-identified
issues with crawler and access tools during the time of capture. After review of the
sampling, for any crawls not accepted by the Library, the same resources or different
resources shall be harvested (again) at Contractors expense.
Content accepted by the Library shall be verified with the Librarys transfer tools for
completeness and file integrity upon completion of transfer.
C.7 REPORTS
C.7.1. The Contractor shall establish a mechanism (such as a ticketing system),
available via a secure login, to provide for exchange of seed lists between the Library
and the Contractor, reporting of issues, quality assurance inquiries and responses,
tracking transfer of content, and other general communications.
C.7.2 Within five (5) business days of the completion of each crawl (and any patch
crawling that is required), the Contractor shall provide reports to the Library of:
p. 9 of 22
p. 10 of 22
p. 11 of 22
C.7.3.13. The Contractor shall provide visualizations of crawl report data in the form
of charts, graphs, and any other mechanisms that enables the Library to make
determinations about the results of each crawl, including reports about numbers of
seeds, documents crawled, hosts crawled, size of harvested data, response codes,
mime types (# of URLs), mime types (by size), seeds change report, and hosts
changed, or any other data provided in the crawl reports.
C.7.3.14. The Contractor shall also provide a report on technical performance issues
and root cause analysis for documented performance issues.
C.8 HOSTING
C.8.1. The Contractor shall maintain a reliable and secure data storage and
maintenance infrastructure capable of storing all web content harvested within the
scope of this contract incorporating the following minimum capabilities:
C.8.2. The Contractor shall store and maintain two replicas of web content harvested
under the scope of this contract: one for access by Library staff, one for long-term
storage. Storage of the two replicas shall be for the life of the contract.
C.8.3. The Contractor shall ensure the infrastructure for crawling, storing, and
providing access to content provides documented physical security, electronic
security, and operational practices typical of a commercial hosting provider. The
Contractor shall abide by Library specifications as to who is allowed to have access
credentials to the Library content for Quality Review and for transfer to the Library.
C.8.4.The Contractor shall ensure that content harvested under the terms of this
contract is not co-mingled with the Contractors own data or with that of other
customers.
C.8.5. The Contractor shall ensure that content harvested under the terms of this
contract is not included in or copied to Contractors own services/collections and is
not made publicly available or to researchers without the Librarys explicit
permission.
C.8.6. The Contractor shall ensure integrity monitoring for data hosted under the
terms of this contract:
C.3.8.6.1. Checksums shall be performed on content every 30 days with error
reporting provided to the Library
C.3.8.6.2. Operational status monitoring of machines, including reports on status
and any errors and remedies performed, shall be provided and reported to the
Library. Errors and remedies should be reported within a one-week period of
detection and repair.
C.9 TRANSFER OF CONTENT TO THE LIBRARY
p. 12 of 22
The Library transfers and stores content in the BagIt File Packaging Format
specification, a hierarchical file packaging format. The Librarys transfer tools are
also based on the specification. See
http://www.digitalpreservation.gov/documents/bagitspec.pdf for information and
instructions.
C.9.1. Once crawls are accepted by the Library, the Contractor shall prepare the
content for transfer via the following steps:
C.9.1.1. Calculate sub-1-terabyte divisions of the complete collection of WARC
files associated with the crawl. For a simple example, a crawl that yields 3500
1GB WARC files, totaling 3.5TB, should be divided into four subdivisions,
which might exhibit the following structure:
999 WARC files -> .999TB
999 WARC files -> .999TB
999 WARC files -> .999TB
503 WARC files -> .503TB
C.9.1.2. Create Holey Bag for each sub-1-terabyte division according to the
specifications of the BagIt file Packaging Format
(http://www.digitalpreservation.gov/documents/bagitspec.pdf)
C.9.1.3. Make resulting bags available for network transfer
C.9.2. The Contractor shall propose a method for transfer of the content to the
Library that does not use the commercial Internet. The Librarys preferred method is
network transfer using Internet2. The Library will also consider transfer using
physical storage devices such as external hard drives, or another method that does not
use the commercial Internet. If the Contractor can provide access to the content via
Internet2 or another proposed network method that does not use the commercial
Internet, then the Library will propose one or more specific protocols to be used (e.g.
rsync, Http), method, and schedule for pulling content from the designated access
point. The Contractor may propose an alternative protocol, method, or schedule, if
useful for improving performance or integrity of transfer.
C.9.3. The Contractor shall ensure that content transferred is equal in size (measured
in bytes) to that which has been generated for the Library in harvest services for that
contract year, as well as any existing backlog, if any, the Library is able to transfer,
within the following requirements:
C.9.3.1. The Contractor shall maintain machines and networks operational for
transfer/pull by the Library of Congress, as described above.
C.9.3.2. The Contractor shall address and rectify transfer failures falling
within its machine(s) and/or network sphere. A transfer failure is a factor that
causes transfer of currently requested content to fail.
C.9.3.2.1. The Contractor shall acknowledge a transfer failure query
from the Library within 24 hours.
C.9.3.2.2. If the transfer failure is determined to fall within the
Contractors machine(s) and/or network sphere, the Contractor shall
rectify the failure within 5 business days.
p. 13 of 22
C.9.3.2.3. If rectifying the failure takes more than 5 business days, the
total time required to rectify it shall be counted as downtime (see next
section).
C.9.3.2. The Contractor shall ensure system availability for transfer of
harvested content to the Library during normal business hours Eastern time.
In no case shall contractors annual downtime exceed 20 business days or
portions thereof.
C.9.3.2.1. Downtime is a period of time during which the Library is
unable to transfer any content from any of the Contractors machines.
C.9.3.2.2. The Contractor shall give the Library 24 hours notice of
downtime (exclusive of transfer failures that become downtime).
C.9.4. The Contractor shall make adjustments in transfer tools as necessary during
the course of the contract to accommodate the Librarys security or technology
refresh requirements.
C.10. CLOSE OUT
At the end of the final option year of the contract, and upon request of the close out
tasks include ensuring delivery of (per Transfer specifications listed above) any
remaining content and reports to the Library.
Any copies of the content stored and hosted during the life of the Contract by the
Contractor shall be deleted at the termination of this contract, and not retained/comingled with Contractors content.
p. 14 of 22
#----------------------# monthly
#----------------------#----------------------#
GLOBAL SURTS
#----------------------#do not crawl query strings on /rss but allow the root rss feed for newsroom.lds.org
# do not crawl
-http://(com,solanocounty,www,)/bosagenda/MG22254/AS22274/
-http://(com,delicious,)/save
-http://(com,secretaryofinnovation,
-http://(com,wikileaks,
-http://(gov,alabama,joblink,
-http://(gov,mt,leg,)/css/committees/session/minutes/07minwrittenaudio.asp?chamber=house
-http://(gov,mt,leg,)/css/committees/session/minutes/07minwrittenaudio.asp?chamber=senate
-http://(gov,mt,leg,)/css/committees/session/minutes/05minwrittenaudio.asp?houseid=1&sessionid=88
-http://(gov,mt,leg,)/css/committees/session/minutes/05minwrittenaudio.asp?houseid=2&sessionid=88
-http://(gov,mt,leg,)/css/sessions/61st/archives.asp
-http://(gov,mt,leg,)/css/sessions/62nd/archives.asp
-http://(info,wikileaks,
-http://(mil,navy,news,
-http://(org,au,members,)/site/userlogin
-http://(org,northcountrygazette,
-http://(org,wikileaks,
-http://(com.tinyurl.errorhelp,
-http://(com,msn,r,
p. 15 of 22
# respect robots
# respect robots on http://joblink.delaware.gov/
# respect robots on http://search.scdhec.gov
# respect robots on http://search.sheriff.org
# respect robots on http://search.socialsecurity.gov
# respect robots on http://search.sos.state.tx.us
# respect robots on http://search.tpwd.texas.gov
# respect robots on http://search.trade.gov
# URL shorteners
+http://(be,yout,
+http://(BO,OFA,
+http://(cc,tiny,
+http://(co,t,
+http://(com,epurl,
+http://(com,tinyurl,
+http://(com,yfrog,
+http://(gd,ta,
+http://(gl,goo,
+http://(gov,usa,go,
#----------------------#----------------------# COLLECTION SURTS
#----------------------# lcwa0039
#----------------------+http://(com,granicus,
p. 16 of 22
+http://(gov,houselive,
+http://207.7.154.110:443/
+http://(207.7.154.110)
+http://(me,bcove,
+http://(com,brightcove,link,
+http://(net,edgeboss,
#----------------------#----------------------#
Monthly Seeds
p. 17 of 22
p. 18 of 22
+http://(com,google,video,)
+http://(com,facebook,www,)/government
+http://(com,facebook,www,)/congress
+http://(com,youtube,www,)/househub
+http://(gov,cosponsor,
# Record ID 94178 - republican.senate.gov
http://republican.senate.gov/
+http://(gov,senate,republican,
+http://(com,twitter,)/Senate_GOPs
+http://(com,youtube,www,)/user/RepublicanSenators
+http://(com,facebook,www,)/RepublicanSenators
# Record ID 94179 - www.stennis.gov
http://www.stennis.gov/
+http://(gov,stennis,www,)/
# Record ID 94180 - www.aoc.gov
http://www.aoc.gov/
+http://(gov,aoc,www,)
+http://(com,flickr,www,)/photos/uscapitol/
+http://(com,youtube,www,)/user/AOCgov
+http://(com,facebook,www,)/ArchitectoftheCapitol
+http://(com,instagram,)/uscapitol
+http://(com,twitter,)/uscapitol
# Record ID 94182 - www.uscc.gov
http://www.uscc.gov/
+http://(gov,uscc,www,
# Record ID 94183 - medpac.gov
p. 19 of 22
http://medpac.gov/
+http://(gov,medpac,
# Record ID 94184 - www.jct.gov
http://www.jct.gov/
+http://(gov,jct,
# Record ID 94185 - www.majorityleader.gov
http://www.majorityleader.gov/
+http://(gov,majorityleader,
# Record ID 94186 - www.loc.gov
http://www.loc.gov/
+http://(gov,loc,www,)
+http://(gov,loc,search,)/
+http://(gov,loc,memory,)/
+http://(gov,loc,mic,)/
+http://(gov,loc,lcweb,)/
+http://(gov,loc,marvel,)/
+http://(gov,loc,international,)/
+http://(gov,digitalpreservation,www,)/
+http://(gov,americaslibrary,www,)/
+http://(gov,read,www,)/
#+http://(gov,loc,thomas,)/
#+http://(com,twitter,)/librarycongress
+http://(com,facebook,www,)/libraryofcongress
+http://(com,facebook,www,)/americanfolklifecenter
+http://(com,facebook,www,)/booksandbeyond
+http://(com,facebook,www,)/lawlibraryofcongress
p. 20 of 22
+http://(com,facebook,www,)/digitalpreservation
+http://(gov,loc,blogs,)/
+http://(gov,loc,www,)/chroniclingamerica/
+http://(gov,loc,thomas,)/cgi-bin/dailydigest
+http://(gov,access,gpo,frwebgate,)/cgi-bin/getpage.cgi?dbname=
+http://(gov,loc,thomas,)/home/lawsmade.toc.html
+http://(gov,loc,thomas,)/home/lawsmade.bysec/
+http://(gov,loc,thomas,)/home/lawsmade.toc.html
+http://(gov,loc,www,)/crsinfo/
+http://(com,twitter,)/librarycongress
+http://(com,youtube,www,)/libraryofcongress
#update 4-6-07 for loc.gov exclusion
#special rule for www.loc.gov
+http://(gov,loc,memory,)/ammem/aaohtml
#there was a space before 'memory' that was deleted BF
+http://(gov,loc,www,)/exhibits/african/
+http://(com,flickr,www,)/photos/library_of_congress/
+http://(com,twitter,)/LawlibCongress
+http://(com,twitter,)/LOCMaps
+http://(com,twitter,)/ndiipp
+http://(com,twitter,)/THOMASdotgov
+http://(com,twitter,)/CopyrightOffice
+http://(com,twitter,)/wdlorg
+http://(gov,loc,lcweb2,)
+http://(gov,loc,media,
+http://(com,pinterest,www,)/LibraryCongress/
p. 21 of 22
+http://(gov,congress,
+http://(com,youtube,www,)/user/LibraryOfCongress/
# Record ID 94187 - www.gpo.gov
http://www.gpo.gov/
+http://(gov,gpo,www,)/
+http://(gov,gpo,bensguide,)/
+http://(gov,gpoaccess,www,)/
+http://(com,twitter,)/usgpo
+http://(com,youtube,www,)/user/gpoprinter
p. 22 of 22