Skip to Main Content
Libraries
askus Ask us
 

Web Archiving for curators

QA and Patch Crawls

Quality Assurance "QA"  - an important step AFTER your crawls have been run.  From the Wayback machine or from a hosts report, content is checked to determine if links behave as expected.  If not, "patch" crawls are run. 

Patch crawls: A crawl to capture and patch in documents that were not captured in your original crawl. (Glossary).  READ: Modify scope and run patch crawls from your report (below) . Patch crawls don't usually take up much data. 

There are a number of ways to run patch crawls, BUT they can only be run on production crawls or saved test crawls. 

"Automatic"Quality Assurance - From with the post-crawl hosts report, scoping rules can be edited and patch crawls run.  

Manual Quality Assurance - Browse in  Wayback to relevant areas of sites & check for relevant files.  Missing URLS  are logged so that they can be crawled later on in a patch crawl.  READ: How to use the Wayback QA Tool (link below).

  1. From your crawl's Seeds report OR the collection's seed list, work through the seeds, accessing the seed in Wayback from the links provided. (Note, there are other routes from seeds to Wayback), i.e. the "Archive" tab). 
  2. Select “Enable QA” from the top Archive-It banner.  This will allow the system to log missing files as you navigate through a site. 
  3. Important: Make sure QA is enabled at all times!  Otherwise you are navigating the site without logging any missing URL’s
  4. Navigate the site page by page to ensure you have captured what needs to be captured.  Try dividing the site into sections so that you can pick up where you left off. 
  5. What happens when the archived website does not look like the web version?  This can happen when image etc.  files are not crawled.  Check the page as archived, look to see if it has the navigation bar or site map as text, with links from that.   Other common issues with archived pages are that the URL has not been crawled.
  6. Additional tool for QA: viewing in proxy mode
  7. When you work in Wayback is done, select  "Run Patch Crawls" 

Final Steps:

  1. Review metadata, edit and add as needed.
  2. Make sure you collection is public.
  3. Let WAWG know when  your collection should be crawled again.  Our collections tend to be crawled annualy, or "one time only"
Creative Commons License
This work by The University of Victoria Libraries is licensed under a Creative Commons Attribution 4.0 International License unless otherwise indicated when material has been used from other sources.