Skip to Main Content
Libraries
askus Ask us
 

Web Archiving for curators

Test Crawl Run and Review

Test Crawls – Reviewing for crawler traps and missing data:

  •  REVIEW: COLLECTION, SEED, HOST, DOCUMENT from the Archive-It Glossary
  • The Hosts tab of a crawl report  includes information on every distinct host site to which your crawl was led, which can include your seed URLs in addition to all other sites considered or directed to be in scope See How to Read your crawl hosts report 
  • The Seeds tab of a crawl report, provides the status of each seed at crawl time, the total number of documents and volume of data archived from each.  View how each seed site renders when replayed by Wayback. It is also possible to drill down by seed to view a report detailing every host to which each seed led the crawler. 

How to Run your Test Crawl

  1. From within your collection, Select seeds, “Run Crawl” will appear, fill in dialog box, choosing “TestCrawl’ .

How to Review your Test Crawl

  1. Login got Archive It with your personal account.
  2. Select your collection.
  3. Select the Crawls tab 
  4. Select Crawl ID 
    1. There may be more than one,  select the most complete or recent
  5. Look at the Crawl Overview
    1.  Look at the statistics - finished? data? docs? Too much?  Too little? 
  6. Select "Seeds" report
    1.  Were all your seeds crawled? Any "404" messages?  If yes, consider editing the seed. If a seed is not crawled, there will be no hosts and no documents. Any "robots.txt" messages?  Read up on status messages 
    2. Review your seeds in Wayback to ensure you have what you want.  If you don't, it may have been "blocked" or deemed "out of scope". Note: you cannot run patch crawls or use Wayback QA on test crawls until the data has been saved. DO NOT SAVE YET
  7. Select "Hosts" report
    1. Check the blocked column first, blocked content are documents discovered but blocked. THERE WILL ALWAYS BE SOME. Click on the number to open a txt file. Look for urls that you might want. The fix? More scoping rules in the next test crawl, and later, patch crawls. 
    2. Look in the Queued column (read up on Queued) , any hosts with large numbers of documents in the Queued column need to be checked for things like:
      1.  “repeating directories”  and calendars - these are crawler traps, prevents a crawl from ending. 
      2. Be sure auto-scoping rules have been applied ; add any rules needed; no need for rule for unwanted queued content, for example .css .drupul. .png 
      3. .css may be queued, will gather during production crawl QA (later step)
    3. Check "out of scope" column. for hosts with large numbers of documents in this column.  There might be content not captured if a "host" was a "link" rather than "embedded" in a "seed." you may need to add more "seeds" to your collection. Alternatively, the content might have been captured  elsewhere. 
  8. IMPORTANT: run another test crawl if you had to add scoping rules or edit your seed list. 

Saving (or not) your crawl

  1. Saving your Crawl - Once your seeds scoped and your crawl is what you want, you have two choices: either
    1. save your test crawl, OR
    2. re-run as a production one-time crawl, then save. 

Recommendation:  re-run as a one-time crawl, and save.  This may not be possible with time sensitive crawls, i.e. where a website might change  (after an election) or go down (after an event).  In this instance, save the test crawl.

Next step: Quality Assurance (QA)

Creative Commons License
This work by The University of Victoria Libraries is licensed under a Creative Commons Attribution 4.0 International License unless otherwise indicated when material has been used from other sources.