LibGuides: Web Archiving for curators: Test Crawl Run and Review

Test Crawl Run and Review

Let WAWG know when your seed list is ready for a test crawl. Test crawls take 1, 3, or 7 days.

Test Crawls – Reviewing for crawler traps and missing data:

REVIEW: COLLECTION, SEED, HOST, DOCUMENT from the Archive-It Glossary
The Hosts tab of a crawl report includes information on every distinct host site to which your crawl was led, which can include your seed URLs in addition to all other sites considered or directed to be in scope See How to Read your crawl hosts report
The Seeds tab of a crawl report, provides the status of each seed at crawl time, the total number of documents and volume of data archived from each. View how each seed site renders when replayed by Wayback. It is also possible to drill down by seed to view a report detailing every host to which each seed led the crawler.

How to Run your Test Crawl

From within your collection, Select seeds, “Run Crawl” will appear, fill in dialog box, choosing “TestCrawl’ .

How to Review your Test Crawl

Login got Archive It with your personal account.
Select your collection.
Select the Crawls tab
Select Crawl ID
1. There may be more than one, select the most complete or recent
Look at the Crawl Overview
1. Look at the statistics - finished? data? docs? Too much? Too little?
Select "Seeds" report
1. Were all your seeds crawled? Any "404" messages? If yes, consider editing the seed. If a seed is not crawled, there will be no hosts and no documents. Any "robots.txt" messages? Read up on status messages
2. Review your seeds in Wayback to ensure you have what you want. If you don't, it may have been "blocked" or deemed "out of scope". Note: you cannot run patch crawls or use Wayback QA on test crawls until the data has been saved. DO NOT SAVE YET
Select "Hosts" report
1. Check the blocked column first, blocked content are documents discovered but blocked. THERE WILL ALWAYS BE SOME. Click on the number to open a txt file. Look for urls that you might want. The fix? More scoping rules in the next test crawl, and later, patch crawls.
2. Look in the Queued column (read up on Queued) , any hosts with large numbers of documents in the Queued column need to be checked for things like:
  1. “repeating directories” and calendars - these are crawler traps, prevents a crawl from ending.
  2. Be sure auto-scoping rules have been applied ; add any rules needed; no need for rule for unwanted queued content, for example .css .drupul. .png
  3. .css may be queued, will gather during production crawl QA (later step)
3. Check "out of scope" column. for hosts with large numbers of documents in this column. There might be content not captured if a "host" was a "link" rather than "embedded" in a "seed." you may need to add more "seeds" to your collection. Alternatively, the content might have been captured elsewhere.
IMPORTANT: run another test crawl if you had to add scoping rules or edit your seed list.

Saving (or not) your crawl

Saving your Crawl - Once your seeds scoped and your crawl is what you want, you have two choices: either
1. save your test crawl, OR
2. re-run as a production one-time crawl, then save.

Recommendation: re-run as a one-time crawl, and save. This may not be possible with time sensitive crawls, i.e. where a website might change (after an election) or go down (after an event). In this instance, save the test crawl.

Next step: Quality Assurance (QA)

Web Archiving for curators

Test Crawl Run and Review

Archive-It videos and training pages