Skip to Main Content
Libraries
askus Ask us
 

Web Archiving for curators

Scoping Rules advice from WAWG

The goal is to get what we want, exclude what we don't want

There are a number of decisions to make:

  1.  Ending slash   - copy directly from browser window.  If a slash is there, keep it. Otherwise too much content may be crawled
  2. Seed type
    1. Standard
    2. Standard Plus external 
    3. One page
    4. One page plus external
  3. Social Media
    1.  Twitter -  Add an ending '/' to the url, for example: http://twitter.com/internetarchive/  (with an ending /).  This allows you to archive only the feed that you specify, rather than all of Twitter! 
    2.  Facebook - use https, use ending slash, used standard seed type
    3. YouTube-  Specific videos on YouTube are hosted on a "watch" page with a URL in the following format: https://www.youtube.com/watch?v=XXXXXXXXXX. Follow this formatting and always use the "One Page" seed type to avoid scoping in all of youtube.com
  4.  Calendars-  Block!  These can be “crawler traps’ Use the Regular expression:  ^.*calendar.*$  in your scope notes to block 
  5. Note_ you can add scoping rules before you upload your seed list, or after.   Let WAWG know when you are ready for a test crawl.  OR schedule the test crawl yourself! 

 

Archive-It vidoes and guides

Creative Commons License
This work by The University of Victoria Libraries is licensed under a Creative Commons Attribution 4.0 International License unless otherwise indicated when material has been used from other sources.