The goal is to get what we want, exclude what we don't want
There are a number of decisions to make:
- Ending slash - copy directly from browser window. If a slash is there, keep it. Otherwise too much content may be crawled
- Seed type
- Standard
- Standard Plus external
- One page
- One page plus external
- Social Media
- Twitter - Add an ending '/' to the url, for example: http://twitter.com/internetarchive/ (with an ending /). This allows you to archive only the feed that you specify, rather than all of Twitter!
- Facebook - use https, use ending slash, used standard seed type
- YouTube- Specific videos on YouTube are hosted on a "watch" page with a URL in the following format: https://www.youtube.com/watch?v=XXXXXXXXXX. Follow this formatting and always use the "One Page" seed type to avoid scoping in all of youtube.com
- Calendars- Block! These can be “crawler traps’ Use the Regular expression: ^.*calendar.*$ in your scope notes to block
- Note_ you can add scoping rules before you upload your seed list, or after. Let WAWG know when you are ready for a test crawl. OR schedule the test crawl yourself!