Web crawling is not feasible with one machine All of the above steps distributed Malicious pages Spam pages Spider traps – include dynamically generated Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Politeness – do not hit a server too often
If you want to change selection, open document below and click on "Move attachment"
- (no access) - TMI1-WST3-05-Crawling.pdf, p7
|last reprioritisation on
|suggested re-reading day
|started reading on
|finished reading on