Web crawling is not feasible with one machine All of the above steps distributed Malicious pages Spam pages Spider traps – include dynamically generated Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Politeness – do not hit a server too often
