Web crawling is not feasible with one machine All of the above steps distributed Malicious pages Spam pages Spider traps – include dynamically generated Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Politeness – do not hit a server too often
If you want to change selection, open document below and click on "Move attachment"
pdf
owner:
Enzou - (no access) - TMI1-WST3-05-Crawling.pdf, p7
Summary
status | not read | | reprioritisations | |
---|
last reprioritisation on | | | suggested re-reading day | |
---|
started reading on | | | finished reading on | |
---|
Details