Edited, memorised or added to reading list

on 02-Feb-2019 (Sat)

Do you want BuboFlash to help you learning these things? Click here to log in or create user.

THE PRESENT RESEARCH

statusnot read reprioritisations
last reprioritisation on reading queue position [%]
started reading on finished reading on

pdf

cannot see any pdfs




Flashcard 3814143167756

Tags
#has-images
Question
What are basic crawler operations?
Answer
  • Begin with known "seed" URLs in a queue
  • Fetch and parse them
    • Extract URLs they point to
    • Place the extracted URLs on a queue
  • Fetch each URL on the queue and repeat


statusnot learnedmeasured difficulty37% [default]last interval [days]               
repetition number in this series0memorised on               scheduled repetition               
scheduled repetition interval               last repetition or drill

pdf

cannot see any pdfs







Web crawling is not feasible with one machine All of the above steps distributed Malicious pages Spam pages Spider traps – include dynamically generated Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Politeness – do not hit a server too often

statusnot read reprioritisations
last reprioritisation on reading queue position [%]
started reading on finished reading on

pdf

cannot see any pdfs