BuboFlash - helps with learning

Edited, memorised or added to reading queue

Do you want BuboFlash to help you learning these things? Click here to log in or create user.

Annotation 3812066200844

THE PRESENT RESEARCH

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Flashcard 3814143167756

Tags

#has-images

Question

What are basic crawler operations?

Answer

Begin with known "seed" URLs in a queue
Fetch and parse them
- Extract URLs they point to
- Place the extracted URLs on a queue
Fetch each URL on the queue and repeat

status	not learned	measured difficulty	37% [default]	last interval [days]
repetition number in this series	0	memorised on		scheduled repetition
scheduled repetition interval		last repetition or drill

pdf

cannot see any pdfs

Web crawling is not feasible with one machine All of the above steps distributed Malicious pages Spam pages Spider traps – include dynamically generated Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Webmasters’ stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Politeness – do not hit a server too often

status	not read	reprioritisations
last reprioritisation on		suggested re-reading day
started reading on		finished reading on

pdf

cannot see any pdfs

Edited, memorised or added to reading queue

on 02-Feb-2019 (Sat)

Annotation 3812066200844

pdf

Flashcard 3814143167756

pdf

Annotation 3814155750668

pdf