Moving into the user sessions on the first day at MySQL Conference 2007, I attended Building a Vertical Search Engine in a Day.
Some of my notes for reference.
Web Crawling 101
- Injection List - What is it seed URL’s you are starting from
 - Fetching the pages
 - Parsing the content - words and links
 - Updating the crawl DB
 - Whitelist
 - Blacklist
 - Convergence — avoiding the honey pots
 - Index
 - Map-reduce — split a large problem into little pieces, process in parallel, then combine results
 
Focused content == vertical crawl
- 20 Billion Pages out there, a lot of junk
 - Bread-first would take years and cost millions of lives