Moving into the user sessions on the first day at MySQL Conference 2007, I attended Building a Vertical Search Engine in a Day.
Some of my notes for reference.
Web Crawling 101
- Injection List - What is it seed URL’s you are starting from
- Fetching the pages
- Parsing the content - words and links
- Updating the crawl DB
- Whitelist
- Blacklist
- Convergence — avoiding the honey pots
- Index
- Map-reduce — split a large problem into little pieces, process in parallel, then combine results
Focused content == vertical crawl
- 20 Billion Pages out there, a lot of junk
- Bread-first would take years and cost millions of lives