Why is MSNBot ignoring robots.txt?

Today, the root file system on our public svn server nearly ran out of disk space. The reason? The /tmp directory was quickly filling up with temporary files created by websvn, which I set up parallel to the FishEye repository browser for testing purposes. A quick investigation of the apache log files revealed the culprit - a crawler from Microsoft was running haywire and decided to ignore the rules in the robots.txt file, even though it did actually looked at the file before!

Here is how robots.txt looked like (I now changed it to disallow everything):

User-agent: *
Disallow: /fisheye/
Disallow: /websvn/

If I am not mistaken, no crawler should actually consider going into the SVN browser directories. Some snippets from the apache log:

$ grep robots.txt /var/log/apache2/access_log | grep msn
65.55.208.178 - - [03/Aug/2008:16:58:35 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.212.64 - - [03/Aug/2008:19:05:55 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.235.139 - - [03/Aug/2008:22:14:47 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.25.136 - - [04/Aug/2008:00:31:32 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.212.64 - - [04/Aug/2008:00:57:38 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.235.139 - - [04/Aug/2008:06:49:33 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.212.64 - - [04/Aug/2008:07:16:21 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.25.136 - - [04/Aug/2008:09:29:17 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.104.156 - - [04/Aug/2008:11:08:24 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [04/Aug/2008:11:29:34 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.212.64 - - [05/Aug/2008:13:30:20 +0200] "GET /robots.txt HTTP/1.0" 200 53 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
65.55.208.178 - - [05/Aug/2008:16:17:59 +0200] "GET /robots.txt HTTP/1.1" 200 53 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"

Good boy, it checks the robots.txt file. But what is this?

$ grep msnbot /var/log/apache2/access_log | tail -20
65.55.208.164 - - [05/Aug/2008:22:48:15 +0200] "GET /websvn/filedetails.php?repname=MySQL+Documentation&path=%2Fworkbench%2Fall-entities.ent&rev=9981&sc=1 HTTP/1.1" 200 6408 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:15 +0200] "GET /websvn/dl.php?repname=MySQL+Connector%2FJ&path=%2Fbranches%2Fbranch_5_0%2Fconnector-j%2F&rev=6600&isdir=1 HTTP/1.1" 200 40960 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:19 +0200] "GET /websvn/rss.php?repname=MySQL+Documentation&path=%2Fproto-doc%2F&rev=9994&sc=1&isdir=1 HTTP/1.1" 200 36907 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:21 +0200] "GET /websvn/rss.php?repname=MySQL+Documentation&path=%2Ffalcon%2F&rev=8323&sc=0&isdir=1 HTTP/1.1" 200 15278 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:21 +0200] "GET /websvn/rss.php?repname=MySQL+Proxy&path=%2Ftrunk%2FDoxyfile&rev=365&sc=1&isdir=0 HTTP/1.1" 200 4162 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:21 +0200] "GET /websvn/rss.php?repname=Eventum&path=%2Feventum%2Freports%2F&rev=3542&sc=1&isdir=1 HTTP/1.1" 200 90591 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:23 +0200] "GET /websvn/log.php?repname=MySQL+Documentation&path=%2Fndbapi%2F&rev=9749&sc=0&isdir=1 HTTP/1.1" 200 21440 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.208.164 - - [05/Aug/2008:22:48:23 +0200] "GET /websvn/log.php?repname=MySQL+Documentation&path=%2Ffalcon%2F&rev=8511&sc=0&isdir=1 HTTP/1.1" 200 18541 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"

As you can see, it is happily crawling everything below /websvn/, which also includes links named "Tarball" - guess what they are good for? Yes, they create tarballs of a given SVN directory, using /tmp to build up the archive file... Within a very short amount of time, it used up more than 6 GB of disk space, as it seems as if websvn leaves these temporary directories behind, if the connection gets aborted or times out. We do have a cron job that wipes /tmp from files older than a certain amount of days, but it currently fills up much faster than what the cron job usually discards. I need to investigate if it is actually is a bug in websvn to leave these temporary dirs behind.

Hello Microsoft? Can you please fix your bots so they not only read but honor robots.txt files and stop DOSing our site? Thanks