# See for # detailed info on excluding robots from a site. # # See for # a way to validate the contents of this file. # # updated: 14-Jan-2006, George A. Theall # Linkcheckers get pretty much free reign. # # o MOMspider can get everywhere. User-agent: MOMspider Disallow: /nogo # Selected search engine 'bots get pretty much free reign. # nb: # appie => Walhello, http://www.walhello.com/ # boitho.com-dc => Boitho, http://www.boitho.com/, Norwegian search engine # fast => Fastsearch (used by alltheweb.com) # gaisbot => Gais, http://gais.cs.ccu.edu.tw/, Taiwanese search engine # GalaxyBot => Galaxy, http://www.galaxy.com/ # Googlebot => Google # Mercator + Scooter => AltaVista # Mj12bot => Majestic-12, http://www.majestic12.co.uk/projects/dsearch/mj12bot.php, a distributed search engine. # mogimogi => http://www.goo.ne.jp/, Japanese search engine. # mozDex => http://www.mozdex.com/, an open source search engine # msnbot => MSN Search. # NG => Exalead, http://www.exalead.com/, French search engine # Nutch =>, http://www.nutch.org/, open-source search engine # Pompos => dir.com, http://dir.com, French search engine # QuepasaCreep => quepasa.com, Latin American portal / search engine # Slurp => Inktomi (includes MSN Search and HotBot) # VIAS => http://vias.ncsa.uiuc.edu/viasarchivinginformation.html # VoilaBot => http://www.voila.com (French search engine) # Zao => Kototai, http://www.kototai.org/, Japanese search engine research project # ZyBorg => WiseNut, http://www.wisenut.com/, and Looksmart User-agent: appie User-agent: boitho.com-dc User-agent: fast User-agent: gaisbot User-agent: GalaxyBot User-agent: Googlebot User-agent: Mercator User-agent: Mj12bot User-agent: mogimogi User-agent: mozDex User-agent: msnbot User-agent: NG User-agent: Nutch User-agent: Pompos User-agent: QuepasaCreep User-agent: Scooter # NB: for the month of July 2004, Intomi's slurp 'bot has done nothing # but try to grab invalid URLs (other than robots.txt), URLs that # *never* existed here. Can you say "database corruption"? :-( #User-agent: Slurp User-agent: VIAS User-agent: VoilaBot User-agent: Zao # NB: starting in January 2005, looksmart's seems to have switched from # WiseNut to grub for its crawler. The later doesn't bother # requesting robots.txt and doesn't seem to understand response # codes of 403. So should WiseNut ever come back, screw 'em. # User-agent: Zyborg Disallow: /cgi-bin Disallow: /code Disallow: /hidden Disallow: /icons Disallow: /nogo Disallow: /zips Disallow: /~amanda/pics Disallow: /~amanda/videos Disallow: /~gpt/pics Disallow: /~gpt/videos Disallow: /~theall/bookmarks Disallow: /~theall/wedding # Other 'bots that I'm ok with. # o IBM Almaden Research Center. User-agent: http://www.almaden.ibm.com/cs/crawler Disallow: /cgi-bin Disallow: /code Disallow: /hidden Disallow: /icons Disallow: /nogo Disallow: /zips Disallow: /~amanda/pics Disallow: /~amanda/videos Disallow: /~gpt/pics Disallow: /~gpt/videos Disallow: /~theall/bookmarks Disallow: /~theall/wedding # o The Internet Archive, http://www.archive.org/. User-agent: ia_archiver Disallow: /cgi-bin Disallow: /code Disallow: /hidden Disallow: /icons Disallow: /nogo Disallow: /zips Disallow: /~amanda/pics Disallow: /~amanda/videos Disallow: /~gpt/pics Disallow: /~gpt/videos Disallow: /~theall/bookmarks Disallow: /~theall/wedding # o LinkWalker, http://www.seventwentyfour.com/, for checking links. User-agent: LinkWalker Disallow: /cgi-bin Disallow: /code Disallow: /hidden Disallow: /icons Disallow: /nogo Disallow: /zips Disallow: /~amanda/pics Disallow: /~amanda/videos Disallow: /~gpt/pics Disallow: /~gpt/videos Disallow: /~theall/bookmarks Disallow: /~theall/wedding # o research project from Kitsuregawa Laboratory, The University of Tokyo. User-agent: Steeler Disallow: /cgi-bin Disallow: /code Disallow: /hidden Disallow: /icons Disallow: /nogo Disallow: /zips Disallow: /~amanda/pics Disallow: /~amanda/videos Disallow: /~gpt/pics Disallow: /~gpt/videos Disallow: /~theall/bookmarks Disallow: /~theall/wedding # All robots are excluded by default. Please direct requests to # allow access to webmaster@tifaware.com. # # 'bots I know about but don't want to bother with # o Girafabot # Used by girafa.com to visualize search results. I'd be ok # with this if only they'd respect robots.txt. # o grub-client, http://grub.org/html/documents.php?op=robots-faq # Distributed crawler for the grub search engine. I'd be ok # with this if only they'd respect robots.txt. # o lachesis, ftp://ftp.imag.fr/pub/labo-LSR/DRAKKAR/internet-performance/lachesis/ # Supposedly an Intel tool for measuring ISP latency, although # after examining it I think it's mis-identified. # o larbin, http://larbin.sourceforge.net/index-eng.html # Multi-purpose web crawler. # o Mozilla/4.0 (efp@gmx.net) # Spammer tool to scrape email addresses. # o NPBot, http://www.nameprotect.com/botinfo.html # Used by NameProtect to scan for brand / IP violations. # o Psbot, http://www.picsearch.com/bot.html # Used by Picsearch to index pictures. I don't really have any # pictures here that I want indexed. # o Teoma # Used by AskJeeves search engine. I'd be ok with it if only # it would respect exclusions in robots.txt. # o TurnitinBot, http://www.turnitin.com/robot/crawlerinfo.html # Used by Turnitin.com to prevent plagarism. User-agent: * Disallow: / Disallow: /nogo