Restricting Robots

Restricting Robots

Robot Exclusion Standard

  • A convention to prevent robots from accessing all or part of a publicly viewable web site
  • Uses robots.txt file to blacklist unwanted robots or whitelist wanted robots
  • Blocking all bots with a few exceptions makes more sense than blacklisting all unwanted bots

Robots.txt

  • Place this file in the root of your web site hierarchy
  • This file should contain the instructions in the format detailed below
  • Disallow: without a path specified is equal to an Allow:
  • Only applies to bots that obey the Robots Exclusion Protocol

Robots.txt Block All Robots Example

  • # To exclude all robots from the entire site
  • User-agent: *
  • Disallow: /

Robots.txt Allow All Robots Example

  • # To allow all robots complete access
  • User-agent: *
  • Disallow:
  • # Alternatively you can create an empty robots.txt file, or just remove the robots.txt file
  • # There is no "Allow" field for robots.txt

Robots.txt Excluding Specific Folder Hierarchies Example

  • # To exclude all robots from part of the server
  • User-agent: *
  • Disallow: /globals/
  • Disallow: /temp/
  • Disallow: /dev-only/

Robots.txt Blacklisting Specific Robots Example

  • # blocking only specific search robots
  • User-agent: 123People
  • User-agent: findestars
  • User-agent: MyOnID
  • User-agent: peekyou
  • Disallow: /

Robots.txt Whitelisting Specific Robots Example

  • # allowing only specific robots
  • User-agent: ArchitextSpider
  • User-agent: Baiduspider
  • User-agent: Googlebot
  • User-agent: IsraBot
  • User-agent: Jeeves
  • User-agent: msnbot
  • User-agent: Orthogaffe
  • User-agent: Scooter
  • User-agent: Slurp
  • User-agent: Teoma
  • User-agent: Yahoo-Blogs
  • User-agent: Yahoo-MMCrawler
  • User-agent: Yandex
  • Disallow:
  • # disallow all the rest
  • User-agent: *
  • Disallow: /

Robots-NoContent Example

  • An attribute value: class="robots-nocontent", not a meta tag
  • Content will be ignored by the Yahoo! crawler and not included
  • Can be used with Web page tags as needed
    • <div class="robots-nocontent">excluded content</div>
    • <span class="robots-nocontent">excluded content</span>
    • <p class="robots-nocontent">excluded content</p>

Sources