robots.txt It's a plain text file in the root of the website , In this file, the site manager can declare that the site does not want to be robots Part of the interview , Or the specified search engine only includes the specified content .

When a search robot ( Some are called search spiders or reptiles ) When visiting a site , It will first check if the root directory of the site exists robots.txt , If there is , The search robot will follow the contents of the file to determine the scope of access ; If the file does not exist , So the search robot grabs along the link .

robots.txt Must be placed in the root of a site , And the file name must be all lowercase .

Robots agreement ( Also known as the crawler protocol 、 Robot protocol, etc ) The full name is “ Exclusion criteria for web crawlers ”(Robots Exclusion Protocol), Website through Robots The protocol tells search engines which pages to grab , Which pages can't be crawled .

robots.txt grammar

  • User-agent

    In general robot.txt The document will use User-agent: start , The value of this item is used to describe the search engine robot robot Name .

    for example User-agent:Baiduspider, It's a protocol constraint against Baidu spider ,robots.txt There must be at least one article in the document User-agent Record . If the value of the item is set to *( wildcard ), Then the protocol is valid for any search engine robot . If you use wildcards , There can only be one such record .



  • Disallow

    It's used to describe a website that you don't want to be visited by a search robot . This URL can be a full path , That is, the format of domain name plus directory name or file name , It could be relative , That is, the domain name is removed , Only file name or directory name .

    One Disallow Corresponding to a file or directory , Write as many directories or files as you need to set Disallow.

    Disallow: /admin/ The definition here is no crawling admin The directory below the directory

    Disallow: /cgi-bin/*.htm Blocking access /cgi-bin/ All the contents in the directory are as follows ".htm" It's a suffix URL( Include subdirectories ).

    Disallow: /.jpg$ It's forbidden to grab all the .jpg Format picture

    Disallow:/ab/adc.html No climbing ab Under the folder adc.html file

    Blocking access Admin A file in a directory


    Disable the access of a search engine crawler

    User-agent: BadBot

    Disallow: /

  • Allow

    Allow: /cgi-bin/ The definition here is to allow crawling cgi-bin The directory below the directory

    Allow: /tmp The definition here is to allow crawling tmp The entire catalog of

    Allow: .htm$ Only allow access to ".htm" It's a suffix URL.

    Allow: .gif$ Allow web pages and gif Format picture

  • Sitemap

Sitemap: Website map Tell the crawler this page is a website map

  Sitemap: <>

Allow and Sitemap It's nonstandard grammar , Maybe only some large search engines will support , To ensure compatibility , Recommended in the robots.txt Use only in User-agent ,Disallow

User-agent: After that is the name of the search robot , If it is *, All search robots .Disallow: The following is the directory of files that are not allowed to be accessed .

Examples of use :

Prohibit all robots from accessing specific file types

User-agent: *

Disallow: /.js$

Disallow: /.inc$

Disallow: /.css$

Intercept all robots

User-agent: * Disallow: /

Allow all robots

User-agent: *



Other ways to influence search engine behavior include using robots Metadata

<meta name="robots" content="noindex,nofollow" />

Robots META Tags are mainly for specific pages . And others META label ( Such as the language used 、 Description of the page 、 Key words ) equally ,Robots META Tags are also on the page <head></head> in , Specifically used to tell search engines ROBOTS How to grab the content of the page .

Robots META How to write a label : Robots META There is no case in the tag ,name=”Robots” All search engines , It can be written for a specific search engine as name=”BaiduSpider”.

content Section has four command options :index、noindex、follow、nofollow, The instructions are separated by “,” Separate . INDEX The command tells the search robot to grab the page ; FOLLOW The command indicates that the search robot can continue to crawl along the links on the page ; Robots Meta The default value of the tag is INDEX and FOLLOW, Only inktomi With the exception of , For it , The default value is INDEX,NOFOLLOW. such , There are four combinations :






<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW"> It can be written. <META NAME="ROBOTS" CONTENT="NONE"> Objective It seems that , Most search engine robots follow robots.txt The rules of , And for Robots META label , At present, there is not much support , But it's growing . Such as the famous search engine GOOGLE I'm totally for , and GOOGLE And an instruction was added “archive”, Sure Limit GOOGLE Do you want to keep the page snapshot . for example :

<mata name="googlebot" content="index,follow,noarchive">

It means to grab the page in the site and grab along the link in the page , But not in GOOLGE Keep a snapshot of the page on .


