Website robots.txt Documentation

Intro

robots.txt It's a plain text file in the root of the website , In this file, the site manager can declare that the site does not want to be robots Part of the interview , Or the specified search engine only includes the specified content .

When a search robot ( Some are called search spiders or reptiles ) When visiting a site , It will first check if the root directory of the site exists robots.txt , If there is , The search robot will follow the contents of the file to determine the scope of access ; If the file does not exist , So the search robot grabs along the link .

robots.txt Must be placed in the root of a site , And the file name must be all lowercase .

Robots agreement ( Also known as the crawler protocol 、 Robot protocol, etc ) The full name is “ Exclusion criteria for web crawlers ”(Robots Exclusion Protocol), Website through Robots The protocol tells search engines which pages to grab , Which pages can't be crawled .

robots.txt grammar

  • User-agent

    In general robot.txt The document will use User-agent: start , The value of this item is used to describe the search engine robot robot Name .

    for example User-agent:Baiduspider, It's a protocol constraint against Baidu spider ,robots.txt There must be at least one article in the document User-agent Record . If the value of the item is set to *( wildcard ), Then the protocol is valid for any search engine robot . If you use wildcards , There can only be one such record .

    User-agent:BaiduSpider 

    User-agent:*

  • Disallow

    It's used to describe a website that you don't want to be visited by a search robot . This URL can be a full path , That is, the format of domain name plus directory name or file name , It could be relative , That is, the domain name is removed , Only file name or directory name .

    One Disallow Corresponding to a file or directory , Write as many directories or files as you need to set Disallow.

    Disallow: /admin/ The definition here is no crawling admin The directory below the directory

    Disallow: /cgi-bin/*.htm Blocking access /cgi-bin/ All the contents in the directory are as follows ".htm" It's a suffix URL( Include subdirectories ).

    Disallow: /.jpg$ It's forbidden to grab all the .jpg Format picture

    Disallow:/ab/adc.html No climbing ab Under the folder adc.html file

    Blocking access Admin A file in a directory

    Disallow:/Admin/

    Disable the access of a search engine crawler

    User-agent: BadBot

    Disallow: /

  • Allow

    Allow: /cgi-bin/ The definition here is to allow crawling cgi-bin The directory below the directory

    Allow: /tmp The definition here is to allow crawling tmp The entire catalog of

    Allow: .htm$ Only allow access to ".htm" It's a suffix URL.

    Allow: .gif$ Allow web pages and gif Format picture

  • Sitemap

Sitemap: Website map Tell the crawler this page is a website map

  Sitemap: <http://www.example.com/sitemap.xml>

Allow and Sitemap It's nonstandard grammar , Maybe only some large search engines will support , To ensure compatibility , Recommended in the robots.txt Use only in User-agent ,Disallow

User-agent: After that is the name of the search robot , If it is *, All search robots .Disallow: The following is the directory of files that are not allowed to be accessed .

Examples of use :

Prohibit all robots from accessing specific file types

User-agent: *

Disallow: /.js$

Disallow: /.inc$

Disallow: /.css$

Intercept all robots

User-agent: * Disallow: /

Allow all robots

User-agent: *

Disallow:

More

Other ways to influence search engine behavior include using robots Metadata

<meta name="robots" content="noindex,nofollow" />

Robots META Tags are mainly for specific pages . And others META label ( Such as the language used 、 Description of the page 、 Key words ) equally ,Robots META Tags are also on the page <head></head> in , Specifically used to tell search engines ROBOTS How to grab the content of the page .

Robots META How to write a label : Robots META There is no case in the tag ,name=”Robots” All search engines , It can be written for a specific search engine as name=”BaiduSpider”.

content Section has four command options :index、noindex、follow、nofollow, The instructions are separated by “,” Separate . INDEX The command tells the search robot to grab the page ; FOLLOW The command indicates that the search robot can continue to crawl along the links on the page ; Robots Meta The default value of the tag is INDEX and FOLLOW, Only inktomi With the exception of , For it , The default value is INDEX,NOFOLLOW. such , There are four combinations :

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">

<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">

<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">

<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

among <META NAME="ROBOTS" CONTENT="INDEX,FOLLOW"> It can be written. <META NAME="ROBOTS" CONTENT="ALL">

<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW"> It can be written. <META NAME="ROBOTS" CONTENT="NONE"> Objective It seems that , Most search engine robots follow robots.txt The rules of , And for Robots META label , At present, there is not much support , But it's growing . Such as the famous search engine GOOGLE I'm totally for , and GOOGLE And an instruction was added “archive”, Sure Limit GOOGLE Do you want to keep the page snapshot . for example :

<mata name="googlebot" content="index,follow,noarchive">

It means to grab the page in the site and grab along the link in the page , But not in GOOLGE Keep a snapshot of the page on .

Reference

Website robots.txt More related articles on document writing

  1. Website SEO Optimize it Robots.txt Document writing .

    As a website developer or webmaster, you must know how important a website is to search engine optimization , A good website should not only have a beautiful interface , Good user experience , And a higher update frequency . To be Baidu .google Such search engines include a lot of , In order to increase the amount of website display , visit ...

  2. Website robots.txt file

    One .robots.txt What is it? ?   robots.txt It's a plain text file , In this file, the site manager can declare the part of the site that he doesn't want to be accessed by the search engine , Or the specified search engine only includes the specified content . Being a search engine ( Also known as search ...

  3. Website robots.txt file

    What is? robots.txt? robots.txt It's a plain text file , It's the first file a crawler wants to view when crawling a website , It is usually located in the root directory of the website .robots.txt The file defines the restrictions that the crawler has when crawling the site , What parts ...

  4. robots.txt File configuration and usage details

    robots.txt file , Bring up the concept , Maybe many webmasters are still very strange : What is? robots.txt file ?robots.txt What's the use of documents ? How to configure robots.txt file ? How to use it correctly robots.txt file ...

  5. No, robots.txt Will the document affect the collection

    Spider Before grabbing your website , We will visit your robots.txt file , To determine if your website will block Spiders grab any web page or URL . If your robots.txt The file exists but cannot be accessed ( in other words , If it doesn't return 2 ...

  6. Don't be unaware of robots.txt file

    robots.txt Basic introduction robots.txt It's a plain text file , In this file, the site manager can declare that the site does not want to be robots Part of the interview , Or the specified search engine only includes the specified content . When a search robot ( Some are called search ...

  7. Python Web crawler 010 ( Advanced features ) analysis robots.txt file

    analysis robots.txt file The system used :Windows 10 64 position Python Language version :Python 2.7.10 V The programming used Python Integrated development environment :PyCharm 2016 ...

  8. Website robots.txt Detection tools Parsero

    Website robots.txt Detection tools Parsero   robots.txt A file is a text file in the root directory of a website .robots.txt It's the first file to view when visiting a website in a search engine . When a search engine visits a site , it ...

  9. use PHP Of fopen Function reading and writing robots.txt file

    I used to introduce PHP Reading and writing text documents to make the simplest access counter does not need a database , The function of web page access counting can be realized only with text documents . We can also expand this idea ,robots.txt Files are very important to our website , Sometimes we need to ...

Random recommendation

  1. Allegro And Win7 Can't refresh the operation display in real time

    Allegro In the main interface of the software : setup-> user preferences-> display->opengl We have chosen disable_opengl, Turn it off OpenGL after , Restart again ...

  2. javascript Data structure and algorithm -- hash

    One :javascript Data structure and algorithm -- hash   One : What is a hash table ? Hash table is also called hash table , According to the key code value (key,value) Data structures that are accessed directly , It accesses records by mapping key values to a location in the table , hash ...

  3. Reading notes _Effective_C++_ Clause 49 : understand new_handler act

    This chapter begins with some usage of memory allocation ,C/C++ Memory allocation adopts new and delete. stay new When applying for memory , One of the things that might happen is , Not enough memory , It's time to throw out out of memory It's abnormal . sometimes , We hope to be able to ...

  4. Python Multithreaded learning ( turn )

    from :http://www.cnblogs.com/slider/archive/2012/06/20/2556256.html introduction about Python Come on , There is no lack of concurrency options , Its standard library includes alignment ...

  5. adopt uiview Animation to enlarge the picture

    UIImage *image=[UIImage imageNamed:"]; UIImageView *imageView=[[UIImageView alloc]init]; imageV ...

  6. .net EF thing Generation of order serial number ( One )

    The first thing you need to add to System.Transactions Assembly Data sheet : create table SalesOrder ( ID ,) primary key not null, OrderNo ) ...

  7. MVC Pattern notes

    See :https://martinfowler.com/eaaCatalog/modelViewController.html Enterprise Application Architecture (P of EAA) The first 14 Chapter Web Mode of expression section one MVC pen ...

  8. 【python】django Upload files

    Reference resources :https://blog.csdn.net/zahuopuboss/article/details/54891917 Reference resources :https://blog.csdn.net/zzg_550413470 ...

  9. Innodb Under the engine mysql Self configuration optimization

    1. Brief introduction InnoDB to MySQL Provided with commit , Transaction security of rollback and crash recovery capability (ACID compatible ) Storage engine .InnoDB Locked at line level and also at SELECT Statement provides a Oracle Style consistent non lock reading . These characteristics ...

  10. from 0 Start :Windows Another way to use the kernel

    https://www.anquanke.com/post/id/91063 from 0 Start :Windows Another way to use the kernel Reading volume    9168 |    Contribution fee 200   Share the : Release time :201 ...