Website robots.txt Documentation
robots.txt It's a plain text file in the root of the website , In this file, the site manager can declare that the site does not want to be robots Part of the interview , Or the specified search engine only includes the specified content .
When a search robot （ Some are called search spiders or reptiles ） When visiting a site , It will first check if the root directory of the site exists robots.txt , If there is , The search robot will follow the contents of the file to determine the scope of access ; If the file does not exist , So the search robot grabs along the link .
robots.txt Must be placed in the root of a site , And the file name must be all lowercase .
Robots agreement （ Also known as the crawler protocol 、 Robot protocol, etc ） The full name is “ Exclusion criteria for web crawlers ”（Robots Exclusion Protocol）, Website through Robots The protocol tells search engines which pages to grab , Which pages can't be crawled .
In general robot.txt The document will use User-agent: start , The value of this item is used to describe the search engine robot robot Name .
for example User-agent:Baiduspider, It's a protocol constraint against Baidu spider ,robots.txt There must be at least one article in the document User-agent Record . If the value of the item is set to *（ wildcard ）, Then the protocol is valid for any search engine robot . If you use wildcards , There can only be one such record .
It's used to describe a website that you don't want to be visited by a search robot . This URL can be a full path , That is, the format of domain name plus directory name or file name , It could be relative , That is, the domain name is removed , Only file name or directory name .
One Disallow Corresponding to a file or directory , Write as many directories or files as you need to set Disallow.
Disallow: /admin/ The definition here is no crawling admin The directory below the directory
Disallow: /cgi-bin/*.htm Blocking access /cgi-bin/ All the contents in the directory are as follows ".htm" It's a suffix URL( Include subdirectories ).
Disallow: /.jpg$ It's forbidden to grab all the .jpg Format picture
Disallow:/ab/adc.html No climbing ab Under the folder adc.html file
Blocking access Admin A file in a directory
Disable the access of a search engine crawler
Allow: /cgi-bin/ The definition here is to allow crawling cgi-bin The directory below the directory
Allow: /tmp The definition here is to allow crawling tmp The entire catalog of
Allow: .htm$ Only allow access to ".htm" It's a suffix URL.
Allow: .gif$ Allow web pages and gif Format picture
Sitemap: Website map Tell the crawler this page is a website map
Allow and Sitemap It's nonstandard grammar , Maybe only some large search engines will support , To ensure compatibility , Recommended in the robots.txt Use only in User-agent ,Disallow
User-agent: After that is the name of the search robot , If it is *, All search robots .Disallow: The following is the directory of files that are not allowed to be accessed .
Examples of use ：
Prohibit all robots from accessing specific file types
Intercept all robots
User-agent: * Disallow: /
Allow all robots
Other ways to influence search engine behavior include using robots Metadata
<meta name="robots" content="noindex,nofollow" />
Robots META Tags are mainly for specific pages . And others META label （ Such as the language used 、 Description of the page 、 Key words ） equally ,Robots META Tags are also on the page ＜head＞＜/head＞ in , Specifically used to tell search engines ROBOTS How to grab the content of the page .
Robots META How to write a label ： Robots META There is no case in the tag ,name=”Robots” All search engines , It can be written for a specific search engine as name=”BaiduSpider”.
content Section has four command options ：index、noindex、follow、nofollow, The instructions are separated by “,” Separate . INDEX The command tells the search robot to grab the page ; FOLLOW The command indicates that the search robot can continue to crawl along the links on the page ; Robots Meta The default value of the tag is INDEX and FOLLOW, Only inktomi With the exception of , For it , The default value is INDEX,NOFOLLOW. such , There are four combinations ：
＜META NAME="ROBOTS" CONTENT="INDEX,FOLLOW"＞
＜META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"＞
＜META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW"＞
＜META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW"＞
＜META NAME="ROBOTS" CONTENT="INDEX,FOLLOW"＞ It can be written.
＜META NAME="ROBOTS" CONTENT="ALL"＞
＜META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW"＞ It can be written.
＜META NAME="ROBOTS" CONTENT="NONE"＞ Objective It seems that , Most search engine robots follow robots.txt The rules of , And for Robots META label , At present, there is not much support , But it's growing . Such as the famous search engine GOOGLE I'm totally for , and GOOGLE And an instruction was added “archive”, Sure Limit GOOGLE Do you want to keep the page snapshot . for example ：
<mata name="googlebot" content="index,follow,noarchive"＞
It means to grab the page in the site and grab along the link in the page , But not in GOOLGE Keep a snapshot of the page on .
- Website SEO Optimize it Robots.txt Document writing .
As a website developer or webmaster, you must know how important a website is to search engine optimization , A good website should not only have a beautiful interface , Good user experience , And a higher update frequency . To be Baidu .google Such search engines include a lot of , In order to increase the amount of website display , visit ...
- Website robots.txt file
One .robots.txt What is it? ? robots.txt It's a plain text file , In this file, the site manager can declare the part of the site that he doesn't want to be accessed by the search engine , Or the specified search engine only includes the specified content . Being a search engine ( Also known as search ...
- Website robots.txt file
What is? robots.txt? robots.txt It's a plain text file , It's the first file a crawler wants to view when crawling a website , It is usually located in the root directory of the website .robots.txt The file defines the restrictions that the crawler has when crawling the site , What parts ...
- robots.txt File configuration and usage details
robots.txt file , Bring up the concept , Maybe many webmasters are still very strange : What is? robots.txt file ?robots.txt What's the use of documents ? How to configure robots.txt file ? How to use it correctly robots.txt file ...
- No, robots.txt Will the document affect the collection
Spider Before grabbing your website , We will visit your robots.txt file , To determine if your website will block Spiders grab any web page or URL . If your robots.txt The file exists but cannot be accessed ( in other words , If it doesn't return 2 ...
- Don't be unaware of robots.txt file
robots.txt Basic introduction robots.txt It's a plain text file , In this file, the site manager can declare that the site does not want to be robots Part of the interview , Or the specified search engine only includes the specified content . When a search robot ( Some are called search ...
- Python Web crawler 010 ( Advanced features ) analysis robots.txt file
analysis robots.txt file The system used :Windows 10 64 position Python Language version :Python 2.7.10 V The programming used Python Integrated development environment :PyCharm 2016 ...
- Website robots.txt Detection tools Parsero
Website robots.txt Detection tools Parsero robots.txt A file is a text file in the root directory of a website .robots.txt It's the first file to view when visiting a website in a search engine . When a search engine visits a site , it ...
- use PHP Of fopen Function reading and writing robots.txt file
I used to introduce PHP Reading and writing text documents to make the simplest access counter does not need a database , The function of web page access counting can be realized only with text documents . We can also expand this idea ,robots.txt Files are very important to our website , Sometimes we need to ...
- Allegro And Win7 Can't refresh the operation display in real time
Allegro In the main interface of the software : setup-> user preferences-> display->opengl We have chosen disable_opengl, Turn it off OpenGL after , Restart again ...
- Reading notes _Effective_C++_ Clause 49 ： understand new_handler act
This chapter begins with some usage of memory allocation ,C/C++ Memory allocation adopts new and delete. stay new When applying for memory , One of the things that might happen is , Not enough memory , It's time to throw out out of memory It's abnormal . sometimes , We hope to be able to ...
- Python Multithreaded learning （ turn ）
from :http://www.cnblogs.com/slider/archive/2012/06/20/2556256.html introduction about Python Come on , There is no lack of concurrency options , Its standard library includes alignment ...
- adopt uiview Animation to enlarge the picture
UIImage *image=[UIImage imageNamed:"]; UIImageView *imageView=[[UIImageView alloc]init]; imageV ...
- .net EF thing Generation of order serial number ( One )
The first thing you need to add to System.Transactions Assembly Data sheet : create table SalesOrder ( ID ,) primary key not null, OrderNo ) ...
- MVC Pattern notes
See :https://martinfowler.com/eaaCatalog/modelViewController.html Enterprise Application Architecture (P of EAA) The first 14 Chapter Web Mode of expression section one MVC pen ...
- 【python】django Upload files
Reference resources :https://blog.csdn.net/zahuopuboss/article/details/54891917 Reference resources :https://blog.csdn.net/zzg_550413470 ...
- Innodb Under the engine mysql Self configuration optimization
1. Brief introduction InnoDB to MySQL Provided with commit , Transaction security of rollback and crash recovery capability (ACID compatible ) Storage engine .InnoDB Locked at line level and also at SELECT Statement provides a Oracle Style consistent non lock reading . These characteristics ...
- from 0 Start ：Windows Another way to use the kernel
https://www.anquanke.com/post/id/91063 from 0 Start :Windows Another way to use the kernel Reading volume 9168 | Contribution fee 200 Share the : Release time :201 ...