gecco Reptiles

If the gecco I don't know. I can see gecco Of github home page .gecco Reptiles are very easy to use ,JD Capture all commodity information 9 One class will do it .

JD Analysis of website

Want to grab JD All the product information of the website , We need to analyze the website first , Jingdong website can be roughly divided into three levels , Jump to the product list page by category on the home page , The product list page has a detail page for each product . So we can capture the commodity information one by one by finding all the categories .

Entrance address

http://www.jd.com/allSort.aspx, This address is JD A classified list of all products , Let's start with this page , Grab JD All the product information of

New start page HtmlBean class AllSort

 @Gecco(matchUrl="http://www.jd.com/allSort.aspx", pipelines={"consolePipeline", "allSortPipeline"})
public class AllSort implements HtmlBean{ private static final long serialVersionUID = 665662335318691818L; @Request
private HttpRequest request; // mobile phone
@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl")
private List<Category> mobile; // Household appliances
@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(3) > div.mc > div.items > dl")
private List<Category> domestic; public List<Category> getMobile(){
return mobile;
} public void setMobile(List<Category> mobile){
this.mobile = mobile;
} public List<Category> getDomestic(){
return domestic;
} public void setDomestic(List<Category> domestic){
this.domestic = domestic;
} public HttpRequest getRequest(){
return request;
} public void setRequest(HttpRequest request){
this.request = request;
}
}

You can see , Here we take the product information of mobile phones and household appliances as an example , You can see that each large class contains several subcategories , use List<Category> Express .gecco Support Bean Nesting of , It can be expressed very well html Page structure .Category Represents the content of subcategory information ,HrefBean It's a shared link Bean.

public class Category implements HtmlBean{
private static final long serialVersionUID = 3018760488621382659L;
@Text
@HtmlField(cssPath="dt a")
private String parentName; @HtmlField(cssPath="dd a")
private List<HrefBean> categorys; public String getParentName(){
return parentName;
} public void setParentName(String parentName){
this.parentName = parentName;
} public List<HrefBean> getCategorys(){
return categorys;
} public void setCategorys(List<HrefBean> categorys){
this.categorys = categorys;
} }

Get page elements cssPath Tips

The difficulty of the above two classes lies in cssPath On the acquisition of , Here are some cssPath Tips to get . use Chrome The browser opens the web page that needs to be crawled , Press F12 Enter sender mode . Choose the elements you want to get , Pictured :

Select the element on the right side of the browser , Right click to select Copy--Copy selector, We can get the cssPath

body > div:nth-child(5) > div.main-classify > div.list > div.category-items.clearfix > div:nth-child(1) > div:nth-child(2) > div.mc > div.items

If you are right about jquery Of selector I understand , Besides, we just want to get dl Elements , Therefore, it can be simplified as :

.category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl

To write AllSort Business processing class of

Finish right AllSort After the injection of , We need to be right about AllSort Conduct business processing , Here we don't do classification information persistence and other processing , Extract only classified Links , Further grab the commodity list information . Look at the code :

 @PipelineName("allSortPipeline")
public classAllSortPipelineimplementsPipeline<AllSort> { @Override
public void process(AllSort allSort) {
List<Category> categorys = allSort.getMobile();
for(Category category : categorys) {
List<HrefBean> hrefs = category.getCategorys();
for(HrefBean href : hrefs) {
String url = href.getUrl()+"&delivery=1&page=1&JL=4_10_0&go=0";
HttpRequest currRequest = allSort.getRequest();
SchedulerContext.into(currRequest.subRequest(url));
}
}
} }

@PipelinName Define the pipeline The name of , stay AllSort Of @Gecco In the annotation , such ,gecco After grabbing and injecting Bean After that, it will be called one by one @Gecco Defined pipeline 了 . Add... To each child link "&delivery=1&page=1&JL=4_10_0&go=0" The purpose of Jingdong is to capture only the self operated and in stock commodities of Jingdong .SchedulerContext.into() The method is to put the link to be crawled into the queue and wait for further crawling .

Teach you how to use java Reptiles gecco Grab JD More articles about all the product information

  1. Python Reptile battle --- Grab library borrowing information

    Python Reptile battle --- Grab library borrowing information Original works , Please indicate the source of the quotation :Python Reptile battle --- Grab library borrowing information I borrowed a lot of books in the library some time ago , It's easy to forget the due date of each book if you borrow more books , Always worried about breaking the contract ...

  2. Java Breadth first crawler example ( Capture Fudan news information )

    One . Techniques used This crawler is a small example of learning crawler technology nearly half a month ago , Relatively simple , I'm afraid I'll forget after a long time , Here is a brief summary . The main use of external Jar Package has a HttpClient4.3.4,HtmlParser2.1, Development using ...

  3. Reptiles —Selenium Crawling JD Commodity information

    One , Grab analysis The goal of this time is to crawl the commodity information of Jingdong , Including pictures of products , name , Price , Number of evaluators , Shop name . Grab the entrance is the search page of Jingdong , This link can be accessed by directly constructing parameters https://search.jd.com/Sea ...

  4. Use lightweight JAVA Reptiles Gecco Tools capture news DEMO

    Write it at the front Recently I saw Gecoo Reptile tools , It feels simple and easy to use , All write a DEMO Test it , Grab the website  http://zj.zjol.com.cn/home.html, Mainly capture the news title and release time as the capture test object ...

  5. 【JAVA series 】Google How a reptile grabs JavaScript Of ?

    official account :SAP Technical The author of this article :matinal The source of the original text is :http://www.cnblogs.com/SAPmatinal/ Link to the original text :[JAVA series ]Google How a reptile grabs Java ...

  6. Pyhton Reptile battle - Grab BOSS Direct job description and Data cleaning

    Pyhton Reptile battle - Grab BOSS Direct job description and Data cleaning zero . thank thank BOSS Direct employment of relatively authoritative Recruitment Information , So that I have this more interesting research tour . Because the crawler keeps crawling www.zhipin.com network ...

  7. JAVA Reptiles Gecco

    Main code : Gecco(matchUrl="https://github.com/{user}/{project}", pipelines="consolePipelin ...

  8. Golang Distributed crawlers : Grab the fried egg article |Redis/Mysql|56,961 An article

    --- layout: post title: "Golang Distributed crawlers : Grab the fried egg article " date: 2017-04-15 author: hunterhug categories ...

  9. scrapy Grab the job information of Lagou ( One )——scrapy The first time I met lagou Crawler project set up

    This time with scrapy Grab the job information of Lagou as scrapy A real combat drill of learning python edition :3.7.1 frame :scrapy(pip Direct installation may report an error , If it is vc++ The environment is not satisfied , It is recommended to install one directly visua ...

Random recommendation

  1. JMeter Study -031-JMeter 3.0 POST Body Data Chinese code scrambling

    today , Friends will JMeter The version of by 2.13 Upgrade to 3.0 Find the previous interface script POST The Chinese in the request body cannot be displayed correctly , The phenomenon is shown in the figure below :

  2. hadoop Input slice calculation (Map Task The determination of the number )

    Homework from JobClient Terminal submitJobInternal() Method submit the job at the same time , call InputFormat Interface getSplits() Method to create split. The default is to use InputFormat Subclasses of ...

  3. win7(64) Place below WinDbg64 debugging VMware10 Under the win7(32 position )

    win7(64) Place below WinDbg64 debugging VMware10 Under the win7(32 position ) One  Windbg32 A still 64 The choice of bits Reference documents <Windbg 32 Bit version and 64 Bit version selection > http:/ ...

  4. MyBatis 3 And spring The use of integration SqlSession

    SqlSessionTemplate yes MyBatis-Spring At the heart of . This class manages MyBatis Of SqlSession. call MyBatis Of SQL Method . SqlSessionTemplate Is the line ...

  5. JavaScript Quick start review

    data type Number JavaScript Don't distinguish between integers and floating-point numbers , Unified use Number Express , All of the following are legal Number type : 123; // Integers 123 0.456; // Floating point numbers 0.456 1.2345e ...

  6. jdbc Introduction to learning

    One .JDBC Introduction to related concepts 1.1. Database driven The concept of driving here is the same as the concept of driving that I hear from others , For example, you usually buy sound cards , The network card can't be inserted directly into the computer , You must install the corresponding driver before you can use the sound card and network card ...

  7. Write... Programmatically Babylon A spaceship of format 3D Model

    Use the last article (https://www.cnblogs.com/ljzc002/p/9353101.html) The method proposed in , Write a simple spaceship 3D Model , In this paper, the model making process and mathematical calculation steps are discussed ...

  8. stay mybatis There are three ways to write fuzzy query in

    <select id="selectStudentsByName" resultType="Student"> <!-- The first one is -->  ...

  9. machine learning — Integrated learning (XGBoost)

    One . Principle part : Two .xgboost Realization Take a look at Dashen's blog and get to know :https://blog.csdn.net/han_xiaoyang/article/details/52665396

  10. 2017 Shanghai Jinma five schools program design competition :Problem C : Count the Number ( simulation )

    Description Given n numbers, your task is to insert '+' or '-' in front of each number to construct ...