If the gecco I don't know. I can see gecco Of github home page .gecco Reptiles are very easy to use ,JD Capture all commodity information 9 One class will do it .

JD Analysis of website

Want to grab JD All the product information of the website , We need to analyze the website first , Jingdong website can be roughly divided into three levels , Jump to the product list page by category on the home page , The product list page has a detail page for each product . So we can capture the commodity information one by one by finding all the categories .

Entrance address, This address is JD A classified list of all products , Let's start with this page , Grab JD All the product information of

New start page HtmlBean class AllSort

 @Gecco(matchUrl="", pipelines={"consolePipeline", "allSortPipeline"})
public class AllSort implements HtmlBean{ private static final long serialVersionUID = 665662335318691818L; @Request
private HttpRequest request; // mobile phone
@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(2) > > div.items > dl")
private List<Category> mobile; // Household appliances
@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(3) > > div.items > dl")
private List<Category> domestic; public List<Category> getMobile(){
return mobile;
} public void setMobile(List<Category> mobile){ = mobile;
} public List<Category> getDomestic(){
return domestic;
} public void setDomestic(List<Category> domestic){
this.domestic = domestic;
} public HttpRequest getRequest(){
return request;
} public void setRequest(HttpRequest request){
this.request = request;

You can see , Here we take the product information of mobile phones and household appliances as an example , You can see that each large class contains several subcategories , use List<Category> Express .gecco Support Bean Nesting of , It can be expressed very well html Page structure .Category Represents the content of subcategory information ,HrefBean It's a shared link Bean.

public class Category implements HtmlBean{
private static final long serialVersionUID = 3018760488621382659L;
@HtmlField(cssPath="dt a")
private String parentName; @HtmlField(cssPath="dd a")
private List<HrefBean> categorys; public String getParentName(){
return parentName;
} public void setParentName(String parentName){
this.parentName = parentName;
} public List<HrefBean> getCategorys(){
return categorys;
} public void setCategorys(List<HrefBean> categorys){
this.categorys = categorys;
} }

Get page elements cssPath Tips

The difficulty of the above two classes lies in cssPath On the acquisition of , Here are some cssPath Tips to get . use Chrome The browser opens the web page that needs to be crawled , Press F12 Enter sender mode . Choose the elements you want to get , Pictured :

Select the element on the right side of the browser , Right click to select Copy--Copy selector, We can get the cssPath

body > div:nth-child(5) > div.main-classify > div.list > div.category-items.clearfix > div:nth-child(1) > div:nth-child(2) > > div.items

If you are right about jquery Of selector I understand , Besides, we just want to get dl Elements , Therefore, it can be simplified as :

.category-items > div:nth-child(1) > div:nth-child(2) > > div.items > dl

To write AllSort Business processing class of

Finish right AllSort After the injection of , We need to be right about AllSort Conduct business processing , Here we don't do classification information persistence and other processing , Extract only classified Links , Further grab the commodity list information . Look at the code :

public classAllSortPipelineimplementsPipeline<AllSort> { @Override
public void process(AllSort allSort) {
List<Category> categorys = allSort.getMobile();
for(Category category : categorys) {
List<HrefBean> hrefs = category.getCategorys();
for(HrefBean href : hrefs) {
String url = href.getUrl()+"&delivery=1&page=1&JL=4_10_0&go=0";
HttpRequest currRequest = allSort.getRequest();
} }

@PipelinName Define the pipeline The name of , stay AllSort Of @Gecco In the annotation , such ,gecco After grabbing and injecting Bean After that, it will be called one by one @Gecco Defined pipeline 了 . Add... To each child link "&delivery=1&page=1&JL=4_10_0&go=0" The purpose of Jingdong is to capture only the self operated and in stock commodities of Jingdong .SchedulerContext.into() The method is to put the link to be crawled into the queue and wait for further crawling .

