demand :

Crawling :https://v.taobao.com/v/content/video All anchor details page information

Home page analysis

Analysis shows that the data is obtained through ajax Requested .

Analysis request header

Details page analysis

Details page and details page data url comparative analysis

After testing , It turns out that we just need to change '''userid''' Different data can be obtained by using the value of .

After the analysis, write the code

The complete code is as follows

import re
import requests
import json
import jsonpath
import pymongo
class VtaoSpider:
headers={
'referer': 'https://v.taobao.com/v/content/video',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
} db=None
def open(self):
' Connect to database '
client=pymongo.MongoClient(host='106.12.108.236',port=27017)
self.db=client['trip'] def get_first_page(self):
' Get all the data on the home page '
url_lst=[]
for i in range(1,26): #25 Page data
' Process page '
params={
'cateType': 602,
'currentPage': i,
'_ksTS': '1554971959356_87',
'':'',
'_output_charset': 'UTF-8',
'_input_charset': 'UTF-8',
}
start_url='https://v.taobao.com/micromission/req/selectCreatorV3.do' first_data=requests.get(url=start_url,headers=self.headers,params=params)
url_lst.append(first_data)
# print(first_data.text)
return url_lst def get_detail_url(self):
' For details page url'
response_list=self.get_first_page()
all_detail_url=[]
for response in response_list:
dd = response.text
d_dict = json.loads(dd)
detail_url = jsonpath.jsonpath(d_dict, '$..homeUrl')
#detail_url It's a list
all_detail_url.extend(detail_url)
# print(all_detail_url)
return all_detail_url def get_detail_data(self):
url_list=self.get_detail_url()
# print(url_list)
for url in url_list:
try:
ex='userId=(.*?)&'
user_id=re.findall(ex,url)[0]
detail_data_url=f'https://v.taobao.com/micromission/daren/daren_main_portalv3.do?userId={user_id}&_ksTS=1554976401436_17'
# print(detail_data_url) # Get response data
data = requests.get(url=detail_data_url, headers=self.headers).text
data_json=json.loads(data)
darenNick=jsonpath.jsonpath(data_json,'$..darenNick')[0]
darenScore=jsonpath.jsonpath(data_json,'$..darenScore')[0]
nick=jsonpath.jsonpath(data_json,'$..nick')[0]
creatorType=jsonpath.jsonpath(data_json,'$..creatorType')[0]
rank=jsonpath.jsonpath(data_json,'$..rank')
res_data={
'darenNick':darenNick,
'darenScore':darenScore,
'nick':nick,
'creatorType':creatorType,
'rank':rank, }
# Store in database
if self.db['vtaobao'].insert(res_data):
print('save to mongo is successful!') except Exception as e:
print(e) if __name__ == '__main__':
vspider=VtaoSpider()
# Database startup only needs to be performed once
vspider.open()
vspider.get_detail_data()

A total of 450 Data , Namely 450 Information about an anchor !!!

This code is for using multi process , Multithreading , Crawling time is not what you want , Interested friends can refactor the code , Use Multi process , Multithreading , Share another wave , Let's learn , thank you !!!

Ali V Task page crawling data analysis of more related articles

  1. 【 Graphic, 】scrapy Crawlers and dynamic pages —— Crawl the job information of Drago (2)

    I dug a hole last time , Today, I finally filled in , Remember the Drago crawler we made before ? At that time, we realized one page crawling , Today, let's keep up our efforts , Realize multi page crawling , By the way, realize the keyword search function of position and company . The previous content is no longer introduced , If you are not familiar with it, please be sure to ...

  2. Python Use Scrapy Frame crawls data into CSV file (Python Reptile battle 4)

    1. Scrapy frame Scrapy yes python The framework to realize the crawler function under , Be able to parse data . Data processing . Crawler framework with data storage integrated function . 2. Scrapy install 1. Install dependency packages yum install g ...

  3. With the help of Chrome And plug-ins crawling data

    Tools Chrome browser TamperMonkey ReRes Chrome browser chrome Browser is the most popular browser at present , Not one of them. , It's compatible with most w3c Standards and ecma standard , For front-end engineers in the development process ...

  4. python3 Write a web crawler 14- Dynamic rendering page crawling

    One . Dynamic rendering page crawling Last class we learned about Ajax Analysis and grabbing methods , This is also true JavaScript A case of dynamically rendering a page , Through direct analysis Ajax, With the help of requests and urllib Realize data crawling however javaS ...

  5. web scraper—— Simple crawling data 【 Two 】

    web scraper—— install [ One ] We have installed it in the above web scraper Now let's do a simple climb , Let's crawl Baidu's real-time hotspot . http://top.baidu.com/buzz?b=1&a ...

  6. About js The idea and whole process of crawling data when rendering web pages ( Source code attached )

    On js The idea of crawling data when rendering web pages You can use it first requests Library access url Let's see if we can get the data , If you can get it, it's an ordinary web page , If appear 403 The error code for the class can be found in requests.get() In the method ...

  7. node.js Crawl data and send it regularly HTML mail

    node.js It's a framework that front-end programmers have to learn , We can use it to crawl data . Send E-mail . Access data, etc . So let's go through koa2 The framework is simple with only one small crawler and uses scheduled tasks to send small emails ! First let's take a look at the renderings How about it ...

  8. Reptile series 5:scrapy Another idea of dynamic page crawling

    A previous article gives a way to crawl dynamic pages , Application Selenium+Firefox( Reference resources <scrapy Dynamic page crawling >). however selenium You need to run a local browser , More time-consuming , It's not suitable for large-scale web pages ...

  9. 【 personal 】 Reptile practice , utilize xpath The way to crawl data is to crawl shrimp music charts

    Experiment website : Shrimp music charts Web site address :http://www.xiami.com/chart  The difficulty coefficient :***** Dependency Library :request.lxml Of etree ( install lxml:pip install ...

Random recommendation

  1. stay Ubuntu Installation on LAMP The server

    1. install Ubuntu Installation on LAMP apt-get install lamp-server^ 2. Setup during installation MySql password 3. test establish index.php var/www/html/index. ...

  2. About monkeyrunner Some preliminary understanding of the topic

    1.Monkeyrunner It contains several basic classes ? What is the function of the difference ? Monkeyrunner It basically includes MonkeyRunner,MonkeyDevice,MonkeyImage MonkeyRun ...

  3. jdbc Unable to connect data resolution

    1. Network reasons 2. Account authority issues Does the account give the following permissions : grant connect, resource to ADM_BI; grant read, write on directory BACKU ...

  4. Python Introduction 2 : function

    One . Definition and use of functions 1. The basic structure : def Function name ( Parameters ): """ docstring """ The body of the function Return value 2. Function name : And variable name naming rule one ...

  5. systemd Add custom system service settings, custom boot

    1. Service rights systemd There is a distinction between system and user : System (/user/lib/systemd/system/). user (/etc/lib/systemd/user/). It is recommended to store the unit files created manually by general system administrators ...

  6. EC Book Notes Series 11: Clause 20、21

    Clause 20 Ning Yi pass-by-reference-to-const Replace pass-by-value remember : * Try to use pass-by-reference-to-const Replace pass-by-value. front ...

  7. Python Introduction PyCharm Shortcut keys with common settings and extensions (Mac System )

    1. Shortcut key 2 . PyCharm Common settings and extensions for ------------------------------------------------------------------------- ...

  8. turn Angular2 Collection of high quality learning resources

    Documents, blogs, books Official website : https://angular.io Chinese site : https://angular.cn Victor Of blog(Victor yes Angular Author of routing module ): https: ...

  9. dhtmlxtree How to get xml,json And so on

    Look at the code first : var TreeForJSON = new dhtmlXTreeObject('TreeForJSON', '100%', '100%', 0); TreeForJSON.setImage ...

  10. python_web Application prototype

    python_web Application prototype Web Applications, as the name suggests , It's a way to get through Web Applications accessed , Web The biggest feature of the application is that users only need to have a web and browser , You don't need to install any other software to get through web Access to the program . WEB ...