What we learned before is to grab static pages , Each request , All the information on its website will be presented at one time . however , Like some shopping websites , Their product information is js Loaded out , And there will be ajax Load asynchronously . A situation like this , Use it directly scrapy Of Request Requests don't get the information we want , The solution is to use scrapy-splash.

scrapy-splash load js The data is based on Splash To achieve ,Splash It's a Javascript Render service . It is an implementation of HTTP API Lightweight browser ,Splash Yes, it is Python Realized , Use at the same time Twisted and QT, And we use scrapy-splash And finally got response It's like after the browser has rendered everything , Get the rendered web page source code .

The preparation process

  • install docker

    stay windows In the environment , install docker The easy way is to use docker toolbox, because Docker The engine's daemons use Linux The kernel of , So we can't be directly in windows Run in docker engine . It's about creating and getting a... On your machine Linux virtual machine , This virtual machine can be used in your windows Running on the system Docker engine ,docker toolbox This toolkit integrates windows Operation in environment docker The necessary tools , Of course, virtual machines are also included .

    First download docker toolbox

    Perform setup , By default , Your computer will install the following programs

    • Windows Version of Docker client
    • Docker Toolbox Management tools and ISO Mirror image
    • Oracle VM virtual machine
    • Git Tools

    Of course , If you have installed it before Oracle VM virtual machine perhaps Git Tools , Then you can uncheck these two items during installation , after , You just need to click the next step . After installation , find Docker Quickstart Terminal Icon , Double-click on the run , Wait for it to configure itself for a short time , You'll see the following interface



    Please pay attention to the red box above , This is assigned to you by default ip, I'll use . thus ,docker The tool is already installed .

  • install Splash

    Double-click on the run Docker Quickstart Terminal, Enter the following

    docker pull scrapinghub/splash

    The order is to pull Splash Mirror image , Wait for a while , That's all right. .

    Here's the start-up Splash

    docker run -p 8050:8050 scrapinghub/splash

    This command is on the computer 8050 Port boot Splash Render service

    You'll see the following illustration .

    This is the time , Open your browser , Input 192.168.99.100:8050 You'll see an interface like this .



    You can enter any web address in the red box above , Click on Render me! To see what it looks like after rendering .

  • install scrapy-splash

    pip install scrapy-splash

    thus , Our preparations are all over .

test

Now let's create a project to test , Whether it really achieves the function we want .

Don't use scrapy-splash

In order to have an intuitive contrast , We don't use... In the first place scrapy- splash, Let's see what the effect is , Let's take Taobao product information as an example , Create a new one called taobao Project , stay spider.py Enter the following in the file .

import scrapy
class Spider(scrapy.Spider):
name = 'taobao'
allowed_domains = []
start_urls = ['https://s.taobao.com/search?q=%E7%BE%8E%E9%A3%9F'] def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse) def parse(self,response):
titele = response.xpath('//div[@class="row row-2 title"]/a/text()').extract()
print(' This is a title, :', titele)

We print out the name of Taobao food , You'll see this message :

Use scrapy-splash

Let's use scrapy-splash Let's do it , Let's see what happens :

Use scrapy-splash Need some extra configuration , Here are some of them :

stay settings.py In file , You need to fill in some additional information below

# Rendering services url
SPLASH_URL = 'http://192.168.99.100:8050' # Downloader middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# Remove the filter
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# Use Splash Of Http cache
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

stay spider.py In file , Fill in the code below :

import scrapy
from scrapy_splash import SplashRequest class Spider(scrapy.Spider):
name = 'taobao'
allowed_domains = []
start_urls = ['https://s.taobao.com/search?q=%E7%BE%8E%E9%A3%9F'] def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse,
args={'wait':1}, endpoint='render.html') def parse(self, response):
titele = response.xpath('//div[@class="row row-2 title"]/a/text()').extract()
print(' This is a title, :', titele)

Remember not to forget to import SplashRequest.

Here's how to run this project , Remember in docker First, put splash Rendering services run .

The results are shown in the following figure .



I can see , What we need has been printed out , It's a bit messy , We can use regularization to match , But this is not the main content of this section , You can try it yourself .

Scrapy Study ( 13、 ... and ) And scrapy-splash More articles about

  1. Scrapy Study ( Ten ) The downloader middleware (Downloader Middleware)

    Downloader middleware is between Scrapy Of request/response Hook frame for handling , Is for global modification Scrapy request and response A light weight of . The underlying system . Activate Downloader Midd ...

  2. Scrapy Study ( 7、 ... and ) And Item Pipeline

    Prior to Scrapy Study ( Four ) In the chapter of data storage , We've actually used Item Pipeline, The main purpose of that chapter is to form a general understanding , know scrapy What can I do , however , In order to form a more comprehensive system ...

  3. Scrapy Study ( Nine ) Download files and pictures

    Media Pipeline Scrapy Download for item Files included in ( For example, when crawling to the product , Also want to save the corresponding picture ) Provides a reusable item pipelines . these pipeline There's something in common ...

  4. Scrapy Study ( 6、 ... and ) And Selector Selectors

    When we got the page response after , The key is how to extract the data we need from the complex web pages ,python There are many packages that extract data from web pages , The following are commonly used : BeautifulSoup It's based on HTML Code ...

  5. Scrapy Study ( 5、 ... and ) And Spiders

    Spiders Spider Class defines how to crawl a website . Including the action of climbing ( for example : Follow up link or not ) And how to extract structured data from the content of web pages ( Crawling item). In short ,Spider You define the crawling action and analyze a net ...

  6. Scrapy Study ( One ) The framework of

    overview In specific learning scrapy Before , Let's start with scrapy Make a simple understanding of the architecture of , Then all the content is based on this architecture , In the beginning stage, you only need a simple understanding , Later in the study , You will have a deeper understanding of this architecture . Here is scr ...

  7. Scrapy Study ( 8、 ... and ) And settings

    Scrapy Set up (settings) Provides customization Scrapy Component approach . You can control, including the core (core), plug-in unit (extension),pipeline And spider Components . Settings provide extraction for the code to key-va ...

  8. Scrapy Study ( 3、 ... and ) To create projects and Scrapy Installation

    install Scrapy I understand Scrapy After the framework and part of the command line , Create project , Before you start using , Installation, of course Scrapy The framework . About Scrapy Installation of frame , Please refer to :https://cuiqingcai.com/5 ...

  9. Scrapy Study ( Two ) Common command line tools

    brief introduction Scrapy It's through Scrapy Command line tools to control , Including creating new projects , The start of the crawler , Related settings ,Scrapy There are two built-in commands , They are global commands and project commands , seeing the name of a thing one thinks of its function , Global commands can be executed anywhere ...

Random recommendation

  1. centos install tmux The process

    original text :https://gist.github.com/rothgar/cecfbd74597cc35a6018 # Install tmux on Centos release 6.5 # insta ...

  2. D:Wordpress_AFC Plug in code

    Get custom variables // Output custom fields title Value <?php the_field('title','options'); ?> // Get custom fields title Value <?php echo ...

  3. Oracle Installation media and patch set download address

    Oracle9i Database Release 2 Enterprise/Standard/Personal Edition for Windows NT/2000/XP http://downl ...

  4. @autoreleasepool stay MRC and ARC The difference between

    about @autoreleasepool {} (1) stay ARC Will destroy all objects created in it , Even if you use the outside Strong The pointer points to him (2) stay MRC If there is an external strong pointer to , It doesn't destroy objects ,retainCoun ...

  5. Pegasos: Primal Estimated sub-GrAdient Solver for SVM

    Abstract We describe and analyze a simple and effective iterative algorithm for solving the optimiza ...

  6. php-- memcache And memcached The difference and common ground of bracket Personal arrangement

    First statement :memcache And memcached There is no relationship between 1. The concept is similar MemCache It's a freedom . Open source code . High performance . Distributed memory object caching system , For dynamic Web Application to reduce the load on the database . m ...

  7. Oracle EBS-SQL (SYS-10): Lock table query .sql

    /* Deadlock query -1*/ SELECT o.object_name, l.session_id,l.process, l.locked_mode FROM v$locked_object l , dba_ ...

  8. Restricted container pair CPU Use - Every day 5 Minutes to play Docker Container technology (28)

    Last section learned how to limit the use of memory by containers , In this section, let's look at it CPU. By default , All containers can be used equally host CPU Resources and there are no limits . Docker Can pass  -c  or  --cpu-shares  Set up containers ...

  9. linux_ email

    How to use linux email ? Mailbox profile : /etc/mail.rc 1. Mailbox file configuration vim /etc/mail.rc # Add some data set from=beimen@163.com smtp= ...

  10. Spring 4.x ( 3、 ... and )

    1 Spring Add DataSource And introduce jdbc.properties step : ① Join in c3p0 Of jar Bao He mysql Driver package ② stay src Under the new jdbc.propertes file jdbc.dri ...