scrapy-redis It's based on redis Of scrapy Components , Through it, simple distributed crawler can be realized quickly , This component essentially provides three functions :

  • scheduler - Scheduler
  • dupefilter - URL Go back to the rules ( Used by scheduler )
  • pipeline   - Data persistence

scrapy-redis Components

1. URL duplicate removal

 Defining de duplication rules ( Called and applied by the scheduler )
a. The internal connection will be made using the following configuration Redis
# REDIS_HOST = 'localhost' # Host name 
# REDIS_PORT = 6379 # port
# REDIS_URL = 'redis://user:pass@hostname:9001' # Connect URL( Takes precedence over the above configuration )
# REDIS_PARAMS = {} # Redis Connection parameters Default :REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
# REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' # Specify the connection Redis Of Python modular Default :redis.StrictRedis
# REDIS_ENCODING = "utf-8" # redis The encoding type Default :'utf-8' b. The de duplication rule passes through redis Complete the collection of , A collection of Key by : key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
The default configuration :
DUPEFILTER_KEY = 'dupefilter:%(timestamp)s' c. The de duplication rule will url Convert to unique identifier , And then in redis Check whether there is already in the set from scrapy.utils import request
from scrapy.http import Request req = Request(url='http://www.cnblogs.com/wupeiqi.html')
result = request.request_fingerprint(req)
print(result) # 8ea4fd67887449313ccc12e5b6b92510cc53675c PS:
- URL When the parameter position is different , The calculation results are consistent ;
- The default request header is not in the calculation range ,include_headers You can set the specified request header
Example :
from scrapy.utils import request
from scrapy.http import Request req = Request(url='http://www.baidu.com?name=8&id=1',callback=lambda x:print(x),cookies={'k1':'vvvvv'})
result = request.request_fingerprint(req,include_headers=['cookies',]) print(result) req = Request(url='http://www.baidu.com?id=1&name=8',callback=lambda x:print(x),cookies={'k1':666}) result = request.request_fingerprint(req,include_headers=['cookies',]) print(result) """
# Ensure all spiders share same duplicates filter through redis.
# DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

2. Scheduler

"""
Scheduler , The scheduler uses PriorityQueue( Ordered set )、FifoQueue( list )、LifoQueue( list ) Make a save request , And use RFPDupeFilter Yes URL duplicate removal a. Scheduler
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # Priority queue is used by default ( Default ), other :PriorityQueue( Ordered set ),FifoQueue( list )、LifoQueue( list )
SCHEDULER_QUEUE_KEY = '%(spider)s:requests' # Requests in the scheduler are stored in redis Medium key
SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # Save to redis The data in is serialized , By default pickle
SCHEDULER_PERSIST = True # Whether to keep the original scheduler and de duplication record when shutting down ,True= Retain ,False= Empty
SCHEDULER_FLUSH_ON_START = True # Whether to empty before you start Scheduler and de duplication ,True= Empty ,False= Don't empty
SCHEDULER_IDLE_BEFORE_CLOSE = 10 # When getting data from the scheduler , If it is empty , Maximum waiting time ( In the end, there's no data , Not obtained ).
SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter' # Go back to the rules , stay redis The corresponding when saving in key
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'# De duplication rule corresponds to the class to be processed """
# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
# SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # Don't cleanup redis queues, allows to pause/resume crawls.
# SCHEDULER_PERSIST = True # Schedule requests using a priority queue. (default)
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # Alternative queues.
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' # Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
# SCHEDULER_IDLE_BEFORE_CLOSE = 10

3. Data persistence

2. Defining persistence , Reptiles yield Item Object RedisPipeline
a. take item Persist to redis when , Appoint key And serialization functions
REDIS_ITEMS_KEY = '%(spider)s:items'
REDIS_ITEMS_SERIALIZER = 'json.dumps' b. Save with a list item data

4. start URL relevant

"""
start URL relevant a. Get start URL when , To get from a collection or from a list ?True, aggregate ;False, list
REDIS_START_URLS_AS_SET = False # Get start URL when , If True, Then use self.server.spop; If False, Then use self.server.lpop
b. When writing crawlers , start URL from redis Of Key In order to get
REDIS_START_URLS_KEY = '%(name)s:start_urls' """
# If True, it uses redis' ``spop`` operation. This could be useful if you
# want to avoid duplicates in your start urls list. In this cases, urls must
# be added via ``sadd`` command or you will get a type error from redis.
# REDIS_START_URLS_AS_SET = False # Default start urls key for RedisSpider and RedisCrawlSpider.
# REDIS_START_URLS_KEY = '%(name)s:start_urls'

scrapy-redis Example

# DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
#
#
# from scrapy_redis.scheduler import Scheduler
# from scrapy_redis.queue import PriorityQueue
# SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # Priority queue is used by default ( Default ), other :PriorityQueue( Ordered set ),FifoQueue( list )、LifoQueue( list )
# SCHEDULER_QUEUE_KEY = '%(spider)s:requests' # Requests in the scheduler are stored in redis Medium key
# SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # Save to redis The data in is serialized , By default pickle
# SCHEDULER_PERSIST = True # Whether to keep the original scheduler and de duplication record when shutting down ,True= Retain ,False= Empty
# SCHEDULER_FLUSH_ON_START = False # Whether to empty before you start Scheduler and de duplication ,True= Empty ,False= Don't empty
# SCHEDULER_IDLE_BEFORE_CLOSE = 10 # When getting data from the scheduler , If it is empty , Maximum waiting time ( In the end, there's no data , Not obtained ).
# SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter' # Go back to the rules , stay redis The corresponding when saving in key
# SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'# De duplication rule corresponds to the class to be processed
#
#
#
# REDIS_HOST = '10.211.55.13' # Host name
# REDIS_PORT = 6379 # port
# # REDIS_URL = 'redis://user:pass@hostname:9001' # Connect URL( Takes precedence over the above configuration )
# # REDIS_PARAMS = {} # Redis Connection parameters Default :REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
# # REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' # Specify the connection Redis Of Python modular Default :redis.StrictRedis
# REDIS_ENCODING = "utf-8" # redis The encoding type Default :'utf-8'

The configuration file

import scrapy
class ChoutiSpider(scrapy.Spider):
name = "chouti"
allowed_domains = ["chouti.com"]
start_urls = (
'http://www.chouti.com/',
) def parse(self, response):
for i in range(0,10):
yield

Crawler file

 
 
 

scrapy-redis More related articles on the use and analysis of

  1. Scrapy Analysis of framework architecture principle

    The crawler frame --Scrapy If you know something about the basics of reptiles , So it's time to learn about the crawler framework . So why use the crawler framework ? The essence of learning framework is to learn a kind of programming thought , Not just how to use it . Learn from ...

  2. Netty Development redis client ,Netty send out redis command ,netty analysis redis news

    keyword :Netty Development redis client ,Netty send out redis command ,netty analysis redis news , netty redis ,redis RESP agreement .redis client ,netty redis agreement ...

  3. Cao Gong said Redis Source code (5)-- redis server Startup process analysis , as well as EventLoop Analysis of the front work before each event ( Next )

    Cao Gong said Redis Source code (5)-- redis server Startup process analysis ,eventLoop Preparation before handling the incident ( Next ) Article navigation Redis The original intention of the source code series , It's about helping us better understand Redis, More understanding Redis ...

  4. Scrapy+redis Implement distributed crawlers

    summary What is a distributed crawler We need to build a system of n A cluster of computers , And then run the same set of programs on each computer , Let it crawl the data of the same network resources in a joint and distributed way . Native Scrapy The reason for not being distributed Native Scrapy The middle scheduler ...

  5. be based on Python,scrapy,redis Distributed crawler implementation framework based on

    original text   http://www.xgezhang.com/python_scrapy_redis_crawler.html Reptile technology , Whether in the academic field , Or in Engineering , They all play a very important role . Compared with other ...

  6. scrapy The crawler frame setting Module analysis

    It doesn't need to be set when writing crawlers setting All of the parameters in , It's a whim today , Took some time to check setting The meaning of all parameters automatically written after the module is created , Make a note of . Module related information # -*- coding: ut ...

  7. Scrapy Framework implementation process analysis

    Here are seven categories Command->CrawlerProcess->Crawler->ExecutionEngine->sceduler There are two other classes :Request and Htt ...

  8. python The whole development of the stack ,Day101(redis operation , The shopping cart ,DRF Parser )

    Yesterday's review 1. django Request lifecycle ? - When the user types in the browser url when , The browser will generate request headers and request bodies and send them to the server The request header and body will contain the actions of the browser (action), This action is usually get perhaps po ...

  9. Cao Gong said Redis Source code (2)-- redis server Starting process analysis and simple c Basic knowledge of language supplement

    Article navigation Redis The original intention of the source code series , It's about helping us better understand Redis, More understanding Redis, And how to understand , It's not enough just to see , I suggest following this one , Set up the environment , You can read the source code yourself later , Or read along with me . because ...

  10. redis.conf Configuration detailed analysis

    # redis Sample configuration file # When you need to specify the memory size for a configuration item , You have to bring your unit with you , # The usual format is 1k 5gb 4m Wait for the sauce to turn purple : # # 1k => 1000 bytes # 1kb ...

Random recommendation

  1. js event Event bubbling and event capture are described in detail

    . Reference resources : http://www.jb51.net/article/42492.htm chart : Suppose an element div, It has a subordinate element p.<div> <p> Elements </p> ...

  2. php Read the contents of the specified end pointer file

    fopen During operation, the file read start pointer is located at the beginning of the file , fseek The end pointer position is determined by the specified file size and the start pointer position Specific cases : <?php// Open the file stream ,fopen Does not load the entire file into memory $f = ...

  3. Sharepoint Learning notes — Exercise series --70-573 Problem solving -(Q60-Q62)

    Question 60You have a SharePoint site collection that contains 100 subsites.You plan to create a Web ...

  4. Super stairs [HDU2041]

    Super stairs Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others)Total Submis ...

  5. Integer to Roman

    Given an integer, convert it to a roman numeral. Input is guaranteed to be within the range from 1 t ...

  6. leetcode:Search for a Range( Array , Two points search )

    Given a sorted array of integers, find the starting and ending position of a given target value. You ...

  7. tomcat And resin Comparison

    Tomcat yes Apache  Software foundation (Apache Software Foundation) Of Jakarta  A core project in a project , from Apache.Sun  Developed with other companies and individuals . Because of ...

  8. /etc/profile

    PS1: It's the user's usual prompt . PS2: The first line didn't finish , Wait for the prompt for the second line of input . Linux The system prompt uses the system variable PS1 To define the . The default form of the general system is :[username@host working directory ]$. use e ...

  9. sql Connect :inner join on, left join on, right join on The use of,

    Click to open the original inner join( Equivalent connection ) Only rows with equal join fields in two tables are returned left join( Left join ) Returns records that include all records in the left table and join fields in the right table right join( Right link ) Return package ...

  10. O2O、C2C、B2B、B2C

    One .O2O.C2C.B2B.B2C What's the difference ? O2O yes Online to offline  There are four modes of operation 1.Online to offline  It's from online trading to offline consumption experience 2.Offline t ...