This Scrapy spider middleware uses the HCF backend from Scrapinghub's Scrapy Cloud service to retrieve the new urls to crawl and store back the links extracted.
Install scrapy-hcf using pip:
$ pip install scrapy-hcf
To activate this middleware it needs to be added to the SPIDER_MIDDLEWARES
dict, i.e:
SPIDER_MIDDLEWARES = {
'scrapy_hcf.HcfMiddleware': 543,
}
And the following settings need to be defined:
HS_AUTH- Scrapy Cloud API key
HS_PROJECTID- Scrapy Cloud project ID (not needed if the spider is ran on dash)
HS_FRONTIER- Frontier name.
HS_CONSUME_FROM_SLOT- Slot from where the spider will read new URLs.
Note that HS_FRONTIER and HS_CONSUME_FROM_SLOT can be overriden
from inside a spider using the spider attributes hs_frontier
and hs_consume_from_slot respectively.
The following optional Scrapy settings can be defined:
HS_ENDPOINT- URL to the API endpoint, i.e: http://localhost:8003. The default value is provided by the python-hubstorage package.
HS_MAX_LINKS- Number of links to be read from the HCF, the default is 1000.
HS_START_JOB_ENABLED- Enable whether to start a new job when the spider finishes.
The default is
False HS_START_JOB_ON_REASON- This is a list of closing reasons,
if the spider ends with any of these reasons a new job will be started
for the same slot. The default is
['finished'] HS_NUMBER_OF_SLOTS- This is the number of slots that the middleware will use to store the new links. The default is 8.
The following keys can be defined in a Scrapy Request meta in order to control the behavior of the HCF middleware:
'use_hcf'- If set to
Truethe request will be stored in the HCF. 'hcf_params'Dictionary of parameters to be stored in the HCF with the request fingerprint
'qdata'- data to be stored along with the fingerprint in the request queue
'fdata'- data to be stored along with the fingerprint in the fingerprint set
'p'- Priority - lower priority numbers are returned first. The default is 0
The value of 'qdata' parameter could be retrieved later using
response.meta['hcf_params']['qdata'].
The spider can override the default slot assignation function by setting the
spider slot_callback method to a function with the following signature:
def slot_callback(request):
...
return slot