Settings¶
The following settings are Scrapy Cluster specific. For all other Scrapy settings please refer to the official Scrapy documentation here.
Redis¶
REDIS_HOST
Default: 'localhost'
The Redis host.
REDIS_PORT
Default: 6379
The port to use when connecting to the REDIS_HOST
.
REDIS_DB
Default: 0
The Redis database to use when connecting to the REDIS_HOST
.
Kafka¶
KAFKA_HOSTS
Default: 'localhost:9092'
The Kafka host. May have multiple hosts separated by commas within the single string like 'h1:9092,h2:9092'
.
KAFKA_TOPIC_PREFIX
Default: 'demo'
The Kafka Topic prefix to use when generating the outbound Kafka topics.
KAFKA_APPID_TOPICS
Default: False
Flag to send data to both the firehose and Application ID specific Kafka topics. If set to True
, results will be sent to both the demo.outbound_firehose
and demo.outbound_<appid>
Kafka topics, where <appid>
is the Application ID used to submit the request. This is useful if you have many applications utilizing your cluster but only would like to listen to results for your specific application.
KAFKA_BASE_64_ENCODE
Default: False
Base64 encode the raw crawl body from the crawlers. This is useful when crawling malformed utf8 encoded pages, where json encoding throws an error. If an error occurs when encoding the crawl object in the item pipeline, there will be an error thrown and the result will be dropped.
KAFKA_PRODUCER_BATCH_LINGER_MS
Default: 25
The time to wait between batching multiple requests into a single one sent to the Kafka cluster.
KAFKA_PRODUCER_BUFFER_BYTES
Default: 4 * 1024 * 1024
The size of the TCP send buffer when transmitting data to Kafka
Zookeeper¶
ZOOKEEPER_ASSIGN_PATH
Default: /scrapy-cluster/crawler/
The location to store Scrapy Cluster domain specific configuration within Zookeeper
ZOOKEEPER_ID
Default: all
The file identifier to read crawler specific configuration from. This file is located within the ZOOKEEPER_ASSIGN_PATH
folder above.
ZOOKEEPER_HOSTS
Default: localhost:2181
The zookeeper host to connect to.
Scheduler¶
SCHEDULER_PERSIST
Default: True
Determines whether to clear all Redis Queues when the Scrapy Scheduler is shut down. This will wipe all domain queues for a particular spider type.
SCHEDULER_QUEUE_REFRESH
Default: 10
How many seconds to wait before checking for new or expiring domain queues. This is also dictated by internal Scrapy processes, so setting this any lower does not guarantee a quicker refresh time.
SCHEDULER_QUEUE_TIMEOUT
Default: 3600
The number of seconds older domain queues are allowed to persist before they expire. This acts as a cache to clean out queues from memory that have not been used recently.
SCHEDULER_BACKLOG_BLACKLIST
Default: True
Allows blacklisted domains to be added back to Redis for future crawling. If set to False
, domains matching the Zookeeper based domain blacklist will not be added back in to Redis.
Throttle¶
QUEUE_HITS
Default: 10
When encountering an unknown domain, throttle the domain to X number of hits within the QUEUE_WINDOW
QUEUE_WINDOW
Default: 60
The number of seconds to count and retain cluster hits for a particular domain.
QUEUE_MODERATED
Default: True
Moderates the outbound domain request flow to evenly spread the QUEUE_HITS
throughout the QUEUE_WINDOW
.
DUPEFILTER_TIMEOUT
Default: 600
Number of seconds to keep crawlid specific duplication filters around after the latest crawl with that id has been conducted. Putting this setting too low may allow crawl jobs to crawl the same page due to the duplication filter being wiped out.
SCHEDULER_IP_REFRESH
Default: 60
The number of seconds to wait between refreshing the Scrapy process’s public IP address. Used when doing IP based throttling.
PUBLIC_IP_URL
Default: 'http://ip.42.pl/raw'
The default URL to grab the Crawler’s public IP Address from.
IP_ADDR_REGEX
Default: (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
The regular expression used to find the Crawler’s public IP Address from the PUBLIC_IP_URL
response. The first element from the results of this regex will be used as the ip address.
SCHEDULER_TYPE_ENABLED
Default: True
If set to true, the crawling process’s spider type is taken into consideration when throttling the crawling cluster.
SCHEDULER_IP_ENABLED
Default: True
If set to true, the crawling process’s public IP Address is taken into consideration when throttling the crawling cluster.
Note
For more information about Type and IP throttling, please see the throttle documentation.
SCHEUDLER_ITEM_RETRIES
Default: 2
Number of cycles through all known domain queues the Scheduler will take before the Spider is considered idle and waits for Scrapy to retry processing a request.
Logging¶
SC_LOGGER_NAME
Default: 'sc-crawler'
The Scrapy Cluster logger name.
SC_LOG_DIR
Default: 'logs'
The directory to write logs into. Only applicable when SC_LOG_STDOUT
is set to False
.
SC_LOG_FILE
Default: 'sc_crawler.log'
The file to write the logs into. When this file rolls it will have .1
or .2
appended to the file name. Only applicable when SC_LOG_STDOUT
is set to False
.
SC_LOG_MAX_BYTES
Default: 10 * 1024 * 1024
The maximum number of bytes to keep in the file based log before it is rolled.
SC_LOG_BACKUPS
Default: 5
The number of rolled file logs to keep before data is discarded. A setting of 5
here means that there will be one main log and five rolled logs on the system, totaling six log files.
SC_LOG_STDOUT
Default: True
Log to standard out. If set to False
, will write logs to the file given by the LOG_DIR/LOG_FILE
SC_LOG_JSON
Default: False
Log messages will be written in JSON instead of standard text messages.
SC_LOG_LEVEL
Default: 'INFO'
The log level designated to the logger. Will write all logs of a certain level and higher.
Note
More information about logging can be found in the utilities Log Factory documentation.
Stats¶
STATS_STATUS_CODES
Default: True
Collect Response status code metrics
STATUS_RESPONSE_CODES
Default:
[
200,
404,
403,
504,
]
Determines the different Response status codes to collect metrics against if metrics collection is turned on.
STATS_CYCLE
Default: 5
How often to check for expired keys and to roll the time window when doing stats collection.
STATS_TIMES
Default:
[
'SECONDS_15_MINUTE',
'SECONDS_1_HOUR',
'SECONDS_6_HOUR',
'SECONDS_12_HOUR',
'SECONDS_1_DAY',
'SECONDS_1_WEEK',
]
Rolling time window settings for statistics collection, the above settings indicate stats will be collected for the past 15 minutes, the past hour, the past 6 hours, etc.
Note
For more information about stats collection, please see the Stats Collector documentation.