The following settings are Scrapy Cluster specific. For all other Scrapy settings please refer to the official Scrapy documentation here.
The Redis host.
The port to use when connecting to the
The Redis database to use when connecting to the
The Kafka host. May have multiple hosts separated by commas within the single string like
The Kafka Topic prefix to use when generating the outbound Kafka topics.
Flag to send data to both the firehose and Application ID specific Kafka topics. If set to
True, results will be sent to both the
demo.outbound_<appid> Kafka topics, where
<appid> is the Application ID used to submit the request. This is useful if you have many applications utilizing your cluster but only would like to listen to results for your specific application.
Base64 encode the raw crawl body from the crawlers. This is useful when crawling malformed utf8 encoded pages, where json encoding throws an error. If an error occurs when encoding the crawl object in the item pipeline, there will be an error thrown and the result will be dropped.
The time to wait between batching multiple requests into a single one sent to the Kafka cluster.
4 * 1024 * 1024
The size of the TCP send buffer when transmitting data to Kafka
The location to store Scrapy Cluster domain specific configuration within Zookeeper
The file identifier to read crawler specific configuration from. This file is located within the
ZOOKEEPER_ASSIGN_PATH folder above.
The zookeeper host to connect to.
Determines whether to clear all Redis Queues when the Scrapy Scheduler is shut down. This will wipe all domain queues for a particular spider type.
How many seconds to wait before checking for new or expiring domain queues. This is also dictated by internal Scrapy processes, so setting this any lower does not guarantee a quicker refresh time.
The number of seconds older domain queues are allowed to persist before they expire. This acts as a cache to clean out queues from memory that have not been used recently.
Allows blacklisted domains to be added back to Redis for future crawling. If set to
False, domains matching the Zookeeper based domain blacklist will not be added back in to Redis.
When encountering an unknown domain, throttle the domain to X number of hits within the
The number of seconds to count and retain cluster hits for a particular domain.
Moderates the outbound domain request flow to evenly spread the
QUEUE_HITS throughout the
Number of seconds to keep crawlid specific duplication filters around after the latest crawl with that id has been conducted. Putting this setting too low may allow crawl jobs to crawl the same page due to the duplication filter being wiped out.
The number of seconds to wait between refreshing the Scrapy process’s public IP address. Used when doing IP based throttling.
The default URL to grab the Crawler’s public IP Address from.
The regular expression used to find the Crawler’s public IP Address from the
PUBLIC_IP_URL response. The first element from the results of this regex will be used as the ip address.
If set to true, the crawling process’s spider type is taken into consideration when throttling the crawling cluster.
If set to true, the crawling process’s public IP Address is taken into consideration when throttling the crawling cluster.
For more information about Type and IP throttling, please see the throttle documentation.
Number of cycles through all known domain queues the Scheduler will take before the Spider is considered idle and waits for Scrapy to retry processing a request.
The Scrapy Cluster logger name.
The directory to write logs into. Only applicable when
SC_LOG_STDOUT is set to
The file to write the logs into. When this file rolls it will have
.2 appended to the file name. Only applicable when
SC_LOG_STDOUT is set to
10 * 1024 * 1024
The maximum number of bytes to keep in the file based log before it is rolled.
The number of rolled file logs to keep before data is discarded. A setting of
5 here means that there will be one main log and five rolled logs on the system, totaling six log files.
Log to standard out. If set to
False, will write logs to the file given by the
Log messages will be written in JSON instead of standard text messages.
The log level designated to the logger. Will write all logs of a certain level and higher.
More information about logging can be found in the utilities Log Factory documentation.
Collect Response status code metrics
[ 200, 404, 403, 504, ]
Determines the different Response status codes to collect metrics against if metrics collection is turned on.
How often to check for expired keys and to roll the time window when doing stats collection.
[ 'SECONDS_15_MINUTE', 'SECONDS_1_HOUR', 'SECONDS_6_HOUR', 'SECONDS_12_HOUR', 'SECONDS_1_DAY', 'SECONDS_1_WEEK', ]
Rolling time window settings for statistics collection, the above settings indicate stats will be collected for the past 15 minutes, the past hour, the past 6 hours, etc.
For more information about stats collection, please see the Stats Collector documentation.