Scrapy Cluster Logo

Scrapy Cluster 1.3 Documentation

This documentation provides everything you need to know about the Scrapy based distributed crawling project, Scrapy Cluster.

Introduction

Overview
Learn about the Scrapy Cluster Architecture.
Quick Start
A Quick Start guide to those who want to jump right in.

Architectural Components

Kafka Monitor

Design
Learn about the design considerations for the Kafka Monitor
Quick Start
How to use and run the Kafka Monitor
API
The default Kafka API the comes with Scrapy Cluster
Plugins
Gives an overview of the different plugin components within the Kafka Monitor, and how to make your own.
Settings
Explains all of the settings used by the Kafka Monitor

Crawler

Design
Learn about the design considerations for the Scrapy Cluster Crawler
Quick Start
How to use and run the distributed crawlers
Controlling
Learning how to control your Scrapy Cluster will enable you to get the most out of it
Extension
How to use both Scrapy and Scrapy Cluster to enhance your crawling capabilites
Settings
Explains all of the settings used by the Crawler

Redis Monitor

Design
Learn about the design considerations for the Redis Monitor
Quick Start
How to use and run the Redis Monitor
Plugins
Gives an overview of the different plugin components within the Redis Monitor, and how to make your own.
Settings
Explains all of the settings used by the Redis Monitor

Rest

Design
Learn about the design considerations for the Rest service
Quick Start
How to use and run the Rest service
API
The API the comes with the endpoint
Settings
Explains all of the settings used by the Rest component

Utilities

Argparse Helper
Simple module to assist in argument parsing with subparsers.
Log Factory
Module for logging multithreaded or concurrent processes to files, stdout, and/or json.
Method Timer
A method decorator to timeout function calls.
Redis Queue
A module for creating easy redis based FIFO, Stack, and Priority Queues.
Redis Throttled Queue
A wrapper around the redis_queue module to enable distributed throttled pops from the queue.
Settings Wrapper
Easy to use module to load both default and local settings for your python application and provides a dictionary object in return.
Stats Collector
Module for statistics based collection in Redis, including counters, rolling time windows, and hyperloglog counters.
Zookeeper Watcher
Module for watching a zookeeper file and handles zookeeper session connection troubles and re-establishment of watches.

Advanced Topics

Upgrade Scrapy Cluster
How to update an older version of Scrapy Cluster to the latest
Integration with ELK
Visualizing your cluster with the ELK stack gives you new insight into your cluster
Docker
Use docker to provision and scale your Scrapy Cluster
Crawling Responsibly
Responsible Crawling with Scrapy Cluster
Production Setup
Thoughts on Production Scale Deployments
DNS Cache
DNS Caching is bad for long lived spiders
Response Time
How the production setup influences cluster response times
Kafka Topics
The Kafka Topics generated when typically running the cluster
Redis Keys
The keys generated when running a Scrapy Cluster in production
Other Distributed Scrapy Projects
A comparison with other Scrapy projects that are distributed in nature

Miscellaneous

Frequently Asked Questions
Scrapy Cluster FAQ
Troubleshooting
Debugging distributed applications is hard, learn how easy it is to debug Scrapy Cluster.
Contributing
Learn how to contribute to Scrapy Cluster
Change Log
View the changes between versions of Scrapy Cluster.
License
Scrapy Cluster is licensed under the MIT License.