Scrapy Cluster 1.2.1 Documentation¶
This documentation provides everything you need to know about the Scrapy based distributed crawling project, Scrapy Cluster.
Introduction¶
- Overview
- Learn about the Scrapy Cluster Architecture.
- Quick Start
- A Quick Start guide to those who want to jump right in.
Architectural Components¶
Kafka Monitor¶
- Design
- Learn about the design considerations for the Kafka Monitor
- Quick Start
- How to use and run the Kafka Monitor
- API
- The default Kafka API the comes with Scrapy Cluster
- Plugins
- Gives an overview of the different plugin components within the Kafka Monitor, and how to make your own.
- Settings
- Explains all of the settings used by the Kafka Monitor
Crawler¶
- Design
- Learn about the design considerations for the Scrapy Cluster Crawler
- Quick Start
- How to use and run the distributed crawlers
- Controlling
- Learning how to control your Scrapy Cluster will enable you to get the most out of it
- Extension
- How to use both Scrapy and Scrapy Cluster to enhance your crawling capabilites
- Settings
- Explains all of the settings used by the Crawler
Redis Monitor¶
- Design
- Learn about the design considerations for the Redis Monitor
- Quick Start
- How to use and run the Redis Monitor
- Plugins
- Gives an overview of the different plugin components within the Redis Monitor, and how to make your own.
- Settings
- Explains all of the settings used by the Redis Monitor
Rest¶
- Design
- Learn about the design considerations for the Rest service
- Quick Start
- How to use and run the Rest service
- API
- The API the comes with the endpoint
- Settings
- Explains all of the settings used by the Rest component
Utilities¶
- Argparse Helper
- Simple module to assist in argument parsing with subparsers.
- Log Factory
- Module for logging multithreaded or concurrent processes to files, stdout, and/or json.
- Method Timer
- A method decorator to timeout function calls.
- Redis Queue
- A module for creating easy redis based FIFO, Stack, and Priority Queues.
- Redis Throttled Queue
- A wrapper around the
redis_queue
module to enable distributed throttled pops from the queue. - Settings Wrapper
- Easy to use module to load both default and local settings for your python application and provides a dictionary object in return.
- Stats Collector
- Module for statistics based collection in Redis, including counters, rolling time windows, and hyperloglog counters.
- Zookeeper Watcher
- Module for watching a zookeeper file and handles zookeeper session connection troubles and re-establishment of watches.
Advanced Topics¶
- Upgrade Scrapy Cluster
- How to update an older version of Scrapy Cluster to the latest
- Integration with ELK
- Visualizing your cluster with the ELK stack gives you new insight into your cluster
- Docker
- Use docker to provision and scale your Scrapy Cluster
- Crawling Responsibly
- Responsible Crawling with Scrapy Cluster
- Production Setup
- Thoughts on Production Scale Deployments
- DNS Cache
- DNS Caching is bad for long lived spiders
- Response Time
- How the production setup influences cluster response times
- Kafka Topics
- The Kafka Topics generated when typically running the cluster
- Redis Keys
- The keys generated when running a Scrapy Cluster in production
- Other Distributed Scrapy Projects
- A comparison with other Scrapy projects that are distributed in nature
Miscellaneous¶
- Frequently Asked Questions
- Scrapy Cluster FAQ
- Troubleshooting
- Debugging distributed applications is hard, learn how easy it is to debug Scrapy Cluster.
- Contributing
- Learn how to contribute to Scrapy Cluster
- Change Log
- View the changes between versions of Scrapy Cluster.
- License
- Scrapy Cluster is licensed under the MIT License.