Change Log¶

This page serves to document any changes made between releases.

Scrapy Cluster 1.3¶

Date: ??/??/????

Add Python 3 support, removed python 2 support
Add Redis password support
Fixed assert deprecations in unit tests
Corrected Ansible host list for zookeeper
Improve cookie handling
Minor documentation changes/updates
Added REDIS_SOCKET_TIMEOUT setting to control socket_timeout and socket_connect_timeout
Removed ansible support
Fix python package dependencies

Date: 03/29/2017

Added Coveralls code coverage integration
Added full stack offline unit tests and online integration testing in Travis CI
Upgraded all components to newest Python packages
Switched example Virtual Machine from Miniconda to Virtualenv
Add setting to specify Redis db across all components
Docker support
Improved RedisThrottledQueue implementation to allow for rubber band catch up while under moderation
Added support for Centos and Ubuntu Virtual Machines

Added ability to control cluster wide blacklists via Zookeeper
Improved memory management in scheduler for domain based queues
Added two new spider middlewares for stats collection and meta field passthrough
Removed excess pipeline middleware

Date: 02/23/2016

Added domain based queue mechanism for better management and control across all components of the cluster
Added easy offline bash script for running all offline tests
Added online bash test script for testing if your cluster is integrated with all other external components
New Vagrant virtual machine for easier development and testing.
Modified demo incoming kafka topic from demo.incoming_urls to just demo.incoming as now all crawl/info/stop requests are serviced through a single topic
Added new scutils package for accessing modules across different components.
Added scutils documentation
Added significantly more documentation and improved layout
Created new elk folder for sample Elasticsearch, Logstash, Kibana integration

Upgraded Crawler to be compatible with Scrapy 1.0
Improved code structure for overriding url.encode in default LxmlParserLinkExtractor
Improved logging
Added ability for the crawling rate to be controlled in a manner that will rate limit the whole crawling cluster based upon the domain, spider type, and public ip address the crawlers have.
Added ability for the crawl rate to be explicitly defined per domain in Zookeeper, with the ability to dynamically update them on the fly
Created manual crawler Zookeeper configuration pusher
Updated offline and added online unit tests
Added response code stats collection
Added example Wandering Spider

Date: 5/21/2015