Change Log¶
This page serves to document any changes made between releases.
Scrapy Cluster 1.1¶
Date: 02/23/2016
- Added domain based queue mechanism for better management and control across all components of the cluster
- Added easy offline bash script for running all offline tests
- Added online bash test script for testing if your cluster is integrated with all other external components
- New Vagrant virtual machine for easier development and testing.
- Modified demo incoming kafka topic from
demo.incoming_urls
to justdemo.incoming
as now all crawl/info/stop requests are serviced through a single topic - Added new
scutils
package for accessing modules across different components. - Added
scutils
documentation - Added significantly more documentation and improved layout
- Created new
elk
folder for sample Elasticsearch, Logstash, Kibana integration
Kafka Monitor Changes¶
- Condensed the Crawler and Actions monitor into a single script
- Renamed
kafka-monitor.py
tokafka_monitor.py
for better PEP 8 standards - Added plugin functionality for easier extension creation
- Improved kafka topic dump utility
- Added both offline and online unit tests
- Improved logging
- Added defaults to
scraper_schema.json
- Added Stats Collection and interface for retrieving stats
Redis Monitor Changes¶
- Added plugin functionality for easier extension creation
- Added both offline and online unit tests
- Improved logging
- Added Stats Collection
Crawler Changes¶
- Upgraded Crawler to be compatible with Scrapy 1.0
- Improved code structure for overriding url.encode in default LxmlParserLinkExtractor
- Improved logging
- Added ability for the crawling rate to be controlled in a manner that will rate limit the whole crawling cluster based upon the domain, spider type, and public ip address the crawlers have.
- Added ability for the crawl rate to be explicitly defined per domain in Zookeeper, with the ability to dynamically update them on the fly
- Created manual crawler Zookeeper configuration pusher
- Updated offline and added online unit tests
- Added response code stats collection
- Added example Wandering Spider