Redis Monitor¶
The Redis Monitor serves to moderate the redis based crawling queue. It is used to expire old crawls, stop existing crawls, and gather information about current crawl jobs.
Quick Start¶
First, make sure your settings.py is updated with your Kafka and Redis hosts.
Then run the redis monitor to monitor your Scrapy Cluster:
python redis_monitor.py
This is a completely optional component to the Scrapy Cluster. If you do not care about getting information about crawls, stopping, or expiring crawls in the cluster then you can leave this component alone.
- For how to set up the Scrapy crawlers, refer to the Crawler documentation
- To learn more about how to submit crawls, please see the Kafka Monitor documentation
Design Considerations¶
The Redis Monitor is designed to act as a surgical instrument allows an application to peer into the queues and variables managed by Redis that act upon the Scrapy Cluster. It runs independently of the cluster, and interacts with the items within Redis in three distinct ways:
Information Retrieval
All of the crawl information is contained within Redis, but stored in a way that makes it hard for a human to read. The Redis Monitor makes getting current crawl information about your application’s overall crawl statistics or an individual crawl easy. If you would like to know how many urls are in a particularcrawlid
queue, or how many urls are in the overallappid
queue then this will be useful.
Stop Crawls
Sometimes crawls get out of control and you would like the Scrapy Cluster to immediately stop what it is doing, but without killing all of the other ongoing crawl jobs. Thecrawlid
is used to identify exactly what crawl job needs to be stopped, and thatcrawlid
will both be purged from the request queue, and blacklisted so the spider’s know to never crawl urls with that particularcrawlid
again.
Expire Crawls
Very large and very broad crawls often can take longer than the time allotted to crawl the domain. In this case, you can set an expiration time for the crawl to automatically expire after X date. The particular crawl job will be purged from the spider’s queue while allowing other crawls to continue onward.
These three use cases conducted by the Redis Monitor give a user or an application a great deal of control over their Scrapy Cluster jobs when executing various levels of crawling effort. At any one time the cluster could be doing an arbitrary number of different crawl styles, with different spiders or different crawl parameters, and the design goal of the Redis Monitor is to control the jobs through the same Kafka interface that all of the applications use.
Components¶
This section explains the individual files located within the redis monitor project.
redis_monitor.py¶
The Redis Monitor acts upon two different action requests, and also sends out notifications when a crawl job expires from the queues.
All requests adhere to the following three Kafka topics for input and output:
Incoming Action Request Kafka Topic:
demo.inbound_actions
- The topic to feed properly formatted action requests to
Outbound Action Result Kafka Topics:
demo.outbound_firehose
- A firehose topic of all resulting actions within the system. Any single action conducted by the Redis Monitor is guaranteed to come out this pipe.demo.outbound_<appid>
- A special topic created for unique applications that submit action requests. Any application can listen to their own specific action results by listening to the the topic created under theappid
they used to submit the request. These topics are a subset of the action firehose data and only contain the results that are applicable to the application who submitted it.
Information Action
The info
action can be conducted in two different ways.
Application Info Request
python kafka-monitor.py feed -s settings_actions.py '{"action":"info", "appid":"testapp", "uuid":"someuuid", "spiderid":"link"}This returns back all information available about the
appid
in question. It is a summation of the variouscrawlid
statistics.Application Info Response
{ u'server_time': 1429216294, u'uuid': u'someuuid', u'total_pending': 12, u'total_domains': 0, u'total_crawlids': 2, u'appid': u'testapp', u'crawlids': { u'2aaabbb': { u'low_priority': 29, u'high_priority': 29, u'expires': u'1429216389' u'total': 1 }, u'1aaabbb': { u'low_priority': 29, u'high_priority': 39, u'total': 11 } } }Here, there were two different
crawlid
‘s in the queue for thelink
spider that had the specifiedappid
. The json return value is the basic structure seen above that breaks down the differentcrawlid
‘s into their total, their high/low priority in the queue, and if they have an expiration.
Crawl ID Info Request
python kafka-monitor.py feed -s settings_actions.py '{"action":"info", "appid":"myapp", "uuid":"someuuid", "crawlid":"abc123", "spiderid":"link"}'This is a very specific request that is asking to poll a specific
crawlid
in thelink
spider queue. Note that this is very similar to the above request but with one extra parameter. The following example response is generated:
Crawl ID Info Response from Kafka
{ u'server_time': 1429216864, u'crawlid': u'abc123', u'total_pending': 28, u'low_priority': 39, u'high_priority': 39, u'appid': u'testapp', u'uuid': u'someuuid' }The response to the info request is a simple json object that gives statistics about the crawl in the system, and is very similar to the results for an
appid
request. Here we can see that there were 28 requests in the queue yet to be crawled of all the same priority.
Stop Action
The stop
action is used to abruptly halt the current crawl job. A request takes the following form:
Stop Request
python kafka-monitor.py feed -s settings_actions.py '{"action":"stop", "appid":"testapp", "uuid":"someuuid2", "crawlid":"ABC123", "spiderid":"link"}'After the request is processed, only current spiders within the cluster currently in progress of downloading a page will continue. All other spiders will not crawl that same
crawlid
past a depth of 0 ever again, and all pending requests will be purged from the queue.
Stop Response from Kafka
{ u'total_purged': 524, u'uuid': u'someuuid', u'spiderid': u'link', u'appid': u'testapp', u'action': u'stop', u'crawlid': u'ABC123' }The json response tells the application that the stop request was successfully completed, and states how many requests were purged from the particular queue.
Expire Notification
An expire
notification is generated by the Redis Monitor any time an on going crawl is halted because it has exceeded the time it was supposed to stop. A crawl request that includes an expires
attribute will generate an expire notification when it is stopped by the Redis Monitor.
Expire Notification from Kafka
{ u'total_expired': 75, u'crawlid': u'abcdef-1', u'spiderid': u'link', u'appid': u'testapp', u'action': u'expired' }This notification states that the
crawlid
of “abcdef-1” expired within the system, and that 75 pending requests were removed.