Kafka Monitor

The Kafka Monitor serves as the entry point into the crawler architecture. It validates both incoming crawl requests and incoming action requests. The files included within can be used to submit new crawl requests, gather information about currently running crawls, and dump results to the command line.

Quick Start

First, make sure your settings_crawler.py and settings_actions.py are updated with your Kafka and Redis hosts.

Then run the kafka monitor for crawl requests:

python kafka-monitor.py run -s settings_crawling.py

Then run the kafka monitor for action requests:

python kafka-monitor.py run -s settings_actions.py

Finally, submit a new crawl request:

python kafka-monitor.py feed -s settings_crawling.py '{"url": "http://istresearch.com", "appid":"testapp", "crawlid":"ABC123"}'

If you have everything else in the pipeline set up correctly you should now see the raw html of the IST Research home page come through. You need both of these to run the full cluster, but at a minimum you should have the -s settings_crawling monitor running.

  • For how to set up the Scrapy crawlers, refer to the Crawler documentation
  • To learn more about how to see crawl info, please see the Redis Monitor documentation

Design Considerations

The design of the Kafka Monitor stemmed from the need to define a format that allowed for the creation of crawls in the crawl architecture from any application. If the application could read and write to the kafka cluster then it could write messages to a particular kafka topic to create crawls.

Soon enough those same applications wanted the ability to retrieve and stop their crawls from that same interface, so we decided to make a dynamic interface that could support all of the request needs, but utilize the same base code. In the future this base code could expanded to handle any different style of request, as long as there was a validation of the request and a place to send the result to.

From our own internal debugging and ensuring other applications were working properly, a utility program was also created on the side in order to be able to interact and monitor the kafka messages coming through. This dump utility can be used to monitor any of the Kafka topics within the cluster.

Components

This section explains the individual files located within the kafka monitor project.

kafka_monitor.py

The Kafka Monitor consists of the following two different functionalities

python kafka-monitor.py run
Usage:
    monitor run --settings=<settings>
    monitor feed --settings=<settings> <json_req>

When in run mode, the class monitors the designated kafka topic and validates it against a formal Json Schema Specification. After the formal json request is validated, the designated SCHEMA_METHOD handler function is then called to process your data. We designed it in this fashion so you could deploy then same simple python file, but dynamically configure it to validate and process your request based on a tunable configuration file. The formal specifications for submitted crawl requests and action requests are outlined in below section.

When in feed mode, the class acts as a simple interface to validate and submit your command line json request. An example is shown in the above section when you submit a new crawl request. Anything in the formal json specification can be submitted via the command line or via a Kafka message generated by another application. The feeder is only used when you need to submit manual crawls to the API for debugging, and any major application should connect via Kafka.

Incoming Crawl Request Kafka Topic:

  • demo.incoming_urls - The topic to feed properly formatted crawl requests to

Outbound Crawl Result Kafka Topics:

  • demo.crawled_firehouse - A firehose topic of all resulting crawls within the system. Any single page crawled by the Scrapy Cluster is guaranteed to come out this pipe.
  • demo.crawled_<appid> - A special topic created for unique applications that submit crawl requests. Any application can listen to their own specific crawl results by listening to the the topic created under the appid they used to submit the request. These topics are a subset of the crawl firehose data and only contain the results that are applicable to the application who submitted it.

Note

For more information about the topics generated and used by the Redis Monitor, please see the Redis Monitor documentation

Example Crawl Requests:

python kafka-monitor.py feed '{"url": "http://www.apple.com/", "appid":"testapp", "crawlid":"myapple"}' -s settings_crawling.py
  • Submits a single crawl of the homepage of apple.com
python kafka-monitor.py feed '{"url": "http://www.dmoz.org/", "appid":"testapp", "crawlid":"abc123", "maxdepth":2, "priority":90}' -s settings_crawling.py
  • Submits a dmoz.org crawl spidering 2 levels deep with a high priority
python wat-scrapy-kafka-monitor.py feed '{"url": "http://aol.com/", "appid":"testapp", "crawlid":"a23bbqwewqe", "maxdepth":3, "allowed_domains":["aol.com"], "expires":1423591888}' -s settings_crawling.py
  • Submits an aol.com crawl that runs for (at the time) 3 minutes with a large depth of 3, but limits the crawlers to only the aol.com domain so as to not get lost in the weeds of the internet.

Example Crawl Request Output from the kafkadump utility:

{
    u'body': u'<real raw html source here>',
    u'crawlid': u'abc1234',
    u'links': [],
    u'response_url': u'http://www.dmoz.org/Recreation/Food/',
    u'url': u'http://www.dmoz.org/Recreation/Food/',
    u'status_code': 200,
    u'status_msg': u'OK',
    u'appid': u'testapp',
    u'headers': {
        u'Cteonnt-Length': [u'40707'],
        u'Content-Language': [u'en'],
        u'Set-Cookie': [u'JSESSIONID=FB02F2BBDBDBDDE8FBE5E1B81B4219E6; Path=/'],
        u'Server': [u'Apache'],
        u'Date': [u'Mon, 27 Apr 2015 21:26:24 GMT'],
        u'Content-Type': [u'text/html;charset=UTF-8']
    },
    u'attrs': {},
    u'timestamp': u'2015-04-27T21:26:24.095468'
}

For a full specification as to how you can control the Scrapy Cluster crawl parameters, please refer to the scraper_schema.json documentation.

kafkadump.py

The Kafka dump utility stemmed from the need to quickly view the resulting kafka crawl results. This is a simple utility designed to do two things:

python kafkadump.py --help

Usage:
    kafkadump list --host=<host>
    kafkadump dump <topic> --host=<host> [--consumer=<consumer>]

When ran with the command list the utility will dump out all of the topics created on your cluster.

When ran with the dump command, the utility will connect to Kafka and dump every message in that topic starting from the beginning. Once it hits the end it will sit there and wait for new data to stream through. This is especially useful when doing command line debugging of the cluster to ensure that crawl results are flowing back out from the system, or for monitoring the results of information requests.

scraper_schema.json

The Scraper Schema defines the level of interaction an application gets with the Scrapy Cluster. The following properties are available to control the crawling cluster:

Required

  • appid: The application ID that submitted the crawl request. This should be able to uniquely identify who submitted the crawl request
  • crawlid: A unique crawl ID to track the executed crawl through the system. Crawl ID’s are passed along when a maxdepth > 0 is submitted, so anyone can track all of the results from a given seed url. Crawl ID’s also serve as a temporary duplication filter, so the same crawl ID will not continue to recrawl pages it has already seen.
  • url: The initial seed url to begin the crawl from. This should be a properly formatted full path url from which the crawl will begin from

Optional:

  • spiderid: The spider to use for the crawl. This feature allows you to chose the spider you wish to execute the crawl from
  • maxdepth: The depth at which to continue to crawl new links found on pages
  • priority: The priority of which to given to the url to be crawled. The Spiders will crawl the highest priorities first.
  • allowed_domains: A list of domains that the crawl should stay within. For example, putting [ "cnn.com" ] will only continue to crawl links of that domain.
  • allow_regex: A list of regular expressions to apply to the links to crawl. Any hits within from any regex will allow that link to be crawled next.
  • deny_regex: A list of regular expressions that will deny links to be crawled. Any hits from these regular expressions will deny that particular url to be crawled next, as it has precedence over allow_regex.
  • deny_extensions: A list of extensions to deny crawling, defaults to the extensions provided by Scrapy (which are pretty substantial).
  • expires: A unix timestamp in seconds since epoch for when the crawl should expire from the system and halt. For example, 1423669147 means the crawl will expire when the crawl system machines reach 3:39pm on 02/11/2015. This setting does not account for timezones, so if the machine time is set to EST(-5) and you give a UTC time for three minutes in the future, the crawl will run for 5 hours and 3 mins!
  • useragent: The header request user agent to fake when scraping the page. If none it defaults to the Scrapy default.
  • attrs: A generic object, allowing an application to pass any type of structured information through the crawl in order to be received on the other side. Useful for applications that would like to pass other data through the crawl.

action_schema.json

The Action Schema allows for extra information to be gathered from the Scrapy Cluster, as well as stopping crawls while they are executing. These commands are executed by the Redis Monitor, and the following properties are available to control.

Required

  • appid: The application ID that is requesting the action.
  • spiderid: The spider used for the crawl (in this case, link)
  • action: The action to take place on the crawl. Options are either info or stop
  • uuid: A unique identifier to associate with the action request. This is used for tracking purposes by the applications who submit action requests.

Optional:

  • crawlid: The unique crawlid to act upon. Only needed when stopping a crawl or gathering information about a specific crawl.