Quick Start¶
This is a complete Scrapy crawling project located in crawler/
.
First, create a crawling/localsettings.py
file to track your custom settings. You can override any setting in the normal settings.py
file, and a typical file may look like the following:
REDIS_HOST = 'scdev'
KAFKA_HOSTS = 'scdev:9092'
ZOOKEEPER_HOSTS = 'scdev:2181'
SC_LOG_LEVEL = 'DEBUG'
Scrapy¶
Then run the Scrapy spider like normal:
scrapy runspider crawling/spiders/link_spider.py
To run multiple crawlers, simply run in the background across X number of machines. Because the crawlers coordinate their efforts through Redis, any one crawler can be brought up/down on any machine in order to add crawling capacity.
Typical Usage¶
Open four terminals.
Terminal 1:
Monitor your kafka output
$ python kafkadump.py dump -t demo.crawled_firehose -p
Terminal 2:
Run the Kafka Monitor
$ python kafka_monitor.py run
Note
This assumes you have your Kafka Monitor already working.
Terminal 3:
Run your Scrapy spider
scrapy runspider crawling/spiders/link_spider.py
Terminal 4:
Feed an item
$ python kafka_monitor.py feed '{"url": "http://dmoz.org", "appid":"testapp", "crawlid":"09876abc"}'
2016-01-21 23:22:23,830 [kafka-monitor] INFO: Feeding JSON into demo.incoming
{
"url": "http://dmoz.org",
"crawlid": "09876abc",
"appid": "testapp"
}
2016-01-21 23:22:23,832 [kafka-monitor] INFO: Successfully fed item to Kafka
You should see a log message come through Terminal 2 stating the message was received.
2016-01-21 23:22:23,859 [kafka-monitor] INFO: Added crawl to Redis
Next, you should see your Spider in Terminal 3 state the crawl was successful.
2016-01-21 23:22:35,976 [scrapy-cluster] INFO: Scraped page
2016-01-21 23:22:35,979 [scrapy-cluster] INFO: Sent page to Kafka
At this point, your Crawler is up and running!
If you are still listening to the Kafka Topic in Terminal 1, the following should come through.
{
"body": "<body ommitted>",
"crawlid": "09876abc",
"response_headers": {
<headers omitted>
},
"response_url": "http://www.dmoz.org/",
"url": "http://www.dmoz.org/",
"status_code": 200,
"status_msg": "OK",
"appid": "testapp",
"links": [],
"request_headers": {
"Accept-Language": "en",
"Accept-Encoding": "gzip,deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"User-Agent": "Scrapy/1.0.4 (+http://scrapy.org)"
},
"attrs": null,
"timestamp": "2016-01-22T04:22:35.976672"
}
This completes the crawl submission, execution, and receival of the requested data.