Frequently Asked Questions

Common questions about Scrapy Cluster organized by category.

General

I can’t get my cluster working, where do I begin?

We always recommend using the latest stable release or commit from the master branch. Code pulled from here should be able to successfully complete all ./offline_tests.sh.

You can then test your online integration settings by running ./online_tests.sh and determining at what point your tests fail. If all of the online tests pass then your cluster appears to be ready for use.

Normally, online test failure is a result of improper settings configuration or network ports not being properly configured. Triple check these!

If you are still stuck please refer the the Vagrant Quickstart guide for setting up an example cluster, or the Troubleshooting page for more information.

How do I debug a component?

Both the Kafka Monitor and Redis Monitor can have their log level altered by passing the --log-level flag to the command of choice. For example, you can see more verbose debug output from the Kafka Monitor’s run command with the following command.

$ python kafka_monitor.py run --log-level DEBUG

You can also alter the LOG_LEVEL setting in your localsettings.py file to achieve the same effect.

If you wish to debug Scrapy Cluster based components in your Scrapy Spiders, use the SC_LOG_LEVEL setting in your localsettings.py file to see scrapy cluster based debug output. Normal Scrapy debugging techniques can be applied here as well, as the scrapy cluster debugging is designed to not interfere with Scrapy based debugging.

What branch should I work from?

If you wish to have stable, tested, and documented code please use the master branch. For untested, undocumented bleeding edge developer code please see the dev branch. All other branches should be offshoots of the dev branch and will be merged back in at some point.

Why do you recommend using a localsettings.py file instead of altering the settings.py file that comes with the components?

Local settings allow you to keep your custom settings for your cluster separate from those provided by default. If we decide to change variable names, add new settings, or alter the default value you now have a merge conflict if you pull that new commit down.

By keeping your settings separate, you can also have more than one setting configuration at a time! For example, you can use the Kafka Monitor to push json into to different Kafka topics for various testing, or have a local debugging vs production setup on your machines. Use the --settings flag for either the Kafka Monitor or Redis Monitor to alter their configuration.

Note

The local settings override flag does not apply to the Scrapy settings, Scrapy uses its own style for settings overrides that can be found on this page

How do I deploy my cluster in an automated fashion?

Deploying a scrapy cluster in an automated fashion is highly dependent on the environment you are working in. Because we cannot control the OS you are running, packages installed, or network setup, it is best recommended you use an automated deployment framework that fits your needs. Some suggestions include Ansible, Puppet, Chef, Salt, Anaconda Cluster, etc.

Are there other distributed Scrapy projects?

Yes! Please see our breakdown at Other Distributed Scrapy Projects

I would like to contribute but do not know where to begin, how can I help?

You can find suggestions of things we could use help on here.

How do I contact the community surrounding Scrapy Cluster?

Feel free to reach out by joining the Gitter chat room, or for more formal issues please raise an issue.

Kafka Monitor

How do I extend the Kafka Monitor to fit my needs?

Please see the plugin documentation here for adding new plugins to the Kafka Monitor. If you would like to contribute to core Kafka Monitor development please consider looking at our guide for Submitting Pull Requests.

Crawler

How do I create a Scrapy Spider that works with the cluster?

To use everything scrapy cluster has to offer with your new Spider, you need your class to inherit from our RedisSpider base class. You will have a custom self._logger for scrapy cluster based logging and a method that will allow you to update your spider statistics you can use with your Response:

self._increment_status_code_stat(response)

You can also yield new Requests or items like a normal Scrapy Spider. For more information see the crawl extension documentation.

Can I use everything else that the original Scrapy has to offer, like middlewares, pipelines, etc?

Yes, you can. Our core logic relies on a heavily customized Scheduler which is not normally exposed to users. If Scrapy Cluster hinders use of a Scrapy ability you need please let us know.

Do I have to restart my Scrapy Cluster Crawlers when I push a new domain specific configuration?

No, the crawlers will receive a notification from Zookeeper that their configuration has changed. They will then automatically update to the new desired settings, without a restart. For more information please see here.

Redis Monitor

How do I extend the Redis Monitor to fit my needs?

Please see the plugin documentation here for adding new plugins to the Redis Monitor. If you would like to contribute to core Redis Monitor development please consider looking at our guide for Submitting Pull Requests.

Utilities

Are the utilities dependent on Scrapy Cluster?

No! The utilities package is located on PyPi here and can be downloaded and used independently of this project.

Have a question that isn’t answered here or in our documentation? Feel free to read our Raising Issues guidelines about opening an issue.