API¶
The Rest service component exposes both the Kafka Monitor API and a mechanism for getting information that may take a significant amount of time.
Standardization¶
Aside from the Index
endpoint which provides information about the service itself, all objects receive the following wrapper:
{
"status": <status>,
"data": <data>,
"error" <error>
}
The status
key is a human readable SUCCESS
or FAILURE
string, used for quickly understanding what happened. The data
key provides the data that was requested, and the error
object contains information about any internal errors that occurred while processing your request.
Accompanying this object are standard 200
, 400
, 404
, or 500
response codes. Whenever you encounter non-standard response codes from the Rest service you should receive the accompanying information provided both in the response from the service itself and/or in the service logs. Problems might include:
Improperly formatted JSON or content headers are incorrect
{ "data": null, "error": { "message": "The payload must be valid JSON." }, "status": "FAILURE" }
Invalid JSON structure for desired endpoint
{ "data": null, "error": { "cause": "Additional properties are not allowed (u'crazykey' was unexpected)", "message": "JSON did not validate against schema." }, "status": "FAILURE" }
Unexpected Exception within the service itself
{ "data": null, "error": { "message": "Unable to connect to Kafka" }, "status": "FAILURE" }
The desired endpoint does not exist
{ "data": null, "error": { "message": "The desired endpoint does not exist" }, "status": "FAILURE" }
For all of these and any other error, you should expect diagnostic information to be either in the response or in the logs.
Index Endpoint¶
The index endpoint allow you to obtain basic information about the status of the Rest Service, its uptime, and connection to critical components.
Headers Expected: None
Method Types: GET
, POST
URI: /
Example¶
$ curl scdev:5343
Responses¶
Unable to connect to Redis or Kafka
{
"kafka_connected": false,
"my_id": "d209adf2aa01",
"node_health": "RED",
"redis_connected": false,
"uptime_sec": 143
}
Able to connect to Redis or Kafka, but not both at the same time
{
"kafka_connected": false,
"my_id": "d209adf2aa01",
"node_health": "YELLOW",
"redis_connected": true,
"uptime_sec": 148
}
Able to connect to both Redis and Kafka, fully operational
{
"kafka_connected": true,
"my_id": "d209adf2aa01",
"node_health": "GREEN",
"redis_connected": true,
"uptime_sec": 156
}
Here, a human readable node_health
field is provided, as well as information about which service is unavailable at the moment. If the component is not GREEN
in health you should troubleshoot your configuration.
Feed Endpoint¶
The feed endpoint transmits your request into JSON that will be sent to Scrapy Cluster. It follows the API exposed by the Kafka Monitor, and acts as a pass-through to that service. The assumptions made are as follows:
- Crawl requests made to the cluster do not expect a response back via Kafka
- Other requests like Action or Stat expect a response within a designated period of time. If a response is expected but not received, a Poll is used to further poll for the desired response.
Headers Expected: Content-Type: application/json
Method Types: POST
URI: /feed
Data: Valid JSON data for the request
Examples¶
Feed a crawl request
$ curl scdev:5343/feed -H "Content-Type: application/json" -d '{"url":"http://dmoztools.net", "appid":"madisonTest", "crawlid":"abc123"}'
Feed a Stats request
$ curl scdev:5343/feed -H "Content-Type: application/json" -d '{"uuid":"abc123", "appid":"stuff"}'
In both of these cases, we are translating the JSON required by the Kafka Monitor into a Restful interface request. You may use all of the API’s exposed by the Kafka Monitor here when creating your request.
Responses¶
The responses from the feed endpoint should match both the standardized object and the expected return value from the Kafka Monitor API.
Successful submission of a crawl request.
{
"data": null,
"error": null,
"status": "SUCCESS"
}
Successful response from a Redis Monitor request
{
"data": {... data here ...},
"error": null,
"status": "SUCCESS"
}
Unsuccessful response from a Redis Monitor request
{
"data": {
"poll_id": <uuid of request>
},
"error": null,
"status": "SUCCESS"
}
In this case, the response was unable to be obtained within the response time and the poll_id
should be used in the poll request below.
Poll Endpoint¶
The Poll endpoint provides the ability to retrieve data from long running requests that might take longer than the desired response time configured for the service. This is useful when conducting statistics gathering or pruning data via action requests.
Headers Expected: Content-Type: application/json
Method Types: POST
URI: /poll
Data: Valid JSON data for the request
JSON Schema¶
{
"type": "object",
"properties": {
"poll_id": {
"type": "string",
"minLength": 1,
"maxLength": 100,
"description": "The poll id to retrieve"
}
},
"required": [
"poll_id"
],
"additionalProperties": false
}
Example¶
$ curl scdev:5343/poll -XPOST -H "Content-Type: application/json" -d '{"poll_id":"abc123"}'
Responses¶
Successfully found a poll that has been completed, but was not returned initially during the request
{
"data": {... data here ...},
"error": null,
"status": "SUCCESS"
}
Did not find the results for the poll_id
{
"data": null,
"error": {
"message": "Could not find matching poll_id"
},
"status": "FAILURE"
}
Note that a failure to find the poll_id
may indicate one of two things:
- The request has not completed yet
- The request incurred a failure within another component of Scrapy Cluster