Batch ingestion troubleshooting guide
This documentation will help answer frequently asked questions regarding 51黑料不打烊 Experience Platform Batch Data Ingestion APIs.
Batch API Calls
Are batches immediately active after receiving an HTTP 200 OK from the CompleteBatch API?
The 200 OK
response from the API means that the batch has been accepted for processing - it is not active until it transitions to its final state, such as Active or Failure.
Is it safe to retry the CompleteBatch API call after it fails?
Yes - it is safe to retry the API call. Despite the failure, it is possible that the operation actually succeeded and the batch was successfully accepted. However, clients are expected to have retry mechanisms in case of API failure, and are, in fact, encouraged to retry. If the operation actually succeeded, the API will return success, even after retrying.
When should the Large File Upload API be used?
The recommended file size for using the Large File Upload API is 256 MB or larger. More information about how to use the Large File Upload API can be found here.
Why is the Large File Complete API call failing?
If chunks of a large file are found overlapping or missing, the server responds with an HTTP 400 Bad Request. This can occur because it is possible to upload overlapping chunks, as range validations are done at the time of file completion, when the file chunks are stitched together.
Ingestion Support
What are the supported ingest formats?
Currently, both Parquet and JSON are supported. CSV is supported on a legacy basis - while data will be promoted to master and preliminary checks will be done, no modern features, such as conversion, partitioning, or row validation will be supported.
Where should the batch input format be specified?
The input format should be specified at batch creation time within the payload. An example of how to specify the batch input format can be seen below:
curl -X POST "https://platform.adobe.io/data/foundation/import/batches" \
-H "accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
-d '{
"datasetId": "{DATASET_ID}",
"inputFormat": {
"format": "json"
}
}'
Why is the uploaded data not appearing in the dataset?
In order for data to appear in the dataset, the batch must be marked as complete. All the files you want to ingest must be uploaded before marking the batch as complete. An example of marking a batch as complete can be seen below:
curl -X POST "https://platform.adobe.io/data/foundation/import/batches/{BATCH_ID}?action=COMPLETE" \
-H 'Authorization: Bearer {ACCESS_TOKEN}' \
-H 'x-gw-ims-org-id: {ORG_ID}' \
-H 'x-api-key: {API_KEY}' \
-H 'x-sandbox-name: {SANDBOX_NAME}'
How is multi-line JSON ingested?
To ingest multi-line JSON, the isMultiLineJson
flag needs to be set at the time of batch creation. An example of this can be seen below:
curl -X POST "https://platform.adobe.io/data/foundation/import/batches" \
-H "accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
-d '{
"datasetId": "{DATASET_ID}",
"inputFormat": {
"format": "json",
"isMultiLineJson": true
}
}'
What is the difference between JSON lines (single-line JSON) and multi-line JSON?
For JSON lines, there is one JSON object per line. For example:
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}
For multi-line JSON, one object can occupy multiple lines, while all the objects are wrapped in a JSON array. For example:
[
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
{
"string": "string3",
"int": 3,
"array": [
3,
6,
9
],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]
By default, Batch Data Ingestion uses single-line JSON.
Is CSV ingestion supported?
CSV ingestion is only supported for flat schemas. Currently, ingesting hierarchical data in CSV is not supported.
To get all the data ingestion features, JSON or Parquet formats need to be used.
What types of validation are performed on the data?
There are three levels of validation performed on the data:
- Schema - Batch Ingestion ensures that the schema of the of the ingested data matches the schema of the dataset.
- Data Type - Batch Ingestion ensures that the type for each field ingested matches the type defined in the schema of the dataset.
- Constraints - Batch Ingestion ensures constraints, such as 鈥淩equired鈥, 鈥渆num鈥, and 鈥渇ormat鈥 are properly defined in the schema definition.
How can an already ingested batch be replaced?
An already ingested batch can be replaced by using the Batch Replay feature. More information about Batch Replay can be found here.
How is batch ingestion monitored?
Once a batch has been signaled for batch promotion, the batch ingestion progress can be monitored with the following request:
curl -X GET "https://platform.adobe.io/data/foundation/catalog/batches/{BATCH_ID}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
With this request, you will get a response similar to this:
200 OK
{
"{BATCH_ID}":{
"imsOrg":"{ORG_ID}",
"created":1494349962314,
"createdClient":"{API_KEY}",
"createdUser":"{USER_ID}",
"updatedUser":"{USER_ID}",
"completed":1494349963467,
"externalId":"{EXTERNAL_ID}",
"status":"staging",
"errors":[],
}
}
Batch States
What are the possible batch states?
A batch can, in its lifecycle, go through the following states:
n
retries by a Batch Monitoring service, the batch promotion has stalled.What does 鈥淪taging鈥 mean for batches?
When a batch is in 鈥淪taging鈥, it means that the batch was successfully signaled for promotion, and that the data is being staged for consumption downstream.
What does it mean when a batch is 鈥淩etrying鈥?
When a batch is in 鈥淩etrying鈥, it means that the batch鈥檚 data ingestion has been temporarily halted due to intermittent issues. When this happens, it requires no customer intervention.
What does it mean when a batch is 鈥淪talled鈥?
When a batch is in 鈥淪talled鈥, it means that Data Ingestion Services is experiencing difficulty ingesting the batch and all retries have been exhausted.
What does it mean if a batch is still 鈥淟oading鈥?
When a batch is in 鈥淟oading鈥, it means that the CompleteBatch API has not been called to promote the batch.
Is there a way to know if a batch has been successfully ingested?
Yes, once the batch status is 鈥淎ctive鈥, the batch has been successfully ingested. To find out the status of the batch, follow the steps detailed earlier.
What happens after a batch fails? what-if-a-batch-fails
When a batch fails, the process stops and returns a Failure
status. The reason it fails can be identified in the errors
section of the payload. Examples of errors can be seen below:
"errors":[
{
"code":"106",
"description":"Dataset file is empty. Please upload file with data.",
"rows":[]
},
{
"code":"118",
"description":"CSV file contains empty header row.",
"rows":[]
}
]
Once the errors have been corrected, the batch can be re-uploaded.
Batch Support
How should batches be deleted?
Instead of deleting directly from Catalog, batches should be removed using either method provided below:
- If the batch is in progress, the batch should be aborted.
- If the batch is successfully mastered, the batch should be reverted.
What batch-level metrics are available?
The following batch-level metrics are available for batches in the Active/Success state:
Why are metrics not available on some batches?
There are two reasons that metrics may not be available on your batch:
- The batch never successfully made it to the Active/Success state.
- The batch was promoted using a legacy promotion path, such as CSV ingestion.
What do the different status codes mean?
GetBatch
endpoint.