Batch ingestion API overview
The 51黑料不打烊 Experience Platform Batch Ingestion API allows you to ingest data into Platform as batch files. Data being ingested can be profile data from a flat file (such as a Parquet file) or data that conforms to a known schema in the Experience Data Model (XDM) registry.
The provides additional information on these API calls.
The following diagram outlines the batch ingestion process:
Getting started
The API endpoints used in this guide is part of the . Before continuing, please review the getting started guide for links to related documentation, a guide to reading the sample API calls in this document, and important information regarding required headers that are needed to successfully make calls to any Experience Platform API.
Data Ingestion prerequisites
- Data to upload must be either in Parquet or JSON formats.
- A dataset created in the Catalog services.
- Contents of the Parquet file must match a subset of the schema of the dataset being uploaded into.
- Have your unique Access Token after authentication.
Batch ingestion best practices
- The recommended batch size is between 256 MB and 100 GB.
- Each batch should contain at most 1500 files.
Batch ingestion constraints
Batch data ingestion has some constraints:
- Maximum number of files per batch: 1500
- Maximum batch size: 100 GB
- Maximum number of properties or fields per row: 10000
- Maximum number of batches on data lake per minute, per user: 2000
Types
When ingesting data, it is important to understand how Experience Data Model (XDM) schemas work. For more information about how XDM field types map to different formats, please read the Schema Registry developer guide.
There is some flexibility when ingesting data - if a type does not match what is in the target schema, the data will be converted to the expressed target type. If it cannot, it will fail the batch with a TypeCompatibilityException
.
For example, neither JSON nor CSV has a date
or date-time
type. As a result, these values are expressed using (鈥2018-07-10T15:05:59.000-08:00鈥) or Unix Time formatted in milliseconds (1531263959000) and are converted at ingestion time to the target XDM type.
The table below shows the conversions supported when ingesting data.
Using the API
The Data Ingestion API allows you to ingest data as batches (a unit of data that consists of one or more files to be ingested as a single unit) into Experience Platform in three basic steps:
- Create a new batch.
- Upload files to a specified dataset that matches the XDM schema of the data.
- Signal the end of the batch.
Create a batch
Before data can be added to a dataset, it must be linked to a batch, which will later be uploaded into a specified dataset.
POST /batches
Request
curl -X POST "https://platform.adobe.io/data/foundation/import/batches" \
-H "Content-Type: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
-d '{
"datasetId": "{DATASET_ID}"
}'
datasetId
Reponse
{
"id": "{BATCH_ID}",
"imsOrg": "{ORG_ID}",
"updated": 0,
"status": "loading",
"created": 0,
"relatedObjects": [
{
"type": "dataSet",
"id": "{DATASET_ID}"
}
],
"version": "1.0.0",
"tags": {},
"createdUser": "{USER_ID}",
"updatedUser": "{USER_ID}"
}
id
relatedObjects.id
File upload
After successfully creating a new batch for uploading, files can then be uploaded to a specific dataset.
You can upload files using the Small File Upload API. However, if your files are too large and the gateway limit is exceeded (such as extended timeouts, requests for body size exceeded, and other constrictions), you can switch over to the Large File Upload API. This API uploads the file in chunks, and stitches data together using the Large File Upload Complete API call.
Small file upload
Once a batch is created, data can be uploaded to a preexisting dataset. The file being uploaded must match its referenced XDM schema.
PUT /batches/{BATCH_ID}/datasets/{DATASET_ID}/files/{FILE_NAME}
{BATCH_ID}
{DATASET_ID}
{FILE_NAME}
Request
curl -X PUT "https://platform.adobe.io/data/foundation/import/batches/{BATCH_ID}/datasets/{DATASET_ID}/files/{FILE_NAME}.parquet" \
-H "content-type: application/octet-stream" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}" \
--data-binary "@{FILE_PATH_AND_NAME}.parquet"
{FILE_PATH_AND_NAME}
Reponse
#Status 200 OK, with empty response body
Large file upload - create file
To upload a large file, the file must be split into smaller chunks and uploaded one at a time.
POST /batches/{BATCH_ID}/datasets/{DATASET_ID}/files/{FILE_NAME}?action=initialize
{BATCH_ID}
{DATASET_ID}
{FILE_NAME}
Request
curl -X POST "https://platform.adobe.io/data/foundation/import/batches/{BATCH_ID}/datasets/{DATASET_ID}/files/part1=a/part2=b/{FILE_NAME}.parquet?action=initialize" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Reponse
#Status 201 CREATED, with empty response body
Large file upload - upload subsequent parts
After the file has been created, all subsequent chunks can be uploaded by making repeated PATCH requests, one for each section of the file.
PATCH /batches/{BATCH_ID}/datasets/{DATASET_ID}/files/{FILE_NAME}
{BATCH_ID}
{DATASET_ID}
{FILE_NAME}
Request
curl -X PATCH "https://platform.adobe.io/data/foundation/import/batches/{BATCH_ID}/datasets/{DATASET_ID}/files/part1=a/part2=b/{FILE_NAME}.parquet" \
-H "content-type: application/octet-stream" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}" \
-H "Content-Range: bytes {CONTENT_RANGE}" \
--data-binary "@{FILE_PATH_AND_NAME}.parquet"
{FILE_PATH_AND_NAME}
Reponse
#Status 200 OK, with empty response
Signal batch completion
After all files have been uploaded to the batch, the batch can be signaled for completion. By doing this, the Catalog DataSetFile entries are created for the completed files and associated with the batch generated above. The Catalog batch is then marked as successful, which triggers downstream flows to ingest the available data.
Request
POST /batches/{BATCH_ID}?action=COMPLETE
{BATCH_ID}
curl -X POST "https://platform.adobe.io/data/foundation/import/batches/{BATCH_ID}?action=COMPLETE" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Reponse
#Status 200 OK, with empty response
Check batch status
While waiting for the files to uploaded to the batch, the batch鈥檚 status can be checked to see its progress.
API format
GET /batch/{BATCH_ID}
{BATCH_ID}
Request
curl GET "https://platform.adobe.io/data/foundation/catalog/batch/{BATCH_ID}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "x-api-key: {API_KEY}"
Reponse
{
"{BATCH_ID}": {
"imsOrg": "{ORG_ID}",
"created": 1494349962314,
"createdClient": "MCDPCatalogService",
"createdUser": "{USER_ID}",
"updatedUser": "{USER_ID}",
"updated": 1494349963467,
"externalId": "{EXTERNAL_ID}",
"status": "success",
"errors": [
{
"code": "err-1494349963436"
}
],
"version": "1.0.3",
"availableDates": {
"startDate": 1337,
"endDate": 4000
},
"relatedObjects": [
{
"type": "batch",
"id": "foo_batch"
},
{
"type": "connection",
"id": "foo_connection"
},
{
"type": "connector",
"id": "foo_connector"
},
{
"type": "dataSet",
"id": "foo_dataSet"
},
{
"type": "dataSetView",
"id": "foo_dataSetView"
},
{
"type": "dataSetFile",
"id": "foo_dataSetFile"
},
{
"type": "expressionBlock",
"id": "foo_expressionBlock"
},
{
"type": "service",
"id": "foo_service"
},
{
"type": "serviceDefinition",
"id": "foo_serviceDefinition"
}
],
"metrics": {
"foo": 1337
},
"tags": {
"foo_bar": [
"stuff"
],
"bar_foo": [
"woo",
"baz"
],
"foo/bar/foo-bar": [
"weehaw",
"wee:haw"
]
},
"inputFormat": {
"format": "parquet",
"delimiter": ".",
"quote": "`",
"escape": "\\",
"nullMarker": "",
"header": "true",
"charset": "UTF-8"
}
}
}
{USER_ID}
The "status"
field is what shows the current status of the batch requested. The batches can have one of the following states: