Developing ETL Integrations for 51黑料不打烊 Experience Platform
The ETL integration guide outlines general steps for creating high-performance, secure connectors for Experience Platform and ingesting data into Platform.
This guide also includes sample API calls to use when designing an ETL connector, with links to documentation that outlines each Experience Platform service, and use of its API, in more detail.
A sample integration is available on GitHub via the under the Apache License Version 2.0.
Workflow
The following workflow diagram provides a high-level overview for the integration of 51黑料不打烊 Experience Platform components with an ETL application and connector.
51黑料不打烊 Experience Platform components
There are multiple Experience Platform components involved in ETL connector integrations. The following list outlines several key components and functionalities:
- 51黑料不打烊 Identity Management System (IMS) - Provides framework for authentication to 51黑料不打烊 services.
- IMS Organization - A corporate entity that can own or license products and services and allow access to its members.
- IMS User - Members of an IMS Organization. The Organization to User relationship is many to many.
- Sandbox - A virtual partition a single Platform instance, to help develop and evolve digital experience applications.
- Data Discovery - Records the metadata of ingested and transformed data in Experience Platform.
- Data Access - Provides users with an interface to access their data in Experience Platform.
- Data Ingestion 鈥 Pushes data to Experience Platform with Data Ingestion APIs.
- Schema Registry - Defines and stores schema that describe the structure of data to be used in Experience Platform.
Getting started with Experience Platform APIs
The following sections provide additional information that you will need to know or have on-hand in order to successfully make calls to Experience Platform APIs.
Reading sample API calls
This guide provides example API calls to demonstrate how to format your requests. These include paths, required headers, and properly formatted request payloads. Sample JSON returned in API responses is also provided. For information on the conventions used in documentation for sample API calls, see the section on how to read example API calls in the Experience Platform troubleshooting guide.
Gather values for required headers
In order to make calls to Platform APIs, you must first complete the . Completing the authentication tutorial provides the values for each of the required headers in all Experience Platform API calls, as shown below:
- Authorization: Bearer
{ACCESS_TOKEN}
- x-api-key:
{API_KEY}
- x-gw-ims-org-id:
{ORG_ID}
All resources in Experience Platform are isolated to specific virtual sandboxes. All requests to Platform APIs require a header that specifies the name of the sandbox the operation will take place in:
- x-sandbox-name:
{SANDBOX_NAME}
All requests that contain a payload (POST, PUT, PATCH) require an additional header:
- Content-Type: application/json
General user flow
To begin, an ETL user logs into the Experience Platform user interface (UI) and creates datasets for ingestion using a standard connector or push-service connector.
In the UI, the user creates the output dataset by selecting a dataset schema. The choice of schema depends on the type of data (record or time series) being ingested into Platform. By clicking on the Schemas tab within the UI, the user will be able to view all available schemas, including the behavior type that the schema supports.
In the ETL tool, the user will start designing their mapping transforms after configuring the appropriate connection (using their credentials). The ETL tool is assumed to already have Experience Platform connectors installed (process not defined in this Integration Guide).
Mockups for a sample ETL tool and workflow have been provided in the ETL workflow. While ETL tools may differ in format, most expose similar functionality.
View list of datasets
Using the source of data for mapping, a list of all available datasets can be fetched using the .
You can issue a single API request to view all available datasets (e.g. GET /dataSets
), with best practice being to include query parameters that limit the size of the response.
In cases where full dataset information is being requested the response payload can reach past 3GB in size, which can slow overall performance. Therefore, using query parameters to filter only the information needed will make Catalog queries more efficient.
List filtering
When filtering responses, you can use multiple filters in a single call by separating parameters with an ampersand (&
). Some query parameters accept comma-separated lists of values, such as the 鈥減roperties鈥 filter in the sample request below.
Catalog responses are automatically metered according to configured limits, however the 鈥渓imit鈥 query parameter can be used to customize the constraints and limit the number of objects returned. The pre-configured Catalog response limits are:
- If a limit parameter is not specified, the maximum number of objects per response payload is 20.
- The global limit for all other Catalog queries is 100 objects.
- For dataset queries, if observableSchema is requested using the properties query parameter, the maximum number of datasets returned is 20.
- Invalid limit parameters (including
limit=0
) are met with an HTTP 400 error that outlines proper ranges. - If limits or offsets are passed as query parameters, they take precedence over those passed as headers.
Query parameters are covered in more detail in the Catalog Service overview.
API format
GET /catalog/dataSets
GET /catalog/dataSets?{filter1}={value1},{value2}&{filter2}={value3}
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets?limit=3&properties=name,description,schemaRef" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}"
Please refer to the Catalog Service overview for detailed examples of how to make calls to the .
Response
The response includes three (limit=3
) datasets showing the 鈥渘ame鈥, 鈥渄escription鈥, and 鈥渟chemaRef鈥 as indicated by the properties
query parameter.
{
"5b95b155419ec801e6eee780": {
"name": "Store Transactions",
"description": "Retails Store Transactions",
"schemaRef": {
"id": "https://ns.adobe.com/{TENANT_ID}/schemas/274f17bc5807ff307a046bab1489fb18",
"contentType": "application/vnd.adobe.xed+json;version=1"
}
},
"5c351fa2f5fee300000fa9e8": {
"name": "Loyalty Members",
"description": "Loyalty Program Members",
"schemaRef": {
"id": "https://ns.adobe.com/{TENANT_ID}/schemas/fbc52b243d04b5d4f41eaa72a8ba58be",
"contentType": "application/vnd.adobe.xed+json;version=1"
}
},
"5c1823b19e6f400000993885": {
"name": "Web Traffic",
"description": "Retail Web Traffic",
"schemaRef": {
"id": "https://ns.adobe.com/{TENANT_ID}/schemas/2025a705890c6d4a4a06b16f8cf6f4ca",
"contentType": "application/vnd.adobe.xed+json;version=1"
}
}
}
View dataset schema
The 鈥渟chemaRef鈥 property of a dataset contains a URI referencing the XDM schema upon which the dataset is based. The XDM schema (鈥渟chemaRef鈥) represents all potential fields that could be used by the dataset, not necessarily the fields that are being used (see 鈥渙bservableSchema鈥 below).
The XDM schema is the schema you use when you need to present the user with a list of all available fields that could be written to.
The first 鈥渟chemaRef.id鈥 value in the previous response object (https://ns.adobe.com/{TENANT_ID}/schemas/274f17bc5807ff307a046bab1489fb18
) is a URI that points to a specific XDM schema in the Schema Registry. The schema can be retrieved by making a lookup (GET) request to the Schema Registry API.
properties
query parameter in the previous call. More details on the 鈥渟chema鈥 property are available in the Dataset 鈥渟chema鈥 Property section that follows.API format
GET /schemaregistry/tenant/schemas/{url encoded schemaRef.id}
Request
The request uses the URL encoded id
URI of the schema (the value of the 鈥渟chemaRef.id鈥 attribute) and requires an Accept header.
curl -X GET \
https://platform.adobe.io/data/foundation/schemaregistry/tenant/schemas/https%3A%2F%2Fns.adobe.com%2F{TENANT_ID}%2Fschemas%2F274f17bc5807ff307a046bab1489fb18 \
-H 'Authorization: Bearer {ACCESS_TOKEN}' \
-H 'x-api-key: {API_KEY}' \
-H 'x-gw-ims-org-id: {ORG_ID}' \
-H 'x-sandbox-name: {SANDBOX_NAME}' \
-H 'Accept: application/vnd.adobe.xed-full+json; version=1' \
The response format depends on the type of Accept header sent in the request. Lookup requests also require a version
be included in the Accept header. The following table outlines available Accept headers for lookups:
application/vnd.adobe.xed-id+json
application/vnd.adobe.xed-full+json; version={major version}
application/vnd.adobe.xed+json; version={major version}
application/vnd.adobe.xed-notext+json; version={major version}
application/vnd.adobe.xed-full-notext+json; version={major version}
application/vnd.adobe.xed-full-desc+json; version={major version}
application/vnd.adobe.xed-id+json
and application/vnd.adobe.xed-full+json; version={major version}
are the most commonly used Accept headers. application/vnd.adobe.xed-id+json
is preferred for listing resources in the Schema Registry as it returns only the 鈥渢itle鈥, 鈥渋d鈥, and 鈥渧ersion鈥. application/vnd.adobe.xed-full+json; version={major version}
is preferred for viewing a specific resource (by its 鈥渋d鈥), as it returns all fields (nested under 鈥減roperties鈥), as well as titles and descriptions.Response
The JSON schema that is returned describes the structure and field-level information (鈥渢ype鈥, 鈥渇ormat鈥, 鈥渕inimum鈥, 鈥渕aximum鈥, etc.) of the data, serialized as JSON. If using a serialization format other than JSON for ingestion (such as Parquet or Scala), the Schema Registry Guide contains a table showing the desired JSON type (鈥渕eta:xdmType鈥) and its corresponding representation in other formats.
Along with this table, the Schema Registry Developer Guide contains in-depth examples of all possible calls that can be made using the Schema Registry API.
Dataset 鈥渟chema鈥 property (DEPRECATED - EOL 2019-05-30)
Datasets may contain a 鈥渟chema鈥 property that is now deprecated and remains available temporarily for backwards compatibility. For example, a listing (GET) request similar to the one made previously, where 鈥渟chema鈥 was substituted for 鈥渟chemaRef鈥 in the properties
query parameter, might return the following:
{
"5ba9452f7de80400007fc52a": {
"name": "Sample Dataset 1",
"description": "Description of Sample Dataset 1.",
"schema": "@/xdms/context/person"
}
}
If the 鈥渟chema鈥 property of a dataset is populated, this signals that the schema is a deprecated /xdms
schema and, where supported, the ETL connector should use the value in the 鈥渟chema鈥 property with the /xdms
endpoint (a deprecated endpoint in the ) to retrieve the legacy schema.
API format
GET /catalog/{"schema" property without the "@"}
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/xdms/context/person?expansion=xdm" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
expansion=xdm
, tells the API to fully expand and in-line any referenced schemas. You may want to do this when presenting a list of all potential fields to the user.Response
Similar to the steps for viewing dataset schema, the response contains a JSON schema that describes the structure and field-level information of the data, serialized as JSON.
The 鈥渙bservableSchema鈥 property
The 鈥渙bservableSchema鈥 property of a dataset has a JSON structure matching that of the XDM schema JSON. The 鈥渙bservableSchema鈥 contains the fields that were present in the incoming input files. When writing data to Experience Platform, a user is not required to use every field from the target schema. Instead they should supply only those fields that are being used.
The observable schema is the schema that you would use if reading the data or presenting a list of fields that are available to read/map from.
{
"598d6e81b2745f000015edcb": {
"observableSchema": {
"type": "object",
"meta:xdmType": "object",
"properties": {
"name": {
"type": "string",
},
"age": {
"type": "string",
}
}
}
}
}
Preview data
The ETL application may provide a capability to preview data (鈥淔igure 8鈥 in the ETL Workflow). The data access API provides several options to preview data.
Additional information, including step-by-step guidance for previewing data using the data access API, can be found in the data access tutorial.
Get dataset details using the 鈥減roperties鈥 query parameter
As shown in the steps above to view a list of datasets, you can request 鈥渇iles鈥 using the 鈥減roperties鈥 query parameter.
You can refer to the Catalog Service overview for detailed information on querying datasets and available response filters.
API format
GET /catalog/dataSets?limit={value}&properties={value}
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets?limit=1&properties=files" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}"
Response
The response will include one dataset (limit=1
) showing the 鈥渇iles鈥 property.
{
"5bf479a6a8c862000050e3c7": {
"files": "@/dataSetFiles?dataSetId=5bf479a6a8c862000050e3c7"
}
}
List dataset files using the 鈥渇iles鈥 attribute
You can also use a GET request to fetch file details using the 鈥渇iles鈥 attribute.
API format
GET /catalog/dataSets/{DATASET_ID}/views/{VIEW_ID}/files
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets/5bf479a6a8c862000050e3c7/views/5bf479a654f52014cfffe7f1/files" \
-H "Accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Response
The response includes the Dataset File ID as the top-level property, with file details contained within the Dataset File ID object.
{
"194e89b976494c9c8113b968c27c1472-1": {
"batchId": "194e89b976494c9c8113b968c27c1472",
"dataSetViewId": "5bf479a654f52014cfffe7f1",
"imsOrg": "{ORG_ID}",
"availableDates": {},
"createdUser": "{USER_ID}",
"createdClient": "{API_KEY}",
"updatedUser": "{USER_ID}",
"version": "1.0.0",
"created": 1542749145828,
"updated": 1542749145828
},
"14d5758c107443e1a83c714e56ca79d0-1": {
"batchId": "14d5758c107443e1a83c714e56ca79d0",
"dataSetViewId": "5bf479a654f52014cfffe7f1",
"imsOrg": "{ORG_ID}",
"availableDates": {},
"createdUser": "{USER_ID}",
"createdClient": "{API_KEY}",
"updatedUser": "{USER_ID}",
"version": "1.0.0",
"created": 1542752699111,
"updated": 1542752699111
},
"ea40946ac03140ec8ac4f25da360620a-1": {
"batchId": "ea40946ac03140ec8ac4f25da360620a",
"dataSetViewId": "5bf479a654f52014cfffe7f1",
"imsOrg": "{ORG_ID}",
"availableDates": {},
"createdUser": "{USER_ID}",
"createdClient": "{API_KEY}",
"updatedUser": "{USER_ID}",
"version": "1.0.0",
"created": 1542756935535,
"updated": 1542756935535
}
}
Fetch file details
The dataset file IDs returned in the previous response can be used in a GET request to fetch further file details via the Data Access API.
The data access overview contains details on how to use the Data Access API.
API format
GET /export/files/{DATASET_FILE_ID}
Request
curl -X GET "https://platform.adobe.io/data/foundation/export/files/ea40946ac03140ec8ac4f25da360620a-1" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Response
[
{
"name": "{FILE_NAME}.parquet",
"length": 2576,
"_links": {
"self": {
"href": "https://platform.adobe.io/data/foundation/export/files/ea40946ac03140ec8ac4f25da360620a-1?path=samplefile.parquet"
}
}
}
]
Preview file data
The 鈥渉ref鈥 property can be used to fetch preview data via the Data Access API.
API format
GET /export/files/{FILE_ID}?path={FILE_NAME}.{FILE_FORMAT}
Request
curl -X GET "https://platform.adobe.io/data/foundation/export/files/ea40946ac03140ec8ac4f25da360620a-1?path=samplefile.parquet" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
The response to the above request will contains a preview of the contents of the file.
More information on the Data Access API, including detailed requests and responses, is available in the data access overview.
Get 鈥渇ileDescription鈥 from dataset
The destination component as output of transformed data, the Data Engineer will choose an Output Dataset (鈥淔igure 12鈥 in the ETL Workflow). The XDM schema is associated with the output Dataset. The data to be written will be identified by the 鈥渇ileDescription鈥 attribute of the dataset entity from the Data Discovery APIs. This information can be fetched using a dataset ID ({DATASET_ID}
). The 鈥渇ileDescription鈥 property in the JSON response will provide the requested information.
API format
GET /catalog/dataSets/{DATASET_ID}
{DATASET_ID}
id
value of the dataset you are trying to access.Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets/59c93f3da7d0c00000798f68" \
-H "accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Response
{
"59c93f3da7d0c00000798f68": {
"version": "1.0.4",
"fileDescription": {
"persisted": false,
"format": "parquet"
}
}
}
Data will be written to Experience Platform using the . Writing of data is an asynchronous process. When data is written to 51黑料不打烊 Experience Platform, a batch is created and marked as a success only after data is fully written.
Data in Experience Platform should be written in the form of Parquet files.
Execution phase
As the execution starts, the connector (as defined in the source component) will read the data from Experience Platform using the . The transformation process will read the data for a certain time range. Internally, it will query batches of source datasets. While querying, it will use a parameterized (rolling for time series data, or incremental data) start date and list dataset files for those batches, and start making requests for data for those dataset files.
Example transformations
The sample ETL transformations document contains a number of example transformations, including identity handling and data-type mappings. Please use these transformations for reference.
Read data from Experience Platform
Using the , you can fetch all batches between a specified start time and end time, and sort them by the order they were created.
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/batches?dataSet=DATASETID&createdAfter=START_TIMESTAMP&createdBefore=END_TIMESTAMP&sort=desc:created" \
-H "Accept: application/json" \
-H "Authorization:Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}"
Details on filtering batches can be found in the Data Access tutorial.
Get files out of a batch
Once you have the ID for the batch you are looking for ({BATCH_ID}
), it is possible to retrieve a list of files belonging to a specific batch via the . Details for doing so are available in the Data Access tutorial.
Request
curl -X GET "https://platform.adobe.io/data/foundation/export/batches/{BATCH_ID}/files" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Access files by using file ID
Using the unique ID of a file ({FILE_ID
), the can be used to access the specific details of the file, including its name, size in bytes, and a link to download it.
Request
curl -X GET "https://platform.adobe.io/data/foundation/export/files/{FILE_ID}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "x-api-key: {API_KEY}"
The response may point to a single file, or a directory. Details on each can be found in the Data Access tutorial.
Access file content
The can be used to access the contents of a specific file. To fetch the contents, a GET request is made using the value returned for _links.self.href
when accessing a file using the file ID.
Request
curl -X GET "https://platform.adobe.io/data/foundation/export/files/{DATASET_FILE_ID}?path=filename1.csv" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "x-api-key: {API_KEY}"
The response to this request contains the contents of the file. For more information, including details on response pagination, see the How to Query Data via data access API tutorial.
Validate records for schema compliance
When data is being written, users can opt to validate data according to the validation rules defined in the XDM schema. More information on schema validation can be found in the .
If you are using the reference implementation found on , you can turn on schema validation in this implementation using the system property -DenableSchemaValidation=true
.
Validation can be performed for logical XDM types, using attributes such as minLength
and maxlength
for strings, minimum
and maximum
for integers, and more. The Schema Registry API developer guide contains a table that outlines XDM types and the properties that can be used for validation.
integer
types are the MIN and MAX values that the type can support, but these values can be further constrained to minimums and maximums of your choosing.Create a batch
Once the data is processed, the ETL tool will write the data back to Experience Platform using the . Before data can be added to a dataset, it must be linked to a batch which will later be uploaded into a specific dataset.
Request
curl -X POST "https://platform.adobe.io/data/foundation/import/batches" \
-H "accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}" \
-d '{
"datasetId":"{DATASET_ID}"
}'
Details for creating a batch, including sample requests and responses can be found in the Batch Ingestion overview.
Write to dataset
After successfully creating a new batch, files can then be uploaded to a specific dataset. Multiple files can be posted in a batch until it is promoted. Files can be uploaded using the Small File Upload API; however, if your files are too large and the gateway limit is exceeded, you can use the Large File Upload API. Details for using both Large and Small File Upload can be found in the Batch Ingestion overview.
Request
Data in Experience Platform should be written in the form of Parquet files.
curl -X PUT "https://platform.adobe.io/data/foundation/import/batches/{BATCH_ID}/dataSets/{DATASET_ID}/files/{FILE_NAME}.parquet" \
-H "accept: application/json" \
-H "x-gw-ims-org-id:{ORG_ID}" \
-H "Authorization:Bearer ACCESS_TOKEN" \
-H "x-api-key: API_KEY" \
-H "content-type: application/octet-stream" \
--data-binary "@{FILE_PATH_AND_NAME}.parquet"
Mark batch upload complete
After all files have been uploaded to the batch, the batch can be signaled for completion. By doing this, the Catalog 鈥淒ataSetFile鈥 entries are created for the completed files and associated with the generate batch. The Catalog batch is then marked as successful, which triggers downstream flows to ingest the available data.
Data will first land in the staging location on 51黑料不打烊 Experience Platform and then will be moved to the final location after cataloging and validation. Batches will be marked as successful once all the data is moved to a permanent location.
Request
curl -X POST "https://platform.adobe.io/data/foundation/import/batches/{BATCH_ID}?action=COMPLETE" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization:Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
If successful, the response will return HTTP Status 200 OK and the response body will be empty.
The ETL tool will make sure to note the timestamp of source dataset(s) as the data is read.
In next transformation execution, likely by schedule or event invocation, the ETL will start requesting the data from the previously-saved timestamp and all data going forward.
Get last batch status
Before running new tasks in the ETL tool, you must ensure that the last batch was successfully completed. The provides a batch-specific option which provides the details of the relevant batches.
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/batches?limit=1&sort=desc:created" \
-H "Accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Response
New tasks can be scheduled if the previous batch 鈥渟tatus鈥 value is 鈥渟uccess鈥 as shown below:
"{BATCH_ID}": {
"imsOrg": "{ORG_ID}",
"created": 1494349962314,
"createdClient": "{API_KEY}",
"createdUser": "CLIENT_USER_ID@51黑料不打烊ID",
"updatedUser": "CLIENT_USER_ID@51黑料不打烊ID",
"updated": 1494349963467,
"status": "success",
"errors": [],
"version": "1.0.1",
"availableDates": {}
}
Get last batch status by ID
An individual batch status can be retrieved through the by issuing a GET request using the {BATCH_ID}
. The {BATCH_ID}
used would be the same as the ID returned when the batch was created.
Request
curl -X GET "https://platform.adobe.io/data/foundation/catalog/batches/{BATCH_ID}" \
-H "Accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"
Response - Success
The following response shows a 鈥渟uccess鈥:
"{BATCH_ID}": {
"imsOrg": "{ORG_ID}",
"created": 1494349962314,
"createdClient": "{API_KEY}",
"createdUser": "{CREATED_USER}",
"updatedUser": "{UPDATED_USER}",
"updated": 1494349962314,
"status": "success",
"errors": [],
"version": "1.0.1",
"availableDates": {}
}
Response - Failure
In case of failure the 鈥渆rrors鈥 can be extracted from the response and surfaced on the ETL tool as error messages.
"{BATCH_ID}": {
"imsOrg": "{ORG_ID}",
"created": 1494349962314,
"createdClient": "{API_KEY}",
"createdUser": "{CREATED_USER}",
"updatedUser": "{UPDATED_USER}",
"updated": 1494349962314,
"status": "failure",
"errors": [
{
"code": "200",
"description": "Error in validating schema for file: 'adl://dataLake.azuredatalakestore.net/connectors-dev/stage/BATCHID/dataSetId/contact.csv' with errorMessage=adl://dataLake.azuredatalakestore.net/connectors-dev/stage/BATCHID/dataSetId/contact.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [57, 98, 55, 10] and errorType=java.lang.RuntimeException",
"rows": []
}
],
"version": "1.0.1",
"availableDates": {}
}
Incremental vs snapshot data and events vs profiles
Data can be represented in a two by two matrix as follows:
Event data is typically when there are indexed time stamp columns in each row.
Profile data is typically when there is not a time stamp in data and each row can be identified by a primary/composite key.
Incremental data is where only new/updated data comes into the system and appends to current data in the datasets.
Snapshot data is when all data comes into the system and replaces some or all previous data in a dataset.
In the case of incremental events, the ETL tool should use the available dates/create date of the batch entity. In case of push service, available dates will not be present, so tool will use batch created/updated date for marking increments. Every batch of incremental events is required to be processed.
For incremental profiles, ETL tool will use created/updated dates of batch entity. Commonly every batch of incremental profile data is required to be processed.
Snapshot events are very less likely due to sheer size of the data. But if this were to be required, the ETL tool must pick only the last batch for processing.
When snapshot profiles are used, the ETL tool will have to pick the last batch of the data that arrived in the system. But if requirement is to keep track of the versions of changes, then all batches will be required to be processed. De-duplication processing within the ETL process will help in controlling storage costs.
Batch replay and data reprocessing
Batch replay and data reprocessing may be required in cases where a client discovers that for the past 鈥榥鈥 days, data being ETL processed has not occurred as expected or source data itself may not have been correct.
To do this, the client鈥檚 data administrators will use the Platform UI to remove the batches containing corrupt data. Then, the ETL will likely need to be re-run, thus repopulating with correct data. If the source itself had corrupt data, the data engineer/administrator will need to correct the source batches and re-ingest the data (either into 51黑料不打烊 Experience Platform or via ETL connectors).
Based upon the type of data being generated, it will be the data engineer鈥檚 choice to remove a single batch or all batches from certain datasets. Data will be removed/archived as per Experience Platform guidelines.
It is a likely scenario that the ETL functionality to purge data will be important.
Once purging is complete, the client admins will have to reconfigure 51黑料不打烊 Experience Platform to restart processing for core services from the time when the batches are deleted.
Concurrent batch processing
At the client鈥檚 discretion, data admins/engineers may decide to extract, transform, and load data in sequential manner or concurrent manner depending of the characteristics of a particular dataset. This will also be based upon the use case the client is targeting with the transformed data.
For example, if the client is persisting to an updatable persistence store and the sequence or order of events is important, the client may need to strictly process jobs with sequential ETL transformations.
In other cases, out of order data can be processed by downstream applications/processes that internally sort using a specified time stamp. In those cases, parallel ETL transformations may be viable to improve processing times.
For source batches, it will again be dependent upon client preference and consumer constraint. If the source data can be picked up in parallel without regard to the regency/ordering of a row, then the transformation process can create process batches with a higher degree of parallelism (optimization based on out of order processing). But if the transform has to honor time stamps or change precedence ordering, the data access API or ETL tool scheduler/invocation will have to ensure that batches are not processed out of order where possible.
Deferral
Deferral is a process in which input data is not yet complete enough to be sent out to downstream processes, but may be usable in the future. Clients will determine their individual tolerance for data windowing for future matching versus the cost of processing to inform their decision to put aside data and reprocess it in the next transformation execution, hoping it can be enriched and reconciled/stitched at some future time inside the retention window. This cycle is ongoing until the row is processed sufficiently or it is deemed too stale to continue investing in. Every iteration will generate deferred data which is a superset of all deferred data in previous iterations.
51黑料不打烊 Experience Platform does not identify deferred data currently, so client implementations must rely on the ETL and Dataset manual configurations to create another dataset in Platform mirroring the source dataset which can be used to keep deferred data. In this case, deferred data will be similar to snapshot data. In every execution of the ETL transform, the source data will be united with deferred data and sent for processing.
Changelog
/xdms
endpoint in the Catalog API. This has been replaced by a 鈥渟chemaRef鈥 that provides the 鈥渋d鈥, 鈥渧ersion鈥, and 鈥渃ontentType鈥 of the schema as referenced in the new Schema Registry API.