51黑料不打烊

Developing ETL Integrations for 51黑料不打烊 Experience Platform

The ETL integration guide outlines general steps for creating high-performance, secure connectors for Experience Platform and ingesting data into Platform.

This guide also includes sample API calls to use when designing an ETL connector, with links to documentation that outlines each Experience Platform service, and use of its API, in more detail.

A sample integration is available on GitHub via the under the Apache License Version 2.0.

Workflow

The following workflow diagram provides a high-level overview for the integration of 51黑料不打烊 Experience Platform components with an ETL application and connector.

51黑料不打烊 Experience Platform components

There are multiple Experience Platform components involved in ETL connector integrations. The following list outlines several key components and functionalities:

  • 51黑料不打烊 Identity Management System (IMS) - Provides framework for authentication to 51黑料不打烊 services.
  • IMS Organization - A corporate entity that can own or license products and services and allow access to its members.
  • IMS User - Members of an IMS Organization. The Organization to User relationship is many to many.
  • Sandbox - A virtual partition a single Platform instance, to help develop and evolve digital experience applications.
  • Data Discovery - Records the metadata of ingested and transformed data in Experience Platform.
  • Data Access - Provides users with an interface to access their data in Experience Platform.
  • Data Ingestion 鈥 Pushes data to Experience Platform with Data Ingestion APIs.
  • Schema Registry - Defines and stores schema that describe the structure of data to be used in Experience Platform.

Getting started with Experience Platform APIs

The following sections provide additional information that you will need to know or have on-hand in order to successfully make calls to Experience Platform APIs.

Reading sample API calls

This guide provides example API calls to demonstrate how to format your requests. These include paths, required headers, and properly formatted request payloads. Sample JSON returned in API responses is also provided. For information on the conventions used in documentation for sample API calls, see the section on how to read example API calls in the Experience Platform troubleshooting guide.

Gather values for required headers

In order to make calls to Platform APIs, you must first complete the . Completing the authentication tutorial provides the values for each of the required headers in all Experience Platform API calls, as shown below:

  • Authorization: Bearer {ACCESS_TOKEN}
  • x-api-key: {API_KEY}
  • x-gw-ims-org-id: {ORG_ID}

All resources in Experience Platform are isolated to specific virtual sandboxes. All requests to Platform APIs require a header that specifies the name of the sandbox the operation will take place in:

  • x-sandbox-name: {SANDBOX_NAME}
NOTE
For more information on sandboxes in Platform, see the sandbox overview documentation.

All requests that contain a payload (POST, PUT, PATCH) require an additional header:

  • Content-Type: application/json

General user flow

To begin, an ETL user logs into the Experience Platform user interface (UI) and creates datasets for ingestion using a standard connector or push-service connector.

In the UI, the user creates the output dataset by selecting a dataset schema. The choice of schema depends on the type of data (record or time series) being ingested into Platform. By clicking on the Schemas tab within the UI, the user will be able to view all available schemas, including the behavior type that the schema supports.

In the ETL tool, the user will start designing their mapping transforms after configuring the appropriate connection (using their credentials). The ETL tool is assumed to already have Experience Platform connectors installed (process not defined in this Integration Guide).

Mockups for a sample ETL tool and workflow have been provided in the ETL workflow. While ETL tools may differ in format, most expose similar functionality.

NOTE
The ETL connector must specify a time stamp filter marking the date to ingest data and offset (i.e. The window for which data is to be read). The ETL tool should support taking these two parameters in this or another relevant UI. In 51黑料不打烊 Experience Platform, these parameters will be mapped to either available dates (if present) or captured date present in batch object of dataset.

View list of datasets

Using the source of data for mapping, a list of all available datasets can be fetched using the .

You can issue a single API request to view all available datasets (e.g. GET /dataSets), with best practice being to include query parameters that limit the size of the response.

In cases where full dataset information is being requested the response payload can reach past 3GB in size, which can slow overall performance. Therefore, using query parameters to filter only the information needed will make Catalog queries more efficient.

List filtering

When filtering responses, you can use multiple filters in a single call by separating parameters with an ampersand (&). Some query parameters accept comma-separated lists of values, such as the 鈥減roperties鈥 filter in the sample request below.

Catalog responses are automatically metered according to configured limits, however the 鈥渓imit鈥 query parameter can be used to customize the constraints and limit the number of objects returned. The pre-configured Catalog response limits are:

  • If a limit parameter is not specified, the maximum number of objects per response payload is 20.
  • The global limit for all other Catalog queries is 100 objects.
  • For dataset queries, if observableSchema is requested using the properties query parameter, the maximum number of datasets returned is 20.
  • Invalid limit parameters (including limit=0) are met with an HTTP 400 error that outlines proper ranges.
  • If limits or offsets are passed as query parameters, they take precedence over those passed as headers.

Query parameters are covered in more detail in the Catalog Service overview.

API format

GET /catalog/dataSets
GET /catalog/dataSets?{filter1}={value1},{value2}&{filter2}={value3}

Request

curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets?limit=3&properties=name,description,schemaRef" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}"

Please refer to the Catalog Service overview for detailed examples of how to make calls to the .

Response

The response includes three (limit=3) datasets showing the 鈥渘ame鈥, 鈥渄escription鈥, and 鈥渟chemaRef鈥 as indicated by the properties query parameter.

{
    "5b95b155419ec801e6eee780": {
        "name": "Store Transactions",
        "description": "Retails Store Transactions",
        "schemaRef": {
            "id": "https://ns.adobe.com/{TENANT_ID}/schemas/274f17bc5807ff307a046bab1489fb18",
            "contentType": "application/vnd.adobe.xed+json;version=1"
        }
    },
    "5c351fa2f5fee300000fa9e8": {
        "name": "Loyalty Members",
        "description": "Loyalty Program Members",
        "schemaRef": {
            "id": "https://ns.adobe.com/{TENANT_ID}/schemas/fbc52b243d04b5d4f41eaa72a8ba58be",
            "contentType": "application/vnd.adobe.xed+json;version=1"
        }
    },
    "5c1823b19e6f400000993885": {
        "name": "Web Traffic",
        "description": "Retail Web Traffic",
        "schemaRef": {
            "id": "https://ns.adobe.com/{TENANT_ID}/schemas/2025a705890c6d4a4a06b16f8cf6f4ca",
            "contentType": "application/vnd.adobe.xed+json;version=1"
        }
    }
}

View dataset schema

The 鈥渟chemaRef鈥 property of a dataset contains a URI referencing the XDM schema upon which the dataset is based. The XDM schema (鈥渟chemaRef鈥) represents all potential fields that could be used by the dataset, not necessarily the fields that are being used (see 鈥渙bservableSchema鈥 below).

The XDM schema is the schema you use when you need to present the user with a list of all available fields that could be written to.

The first 鈥渟chemaRef.id鈥 value in the previous response object (https://ns.adobe.com/{TENANT_ID}/schemas/274f17bc5807ff307a046bab1489fb18) is a URI that points to a specific XDM schema in the Schema Registry. The schema can be retrieved by making a lookup (GET) request to the Schema Registry API.

NOTE
The 鈥渟chemaRef鈥 property replaces the now deprecated 鈥渟chema鈥 property. If 鈥渟chemaRef鈥 is absent from the dataset or does not contain a value, you will need to check for the presence of a 鈥渟chema鈥 property. This could be done by replacing 鈥渟chemaRef鈥 with 鈥渟chema鈥 in the properties query parameter in the previous call. More details on the 鈥渟chema鈥 property are available in the Dataset 鈥渟chema鈥 Property section that follows.

API format

GET /schemaregistry/tenant/schemas/{url encoded schemaRef.id}

Request

The request uses the URL encoded id URI of the schema (the value of the 鈥渟chemaRef.id鈥 attribute) and requires an Accept header.

curl -X GET \
  https://platform.adobe.io/data/foundation/schemaregistry/tenant/schemas/https%3A%2F%2Fns.adobe.com%2F{TENANT_ID}%2Fschemas%2F274f17bc5807ff307a046bab1489fb18 \
  -H 'Authorization: Bearer {ACCESS_TOKEN}' \
  -H 'x-api-key: {API_KEY}' \
  -H 'x-gw-ims-org-id: {ORG_ID}' \
  -H 'x-sandbox-name: {SANDBOX_NAME}' \
  -H 'Accept: application/vnd.adobe.xed-full+json; version=1' \

The response format depends on the type of Accept header sent in the request. Lookup requests also require a version be included in the Accept header. The following table outlines available Accept headers for lookups:

Accept
Description
application/vnd.adobe.xed-id+json
List (GET) requests, titles, ids and versions
application/vnd.adobe.xed-full+json; version={major version}
$refs and allOf resolved, has titles and descriptions
application/vnd.adobe.xed+json; version={major version}
Raw with $ref and allOf, has titles and descriptions
application/vnd.adobe.xed-notext+json; version={major version}
Raw with $ref and allOf, no titles or descriptions
application/vnd.adobe.xed-full-notext+json; version={major version}
$refs and allOf resolved, no titles or descriptions
application/vnd.adobe.xed-full-desc+json; version={major version}
$refs and allOf resolved, descriptors included
NOTE
application/vnd.adobe.xed-id+json and application/vnd.adobe.xed-full+json; version={major version} are the most commonly used Accept headers. application/vnd.adobe.xed-id+json is preferred for listing resources in the Schema Registry as it returns only the 鈥渢itle鈥, 鈥渋d鈥, and 鈥渧ersion鈥. application/vnd.adobe.xed-full+json; version={major version} is preferred for viewing a specific resource (by its 鈥渋d鈥), as it returns all fields (nested under 鈥減roperties鈥), as well as titles and descriptions.

Response

The JSON schema that is returned describes the structure and field-level information (鈥渢ype鈥, 鈥渇ormat鈥, 鈥渕inimum鈥, 鈥渕aximum鈥, etc.) of the data, serialized as JSON. If using a serialization format other than JSON for ingestion (such as Parquet or Scala), the Schema Registry Guide contains a table showing the desired JSON type (鈥渕eta:xdmType鈥) and its corresponding representation in other formats.

Along with this table, the Schema Registry Developer Guide contains in-depth examples of all possible calls that can be made using the Schema Registry API.

Dataset 鈥渟chema鈥 property (DEPRECATED - EOL 2019-05-30)

Datasets may contain a 鈥渟chema鈥 property that is now deprecated and remains available temporarily for backwards compatibility. For example, a listing (GET) request similar to the one made previously, where 鈥渟chema鈥 was substituted for 鈥渟chemaRef鈥 in the properties query parameter, might return the following:

{
  "5ba9452f7de80400007fc52a": {
    "name": "Sample Dataset 1",
    "description": "Description of Sample Dataset 1.",
    "schema": "@/xdms/context/person"
  }
}

If the 鈥渟chema鈥 property of a dataset is populated, this signals that the schema is a deprecated /xdms schema and, where supported, the ETL connector should use the value in the 鈥渟chema鈥 property with the /xdms endpoint (a deprecated endpoint in the ) to retrieve the legacy schema.

API format

GET /catalog/{"schema" property without the "@"}

Request

curl -X GET "https://platform.adobe.io/data/foundation/catalog/xdms/context/person?expansion=xdm" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}"
NOTE
An optional query parameter, expansion=xdm, tells the API to fully expand and in-line any referenced schemas. You may want to do this when presenting a list of all potential fields to the user.

Response

Similar to the steps for viewing dataset schema, the response contains a JSON schema that describes the structure and field-level information of the data, serialized as JSON.

NOTE
When the 鈥渟chema鈥 field is empty or absent entirely, the connector should read the 鈥渟chemaRef鈥 field and use the as shown in the previous steps to view a dataset schema.

The 鈥渙bservableSchema鈥 property

The 鈥渙bservableSchema鈥 property of a dataset has a JSON structure matching that of the XDM schema JSON. The 鈥渙bservableSchema鈥 contains the fields that were present in the incoming input files. When writing data to Experience Platform, a user is not required to use every field from the target schema. Instead they should supply only those fields that are being used.

The observable schema is the schema that you would use if reading the data or presenting a list of fields that are available to read/map from.

{
    "598d6e81b2745f000015edcb": {
        "observableSchema": {
            "type": "object",
            "meta:xdmType": "object",
            "properties": {
                "name": {
                    "type": "string",
                },
                "age": {
                    "type": "string",
                }
            }
        }
    }
}

Preview data

The ETL application may provide a capability to preview data (鈥淔igure 8鈥 in the ETL Workflow). The data access API provides several options to preview data.

Additional information, including step-by-step guidance for previewing data using the data access API, can be found in the data access tutorial.

Get dataset details using the 鈥減roperties鈥 query parameter

As shown in the steps above to view a list of datasets, you can request 鈥渇iles鈥 using the 鈥減roperties鈥 query parameter.

You can refer to the Catalog Service overview for detailed information on querying datasets and available response filters.

API format

GET /catalog/dataSets?limit={value}&properties={value}

Request

curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets?limit=1&properties=files" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}"

Response

The response will include one dataset (limit=1) showing the 鈥渇iles鈥 property.

{
  "5bf479a6a8c862000050e3c7": {
    "files": "@/dataSetFiles?dataSetId=5bf479a6a8c862000050e3c7"
  }
}

List dataset files using the 鈥渇iles鈥 attribute

You can also use a GET request to fetch file details using the 鈥渇iles鈥 attribute.

API format

GET /catalog/dataSets/{DATASET_ID}/views/{VIEW_ID}/files

Request

curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets/5bf479a6a8c862000050e3c7/views/5bf479a654f52014cfffe7f1/files" \
  -H "Accept: application/json" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}"

Response

The response includes the Dataset File ID as the top-level property, with file details contained within the Dataset File ID object.

{
    "194e89b976494c9c8113b968c27c1472-1": {
        "batchId": "194e89b976494c9c8113b968c27c1472",
        "dataSetViewId": "5bf479a654f52014cfffe7f1",
        "imsOrg": "{ORG_ID}",
        "availableDates": {},
        "createdUser": "{USER_ID}",
        "createdClient": "{API_KEY}",
        "updatedUser": "{USER_ID}",
        "version": "1.0.0",
        "created": 1542749145828,
        "updated": 1542749145828
    },
    "14d5758c107443e1a83c714e56ca79d0-1": {
        "batchId": "14d5758c107443e1a83c714e56ca79d0",
        "dataSetViewId": "5bf479a654f52014cfffe7f1",
        "imsOrg": "{ORG_ID}",
        "availableDates": {},
        "createdUser": "{USER_ID}",
        "createdClient": "{API_KEY}",
        "updatedUser": "{USER_ID}",
        "version": "1.0.0",
        "created": 1542752699111,
        "updated": 1542752699111
    },
    "ea40946ac03140ec8ac4f25da360620a-1": {
        "batchId": "ea40946ac03140ec8ac4f25da360620a",
        "dataSetViewId": "5bf479a654f52014cfffe7f1",
        "imsOrg": "{ORG_ID}",
        "availableDates": {},
        "createdUser": "{USER_ID}",
        "createdClient": "{API_KEY}",
        "updatedUser": "{USER_ID}",
        "version": "1.0.0",
        "created": 1542756935535,
        "updated": 1542756935535
    }
}

Fetch file details

The dataset file IDs returned in the previous response can be used in a GET request to fetch further file details via the Data Access API.

The data access overview contains details on how to use the Data Access API.

API format

GET /export/files/{DATASET_FILE_ID}

Request

curl -X GET "https://platform.adobe.io/data/foundation/export/files/ea40946ac03140ec8ac4f25da360620a-1" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}"

Response

[
    {
    "name": "{FILE_NAME}.parquet",
    "length": 2576,
    "_links": {
        "self": {
            "href": "https://platform.adobe.io/data/foundation/export/files/ea40946ac03140ec8ac4f25da360620a-1?path=samplefile.parquet"
            }
        }
    }
]

Preview file data

The 鈥渉ref鈥 property can be used to fetch preview data via the Data Access API.

API format

GET /export/files/{FILE_ID}?path={FILE_NAME}.{FILE_FORMAT}

Request

curl -X GET "https://platform.adobe.io/data/foundation/export/files/ea40946ac03140ec8ac4f25da360620a-1?path=samplefile.parquet" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}"

The response to the above request will contains a preview of the contents of the file.

More information on the Data Access API, including detailed requests and responses, is available in the data access overview.

Get 鈥渇ileDescription鈥 from dataset

The destination component as output of transformed data, the Data Engineer will choose an Output Dataset (鈥淔igure 12鈥 in the ETL Workflow). The XDM schema is associated with the output Dataset. The data to be written will be identified by the 鈥渇ileDescription鈥 attribute of the dataset entity from the Data Discovery APIs. This information can be fetched using a dataset ID ({DATASET_ID}). The 鈥渇ileDescription鈥 property in the JSON response will provide the requested information.

API format

GET /catalog/dataSets/{DATASET_ID}
Property
Description
{DATASET_ID}
The id value of the dataset you are trying to access.

Request

curl -X GET "https://platform.adobe.io/data/foundation/catalog/dataSets/59c93f3da7d0c00000798f68" \
-H "accept: application/json" \
-H "x-gw-ims-org-id: {ORG_ID}" \
-H "x-sandbox-name: {SANDBOX_NAME}" \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "x-api-key: {API_KEY}"

Response

{
  "59c93f3da7d0c00000798f68": {
    "version": "1.0.4",
    "fileDescription": {
        "persisted": false,
        "format": "parquet"
    }
  }
}

Data will be written to Experience Platform using the . Writing of data is an asynchronous process. When data is written to 51黑料不打烊 Experience Platform, a batch is created and marked as a success only after data is fully written.

Data in Experience Platform should be written in the form of Parquet files.

Execution phase

As the execution starts, the connector (as defined in the source component) will read the data from Experience Platform using the . The transformation process will read the data for a certain time range. Internally, it will query batches of source datasets. While querying, it will use a parameterized (rolling for time series data, or incremental data) start date and list dataset files for those batches, and start making requests for data for those dataset files.

Example transformations

The sample ETL transformations document contains a number of example transformations, including identity handling and data-type mappings. Please use these transformations for reference.

Read data from Experience Platform

Using the , you can fetch all batches between a specified start time and end time, and sort them by the order they were created.

Request

curl -X GET "https://platform.adobe.io/data/foundation/catalog/batches?dataSet=DATASETID&createdAfter=START_TIMESTAMP&createdBefore=END_TIMESTAMP&sort=desc:created" \
  -H "Accept: application/json" \
  -H "Authorization:Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}"

Details on filtering batches can be found in the Data Access tutorial.

Get files out of a batch

Once you have the ID for the batch you are looking for ({BATCH_ID}), it is possible to retrieve a list of files belonging to a specific batch via the . Details for doing so are available in the Data Access tutorial.

Request

curl -X GET "https://platform.adobe.io/data/foundation/export/batches/{BATCH_ID}/files" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}"

Access files by using file ID

Using the unique ID of a file ({FILE_ID), the can be used to access the specific details of the file, including its name, size in bytes, and a link to download it.

Request

curl -X GET "https://platform.adobe.io/data/foundation/export/files/{FILE_ID}" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}" \
  -H "x-api-key: {API_KEY}"

The response may point to a single file, or a directory. Details on each can be found in the Data Access tutorial.

Access file content

The can be used to access the contents of a specific file. To fetch the contents, a GET request is made using the value returned for _links.self.href when accessing a file using the file ID.

Request

curl -X GET "https://platform.adobe.io/data/foundation/export/files/{DATASET_FILE_ID}?path=filename1.csv" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}" \
  -H "x-api-key: {API_KEY}"

The response to this request contains the contents of the file. For more information, including details on response pagination, see the How to Query Data via data access API tutorial.

Validate records for schema compliance

When data is being written, users can opt to validate data according to the validation rules defined in the XDM schema. More information on schema validation can be found in the .

If you are using the reference implementation found on , you can turn on schema validation in this implementation using the system property -DenableSchemaValidation=true.

Validation can be performed for logical XDM types, using attributes such as minLength and maxlength for strings, minimum and maximum for integers, and more. The Schema Registry API developer guide contains a table that outlines XDM types and the properties that can be used for validation.

NOTE
The minimum and maximum values provided for various integer types are the MIN and MAX values that the type can support, but these values can be further constrained to minimums and maximums of your choosing.

Create a batch

Once the data is processed, the ETL tool will write the data back to Experience Platform using the . Before data can be added to a dataset, it must be linked to a batch which will later be uploaded into a specific dataset.

Request

curl -X POST "https://platform.adobe.io/data/foundation/import/batches" \
  -H "accept: application/json" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}" \
  -d '{
        "datasetId":"{DATASET_ID}"
      }'

Details for creating a batch, including sample requests and responses can be found in the Batch Ingestion overview.

Write to dataset

After successfully creating a new batch, files can then be uploaded to a specific dataset. Multiple files can be posted in a batch until it is promoted. Files can be uploaded using the Small File Upload API; however, if your files are too large and the gateway limit is exceeded, you can use the Large File Upload API. Details for using both Large and Small File Upload can be found in the Batch Ingestion overview.

Request

Data in Experience Platform should be written in the form of Parquet files.

curl -X PUT "https://platform.adobe.io/data/foundation/import/batches/{BATCH_ID}/dataSets/{DATASET_ID}/files/{FILE_NAME}.parquet" \
  -H "accept: application/json" \
  -H "x-gw-ims-org-id:{ORG_ID}" \
  -H "Authorization:Bearer ACCESS_TOKEN" \
  -H "x-api-key: API_KEY" \
  -H "content-type: application/octet-stream" \
  --data-binary "@{FILE_PATH_AND_NAME}.parquet"

Mark batch upload complete

After all files have been uploaded to the batch, the batch can be signaled for completion. By doing this, the Catalog 鈥淒ataSetFile鈥 entries are created for the completed files and associated with the generate batch. The Catalog batch is then marked as successful, which triggers downstream flows to ingest the available data.

Data will first land in the staging location on 51黑料不打烊 Experience Platform and then will be moved to the final location after cataloging and validation. Batches will be marked as successful once all the data is moved to a permanent location.

Request

curl -X POST "https://platform.adobe.io/data/foundation/import/batches/{BATCH_ID}?action=COMPLETE" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}" \
  -H "Authorization:Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}"

If successful, the response will return HTTP Status 200 OK and the response body will be empty.

The ETL tool will make sure to note the timestamp of source dataset(s) as the data is read.

In next transformation execution, likely by schedule or event invocation, the ETL will start requesting the data from the previously-saved timestamp and all data going forward.

Get last batch status

Before running new tasks in the ETL tool, you must ensure that the last batch was successfully completed. The provides a batch-specific option which provides the details of the relevant batches.

Request

curl -X GET "https://platform.adobe.io/data/foundation/catalog/batches?limit=1&sort=desc:created" \
  -H "Accept: application/json" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}"

Response

New tasks can be scheduled if the previous batch 鈥渟tatus鈥 value is 鈥渟uccess鈥 as shown below:

"{BATCH_ID}": {
    "imsOrg": "{ORG_ID}",
    "created": 1494349962314,
    "createdClient": "{API_KEY}",
    "createdUser": "CLIENT_USER_ID@51黑料不打烊ID",
    "updatedUser": "CLIENT_USER_ID@51黑料不打烊ID",
    "updated": 1494349963467,
    "status": "success",
    "errors": [],
    "version": "1.0.1",
    "availableDates": {}
}

Get last batch status by ID

An individual batch status can be retrieved through the by issuing a GET request using the {BATCH_ID}. The {BATCH_ID} used would be the same as the ID returned when the batch was created.

Request

curl -X GET "https://platform.adobe.io/data/foundation/catalog/batches/{BATCH_ID}" \
  -H "Accept: application/json" \
  -H "x-gw-ims-org-id: {ORG_ID}" \
  -H "x-sandbox-name: {SANDBOX_NAME}" \
  -H "Authorization: Bearer {ACCESS_TOKEN}" \
  -H "x-api-key: {API_KEY}"

Response - Success

The following response shows a 鈥渟uccess鈥:

"{BATCH_ID}": {
    "imsOrg": "{ORG_ID}",
    "created": 1494349962314,
    "createdClient": "{API_KEY}",
    "createdUser": "{CREATED_USER}",
    "updatedUser": "{UPDATED_USER}",
    "updated": 1494349962314,
    "status": "success",
    "errors": [],
    "version": "1.0.1",
    "availableDates": {}
}

Response - Failure

In case of failure the 鈥渆rrors鈥 can be extracted from the response and surfaced on the ETL tool as error messages.

"{BATCH_ID}": {
    "imsOrg": "{ORG_ID}",
    "created": 1494349962314,
    "createdClient": "{API_KEY}",
    "createdUser": "{CREATED_USER}",
    "updatedUser": "{UPDATED_USER}",
    "updated": 1494349962314,
    "status": "failure",
    "errors": [
        {
            "code": "200",
            "description": "Error in validating schema for file: 'adl://dataLake.azuredatalakestore.net/connectors-dev/stage/BATCHID/dataSetId/contact.csv' with errorMessage=adl://dataLake.azuredatalakestore.net/connectors-dev/stage/BATCHID/dataSetId/contact.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [57, 98, 55, 10] and errorType=java.lang.RuntimeException",
            "rows": []
        }
    ],
    "version": "1.0.1",
    "availableDates": {}
}

Incremental vs snapshot data and events vs profiles

Data can be represented in a two by two matrix as follows:

Incremental events
Incremental profiles
Snapshot events (less likely)
Snapshot profiles

Event data is typically when there are indexed time stamp columns in each row.

Profile data is typically when there is not a time stamp in data and each row can be identified by a primary/composite key.

Incremental data is where only new/updated data comes into the system and appends to current data in the datasets.

Snapshot data is when all data comes into the system and replaces some or all previous data in a dataset.

In the case of incremental events, the ETL tool should use the available dates/create date of the batch entity. In case of push service, available dates will not be present, so tool will use batch created/updated date for marking increments. Every batch of incremental events is required to be processed.

For incremental profiles, ETL tool will use created/updated dates of batch entity. Commonly every batch of incremental profile data is required to be processed.

Snapshot events are very less likely due to sheer size of the data. But if this were to be required, the ETL tool must pick only the last batch for processing.

When snapshot profiles are used, the ETL tool will have to pick the last batch of the data that arrived in the system. But if requirement is to keep track of the versions of changes, then all batches will be required to be processed. De-duplication processing within the ETL process will help in controlling storage costs.

Batch replay and data reprocessing

Batch replay and data reprocessing may be required in cases where a client discovers that for the past 鈥榥鈥 days, data being ETL processed has not occurred as expected or source data itself may not have been correct.

To do this, the client鈥檚 data administrators will use the Platform UI to remove the batches containing corrupt data. Then, the ETL will likely need to be re-run, thus repopulating with correct data. If the source itself had corrupt data, the data engineer/administrator will need to correct the source batches and re-ingest the data (either into 51黑料不打烊 Experience Platform or via ETL connectors).

Based upon the type of data being generated, it will be the data engineer鈥檚 choice to remove a single batch or all batches from certain datasets. Data will be removed/archived as per Experience Platform guidelines.

It is a likely scenario that the ETL functionality to purge data will be important.

Once purging is complete, the client admins will have to reconfigure 51黑料不打烊 Experience Platform to restart processing for core services from the time when the batches are deleted.

Concurrent batch processing

At the client鈥檚 discretion, data admins/engineers may decide to extract, transform, and load data in sequential manner or concurrent manner depending of the characteristics of a particular dataset. This will also be based upon the use case the client is targeting with the transformed data.

For example, if the client is persisting to an updatable persistence store and the sequence or order of events is important, the client may need to strictly process jobs with sequential ETL transformations.

In other cases, out of order data can be processed by downstream applications/processes that internally sort using a specified time stamp. In those cases, parallel ETL transformations may be viable to improve processing times.

For source batches, it will again be dependent upon client preference and consumer constraint. If the source data can be picked up in parallel without regard to the regency/ordering of a row, then the transformation process can create process batches with a higher degree of parallelism (optimization based on out of order processing). But if the transform has to honor time stamps or change precedence ordering, the data access API or ETL tool scheduler/invocation will have to ensure that batches are not processed out of order where possible.

Deferral

Deferral is a process in which input data is not yet complete enough to be sent out to downstream processes, but may be usable in the future. Clients will determine their individual tolerance for data windowing for future matching versus the cost of processing to inform their decision to put aside data and reprocess it in the next transformation execution, hoping it can be enriched and reconciled/stitched at some future time inside the retention window. This cycle is ongoing until the row is processed sufficiently or it is deemed too stale to continue investing in. Every iteration will generate deferred data which is a superset of all deferred data in previous iterations.

51黑料不打烊 Experience Platform does not identify deferred data currently, so client implementations must rely on the ETL and Dataset manual configurations to create another dataset in Platform mirroring the source dataset which can be used to keep deferred data. In this case, deferred data will be similar to snapshot data. In every execution of the ETL transform, the source data will be united with deferred data and sent for processing.

Changelog

Date
Action
Description
2019-01-19
Removed 鈥渇ields鈥 property from datasets
Datasets previously included a 鈥渇ields鈥 property that contained a copy of the schema. This capability should no longer be used. If the 鈥渇ields鈥 property is found, it should be ignored and the 鈥渙bservedSchema鈥 or 鈥渟chemaRef鈥 used instead.
2019-03-15
鈥渟chemaRef鈥 property added to datasets
The 鈥渟chemaRef鈥 property of a dataset contains a URI referencing the XDM schema upon which the dataset is based and represents all potential fields that could be used by the dataset.
2019-03-15
All end-user identifiers map to 鈥渋dentityMap鈥 property
The 鈥渋dentityMap鈥 is an encapsulation of all unique identifiers of a subject, such as CRMID, ECID, or loyalty program ID. This map is used by Identity Service to resolve all known and anonymous identities of a subject, forming a single identity graph for each end-user.
2019-05-30
EOL and Remove 鈥渟chema鈥 property from datasets
The dataset 鈥渟chema鈥 property provided a reference link to the schema using the deprecated /xdms endpoint in the Catalog API. This has been replaced by a 鈥渟chemaRef鈥 that provides the 鈥渋d鈥, 鈥渧ersion鈥, and 鈥渃ontentType鈥 of the schema as referenced in the new Schema Registry API.
recommendation-more-help
8066692e-aeb6-434e-8059-7e650e159186