Model authoring using the 51黑料不打烊 Experience Platform Platform SDK
This tutorial provides you with information on converting data_access_sdk_python
to the new Python platform_sdk
in both Python and R. This tutorial provides information on the following operations:
Build authentication build-authentication
Authentication is required to make calls to 51黑料不打烊 Experience Platform, and is comprised of API Key, organization ID, a user token, and a service token.
Python
If you are using Jupyter Notebook, please use the below code to build the client_context
:
client_context = PLATFORM_SDK_CLIENT_CONTEXT
If you are not using Jupyter Notebook or you need to change the organization, please use the below code sample:
from platform_sdk.client_context import ClientContext
client_context = ClientContext(api_key={API_KEY},
org_id={ORG_ID},
user_token={USER_TOKEN},
service_token={SERVICE_TOKEN})
R
If you are using Jupyter Notebook, please use the below code to build the client_context
:
library(reticulate)
use_python("/usr/local/bin/ipython")
psdk <- import("platform_sdk")
py_run_file("../.ipython/profile_default/startup/platform_sdk_context.py")
client_context <- py$PLATFORM_SDK_CLIENT_CONTEXT
If you are not using Jupyter Notebook or you need to change organization, please use the below code sample:
library(reticulate)
use_python("/usr/local/bin/ipython")
psdk <- import("platform_sdk")
client_context <- psdk$client_context$ClientContext(api_key={API_KEY},
org_id={ORG_ID},
user_token={USER_TOKEN},
service_token={SERVICE_TOKEN})
Basic reading of data basic-reading-of-data
With the new Platform SDK, the maximum read size is 32 GB, with a maximum read time of 10 minutes.
If your read time is taking too long, you can try using one of the following filtering options:
client_context
.Python
To read data in Python, please use the code sample below:
from platform_sdk.dataset_reader import DatasetReader
dataset_reader = DatasetReader(client_context, "{DATASET_ID}")
df = dataset_reader.limit(100).read()
df.head()
R
To read data in R, please use the code sample below:
DatasetReader <- psdk$dataset_reader$DatasetReader
dataset_reader <- DatasetReader(client_context, "{DATASET_ID}")
df <- dataset_reader$read()
df
Filter by offset and limit filter-by-offset-and-limit
Since filtering by batch ID is no longer supported, in order to scope reading of data, you need to use offset
and limit
.
Python
df = dataset_reader.limit(100).offset(1).read()
df.head
R
df <- dataset_reader$limit(100L)$offset(1L)$read()
df
Filter by date filter-by-date
Granularity of date filtering is now defined by the timestamp, rather than being set by the day.
Python
df = dataset_reader.where(\
dataset_reader['timestamp'].gt('2019-04-10 15:00:00').\
And(dataset_reader['timestamp'].lt('2019-04-10 17:00:00'))\
).read()
df.head()
R
df2 <- dataset_reader$where(
dataset_reader['timestamp']$gt('2018-12-10 15:00:00')$
And(dataset_reader['timestamp']$lt('2019-04-10 17:00:00'))
)$read()
df2
The new Platform SDK supports the following operations:
=
)eq()
>
)gt()
>=
)ge()
<
)lt()
<=
)le()
&
)And()
Filter by selected columns filter-by-selected-columns
To further refine your reading of data, you can also filter by column name.
Python
df = dataset_reader.select(['column-a','column-b']).read()
R
df <- dataset_reader$select(c('column-a','column-b'))$read()
Get sorted results get-sorted-results
Results received can be sorted by specified columns of the target dataset and in their order (asc/desc) respectively.
In the following example, dataframe is sorted by 鈥渃olumn-a鈥 first in ascending order. Rows having the same values for 鈥渃olumn-a鈥 are then sorted by 鈥渃olumn-b鈥 in descending order.
Python
df = dataset_reader.sort([('column-a', 'asc'), ('column-b', 'desc')])
R
df <- dataset_reader$sort(c(('column-a', 'asc'), ('column-b', 'desc')))$read()
Basic writing of data basic-writing-of-data
client_context
.To write data in Python and R, use one of the following examples below:
Python
from platform_sdk.models import Dataset
from platform_sdk.dataset_writer import DatasetWriter
dataset = Dataset(client_context).get_by_id("{DATASET_ID}")
dataset_writer = DatasetWriter(client_context, dataset)
write_tracker = dataset_writer.write({PANDA_DATAFRAME}, file_format='json')
R
dataset <- psdk$models$Dataset(client_context)$get_by_id("{DATASET_ID}")
dataset_writer <- psdk$dataset_writer$DatasetWriter(client_context, dataset)
write_tracker <- dataset_writer$write({PANDA_DATAFRAME}, file_format='json')
Next steps
Once you have configured the platform_sdk
data loader, the data undergoes preparation and is then split to the train
and val
datasets. To learn about data preparation and feature engineering please visit the section on data preparation and feature engineering in the tutorial for creating a recipe using JupyterLab notebooks.