Data Configuration
To register your reference and evaluation datasets, you must configure
a SingleDataInfo
object, which contains two elements:
a connection_info object that specifies the location of the dataset; and
a data_params object that characterizes your dataset’s structure and how its contents will be used.
Parameter | Type | Description |
---|---|---|
connection_info | ConnectionInfo | Path to a ConnectionInfo object. |
data_params | DataInfoParams | Path to a data_params object. |
{
"data_info": {
"connection_info": {},
"data_params": {}
}
}
Set up your connection_info and data_params as shown below.
Connection Info
The connection_info
object specifies how you will connect to your data
source. Specify this as shown below for your data source type:
Data collector (This approach is deprecated.)
Load a data file from a cloud data store
Use this approach for loading from a cloud storage service such as AWS S3 or Azure Blob Storage.
a) Parameter = data_file
b) Type = DataFileInfo
{
"connection_info": {
"data_file": {
"path": ""
}
}
}
Parameter | Type | Description |
---|---|---|
path | string | The path to the file. |
data_type | string | Default value is "DATA_TYPE_UNSPECIFIED" . Should be set to "DATA_TYPE_DELTA_TABLE" when the path points to a Delta Table on S3. |
See the AWS data loading and Azure data loading examples.
Load a data file from a custom data loader
a) Parameter = data_loading
b) Type = DataLoadingInfo
{
"connection_info": {
"data_loading": {
"path": "",
"load_func_name": "",
"loader_kwargs_json": "",
"data_endpoint_integration_id": ""
}
}
}
Parameter | Type | Description |
---|---|---|
path | string | The path to the file. |
load_func_name | string | The function to call from the file at the path. |
loader_kwargs | string | A JSON String of keyword arguments to be passed into the load function. |
data_endpoint_integration_id | Optional string | The UUID for an integration that has scerets as key/value pairs, that will be provided to the custom loader at runtime. |
See the custom data loader example.
Load a Databricks Delta Lake table
a) parameter = databricks
b) Type = DatabricksInfo
{
"connection_info": {
"databricks": {
"table_name": "Table"
}
}
}
Parameter | Type | Description |
---|---|---|
table_name | String | The name of the Delta Lake table. |
See the Databricks Delta Lake example.
Load a Hugging Face dataset
a) Parameter = hugging_face
b) Type = HuggingFaceDataInfo
{
"connection_info": {
"hugging_face": {
"dataset_uri": "",
"split_name": "",
"loading_params_json": ""
}
}
}
Parameter | Type | Description |
---|---|---|
dataset_uri | String | The unique identifier of the dataset. |
split_name | String | The name of a predefined subset of data. |
loading_params_json | String | A JSON serialized string that contains loading parameters. |
See the Hugging Face example.
Load a data stream from a data collector (deprecated)
a) Parameter = data_collector
b) Type = DataCollectorInfo
{
"connection_info": {
"data_collector": {
"data_stream_id": null,
"start_time": 0,
"end_time": 0
}
}
}
Note: This can only be specified as part of a Continuous Testing configuration.
Parameter | Type | Description |
---|---|---|
data_stream_id | rime.UUID | The unique identifier assigned by Robust Intelligence to a data stream. |
start_time | int64 | The start time in seconds from the UNIX epoch. |
end_time | int64 | The end time in seconds from the UNIX epoch. |
Data Parameters
To configure a data source, specify a mapping in the main Robust Intelligence JSON configuration file
in the data_params
argument. The data_params
configuration can take on different forms, offering a tradeoff between
simplicity and flexibility.
Default Data Params Template
{
"data_params": {
"label_col": "",
"timestamp_col": "",
"class_names": [],
"ranking_info":{
"query_col":"",
"nqueries": null,
"nrows_per_query": null,
"drop_query_id": null
},
"nrows": null,
"nrows_per_time_bin": null,
"sample": true,
"categorical_features": [],
"protected_features": [],
"features_not_in_model": [],
"text_features": [],
"image_features": [],
"intersections": [
{
"features": []
}
],
"loading_kwargs": "",
"feature_type_path": "",
"image_load_path": ""
}
}
Parameters for the data_params
object
General
Parameter | Type | Description |
---|---|---|
label_col | String | Naming of special columns. |
timestamp_col | String | Column to look at for CT timestamp. |
class_names | Repeated String | List of label class names. |
ranking_info | JSON object | Contains parameters used for the ranking model task. |
query_col | String | Name of column in dataset that contains the query IDs. |
nqueries | Optional int64 | Number of queries to consider. Uses all queries when null. |
nrows_per_query | Optional int64 | Number of rows to use per query. Uses all rows when null. |
drop_query_id | Optional Boolean | Specifies whether to drop the query ID column from the dataset in order to prevent passing the query ID column to the model as a feature. |
Dataset sizing
Parameter | Type | Description |
---|---|---|
nrows | Optional int64 | Number of rows of data to load and test. Loads all rows when null and sample is not specified. Infers the maximum number of rows possible when null and sample is true. See Row Sampling. |
nrows_per_time_bin | Optional int64 | Number of rows of data per time bin to load and test in CT. Loads all rows when null. |
sample | Optional Boolean | Specifies whether to sample rows in order to maximize test data given the available memory. Default is True. See Smart Dataset Sampling. |
Feature types and relations
Parameter | Type | Description |
---|---|---|
categorical_features | Repeated String | A list of categorical features. |
protected_features | Repeated String | A list of features that are protected attributes. When the Bias and Fairness category is specified, these tests are only run over the listed features. |
features_not_in_model | Repeated String | A list of features not present in the model. |
text_features | Repeated String | A list of text features to run NLP tests over. |
image_features | Repeated String | A list of image features to run CV tests over. |
Feature intersections
Parameter | Type | Description |
---|---|---|
intersections | Repeated FeatureIntersection | A list of arrays of features, where each array represents the intersection of features on which certain subset and fairness tests are run. |
The FeatureIntersection
message is defined as follows:
Parameter | Type | Description |
---|---|---|
features | Repeated string | A list of feature names over which subgroups are generated. |
External resources
Parameter | Type | Description |
---|---|---|
loading_kwargs | String | Keyword arguments passed to the pandas loading function. Do not specify nrows here. |
feature_type_path | String | Deprecated. Path to a CSV file that specifies the data type of each feature. The file must have two columns, FeatureName and FeatureType . |
image_load_path | String | Path to a python file that contains a load_image function defining custom logic for loading an image from the file path provided in the dataset. |