Data Configuration
Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the
data_info
argument.
The data_info
configuration can take on different forms, offering a tradeoff between simplicity and flexibility. In the default approach,
you may choose to specify the file paths to both the reference and evaluation datasets, along with supporting arguments such as pred_col
,
label_col
, and more. This is the easiest configuration to get started with.
In the split approach, you may choose to specify a separate “single” data info struct for both the reference and evaluation datasets:
ref_data_info
and eval_data_info
. Both the reference and evaluation single data info structs can take on different configuration types,
from a file-path based approach, to a custom data loader, to a Delta Lake table, to our data collector (for Continuous Testing only). This allows
you to specify different data loaders for the reference and evaluation datasets. The configuration
templates for all of these data infos are also detailed below.
NOTE: for AI Continuous Testing, predictions are required. Either pred_col
must be specified or ref_pred_path
and eval_pred_path
must be specified.
Default Data Info Template
{
"data_info": {
"ref_path": "path/to/ref.csv", (REQUIRED)
"eval_path": "path/to/eval.csv", (REQUIRED)
"label_col": "Label",
"pred_col": "Prediction", (WORKS FOR ALL TASKS EXCEPT MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
"ref_pred_path": "path/to/ref/preds.csv", (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
"eval_pred_path": "path/to/eval/preds.csv", (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
"nrows": null,
"categorical_features": null,
"protected_features": null,
"features_not_in_model": null,
"feature_type_path": null,
"ranking_info": null, (REQUIRED FOR RANKING)
"loading_kwargs": null,
"embeddings": null
},
...
}
Arguments
ref_path
: string, requiredPath to reference data file.
eval_path
: string, requiredPath to evaluation data file.
label_col
: string or null, default =null
Name of column in data that corresponds to the labels.
pred_col
: string or null, default =null
Name of column in data that corresponds to the predictions.
ref_pred_path
: string or null, default =null
Path to a CSV or Parquet file containing the predictions on the reference dataset. This is how predictions are specified for multi-class models.
eval_pred_path
: string or null, default =null
Path to a CSV or Parquet file containing the predictions on the evaluation dataset. This is how predictions are specified for multi-class models.
nrows
: int or null, default =null
Number of rows of data to load and test. If
null
, will load all rows. By default it isnull
.categorical_features
: list or null, default =null
List of categorical features in data. If provided, these should be ALL the categorical features. If
null
, RIME will automatically determine whether a column is categorical or not. By default it isnull
.protected_features
: list or null, default =null
List of protected features in data. If the
Bias And Fairness
category is added tocategories
in the test config (see TestSuiteConfig(), andprotected_features
are included - a set of bias and fairness tests will be run over the protected features.features_not_in_model
: list or null, default =null
List of features in the dataset that are not used by the model. Specifying this will ensure that only relevant tests are run on these features.
feature_type_path
: string, default =null
Path to a CSV file that specifies the data type of each feature. The file should have two columns:
FeatureName
andFeatureType
. The possible values forFeatureType
areBoolCategoricalColumn
,DomainColumn
,EmailColumn
,NumericCategoricalColumn
,StringCategoricalColumn
,UrlColumn
,FloatColumn
,IntegerColumn
.ranking_info
: mapping, default =null
Arguments to be used for Ranking tasks. If you are not running RIME on a Ranking task this value should be null. If you are running on a Ranking task, the following keys should be provided:
query_col
: string, requiredName of column in dataset that contains the query ids.
nqueries
: int or null, default =null
Number of queries to consider when running RIME. If
null
, will use all queries.nrows_per_query
: int or null, default =null
Number of rows to use per query when running RIME. If
null
, will use all rows.drop_query_id
: bool, default = TrueWhether to drop the query ID column from the dataset to avoid passing as a feature to the model.
loading_kwargs
: mapping, default =null
Keyword arguments to be passed to the
pandas
loading function (eitherpd.read_CSV
orpd.read_Parquet
, depending on your data format). NOTE: if you wish to specifynrows
, this should NOT be done with these kwargs but rather with thenrows
parameter above.embeddings
: list ornull
, default =null
A list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
name
: stringName of the embedding to be shown in the test run results.
cols
: listList of column names corresponding to the embedding. For example, suppose each data row is represented by columns
age
,is_member
,query_0
,query_1
,query_2
, etc., where each column “query_\d
” is a dimension of a sentence embedding extracted from a user query. Specifying"embeddings": [{"name": "user_query", "cols": ["query_0", "query_1", "query_2", ...]}]
(i.e., specifying each column contained by the embedding) would direct the RI Platform to treat these columns as a single dense vector-valued embedding feature.
Split Data Info Template
{
"data_info": {
"ref_data_info": ..., (REQUIRED)
"eval_data_info": ..., (REQUIRED)
},
...
}
Arguments
ref_data_info
: SingleDataInfo, requiredPath to single data info struct (see below).
eval_data_info
: SingleDataInfo, requiredPath to single data info struct (see below).
Single Data Info Templates
Note that these single data info structs can be used to specify both the ref_data_info
as well as eval_data_info
in the split data into template above.
Note that all single data info structs also take in a set of tabular parameters which allow the user to additionally
specify properties of their data. These parameters include fields such as label_col
, pred_col
, nrows
, protected_features
,
and more. The full list is detailed below. Note that for AI Continuous Testing, predictions are required.
Either pred_col
must be specified or pred_path
must be specified.
General Tabular Parameters for Single Data Info
{
"label_col": "Label",
"pred_col": "Prediction", (WORKS FOR ALL TASKS EXCEPT MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
"pred_path": "path/to/preds.csv", (ONLY SPECIFY FOR MULTI-CLASS, REQUIRED FOR CONTINUOUS TESTING)
"nrows": null,
"nrows_per_time_bin": null,
"categorical_features": null,
"protected_features": null,
"features_not_in_model": null,
"feature_type_path": null,
"ranking_info": null, (REQUIRED FOR RANKING)
"loading_kwargs": null,
"embeddings": null
}
Arguments
label_col
: string or null, default =null
Name of column in data that corresponds to the labels.
pred_col
: string or null, default =null
Name of column in data that corresponds to the predictions.
pred_path
: string or null, default =null
Path to a CSV or Parquet file containing the predictions on the evaluation dataset. This is how predictions are specified for multi-class models.
nrows
: int or null, default =null
Number of rows of data to load and test. If
null
, will load all rows. By default it isnull
.nrows_per_time_bin
: int or null, default =null
Number of rows of data per time bin to load and test in continuous testing (does not affect stress testing results). If
null
, will load all rows. By default it isnull
.categorical_features
: list or null, default =null
List of categorical features in data. If provided, these should be ALL the categorical features. If
null
, RIME will automatically determine whether a column is categorical or not. By default it isnull
.protected_features
: list or null, default =null
List of protected features in data. If the
Bias And Fairness
category is added tocategories
in the test config (see TestSuiteConfig(), andprotected_features
are included - a set of bias and fairness tests will be run over the protected features.features_not_in_model
: list or null, default =null
List of features in the dataset that are not used by the model. Specifying this will ensure that only relevant tests are run on these features.
feature_type_path
: string, default =null
Path to a CSV file that specifies the data type of each feature. The file should have two columns:
FeatureName
andFeatureType
. The possible values forFeatureType
areBoolCategoricalColumn
,DomainColumn
,EmailColumn
,NumericCategoricalColumn
,StringCategoricalColumn
,UrlColumn
,FloatColumn
,IntegerColumn
.ranking_info
: mapping, default =null
Arguments to be used for Ranking tasks. If you are not running RIME on a Ranking task this value should be null. If you are running on a Ranking task, the following keys should be provided:
query_col
: string, requiredName of column in dataset that contains the query ids.
nqueries
: int or null, default =null
Number of queries to consider when running RIME. If
null
, will use all queries.nrows_per_query
: int or null, default =null
Number of rows to use per query when running RIME. If
null
, will use all rows.drop_query_id
: bool, default = TrueWhether to drop the query ID column from the dataset to avoid passing as a feature to the model.
loading_kwargs
: mapping, default =null
Keyword arguments to be passed to the
pandas
loading function (eitherpd.read_CSV
orpd.read_Parquet
, depending on your data format). NOTE: if you wish to specifynrows
, this should NOT be done with these kwargs but rather with thenrows
parameter above.embeddings
: list ornull
, default =null
A list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
name
: stringName of the embedding to be shown in the test run results.
cols
: listList of column names corresponding to the embedding. For example, suppose each data row is represented by columns
age
,is_member
,query_0
,query_1
,query_2
, etc., where each column “query_\d
” is a dimension of a sentence embedding extracted from a user query. Specifying"embeddings": [{"name": "user_query", "cols": ["query_0", "query_1", "query_2", ...]}]
(i.e., specifying each column contained by the embedding) would direct the RI Platform to treat these columns as a single dense vector-valued embedding feature.
File-based Single Data Info Template
{
"file_name": "path/to/file.csv",
**tabular_params
}
Arguments
file_name
: string, requiredPath to data file.
**tabular_params
: DictSee Tabular Parameters above.
Custom Dataloader Single Data Info Template
{
"load_path": "path/to/custom_loader.py",
"load_func_name": "load_fn_name",
"loader_kwargs": null,
"loader_kwargs_json": null
**tabular_params
}
Arguments
load_path
: string, requiredPath to custom loader Python file.
load_func_name
: string, requiredName of the loader function. Must be defined within the Python file.
loader_kwargs
: Dict, default =null
Arguments to pass in to the loader function, in dictionary form. We pass these arguments in as **kwargs. Only one of
loader_kwargs
andloader_kwargs_json
can be specified.**loader_kwargs_json
: DictArguments to pass in to the loader function, in JSON-serialized string form. We pass these arguments in as **kwargs. Only one of
loader_kwargs
andloader_kwargs_json
can be specified.**tabular_params
: DictSee Tabular Parameters above.
Data Collector Single Data Info Template
NOTE: this can only be specified as part of a Continuous Testing config, not offline testing config. See the Continuous Tests Configuration for more details.
{
"start_time": start_time,
"end_time": end_time,
**tabular_params
}
Arguments
start_time
: int, requiredStart time of the data collector to fetch data from. Format is UNIX epoch time in seconds.
end_time
: int, requiredEnd time of the data collector to fetch data from. Format is UNIX epoch time in seconds.
**tabular_params
: DictSee Tabular Parameters above.
Delta Lake Single Data Info Template
NOTE: the Databricks secret access token (along with the “server_hostname” and “http_path”) are specified as part of Data Sources Configuration. If you are launching a manual Stress Test run or Continuous Test run, do not fill in the server_hostname or http_path fields; those fields + the secret token will automatically be inserted if you specify the corresponding data source name when kicking off the run.
{
"server_hostname": "<workspace-id>.cloud.databricks.com",
"http_path": "http/path",
"table_name": "table_name",
"start_time": start_time,
"end_time": end_time,
"time_col": "Timestamp column",
**tabular_params
}
Arguments
server_hostname
: string, requiredServer hostname of the Databricks cluster. This should be specified as part of a Data Source. See Databricks docs for more details.
http_path
: string, requiredHTTP Path of the Databricks cluster. This should be specified as part of a Data Source. See Databricks docs for more details.
table_name
: string, requiredName of the Delta Lake table.
start_time
: int, requiredStart time of the Delta Lake table to fetch data from. Format is UNIX epoch time in seconds.
end_time
: int, requiredEnd time of the Delta Lake table to fetch data from. Format is UNIX epoch time in seconds.
time_col
: string, requiredName of the timestamp column. This is a required field in order to determine the range of data which satisfies the
start_time
andend_time
params.**tabular_params
: DictSee Tabular Parameters above.