Data Configuration
Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the
data_info
argument. By default, RIME can load any dataset from disk or cloud storage so long as the files are correctly formatted. RIME additionally supports user-defined dataloaders contained in a configured python file as well as a native integration with the Hugging Face datasets hub.
The data_info
configuration communicates how RIME should should ingest data and has expects different arguments depending on its type
.
The data_info
type can direct the RI platform to load both the reference and evaluation datasets from the same source type, such as files, a user-defined dataloader, or a Huggingface dataset.
Alternatively, to load reference and evaluation sets from different data sources, you can specify the “split” type.
In the split approach, you may specify a separate “single” data info struct for each of the reference and evaluation datasets: ref_data_info
and eval_data_info
.
Each of the reference and evaluation single data structs can take on a different configuration type, similar to the types listed above.
The full list of single data info configuration templates is given below.
Default Data Info Template
{
"data_info": {
"ref_path": "path/to/ref.jsonl.gz", (REQUIRED)
"eval_path": "path/to/eval.jsonl.gz", (REQUIRED)
"embeddings": null
},
...
}
Arguments
ref_path
: string, requiredPath to reference data file. Please reference the NLP file guide for a description of supported file formats.
eval_path
: string, requiredPath to evaluation data file. Please reference the NLP file guide for a description of supported file formats.
embeddings
: list ornull
, default =null
A list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
key
: stringName of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by
{"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}
, specifyingembeddings: [{"key": "context_vec"}]
in thedata_info
would direct the RI Platform to treat this value as a dense vector-valued embedding feature.
Custom Dataloader Template
{
"data_info": {
"type": "custom", (REQUIRED)
"load_path": "path/to/dataloader.py", (REQUIRED)
"embeddings": null
},
...
}
Arguments
type
: string, requiredMust be set to “custom”.
load_path
: string, requiredPath to the custom dataloader file. Please reference the NLP Dataloader documentation instructions on how to create a compatible file.
embeddings
: list ornull
, default =null
A list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
key
: stringName of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by
{"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}
, specifyingembeddings: [{"key": "context_vec"}]
in thedata_info
would direct the RI Platform to treat this value as a dense vector-valued embedding feature.
Hugging Face Dataset Template
{
"data_info": {
"type": "huggingface", (REQUIRED)
"dataset_uri": "path", (REQUIRED)
"ref_split": "train",
"eval_split": "test",
"text_key": "text",
"text_pair_key": "text_pair",
"label_key": "label",
"eval_label_key": "label"
"loading_params": null
},
...
}
Arguments
type
: string, requiredMust be set to “huggingface”.
dataset_uri
: string, requiredThe path or tag passed to ‘load_dataset’.
ref_split
: string, default = “train”The key used to access the reference split from the downloaded ‘DatasetDict’.
eval_split
: string, default = “test”The key used to access the evaluation split from the downloaded ‘DatasetDict’.
text_key
: string, default = “text”The feature name for the NLP input text attribute.
text_pair_key
: string, default = “text_pair”The feature name for the NLP second input text attribute (for NLI model task).
label_key
: string or null, default = “label”The feature name for the label class ID. If
null
, don’t load labels.eval_label_key
: string or null, default = “label”The feature name for the label class ID in the evaluation split. If
null
, don’t load labels.loading_params
: dict or null, default =null
Additional kwargs to pass to ‘load_dataset’.
Split Data Info Template
{
"data_info": {
"ref_data_info": ..., (REQUIRED)
"eval_data_info": ..., (REQUIRED)
},
...
}
Arguments
ref_data_info
: SingleDataInfo, requiredPath to single data info struct (see below).
eval_data_info
: SingleDataInfo, requiredPath to single data info struct (see below).
Single Data Info Templates
Note that these single data info structs can be used to specify both the ref_data_info
as well as eval_data_info
in the split data into template above.
Note that all single data info structs also take in a set of NLP parameters which allow the user to additionally
specify properties of their data, such as predictions in prediction_info
and embeddings in embeddings
.
The full list is detailed below.
General NLP Parameters for Single Data Info
{
"prediction_info": null,
"embeddings": null
}
Arguments
prediction_info
: mapping, default =null
Arguments to specify prediction info. Very similar to the
prediction_info
struct in the Prediction Configuration page. Note that only one of these two structs can be specified. Ifprediction_info
is specified in reference and evaluation single data info structs, then it cannot also be specified as a separate top-level struct in the JSON configuration.path
: string ornull
, default =null
Path to prediction cache corresponding to the data file. Please see the NLP Prediction Cache Data Format reference for a description of supported file format.
n_samples
: int ornull
, default =null
Number of samples from each dataset to score. If both
ref_path
andeval_path
are specified, this must be set to null. If either prediction cache is not specified andn_samples
is set tonull
, the default is to score the entire dataset. If model throughput is low, it is recommended to use a prediction cache or specify a smaller value forn_samples
.
embeddings
: list ornull
, default =null
A list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
key
: stringName of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by
{"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}
, specifyingembeddings: [{"key": "context_vec"}]
in thedata_info
would direct the RI Platform to treat this value as a dense vector-valued embedding feature.
File-based Single Data Info Template
{
"file_name": "path/to/file.csv",
**nlp_params
}
Arguments
file_name
: string, requiredPath to data file.
**nlp_params
: DictSee NLP Parameters above.
Custom Dataloader Single Data Info Template
{
"load_path": "path/to/custom_loader.py",
"load_func_name": "load_fn_name",
"loader_kwargs": null,
"loader_kwargs_json": null,
**nlp_params
}
Arguments
load_path
: string, requiredPath to custom loader Python file.
load_func_name
: string, requiredName of the loader function. Must be defined within the Python file.
loader_kwargs
: Dict, default =null
Arguments to pass in to the loader function, in dictionary form. We pass these arguments in as **kwargs. Only one of
loader_kwargs
andloader_kwargs_json
can be specified.loader_kwargs_json
: DictArguments to pass in to the loader function, in JSON-serialized string form. We pass these arguments in as **kwargs. Only one of
loader_kwargs
andloader_kwargs_json
can be specified.**nlp_params
: DictSee NLP Parameters above.
Data Collector Single Data Info Template
NOTE: this can only be specified as part of a Continuous Testing config, not offline testing config. See the Continuous Tests Configuration for more details.
{
"start_time": start_time,
"end_time": end_time,
**nlp_params
}
Arguments
start_time
: int, requiredStart time of the data collector to fetch data from. Format is UNIX epoch time in seconds.
end_time
: int, requiredEnd time of the data collector to fetch data from. Format is UNIX epoch time in seconds.
**nlp_params
: DictSee NLP Parameters above.
Hugging Face Single Data Info Template
{
"data_info": {
"type": "huggingface", (REQUIRED)
"dataset_uri": "path", (REQUIRED)
"split_name": "train",
"text_key": "text",
"text_pair_key": "text_pair",
"label_key": "label",
"loading_params": null
},
...
}
Arguments
type
: string, requiredMust be set to “huggingface”.
dataset_uri
: string, requiredThe path or tag passed to ‘load_dataset’.
split_name
: string, default = “train”The key used to access the split from the downloaded ‘DatasetDict’.
text_key
: string, default = “text”The feature name for the NLP input text attribute.
text_pair_key
: string, default = “text_pair”The feature name for the NLP second input text attribute (for NLI model task).
label_key
: string or null, default = “label”The feature name for the label class ID. If
null
, don’t load labels.loading_params
: dict or null, default =null
Additional kwargs to pass to ‘load_dataset’.