Data Configuration

Configuring a data source can be done by specifying a mapping in the main RIME JSON configuration file, under the data_info argument. By default, RIME can load any dataset from disk or cloud storage so long as the files are correctly formatted. RIME additionally supports user-defined dataloaders contained in a configured python file as well as a native integration with the Hugging Face datasets hub.

The data_info configuration communicates how RIME should should ingest data and has expects different arguments depending on its type. The data_info type can direct the RI platform to load both the reference and evaluation datasets from the same source type, such as files, a user-defined dataloader, or a Huggingface dataset.

Alternatively, to load reference and evaluation sets from different data sources, you can specify the “split” type. In the split approach, you may specify a separate “single” data info struct for each of the reference and evaluation datasets: ref_data_info and eval_data_info. Each of the reference and evaluation single data structs can take on a different configuration type, similar to the types listed above. The full list of single data info configuration templates is given below.

Default Data Info Template

{
    "data_info": {
        "ref_path": "path/to/ref.jsonl.gz",        (REQUIRED)
        "eval_path": "path/to/eval.jsonl.gz",      (REQUIRED)
        "embeddings": null
    },
    ...
}

Arguments

ref_path: string, required

Path to reference data file. Please reference the NLP file guide for a description of supported file formats.
eval_path: string, required

Path to evaluation data file. Please reference the NLP file guide for a description of supported file formats.
embeddings: list or null, default = null

A list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
- key: string
  
  Name of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by {"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}, specifying embeddings: [{"key": "context_vec"}] in the data_info would direct the RI Platform to treat this value as a dense vector-valued embedding feature.

Custom Dataloader Template

{
    "data_info": {
        "type": "custom",                           (REQUIRED)
        "load_path": "path/to/dataloader.py",       (REQUIRED)
        "embeddings": null
    },
    ...
}

Arguments

type: string, required

Must be set to “custom”.
load_path: string, required

Path to the custom dataloader file. Please reference the NLP Dataloader documentation instructions on how to create a compatible file.
embeddings: list or null, default = null

A list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
- key: string
  
  Name of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by {"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}, specifying embeddings: [{"key": "context_vec"}] in the data_info would direct the RI Platform to treat this value as a dense vector-valued embedding feature.

Hugging Face Dataset Template

{
    "data_info": {
        "type": "huggingface",                      (REQUIRED)
        "dataset_uri": "path",                      (REQUIRED)
        "ref_split": "train",
        "eval_split": "test",
        "text_key": "text",
        "text_pair_key": "text_pair",
        "label_key": "label",
        "eval_label_key": "label"
        "loading_params": null
    },
    ...
}

Arguments

type: string, required

Must be set to “huggingface”.
dataset_uri: string, required

The path or tag passed to ‘load_dataset’.
ref_split: string, default = “train”

The key used to access the reference split from the downloaded ‘DatasetDict’.
eval_split: string, default = “test”

The key used to access the evaluation split from the downloaded ‘DatasetDict’.
text_key: string, default = “text”

The feature name for the NLP input text attribute.
text_pair_key: string, default = “text_pair”

The feature name for the NLP second input text attribute (for NLI model task).
label_key: string or null, default = “label”

The feature name for the label class ID. If null, don’t load labels.
eval_label_key: string or null, default = “label”

The feature name for the label class ID in the evaluation split. If null, don’t load labels.
loading_params: dict or null, default = null

Additional kwargs to pass to ‘load_dataset’.

Split Data Info Template

{
    "data_info": {
        "ref_data_info": ...,              (REQUIRED)
        "eval_data_info": ...,            (REQUIRED)
    },
    ...
}

Arguments

ref_data_info: SingleDataInfo, required

Path to single data info struct (see below).
eval_data_info: SingleDataInfo, required

Path to single data info struct (see below).

Single Data Info Templates

Note that these single data info structs can be used to specify both the ref_data_info as well as eval_data_info in the split data into template above.

Note that all single data info structs also take in a set of NLP parameters which allow the user to additionally specify properties of their data, such as predictions in prediction_info and embeddings in embeddings. The full list is detailed below.

General NLP Parameters for Single Data Info

{
    "prediction_info": null,
    "embeddings": null
}

Arguments

prediction_info: mapping, default = null

Arguments to specify prediction info. Very similar to the prediction_info struct in the Prediction Configuration page. Note that only one of these two structs can be specified. If prediction_info is specified in reference and evaluation single data info structs, then it cannot also be specified as a separate top-level struct in the JSON configuration.
- path: string or null, default = null
  
  Path to prediction cache corresponding to the data file. Please see the NLP Prediction Cache Data Format reference for a description of supported file format.
- n_samples: int or null, default = null
  
  Number of samples from each dataset to score. If both ref_path and eval_path are specified, this must be set to null. If either prediction cache is not specified and n_samples is set to null, the default is to score the entire dataset. If model throughput is low, it is recommended to use a prediction cache or specify a smaller value for n_samples.
embeddings: list or null, default = null

A list of dictionaries corresponding to information for each embedding. The arguments for each dictionary are described below.
- key: string
  
  Name of the key in the data dictionary corresponding to the specified embedding. For example, if each data point is represented by {"text": "", "label": 1, "probabilitiies": [...], "context_vec": [...]}, specifying embeddings: [{"key": "context_vec"}] in the data_info would direct the RI Platform to treat this value as a dense vector-valued embedding feature.

File-based Single Data Info Template

{
    "file_name": "path/to/file.csv",
    **nlp_params
}

Arguments

file_name: string, required

Path to data file.
**nlp_params: Dict

See NLP Parameters above.

Custom Dataloader Single Data Info Template

{
    "load_path": "path/to/custom_loader.py",
    "load_func_name": "load_fn_name",
    "loader_kwargs": null,
    "loader_kwargs_json": null,
    **nlp_params
}

Arguments

load_path: string, required

Path to custom loader Python file.
load_func_name: string, required

Name of the loader function. Must be defined within the Python file.
loader_kwargs: Dict, default = null

Arguments to pass in to the loader function, in dictionary form. We pass these arguments in as **kwargs. Only one of loader_kwargs and loader_kwargs_json can be specified.
loader_kwargs_json: Dict

Arguments to pass in to the loader function, in JSON-serialized string form. We pass these arguments in as **kwargs. Only one of loader_kwargs and loader_kwargs_json can be specified.
**nlp_params: Dict

See NLP Parameters above.

Data Collector Single Data Info Template

NOTE: this can only be specified as part of a Continuous Testing config, not offline testing config. See the Continuous Tests Configuration for more details.

{
    "start_time": start_time,
    "end_time": end_time,
    **nlp_params
}

Arguments

start_time: int, required

Start time of the data collector to fetch data from. Format is UNIX epoch time in seconds.
end_time: int, required

End time of the data collector to fetch data from. Format is UNIX epoch time in seconds.
**nlp_params: Dict

See NLP Parameters above.

Hugging Face Single Data Info Template

{
    "data_info": {
        "type": "huggingface",                      (REQUIRED)
        "dataset_uri": "path",                      (REQUIRED)
        "split_name": "train",
        "text_key": "text",
        "text_pair_key": "text_pair",
        "label_key": "label",
        "loading_params": null
    },
    ...
}

Arguments

type: string, required

Must be set to “huggingface”.
dataset_uri: string, required

The path or tag passed to ‘load_dataset’.
split_name: string, default = “train”

The key used to access the split from the downloaded ‘DatasetDict’.
text_key: string, default = “text”

The feature name for the NLP input text attribute.
text_pair_key: string, default = “text_pair”

The feature name for the NLP second input text attribute (for NLI model task).
label_key: string or null, default = “label”

The feature name for the label class ID. If null, don’t load labels.
loading_params: dict or null, default = null

Additional kwargs to pass to ‘load_dataset’.