Preprocessing Your Datasets
Supported File Formats
Robust Intelligence currently supports the following formats:
Delta lake tables
JSON
JSONL
CSV
The column headers of CSV, Parquet, and Delta Lake input files are used as feature names and must be strings. For JSON and JSONL files, the keys in each record are used as feature names.
Text features and image features must have string values. The value of the image feature string is the path to an image file. Specify text features and image features in the data info configuration.
Providing labels and predictions increases the effectiveness of Robust Intelligence, but most tasks do not require either.
Specify predictions in a separate file, and register them using the registry functionality. Include labels with the datasets.
Data requirements by task
Binary Classification
Labels must be the integer values 0 or 1
Predictions must be float values between 0 and 1 that represent the positive class (label = 1) probability
Multi-Class Classification
Labels must be integers referring to class index
Predictions must be a list of floats summing to 1. This can be passed in as a single column of lists, or as a file with one column per prediction.
Natural Language Inference
Labels must be integers referring to class index
Predictions must be a list of floats summing to 1. This can be passed in as a single column of lists, or as a file with one column per prediction.
Ranking
Labels are required
Labels can be any real number
Predictions can be any real number
ranking_info
must be provided in the data configuration
Regression
Labels can be any real number
Predictions can be any real number
Named Entity Recognition
Labels must be lists of dictionaries, with each dictionary corresponding to an entity.
Predictions must be lists of dictionaries, with each dictionary corresponding to an entity.
Each entity dictionary should have a
type
key (specifying the type of the entity) as well as amentions
key that contains all the mentions referring to this entity. Each mention itself a dictionary with two keys: astart_offset
key and anend_offset
key, which are integers referring respectively to the start and end indices of the mention in question.
Object Detection
Labels must be lists of dictionaries, with each dictionary corresponding to an object. Each object dictionary must contain the four keys
x_min
,x_max
,y_min
, andy_max
with float values defining the coordinates of the bounding box, along with aclass_id
key with the integer value of the class label.Predictions must be lists of dictionaries, with each dictionary corresponding to an object. Each object dictionary must contain the four keys
x_min
,x_max
,y_min
, andy_max
with float values defining the coordinates of the bounding box, along with aprobabilities
key with a list of floats defining the predicted probability for each class.