Preprocessing Your Datasets
Supported File Formats
Robust Intelligence currently supports the following formats:
- Delta lake tables 
- JSON 
- JSONL 
- CSV 
The column headers of CSV, Parquet, and Delta Lake input files are used as feature names and must be strings. For JSON and JSONL files, the keys in each record are used as feature names.
Text features and image features must have string values. The value of the image feature string is the path to an image file. Specify text features and image features in the data info configuration.
Providing labels and predictions increases the effectiveness of Robust Intelligence, but most tasks do not require either.
Specify predictions in a separate file, and register them using the registry functionality. Include labels with the datasets.
Data requirements by task
Binary Classification
- Labels must be the integer values 0 or 1 
- Predictions must be float values between 0 and 1 that represent the positive class (label = 1) probability 
Multi-Class Classification
- Labels must be integers referring to class index 
- Predictions must be a list of floats summing to 1. This can be passed in as a single column of lists, or as a file with one column per prediction. 
Natural Language Inference
- Labels must be integers referring to class index 
- Predictions must be a list of floats summing to 1. This can be passed in as a single column of lists, or as a file with one column per prediction. 
Ranking
- Labels are required 
- Labels can be any real number 
- Predictions can be any real number 
- ranking_infomust be provided in the data configuration
Regression
- Labels can be any real number 
- Predictions can be any real number 
Named Entity Recognition
- Labels must be lists of dictionaries, with each dictionary corresponding to an entity. 
- Predictions must be lists of dictionaries, with each dictionary corresponding to an entity. 
- Each entity dictionary should have a - typekey (specifying the type of the entity) as well as a- mentionskey that contains all the mentions referring to this entity. Each mention itself a dictionary with two keys: a- start_offsetkey and an- end_offsetkey, which are integers referring respectively to the start and end indices of the mention in question.
Object Detection
- Labels must be lists of dictionaries, with each dictionary corresponding to an object. Each object dictionary must contain the four keys - x_min,- x_max,- y_min, and- y_maxwith float values defining the coordinates of the bounding box, along with a- class_idkey with the integer value of the class label.
- Predictions must be lists of dictionaries, with each dictionary corresponding to an object. Each object dictionary must contain the four keys - x_min,- x_max,- y_min, and- y_maxwith float values defining the coordinates of the bounding box, along with a- probabilitieskey with a list of floats defining the predicted probability for each class.
