# RIME Data Format Check `rime-data-format-check` is a command-line utility that validates data for use with the RIME platform by performing a series of sanity checks, depending on the nature of the data (tabular vs. NLP). Be sure to have the Python SDK installed beforehand: ```bash pip install rime-sdk ``` After installation, you will be able to run `rime-data-format-check` commands in your shell. Try running it with `--help` to see what flags to specify to use it on your data: ```bash rime-data-format-check --help usage: rime-data-format-check [-h] (-nlp | -tabular) --ref-path REF_PATH --eval-path EVAL_PATH --task {Multi-class Classification,Regression,Binary Classification,Named Entity Recognition,Text Classification} [--preds-ref-path PREDS_REF_PATH] [--preds-eval-path PREDS_EVAL_PATH] [--label-col-name LABEL_COL_NAME] [--pred-col-name PRED_COL_NAME] [--timestamp-col-name TIMESTAMP_COL_NAME] ``` ## Examples ### Tabular **For a full list of the requirements for tabular input data, see the specification [here](../../configuration/tabular/task_data_format.md).** These examples were run using the Kaggle Titanic [dataset](https://www.kaggle.com/c/titanic/data). #### 1. Basic Usage (no labels or predictions) ``` rime-data-format-check -tabular \ --task "Binary Classification" \ --ref-path data/titanic/train.csv \ --eval-path data/titanic/test.csv ``` Output (PASSING): ``` Inspecting 'data/titanic/train.csv' Done! Inspecting 'data/titanic/test.csv' Done! WARNING: No prediction column is provided. Although you can still run RIME without predictions, it will not be as powerful as if you run it WITH predictions. WARNING: No label column is provided. Although you can still run RIME without labels, it will not be as powerful as if you run it WITH labels. --- Your data should work with RIME! ``` #### 2. With Labels (and a sample failure) ``` rime-data-format-check -tabular \ --task "Binary Classification" \ --ref-path data/titanic/train.csv \ --eval-path data/titanic/test.csv \ --label-col-name "Survived" ``` Output (ERROR): ``` Inspecting 'data/titanic/train.csv' Done! Inspecting 'data/titanic/test.csv' --- Error: Label column (Survived) not found in data (data/titanic/test.csv). If a label column exists in one dataset, it MUST exist in the other. ``` In this case, it looks like we're missing the label column from our evaluation set. For the Titanic dataset, these values are provided in the `gender_submission.csv`. Adding the label column to the evaluation set should resolve the issue: ``` import pandas as pd df_test = pd.read_csv("titanic/test.csv") df_labels = pd.read_csv("titanic/gender_submission.csv") df_test_with_labels = pd.merge(df_test, df_labels, on=["PassengerId"]) df_test_with_labels.to_csv("titanic/test_with_labels.csv") ``` Using the updated evaluation set, the data checks should pass: ``` rime-data-format-check -tabular \ --task "Binary Classification" \ --ref-path data/titanic/train.csv \ --eval-path data/titanic/test_with_labels.csv \ --label-col-name "Survived" ``` Output (PASSING): ``` Inspecting 'data/titanic/train.csv' Done! Inspecting 'data/titanic/test_with_labels.csv' Done! WARNING: No prediction column is provided. Although you can still run RIME without predictions, it will not be as powerful as if you run it WITH predictions. --- Your data should work with RIME! ``` The `--pred-col-name` flag operates identically to `--label-col-name`. ### NLP **For a full list of the requirements for NLP input data, see the specification [here](../../configuration/nlp/task_data_format.md).** These examples were run on a formatted version of the arXiv [dataset](https://www.kaggle.com/Cornell-University/arxiv), where input files contains data points with the following structure: ``` (in data/classification/arxiv/train.json) [ ... { "text": "On maxima and ladder processes for a dense class of Levy processes", "timestamp": "2007-05-23 00:00:00", "label": 4, "probabilities": [0.11, 0.0, 0.03, 0.01, 0.82, 0.0, 0.03, 0.0, 0.0, 0.0] }, ... ] ``` #### 1. Basic Usage (predictions included with inputs) ``` rime-data-format-check -nlp \ --task "Text Classification" \ --ref-path data/classification/arxiv/train.json \ --eval-path data/classification/arxiv/val.json ``` Output (PASSING) ``` rime-data-format-check -nlp \ --task "Text Classification" \ --ref-path ../../../rime-data-format-check/data/classification/arxiv/train.json \ --eval-path ../../../rime-data-format-check/data/classification/arxiv/val.json Inspecting '../../../rime-data-format-check/data/classification/arxiv/train.json': 100%|█████████████████████████████████████████████████████████████████████████████| 50000/50000 [00:01<00:00, 33219.50it/s] Inspecting '../../../rime-data-format-check/data/classification/arxiv/val.json': 100%|███████████████████████████████████████████████████████████████████████████| 147087/147087 [00:04<00:00, 33319.64it/s] --- Your data should work with RIME! ``` #### 2. With Predictions Provided Separately (and a sample failure) In this example, predictions are omitted from the inputs and provided in a separate file, `data/classification/arxiv/train_preds.json`. ``` (in data/classification/arxiv/train.json) [ ... { "text": "On maxima and ladder processes for a dense class of Levy processes" "timestamp": "2007-05-23 00:00:00", "label": 4 }, ... ] (in data/classification/arxiv/train_preds.json) [ ... { "probabilities": [0.11, 0.0, 0.03, 0.01, 0.82, 0.0, 0.03, 0.0, 0.0, 0.0] }, ... ] ``` ``` rime-data-format-check -nlp \ --task "Text Classification" \ --ref-path data/classification/arxiv/train.json \ --preds-ref-path data/classification/arxiv/train_preds.json \ --eval-path data/classification/arxiv/val.json \ --preds-eval-path data/classification/arxiv/val_preds.json ``` Output (ERROR): ``` Inspecting 'data/classification/arxiv/train_preds.json': 0%| | 35/50000 [00:00<00:02, 24377.39it/s] --- Error: File 'data/classification/arxiv/train_preds.json', Index 35: Key 'probabilities' error: Or() did not validate 24 24 should be instance of 'float' --- Inputs for task 'Text Classification' must adhere to the following structure: {'probabilities': []} ``` In this case, it looks like one of our prediction entities has an invalid value of `24`, an `int` instead of the expected `float`. The element at the specified file (`data/classification/arxiv/train_preds.json`) and index (`35`) has the following values: ``` { "probabilities": [24, 0.01, 0.03, 0.03, 0.68, 0.01, 0.0, 0.0, 0.0, 0.0] } ``` It appears that the first value, `24`, should actually be `0.24`. Adusting the value resolves the issue: ``` Inspecting 'data/classification/arxiv/train_preds.json': 100%|█████████████████████████████████████████████████████████████████████████████|50000/50000 [00:01<00:00, 26954.06it/s] Inspecting 'data/classification/arxiv/val_preds.json': 100%|███████████████████████████████████████████████████████████████████████████|147087/147087 [00:05<00:00, 27215.18it/s] Inspecting 'data/classification/arxiv/train.json': 100%|█████████████████████████████████████████████████████████████████████████████|50000/50000 [00:01<00:00, 32684.76it/s] Inspecting 'data/classification/arxiv/val.json': 100%|███████████████████████████████████████████████████████████████████████████|147087/147087 [00:04<00:00, 32876.84it/s] --- Your data should work with RIME! ```