# RIME Data Format Check
`rime-data-format-check` is a command-line utility that validates data for use with the RIME platform by performing a series of sanity checks, depending on the nature of the data (tabular vs. NLP).

Be sure to have the Python SDK installed beforehand:
```bash
pip install rime-sdk
```

After installation, you will be able to run `rime-data-format-check` commands in your shell. Try running it with `--help` to see what flags to specify to use it on your data:
```bash
rime-data-format-check --help

usage: rime-data-format-check [-h] (-nlp | -tabular) --ref-path REF_PATH
                              --eval-path EVAL_PATH --task {Multi-class
                              Classification,Regression,Binary
                              Classification,Named Entity Recognition,Text
                              Classification}
                              [--preds-ref-path PREDS_REF_PATH]
                              [--preds-eval-path PREDS_EVAL_PATH]
                              [--label-col-name LABEL_COL_NAME]
                              [--pred-col-name PRED_COL_NAME]
                              [--timestamp-col-name TIMESTAMP_COL_NAME]
```

## Examples

### Tabular
**For a full list of the requirements for tabular input data, see the specification [here](../../configuration/tabular/task_data_format.md).**

These examples were run using the Kaggle Titanic [dataset](https://www.kaggle.com/c/titanic/data).

#### 1. Basic Usage (no labels or predictions)
```
rime-data-format-check -tabular \
  --task "Binary Classification" \
  --ref-path data/titanic/train.csv \
  --eval-path data/titanic/test.csv
```
Output (PASSING):
```

Inspecting 'data/titanic/train.csv'
Done!

Inspecting 'data/titanic/test.csv'
Done!

WARNING: No prediction column is provided. Although you can still run RIME without predictions, it will not be as powerful as if you run it WITH predictions.

WARNING: No label column is provided. Although you can still run RIME without labels, it will not be as powerful as if you run it WITH labels.


---


Your data should work with RIME!

```

#### 2. With Labels (and a sample failure)
```
rime-data-format-check -tabular \
  --task "Binary Classification" \
  --ref-path data/titanic/train.csv \
  --eval-path data/titanic/test.csv \
  --label-col-name "Survived"
```
Output (ERROR):
```

Inspecting 'data/titanic/train.csv'
Done!

Inspecting 'data/titanic/test.csv'

---

Error:

Label column (Survived) not found in data (data/titanic/test.csv). If a label column exists in one dataset, it MUST exist in the other.

```
In this case, it looks like we're missing the label column from our evaluation set.
For the Titanic dataset, these values are provided in the `gender_submission.csv`.

Adding the label column to the evaluation set should resolve the issue:
```
import pandas as pd
df_test = pd.read_csv("titanic/test.csv")
df_labels = pd.read_csv("titanic/gender_submission.csv")
df_test_with_labels = pd.merge(df_test, df_labels, on=["PassengerId"])
df_test_with_labels.to_csv("titanic/test_with_labels.csv")
```

Using the updated evaluation set, the data checks should pass:
```
rime-data-format-check -tabular \
  --task "Binary Classification" \
  --ref-path data/titanic/train.csv \
  --eval-path data/titanic/test_with_labels.csv \
  --label-col-name "Survived"
```
Output (PASSING):
```

Inspecting 'data/titanic/train.csv'
Done!

Inspecting 'data/titanic/test_with_labels.csv'
Done!

WARNING: No prediction column is provided. Although you can still run RIME without predictions, it will not be as powerful as if you run it WITH predictions.


---


Your data should work with RIME!

```

The `--pred-col-name` flag operates identically to `--label-col-name`.


### NLP
**For a full list of the requirements for NLP input data, see the specification [here](../../configuration/nlp/task_data_format.md).**

These examples were run on a formatted version of the arXiv [dataset](https://www.kaggle.com/Cornell-University/arxiv),
where input files contains data points with the following structure:
```
(in data/classification/arxiv/train.json)
[
  ...
  {
    "text": "On maxima and ladder processes for a dense class of Levy processes",
    "timestamp": "2007-05-23 00:00:00",
    "label": 4,
    "probabilities": [0.11, 0.0, 0.03, 0.01, 0.82, 0.0, 0.03, 0.0, 0.0, 0.0]
  },
  ...
]
```

#### 1. Basic Usage (predictions included with inputs)

```
rime-data-format-check -nlp \
  --task "Text Classification" \
  --ref-path data/classification/arxiv/train.json \
  --eval-path data/classification/arxiv/val.json
```
Output (PASSING)
```
rime-data-format-check -nlp \
  --task "Text Classification" \
  --ref-path ../../../rime-data-format-check/data/classification/arxiv/train.json \
  --eval-path ../../../rime-data-format-check/data/classification/arxiv/val.json

Inspecting '../../../rime-data-format-check/data/classification/arxiv/train.json':
100%|█████████████████████████████████████████████████████████████████████████████| 50000/50000 [00:01<00:00, 33219.50it/s]

Inspecting '../../../rime-data-format-check/data/classification/arxiv/val.json':
100%|███████████████████████████████████████████████████████████████████████████| 147087/147087 [00:04<00:00, 33319.64it/s]

---


Your data should work with RIME!

```

#### 2. With Predictions Provided Separately (and a sample failure)
In this example, predictions are omitted from the inputs and provided in a
separate file, `data/classification/arxiv/train_preds.json`.
```
(in data/classification/arxiv/train.json)
[
  ...
  {
    "text": "On maxima and ladder processes for a dense class of Levy processes"
    "timestamp": "2007-05-23 00:00:00",
    "label": 4
  },
  ...
]

(in data/classification/arxiv/train_preds.json)
[
  ...
  {
    "probabilities": [0.11, 0.0, 0.03, 0.01, 0.82, 0.0, 0.03, 0.0, 0.0, 0.0]
  },
  ...
]

```

```
rime-data-format-check -nlp \
  --task "Text Classification" \
  --ref-path data/classification/arxiv/train.json \
  --preds-ref-path data/classification/arxiv/train_preds.json \
  --eval-path data/classification/arxiv/val.json \
  --preds-eval-path data/classification/arxiv/val_preds.json
```
Output (ERROR):
```


Inspecting 'data/classification/arxiv/train_preds.json':
  0%|                                                                             | 35/50000 [00:00<00:02, 24377.39it/s]

---

Error:

File 'data/classification/arxiv/train_preds.json', Index 35:

Key 'probabilities' error:
Or(<class 'float'>) did not validate 24
24 should be instance of 'float'

---

Inputs for task 'Text Classification' must adhere to the following structure:

{'probabilities': [<class 'float'>]}

```
In this case, it looks like one of our prediction entities has an invalid value of `24`,
an `int` instead of the expected `float`.

The element at the specified file (`data/classification/arxiv/train_preds.json`) and index (`35`)
has the following values:
```
{
  "probabilities": [24, 0.01, 0.03, 0.03, 0.68, 0.01, 0.0, 0.0, 0.0, 0.0]
}
```

It appears that the first value, `24`, should actually be `0.24`. Adusting the value resolves the issue:
```

Inspecting 'data/classification/arxiv/train_preds.json':
100%|█████████████████████████████████████████████████████████████████████████████|50000/50000 [00:01<00:00, 26954.06it/s]

Inspecting 'data/classification/arxiv/val_preds.json':
100%|███████████████████████████████████████████████████████████████████████████|147087/147087 [00:05<00:00, 27215.18it/s]

Inspecting 'data/classification/arxiv/train.json':
100%|█████████████████████████████████████████████████████████████████████████████|50000/50000 [00:01<00:00, 32684.76it/s]

Inspecting 'data/classification/arxiv/val.json':
100%|███████████████████████████████████████████████████████████████████████████|147087/147087 [00:04<00:00, 32876.84it/s]

---


Your data should work with RIME!

```