Input Data Format ======================== ## Automated Validation {{ rime_data_format_check_redirect }} --- ## Supported File Formats RIME NLP currently supports both [JSON](https://www.json.org/json-en.html) (`.json`) and [JSON lines](https://jsonlines.org/) (`.jsonl`) formats, optionally compressed using [gzip](https://www.gnu.org/software/gzip/) (`.json.gz` or `.jsonl.gz`). For JSON lines files, RIME expects each line to be a dictionary representing a single data point. For standard JSON files, RIME expects the content to be a list of dictionaries. The structure of each dictionary is task-specific. See below for a detailed description of the expected data format for supported tasks. ## Requirements By Task ### Text Classification For the Text Classification task, each data point is represented by a dictionary containing the following keys: ```python [ { "text": "Hello, world!", (REQUIRED) "label": 1, "probabilities": [0.02, 0.94, 0.04] }, ... ] ``` - **`text`**: string, ***required*** The input string for this data point. - `label`: int The ground truth class label. This should be an integer in `[0, num_classes)`, where `num_classes` is the length of the probability vector output by the model. The label for a class should correspond to the index of that class in the model output. - `probabilities`: List[float] The model prediction for this data point. This should be a normalized vector of class probabilities, with a probability for each possible class. NOTE: predictions also can be provided in a separate file. See the [NLP Prediction Configuration](prediction_info) reference for more on how to provide cached predictions. ### Named Entity Recognition For the Named Entity Recognition task, each data point is represented by a dictionary containing the following keys: ```python [ { "text": "Hello, world!", (REQUIRED) "entities": [ { "mentions": [ { "start_offset": 7, "end_offset": 11 } ], "type": "LOC" } ], "predicted_entities": [ { "mentions": [ { "start_offset": 7, "end_offset": 11 } ], "type": "ORG" } ] }, ... ] ``` - **`text`**: string, ***required*** The input string for this data point. - `entities`: List[dict] The ground truth annotations. This should be a list of dictionaries, with each dictionary corresponding to an entity. Each entity dictionary should have a 'type' key (specifying the type the entity is predicted to be) as well as a 'mentions' key which contains all the mentions predicted to refer to this entity. Each mention itself a dictionary with two keys: a 'start_offset' key and a 'end offset' key, which are both integers referring to the start and end of the mention in question. - `predicted_entities`: List[dict] The model predictions for this data point. This should be a list of dictionaries, with each dictionary corresponding to an entity. Each entity dictionary should have a 'type' key (specifying the type the entity is predicted to be) as well as a 'mentions' key which contains all the mentions predicted to refer to this entity. Each mention itself a dictionary with two keys: a 'start_offset' key and a 'end offset' key, which are both integers referring to the start and end of the mention in question. See the [NLP Prediction Configuration](prediction_info) reference for more on how to provide cached predictions.