Prediction Cache Data Format ================== ### Supported File Formats RIME NLP supports the same file formats for the prediction cache as it does for the [input data](task_data_format), namely [JSON](https://www.json.org/json-en.html) (`.json`) and [JSON lines](https://jsonlines.org/) (`.jsonl`) formats. Each prediction should be stored in its own dictionary in the json list or as a dictionary on its own line for JSONL files. To use a prediction cache for a given test run, it is currently required that a prediction be present for every data point in the corresponding input data. For example, if a dataset is of size `N`, line `i` in the prediction cache should contain the model output for input example `i` in the dataset for every `0 <= i < N`. The data format for each prediction is similar to that for the [input data](task_data_format), the only difference being the "text" and ground truth label keys for the NLP task are removed. #### Text Classification For the Text Classification task, each prediction is represented by a dictionary containing the following key-value pair: ```python [ { "probabilities": [0.02, 0.94, 0.04] (REQUIRED) }, ... ] ``` - **`probabilities`**: List[float], ***required*** The model prediction for this data point. This should be a normalized vector of class probabilities, with a probability for each possible class. #### Named Entity Recognition For the Named Entity Recognition task, each prediction is represented by a dictionary containing the following key-value pair: ```python [ { "predicted_entities": [ (REQUIRED) { "mentions": [ { "start_offset": 7, "end_offset": 11 } ], "type": "ORG" } ] }, ... ] ``` - **`predicted_entities`**: List[dict], ***required*** The model predictions for this data point. This should be a list of dictionaries, with each dictionary corresponding to an entity. Each entity dictionary should have a 'type' key (specifying the type the entity is predicted to be) as well as a 'mentions' key which contains all the mentions predicted to refer to this entity. Each mention itself a dictionary with two keys: a 'start_offset' key and a 'end offset' key, which are both integers referring to the start and end of the mention in question. See the [NLP Prediction Configuration](prediction_info) reference for more on how to provide cached predictions.