RIME Model Performance
Documentation for aspects of the rime.tabular package that relate to model performance improvements.
Utilities related to model performance improvement.
- rime.tabular.performance.get_biggest_errors(df: DataFrame, model: TabularBlackBoxModel, labels: Series, n: int = 100, decision_threshold: float = 0.5, preds: Optional[ndarray] = None) List[Series]
Return the biggest errors.
- Parameters:
df – dataframe to make predictions over.
model – model to use to make predictions.
labels – labels for data.
n – number of worst datapoints to return.
decision_threshold – decision threshold of the model. Only relevant for
classification. (binary) –
preds – predictions produced by the model. If specified, will use these predictions instead of calling the model. Defaults to None.
- Returns:
A list of sorted Pandas series for largest errors. There are currently 3 situations: 1) Binary classification: Returns a list [x, y] where x is the false positives sorted by worst false positive rate and y is the false negatives sorted by worst false negative rate. 2) Multi-class classification case returns: [a] where a is the data sorted by cross entropy loss. 3) Regression case returns: [b] where b is the data sorted by L1 loss.
- rime.tabular.performance.preprocess_df(df: DataFrame, columns_to_ignore: Optional[List[str]] = None, impute_nulls: bool = True) DataFrame
Preprocess dataframe.
- Functionality includes:
Mapping categorical values to numeric based on value_counts
Impute nulls based on a reasonable strategy (detect if numeric or obj)
- Parameters:
df – dataframe to be preprocessed.
columns_to_ignore – These columns will be ignored and not preprocessed. If None will not ignore any columns. Defaults to None.
- Returns:
A preprocessed dataframe.
- rime.tabular.performance.single_model_active_learning(df: DataFrame, model: TabularBlackBoxModel, n: int, uncertainty_overweight: int = 1, decision_threshold: Optional[float] = 0.5, seed: Optional[int] = 0, preds: Optional[ndarray] = None) ndarray
Select points to label by looking at a single model.
Selects points weighted on how uncertain the model is about them.
- Parameters:
df – dataframe of points to consider sampling.
model – current iteration of model.
n – number of datapoints to sample.
uncertainty_overweight – factor to scale up probability of sampling points the model is uncertain about. Defaults to 1 (standard overweight).
decision_threshold – decision threshold of the model. Defaults to .5.
seed – Used to seed for any randomness. If set to None, then seed is not specified. Defaults to 0.
preds – predictions produced by the model. If specified, will use these predictions instead of calling the model. Defaults to None.
- Returns:
Indices of points to label.
- rime.tabular.performance.two_model_active_learning(df: DataFrame, model1: TabularBlackBoxModel, model2: TabularBlackBoxModel, n: int, disagreement_overweight: int = 1, uncertainty_overweight: int = 1, decision_threshold: Optional[float] = 0.5, seed: Optional[int] = 0, preds1: Optional[ndarray] = None, preds2: Optional[ndarray] = None) ndarray
Select points to label based on looking at two models.
Selects points based on a combination of how uncertain the models are about them and how much the models disagree about them.
- Parameters:
df – dataframe of points to consider sampling.
model1 – an iteration of the model.
model2 – a separate iteration of the model.
n – number of datapoints to sample.
disagreement_overweight – factor to scale up probability of sampling points the models disagree on about. Defaults to 1 (standard overweight).
uncertainty_overweight – factor to scale up probability of sampling points the models are uncertain about. Defaults to 1 (standard overweight).
decision_threshold – decision threshold of the models. Defaults to .5. Only used in binary classification setting.
seed – Used to seed for any randomness. If set to None, then seed is not specified. Defaults to 0.
preds1 – predictions produced by the first model. If specified, will use these predictions instead of calling the model. Defaults to None.
preds2 – predictions produced by the second model. If specified, will use these predictions instead of calling the model. Defaults to None.
- Returns:
Indices of points to label.