Configuring Data Profiling
Robust Intelligence performs profiling of datasets in order to inform how its tests run. This involves inferring feature types, computing feature relationships, and restricting feature counts. Robust Intelligence uses safe default data profiling values and flags exceptions for each parameter, so setting the parameters below is optional.
Your dataset’s profile is cached during the first test run, and the cache is used on subsequent runs for faster performance. See the performance section for information about cached profiles.
Data profiling template
Specify your dataset profile parameters in the
AI Stress Testing Configuration
JSON object, which is part of the data_profiling
configuration in the
profiling_config
dictionary.
{
#...,
"data_profiling": {
"num_quantiles": "1001",
"num_subsets": "10",
"column_type_info": {
"min_nunique_for_numeric": "10",
"numeric_violation_threshold": 0.01,
"categorical_violation_threshold": 0.05,
"min_unique_prop": 0.99,
"allow_float_unique": False,
"numeric_range_inference_threshold": 1.0,
"unseen_values_allowed_criteria": 0.25,
},
"feature_relationship_info": {
"features_to_profile": ["foo", "bar", "baz"],
"num_feats_to_profile": 30,
"compute_feature_relationships": True,
"compute_numeric_feature_relationships": False,
"ignore_nan_for_feature_relationships": True,
},
}
}
Data profiling arguments
The properties of the data_profiling
configuration in the
profiling_config
are as follows:
num_quantiles
: string, default ="1001"
The number of quantiles to store for numerical columns.
num_subsets
: string, default =column_type_info
:min_nunique_for_numeric
: string, default ="10"
Minimum number of unique values in column for it to be considered a numeric column, otherwise the column is considered categorical.
numeric_violation_threshold
: float, default = 0.01,Maximum fraction of violations when assigning numeric columns (not including missing values).
categorical_violation_threshold
: float, default = 0.05,Maximum fraction of violations when assigning categorical subtypes (not including missing values).
min_unique_prop
: float, default = 0.99If data has at least min_unique_prop proportion of unique values then classify as a column that must have unique values.
allow_float_unique
: bool, default = FalseAllow float columns to be inferred as unique.
numeric_range_inference_threshold
:unseen_values_allowed_criteria
:
feature_relationship_info
:features_to_profile
: List[str] ornull
, default =null
List of features to be analyzed by the Robust Intelligence platform. If specified, only the provided features (with a maximum of 750 features) are analyzed, and
num_feats_to_profile
is ignored.num_feats_to_profile
: string ornull
, default ="30"
Number of features to profile for Smart Feature Sampling.
compute_feature_relationships
: bool, default = TrueWhether to compute feature relationships.
compute_numeric_feature_relationships
: bool, default = FalseWhether to compute feature relationships for discretized numeric columns.
ignore_nan_for_feature_relationships
: bool, default = TrueWhether to ignore nan when computing feature relationships.
class_name
: List[str] ornull
, default =null
Optional list of label class names (in label order). For classification tasks only.