NannyML 0.3.1 Release - Soft Launch
Hi Everyone,
Niels from the NannyML engineering team here, just to let you know that version 0.3.1 of the NannyML package is now available! We are also now public on github → https://github.com/NannyML/nannyml 🥳 (a star would go a long way for our visibility)
Installing / upgrading
You can get this latest version by using pip
:
pip install nannyml==0.3.1
What's new?
1. Performance calculation
We've added a new PerformanceCalculator
that allows you to quickly calculate and plot some well-known performance metrics such as roc_auc
, f1
or accuracy
on your chunked model inputs/outputs.Since we're dealing with realized performance you will need to have your target data available.We've added support to specify both predicted labels as well as predicted probabilities in the metadata, because both are required for some of the performance metrics.A quick example of what this looks like in code:
import pandas as pd
import nannyml as nml
reference, analysis, analysis_target = nml.load_synthetic_sample()
# Threshold probabilities to get predicted labels
reference['y_pred'] = reference['y_pred_proba'].map(lambda p: p >= 0.8)
analysis['y_pred'] = analysis['y_pred_proba'].map(lambda p: p >= 0.8)
# Extract model metadata and provide additional details
metadata = nml.extract_metadata(data = reference, model_name='wfh_predictor')
metadata.target_column_name = 'work_home_actual'
# Create the calculator and fit it on reference data
perf_calc = nml.PerformanceCalculator(model_metadata=metadata, metrics=['roc_auc', 'accuracy'], chunk_size=5000)
perf_calc.fit(reference_data=reference)
# Enrich analysis data with the available target values
analysis_with_targets = analysis.merge(analysis_target, on='identifier')
data = pd.concat([reference, analysis_with_targets], ignore_index=True)
# Perform the calculations and plot!
perf_calc_results = perf_calc.calculate(data)
for metric in perf_calc.metrics:
perf_calc_results.plot(kind='performance', metric=metric).show()
Check out our guide on performance calculation for more details!
2. Target distribution monitoring
To support keeping a closer eye on the distribution of your model target distribution we've added the TargetDistributionCalculator
to the nannyml.drift.target
package. When provided with target-enriched data it will calculate both the mean as well as the Chi square statistic for each chunk of target values. The targets are considered to be drifting when the p_value < 0.05
where p_value
was returned by the Chi square test.Here is how to do it in code:
import pandas as pd
import nannyml as nml
reference, analysis, analysis_target = nml.load_synthetic_sample()
# Extract model metadata and provide additional details
metadata = nml.extract_metadata(data = reference, model_name='wfh_predictor')
metadata.target_column_name = 'work_home_actual'
# Create the calculator and fit it on reference data
tgt_calc = nml.TargetDistributionCalculator(model_metadata=metadata, chunk_size=5000)
tgt_calc.fit(reference)
# Enrich analysis data with the available target values
analysis_with_targets = analysis.merge(analysis_target, on='identifier')
data = pd.concat([reference, analysis_with_targets], ignore_index=True)
tgt_results = tgt_calc.calculate(data)
tgt_results.plot(distribution='metric').show() # plot mean
tgt_results.plot(distribution='statistical').show() # plot chi squared statistic
Check out our guide to drift detection for more details!
What's changed?
We've made a lot of changes in this release, mostly small things under the hood. Some deserve a special mention however:
Plotting now defaults to using step plots, since we felt this represents the nature of chunking data into intervals better.
We've restructured the
nannyml.drift
package and subpackages to make it clearer what kinds of drift calculation are to be used where. Check your existing code since this might break some imports and let us know if you think this is an actual improvement!Metadata will no longer be considered complete if some features have a
FeatureType.UNKNOWN
. If your previous code is suddenly bugging out, check if your extracted metadata was able to identify all columns (e.g. by using the handy DataFrame representationmetadata.to_df()
).We've had to restrict the
scipy
dependency version to>=1.7.3, <1.8.0
for now. We'll test bumping versions for depending packages and try to relax that constraint again.
What's up next?
We're starting work on an exciting couple of features, too much to describe here. We'd like your inputs on a couple of things as well, so expect some news next week!
Sharing is caring!
As always the best way of learning is doing! Use our included datasets or your own for some quick experimentation. In case of problems, check our docs at https://docs.nannyml.com or reach out to us here for assistance. We're looking forward to your experiences and insights! Have a great week!
Best Regards,
Niels