Skip to main content Scroll Top
19th Ave New York, NY 95822, USA
Untitled design

Machine learning models are increasingly used in high-stakes domains such as healthcare, finance, and public decision-making, where predictions are often interpreted not only as labels, but as probabilities. In these settings, it is not enough for a model to be accurate: it must also be well-calibrated, meaning that its predicted probabilities should reflect real-world frequencies.

However, calibration becomes much more challenging in multiclass classification, where a model must assign reliable probabilities across many possible classes at once. Existing approaches often focus only on the most confident prediction, overlooking whether the full probability distribution is trustworthy. This can be risky: in sensitive applications, lower-probability classes may still carry important information, such as rare disease states or early signs of critical transitions.

From global to local multiclass calibration

Another pressing concern of multiclass calibration methods is proximity bias: predictions for samples in sparse or underrepresented regions may be systematically miscalibrated, even when global calibration metrics look acceptable. 

Local calibration provides a more fine-grained and trustworthy way to assess model reliability. In particular, Luo et al., [1] introduce a new perspective on calibration: instead of evaluating reliability only globally, they ask whether a model is calibrated locally, around each input in the feature space. The key idea is intuitive: similar instances should have similar class-frequency patterns, and a model’s predicted probability vector should align with the observed class distribution in its local neighbourhood.

In Barbera et al., [2], which was recently presented at AISTATS 2026 in Tangier, the authors extend local calibration to multiclass local calibration and connect it theoretically to strong calibration, the strictest notion of global calibration. Moreover, they analyze the limitations of existing calibration metrics when applied to local multiclass settings. Finally, they propose LoCal Nets, a practical neural-network-based method that improves local calibration by aligning predictions with local estimates of class frequencies. 

LoCal Nets: learning better calibrated representations

Unlike traditional post-hoc calibration methods, which simply rescale the outputs of an already trained model, LoCal Nets go one step further. They learn new feature representations and new calibrated logits, reshaping the geometry of the representation space so that neighbourhoods better reflect true class frequencies.

The method uses the Jensen-Shannon distance to align predicted probability vectors with local kernel-based estimates of class distributions, in a sort of distillation procedure from the kernel-based estimator. This allows the model to learn outputs that are not only globally plausible, but locally consistent with the data around each instance. Importantly, the method uses these local estimates during training, but not at inference time, preserving the efficiency of standard feed-forward neural networks.

Empirical results show that LoCal Nets consistently improve local calibration across benchmark and real-world datasets. Moreover, the method achieves strong reductions in local calibration error while remaining competitive on global calibration metrics and, in several cases, improving predictive performance.

Towards More Trustworthy Systems

This research highlights an important lesson for trustworthy AI: a model can appear reliable on average while still behaving poorly in specific regions of the data space. Local calibration helps uncover and correct these hidden failures, especially for underrepresented or sparse groups of instances. This is particularly relevant in domains such as medicine, law, finance, and public services, where unreliable probabilities can lead to unfair or unsafe decisions.

By introducing a formal definition, theoretical guarantees, and a practical neural-network calibration method, this work provides a strong foundation for building AI systems whose uncertainty estimates are more meaningful, locally reliable, and better aligned with real-world decision-making needs.

Written by: Andrea Pugnana, UNITN

[1] – Luo, R., Bhatnagar, A., Bai, Y., Zhao, S., Wang, H., Xiong, C., Savarese, S., Ermon, S., Schmerling, E., and Pavone, M. (2022). Local calibration: metrics and recalibration. In UAI, volume 180 of Proceedings of Machine Learning Research, pages 1286–1295. PMLR.

[2] – Barbera, C., Perini, L., De Toni, G., Passerini, A., Pugnana, A., (2026). Multiclass Local Calibration with the Jensen Shannon Distance. In AISTATS, forthcoming