1

Position: Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they …

Semantic-Level Confidence Calibration of Language Models via Temperature Scaling

Calibration of language models is typically studied at the token level, with scalar temperature scaling serving as the primary approach for recalibrating models. Recent multi-sampling techniques allow us to elicit semantic uncertainty measures from …

Is merging worth it? Securely evaluating the information gain for causal dataset acquisition

Merging datasets across institutions is a lengthy and costly procedure, especially when it involves private information. Data hosts may therefore want to prospectively gauge which datasets are most beneficial to merge with, without revealing …

Modern Bayesian Experimental Design

Bayesian experimental design (BED) provides a powerful and general framework for optimizing the design of experiments. However, its deployment often poses substantial computational challenges that can undermine its practical use. In this review, we …

CO-BED: Information-Theoretic Contextual Optimization via Bayesian Experimental Design

We formalize the problem of contextual optimization through the lens of Bayesian experimental design and propose CO-BED -- a general, model-agnostic framework for designing contextual experiments using information-theoretic principles. After …

Differentiable Multi-Target Causal Bayesian Experimental Design

We introduce a gradient-based approach for the problem of Bayesian optimal experimental design to learn causal models in a batch setting -- a critical component for causal discovery from finite data where interventions can be costly or risky. …

Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference

We propose a method to improve the efficiency and accuracy of amortized Bayesian inference by leveraging universal symmetries in the joint probabilistic model of parameters and data. In a nutshell, we invert Bayes' theorem and estimate the marginal …

Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design

We develop a semi-amortized, policy-based, approach to Bayesian experimental design (BED) called Step-wise Deep Adaptive Design (Step-DAD). Like existing, fully amortized, policy-based BED approaches, Step-DAD trains a design policy upfront before …

Efficient Real-world Testing of Causal Decision Making via Bayesian Experimental Design for Contextual Optimisation

Our method is used for the data-efficient evaluation of the regret of past treatment assignments. Unlike approaches such as A/B testing, our method avoids assigning treatments that are known to be highly sub-optimal, whilst engaging in some exploration to gather pertinent information. We achieve this by introducing an information-based design objective, which we optimise end-to-end.

Implicit Deep Adaptive Design: Policy–Based Experimental Design without Likelihoods

iDAD allows us to practically run Bayesian optimal experiments with implicit (likelihood-free) models in real-time. Previous methods either relied on an explicit likelihood model of the outcomes, or were too computationally costly to run in real-time.