There is huge value in developing new benchmarks and I think the one that proposed in the paper 'GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models' by Apple is quite neat and useful! The accompanying analysis, in my opinion, can be substantially improved with the help of basic statistics. Without those we risk over-interpreting results and drawing misleading conclusions. I never thought I would be the one advocating for the use of hypothesis tests and p-values, but here we are... When it comes to language models evals, it is time to make statistics great again!
A simple tutorial on how to implement static BED in BayesFlow
Machine learning, and deep probabilistic modelling specifically, seems to be revolutionising the space of data compression. This short post describes 1) the basic components of the data compression pipeline; 2) the objective used to optimise model parameters and its equivalence to training a VAE; and 3) some of the challenges that need to be solved.
Bayesian Optimal Experimental Design (BOED) is an elegant mathematical framework that enables us to design experiments optimally. This introductory post describes the BOED framework and the computational challenges associated with deploying it in applications.