This post was prompted by recent nicely done videos by Rasmus Baath that provide an intuitive and low math introduction to Bayesian material. Now, I do not know that these have delivered less than he hoped for. Nor I have asked him. However, given similar material I and others have tried out in the past that did not deliver what was hoped for, I am anticipating that and speculating why here. I have real doubts about such material actually enabling others to meaningfully interpret Bayesian analyses let alone implement them themselves. For instance, in a conversation last year with David Spiegelhalter, his take was that some material I had could easily be followed by many, but the concepts that material was trying to get across were very subtle and few would have the background to connect to them. On the other hand, maybe I am hoping to be convinced otherwise here.
For those too impatient to watch the three roughly 30 minute videos, I will quickly describe my material David commented on (which I think is fairly similar to Rasmus’). I am more familiar with it and doing that avoids any risk of misinterpreting anything Rasmus did. It is based on a speculative description of what what Francis Galton did in 1885 which was discussed more thoroughly by Stephen Stigler. It also involves a continuous (like) example which I highly prefer starting with. I think continuity is of overriding importance so one should start with it unless they absolutely can not.
Galton constructed a two stage quincunx (diagram for a 1990 patent application!) with the first stage representing his understanding of the variation of the targeted plant height in a randomly chosen plant seed of a given variety. The pellet haphazardly falls through the pins and lands at the bottom of the first level as the target height of the seed. His understanding I think is a better choice of wording than belief, information or even probability (which it can be taken to be given the haphazardness). Also it is much much better than prior! Continuing on from the first level, the pellet falls down a second set of pins landing at the very bottom as the height the plant actually grew to. This second level represents Galton’s understanding of how a seed of a given targeted height varies in the height it actually grows. Admittedly this physical representation is actually discrete but the level of discreteness can be lessened without preset limits (other than practicality).
Possibly, he would have assessed the ability of his machine to adequately represent his understanding, by running it over and over again and and comparing the set of heights plants represented on the bottom level with knowledge of past, if not current heights this variety of seed usually did grow to. He should have. Another way to put Galton’s work would be that of building (and checking) a two stage simulation to adequately emulate one’s understanding of targeted plant heights and actual plant heights that have been observed to grow. Having assessed his machine as adequate (by surviving a fake data simulation check) he might have then thought about how to learn about a particular given seeds targeted height (possibly already growing or grown) given he would only get to see the actual height grown. The targeted height remains unknown and actual height becomes know. It is clear that Galton decided that in trying to assess the targeted height from an actual height one should not look downward from a given targeted height but rather upward from the actual height grown.
Now by doing multiple drops of pellets, one at a time, and recording only where the seed was at the bottom of the first level if and only if it lands at a particular location on the bottom level matching an actual grown height, he would doing a two two stage simulation with rejection. This clearly provides an exact (smallish) sample from the posterior given the exact joint probability model (physically) specified/simulated by the quincunx. It is exactly the same as the conceptual way to understand Bayes suggested by Don Rubin in 1982. As such, it would have been an early fully Bayesian analysis, even if not actually perceived as such at the time (though Stigler argues that it likely was).
This awkward to carry out, but arguably less challenging way to grasp Bayesian analysis can be worked up to address numerous concepts in statistics (both implementing calculations and motivating formulas ) that are again less challenging to grasp (or so its hoped). This is what I perceive, Rasmus, Richard McElreath and and others are essentially doing. Authors do differ in their choices of which concepts to focus on. My initial recognition of these possibilities lead to this overly exuberant but very poorly thought through post back in 2010 (some links are broken).
To more fully discuss this below (which may be of interest only to those very interested), I will extend the quincuz to multiple samples (n > 1) and multiple parameters, clarify the connection to approximate Bayesian computation (ABC) and point out something much more sensible when there is a formula for the second level of the quincunz (the evil likelihood function) . The likelihood might provide a smoother transition to (MCMC) sampling from the typical set instead of the entirety of parameter space. I will also say some nice things about Rasmus’s videos and of course make a few criticisms.
Here I am switching back to standard statistical terminology, even though I agree with Richard McElreath that these should not be used in introductory material.
As for n > 1, had Galton conceived of cloning the germinated seeds to allow multiple actual heights with the same targeted height, he could have emulated sample sizes of n > 1 in a direct if not very awkward way using multiple quincunzes. In the first quincunz, the first level would represent the prior and the subset of the pellets that ended up on the bottom of the second level matching the height of the first plant actually grown, would represent the posterior given the first plant height. The pellets representing the posterior in the first quincunx (the subset at the bottom of the first level) would then need to be transferred to the bottom of the first level of second machine (as the new prior). They then would be let fall down to the bottom of its second level to represent the posterior given both the first and second plant height. And so on for each and every sample height grown from the cloned seed.
As for multiple parameters, Rasmus moved on to two parameter in his videos by simply by running two quincunzes in parallel. At some point the benefit/cost of this physical analogue (or metaphor) quickly approaches zero and should be discarded. Perhaps at this point just move on to two stage rejection sampling with multiple parameters and sample sizes > 1.
The history of approximate Bayesian computation is interesting and perhaps especially so for me as I thought I had invented it for a class of graduate Epidemiology students in 2005. I needed a way of convincing them Bayes was not a synonym for MCMC and thought of using two stage rejection sampling to do this. Though, in a later discussion with Don Rubin it is likely I got it from his paper linked above and just forgot about that. But two stage rejection sampling is and is not ABC.
The motivation for ABC was from not having a tractable likelihood but still wanting to do a Bayesian analysis. It was the same motivation for my DPhil thesis which was not having a tractable likelihood for many published summaries (with no access to individual values) but still wanting to do a likelihood based confidence interval analysis (I was not yet favouring a Bayesian approach). In fact, the group that is generally credited with first doing ABC (with that Bayes motivation) included my internal thesis examiner RC Griffiths (Oxford). Now, I first heard about ABC in David Cox’s recorded JSM talk in Florida. Afterwards, whenever I exchanged emails with some in Griffiths’ group and others doing ABC, there arose a lot of initial confusion.
That was because in my thesis work, I did have the likelihood in closed form for individual observations but only had summaries which usually did not have a tractable likelihood. The ABC group did not have a tractable likelihood for individual observations ever (or it was to expensive to compute). Because of this, when I used ABC to get posteriors from summarised data, because that was all that I had observed, it would be actually approximating the exact posterior (given one had only observed the summaries). So to some of them I was not actually doing ABC but some weird other thing. (I am not aware if anyone has published such an ABC like analysis, for instance a meta-analysis of published summaries).
So now into some of the technicalities of real ABC and not quite real ABC. Lets take a very simply example of one continuous observation that was recorded only to two decimation places and one unknown parameter. In general, with any prior and data generating model, two stage rejection sampling matched to two decimation places provides a sample from the exact posterior. So not ABC just full Bayes done inefficiently. On the other hand, if it was recorded to 5 or 10 or more decimal places, matching all the decimal places may not be feasible and choosing to match to just two decimal places would be real ABC. Now think of 20, 30 or more samples recorded to two decimal places. Matching all, even to 2 decimal places is not feasible but deciding to match the sample mean to all decimal places of the sample mean recorded will be feasible and is ABC having used just the summary. Well, unless one assumes the data generating model is Normal as then by sufficiency its not ABC – its just full Bayes. These distinctions are somewhat annoying – but the degree of approximation does need to be recognised and admittedly ABC is the wrong label when there is no approximate rather than exact posterior.
Now some criticism of Rasmus’s videos. I really did not like the part where the number of positive responses to a mail out of 16 offers was analysed using a uniform prior – primarily motivated as non-informative – and the resulting (formal) posterior probabilities discussed as if they were relevant or useful. This is not the kind of certainty about uncertainty to be obtained through statistical alchemy that we want to be inadvertently instilling in people. Now, informative priors were later pitched as being better by Rasmus, but I think the damage has already been done.
Unfortunately, the issue is not that well addressed in the statistical literature and someone even once wrote that most Bayesians would be very clear about the posterior not being the posterior but rather dependent on the particular prior. At least in any published Bayesian analysis. Now, if I was interested in how someone in particular or some group in particular would react, their prior and hence their implied posterior probabilities given the data they have observed would be relevant, useful and could even be taken literally. But if I was interested in how the world would react, the prior would need to be credibly related to the world for me to take posterior probabilities as relevant and in any remote sense literal.
Now, if calibrated, posterior probabilities could provide useful uncertainty intervals. That’s a different topic. For an instance of priors being unconnected to the world, Andrew’s multiple comp post provided an example of a uniform prior on the line that is horribly not credibly related to the effects sizes in the research area one is working in. Additionally, the studies being way too week to in any sense get over that horribly not relatedness. In introductory material, just don’t use flat priors at all. Do mention them but point out that they can be very dangerous in general (i.e. consult an experienced expert before using) but don’t use them in introductory material.
I really did like the side by side plots of the sample space and parameter space for the linear regression example. The sample space plot showing the fitted line (and the individual x and y values) and the parameter space plot initially having a dot at the intercept and slope values that give the maximum probability of observing the individual x and y values actually observed. Later dots where added and scaled to show intercept and slope values that give less probability and then posterior probabilities where printed over these. Now, I do think it would be better if there was something in the parameter space plot that roughly represented the individual x and y values observed.
Here one could take the value of the intercept (that jointly with the slope gave the maximum probability) as fixed or known and then with just one unknown left, find the maximum slope using each individual x and y value and plot those. Then do the same for the intercept by taking the slope value (that gave the maximum probability) as known. The complication here comes from the intercept and slope parameters being tangled up together. Much more can be done here, but admittedly I have able to convince very few that this sort of thing would be worth the trouble. (Well one journal editor, but they found that the technical innovations involved were not worthy of getting a published paper in their journal). What about the standard deviation parameter? It was taken as know from the start? Actually that does not matter as much as that parameter is much less tangled up with the intercept and slope parameters.
When one does have a closed form for the likelihood (and it is not numerically intensive), two stage rejection sampling is sort of silly. If you think about two stage rejection sampling, in the first stage you get a sample of the proportions certain values have in the prior (i.e. estimates of the prior probabilities of those values). In the second stage you keep the proportion those certain values that generated simulated values that matched (or closely approximated) observed values. The proportions kept in the second stage are estimates of probabilities of generating the observed values given the parameter values generated in the first stage. Hence they are estimates of the likelihood – P(X|parameter) – but you have that in closed form. So take the parameter values generated in the first stage and simply weight them by the likelihood (i.e. importance sampling) to get a better sample from the posterior much more efficiently. Doing this, one can easily implement more realistic examples such as multiple regression with half a dozen or more covariates. Some are tempted to avoid any estimation at all by using a systematic approximation of the joint distribution on grids of points leading to P(discretized(X),discretized(parameter)) and P(parameter) * P(X|parameter) for some level of discreteness. I think this is a mistake as it severely breaks continuity, does not scale to realistic examples and suggests a return to unthinkingly plugging and chugging mindlessly through weird formulas that one just needs to get used to.
These more realistic higher dimension examples may help bridge people to the need for sampling from the typical set instead of the entirety of parameter space. I did work it through for bridging to sequential importance sampling by _walking_ to prior to the posterior in smaller, safer steps. But bridging to the typical set likely would be better.
In closing this long post, I feel I should acknowledge the quality of Rasmus videos which he is sharing with everyone. I am sure it took a lot of time and work. It took me more time than I want to admit to put this post together, perhaps since its been a few years since I actually worked on such material. Seeing other make some progress, prompted me to try again by at least thinking about the challenges.
The post Seemingly intuitive and low math intros to Bayes never seem to deliver as hoped: Why? appeared first on Statistical Modeling, Causal Inference, and Social Science.