## Friday, May 16, 2014

I've made some good progress with the help of my fantastic mentor, Kerby Shedden, as well as the statsmodels guru Josef Perktold. In general terms, we have a working implementation of MICE for two parametric models that have good theoretical properties: logistic regression and OLS.

On simulation:

Our implementation uses a Gaussian approximation to the posterior distribution. We draw a parameter from this distribution and add noise according to the estimated quantities. The center of this Gaussian is the maximum likelihood estimator and the covariance matrix is the inverse Fisher information matrix. So, the distribution of the parameters for OLS is just approximately normal with the estimated conditional mean function and covariance, with an (optional) perturbation in the variance parameters given by a chi-square draw. Similarly, for logistic regression the draw is from a Bernoulli. These are asymptotically Bayesian because the real Bayesian posterior mean is a weighted sum of some prior and this estimated mean (as are the scale parameters for OLS), but as the sample size grows large, the prior is discounted. The approximation is most accurate when the prior distribution is uniform. Note that IVEware, a well respected MICE package in SAS, uses this methodology.

On combining:

The combining rules are given by Rubin's rules, which are described in detail here. This is a very intuitive set of rules that are not trivial to derive, but for our purposes we can take them for granted. The estimated parameters are just the mean of the parameters fit on each imputed dataset, and the standard errors are a weighted sum of the within-imputed-data variance and the between-imputed-data variance.

And that's it for the actual statistics! A lot of time was spent figuring out how this can fit into the general statsmodels framework. Next step for now is going to be implementing predictive mean matching. This is important because, although it is potentially biased, it is an imputation method that is robust to choice of model. So when we get into other parametric families that are resistant to our attempts to define a posterior, PMM will become a viable backup.

Cool graph on a bunch of imputations run on a single dataset:

Notice that the green dots fulfill our missing at random assumption: conditional on the horizontal axis variable, the value of the vertical axis variable is random. The reverse is also true, but harder to see on this graph. More in a week or so.