Business Intelligence

Missing Data Techniques for Dummies

This is not another article explaining various missing data techniques. This is a post about how to use them without getting an advanced degree in statistics or charming that nice young Data Analyst into doing it for you.

If you Google “missing data” you will be barraged with complicated statistical techniques. That’s par for the course, but none of these sites ever seem to tell you how to actually implement these techniques in the real world with real data. So I’ll give it a try.

Not that it’s best, but this is the approach that I use:

  • If less than 5% of data points are missing, I use plain old Listwise Deletion.
  • If less than 10% of data points are missing, I use Mean Imputation. Yes, I know it artificially inflates central tendency and affects standard error. But since such a small amount of data is missing, I can live with it.
  • If more than 10% of data points are missing and I’m confident the missing data are MAR or MCAR, then I use Multiple Imputation.
  • If more than 10% of data points are missing and I believe the missing data are NMAR, then I pull out the big guns and use Heckman Selection Modeling.
  • If the missing variable is Race, I simply assign the record to the “Other” category and forget about it.
  • If the data are longitudinal and it’s not the first wave, then I just use the mean of the subject’s previous observations and forget about it.

And here’s how I go about it:

For Listwise Deletion, I import the data in an Excel worksheet then use Autofill to select the blank records or variables.

For Mean Imputation, I use Excel’s AVERAGE function to calculate the mean and then use an IF statement to insert that mean into all missing records. If your dataset has tens of thousands of rows, this can be excruciatingly slow in Excel. So you should know in those cases it can be done faster in SAS with PROC MEAN.

For Multiple Imputation, I use SAS’ PROC MI function to run the imputations and its PROC MIANALYSE function to calculate the summary statistics on the regressors. I always run five sets of imputations because studies by very smart people have shown this is enough. Excel gurus know that the summary statistics can be done in Excel by searching for the text string _imputation_ (including the underscores) and separating each imputation into a separate worksheet. The string _imputation_ is the delimiter that PROC MI inserts into the dataset.

For Heckman Selection Modeling, I use Stata’s HECKMAN command, because it lets you choose between maximum likelihood and two-stage estimation. I always use the two-stage option, although there’s probably some complicated rule somewhere in the manual explaining when to use which. SAS lovers should know that SAS (version 9 and later) can do Heckman estimation with its PROC QLIM function, but keep in mind it’s limited to the maximum likelihood version of the model. Either way, people who don’t know about these functions will be dazzled with your skills. And it sounds fabulous when the footnotes of your survey say something like, “Missing values were imputed using two-stage Heckman Correction Estimation.”

For what it’s worth:

If somebody asks me why I use Multiple Imputation over alternate methods, I just say, “If it’s good enough for the Census Bureau, then it’s good enough for me.” Then I walk away and go to Starbucks.

If somebody asks me why I use Heckman Selection Modeling, I just say, “If it’s good enough for the Nobel Prize Committee, then it’s good enough for me.” Then I turn my iPod back on and spin back towards my computer screen.

I’ve heard that Stata’s ICE command is better than SAS’s PROC MI function, but I’m so accustomed to using SAS and Excel for this that I’ve never tried it.

I know I won’t win any goodwill points from the American Statistical Association for saying this, but unless your missing data could affect something really important like, say, nuclear missile targeting, I wouldn’t lose sleep over the choice of technique you choose because every one of them has legitimate weaknesses and the probability that no one really cares is 99.99%.

Jamel Cato
The Blue Collar Data Analyst