Multiple Imputation via Local Regression (Miles)

Methods for statistical analyses generally rely upon complete rectangular data sets. When the data are incomplete due to, e.g. nonresponse in surveys, the researcher must choose between three alternatives:

1. The analysis rests on the complete cases only: This is almost always the worst option. In, e.g. market research, missing values occur more often among younger respondents. Because relevant behavior such as media consumption or past purchases often correlates with age, a complete case analysis provides the researcher with misleading answers.
2. The missing data are imputed (i.e., filled in) by the application of an ad-hoc method: Ad-hoc methods range from filling in mean values to applying nearest neighbor techniques. Whereas filling in mean values performs poorly, nearest neighbor approaches bear the advantage of imputing plausible values and work well in some applications. Yet, ad-hoc approaches generally suffer from two limitations: they do not apply to complex missing data patterns, and they distort statistical inference, such as t-tests, on the completed data sets.
3. The missing data are imputed by the application of a method that is based on an explicit model: Such model-based methods can cope with the broadest range of missing data problems. However, they depend on a considerable set of assumptions and are susceptible to their violations.

This dissertation proposes the two new methods <midastouch> and <Miles> that build on ideas by Cleveland & Devlin (1988) and Siddique & Belin (2008). Both these methods combine model-based imputation with nearest neighbor techniques. Compared to default model-based imputation, these methods are as broadly applicable but require fewer assumptions and thus hopefully appeal to practitioners. In this text, the proposed methods' theoretical derivations in the multiple imputation framework (Rubin, 1987) precede their performance assessments using both artificial data and a natural TV consumption data set from the GfK SE company. In highly nonlinear data, we observe <Miles> outperform alternative methods and thus recommend its use in applications.

GND Keywords:

Datenerhebung

;

Fehlende Daten

;

Regressionsanalyse

Keywords:

Multiple Imputation

;

Predictive Mean Matching

;

Sequential Regressions

;

Local Regression

;

Distance-Aided Donor Selection

DDC Classification:

310 Statistics

RVK Classification:

QH 235

Type:

Doctoralthesis

URI:

https://fis.uni-bamberg.de/handle/uniba/42362

Activation date:

November 29, 2017

Permalink https://fis.uni-bamberg.de/handle/uniba/42362

FIS

Full text/File(s)

Question on publication

Options

Full text/File(s)

Question on publication