Options
Multiple Imputation via Local Regression (Miles)
Gaffert, Philipp (2017): Multiple Imputation via Local Regression (Miles), Bamberg: opus, doi: 10.20378/irbo-49884.
Author:
Publisher Information:
Year of publication:
2017
Pages:
Supervisor:
Language:
English
Remark:
Dissertation, Otto-Friedrich-Universität Bamberg, 2017
DOI:
Licence:
Abstract:
Methods for statistical analyses generally rely upon complete rectangular data sets. When the data are incomplete due to, e.g. nonresponse in surveys, the researcher must choose between three alternatives:
1. The analysis rests on the complete cases only: This is almost always the worst option. In, e.g. market research, missing values occur more often among younger respondents. Because relevant behavior such as media consumption or past purchases often correlates with age, a complete case analysis provides the researcher with misleading answers.
2. The missing data are imputed (i.e., filled in) by the application of an ad-hoc method: Ad-hoc methods range from filling in mean values to applying nearest neighbor techniques. Whereas filling in mean values performs poorly, nearest neighbor approaches bear the advantage of imputing plausible values and work well in some applications. Yet, ad-hoc approaches generally suffer from two limitations: they do not apply to complex missing data patterns, and they distort statistical inference, such as t-tests, on the completed data sets.
3. The missing data are imputed by the application of a method that is based on an explicit model: Such model-based methods can cope with the broadest range of missing data problems. However, they depend on a considerable set of assumptions and are susceptible to their violations.
This dissertation proposes the two new methods <midastouch> and <Miles> that build on ideas by Cleveland & Devlin (1988) and Siddique & Belin (2008). Both these methods combine model-based imputation with nearest neighbor techniques. Compared to default model-based imputation, these methods are as broadly applicable but require fewer assumptions and thus hopefully appeal to practitioners. In this text, the proposed methods' theoretical derivations in the multiple imputation framework (Rubin, 1987) precede their performance assessments using both artificial data and a natural TV consumption data set from the GfK SE company. In highly nonlinear data, we observe <Miles> outperform alternative methods and thus recommend its use in applications.
1. The analysis rests on the complete cases only: This is almost always the worst option. In, e.g. market research, missing values occur more often among younger respondents. Because relevant behavior such as media consumption or past purchases often correlates with age, a complete case analysis provides the researcher with misleading answers.
2. The missing data are imputed (i.e., filled in) by the application of an ad-hoc method: Ad-hoc methods range from filling in mean values to applying nearest neighbor techniques. Whereas filling in mean values performs poorly, nearest neighbor approaches bear the advantage of imputing plausible values and work well in some applications. Yet, ad-hoc approaches generally suffer from two limitations: they do not apply to complex missing data patterns, and they distort statistical inference, such as t-tests, on the completed data sets.
3. The missing data are imputed by the application of a method that is based on an explicit model: Such model-based methods can cope with the broadest range of missing data problems. However, they depend on a considerable set of assumptions and are susceptible to their violations.
This dissertation proposes the two new methods <midastouch> and <Miles> that build on ideas by Cleveland & Devlin (1988) and Siddique & Belin (2008). Both these methods combine model-based imputation with nearest neighbor techniques. Compared to default model-based imputation, these methods are as broadly applicable but require fewer assumptions and thus hopefully appeal to practitioners. In this text, the proposed methods' theoretical derivations in the multiple imputation framework (Rubin, 1987) precede their performance assessments using both artificial data and a natural TV consumption data set from the GfK SE company. In highly nonlinear data, we observe <Miles> outperform alternative methods and thus recommend its use in applications.
GND Keywords: ; ;
Datenerhebung
Fehlende Daten
Regressionsanalyse
Keywords: ; ; ; ;
Multiple Imputation
Predictive Mean Matching
Sequential Regressions
Local Regression
Distance-Aided Donor Selection
DDC Classification:
RVK Classification:
Type:
Doctoralthesis
Activation date:
November 29, 2017
Permalink
https://fis.uni-bamberg.de/handle/uniba/42362