Options
Generalized Tree-Based Machine Learning Methods with Applications to Small Area Estimation
Frink, Nicolas (2025): Generalized Tree-Based Machine Learning Methods with Applications to Small Area Estimation, Bamberg: Otto-Friedrich-Universität, doi: 10.20378/irb-107660.
Author:
Publisher Information:
Year of publication:
2025
Pages:
Supervisor:
Language:
English
Remark:
Kumulative Dissertation, Otto-Friedrich-Universität Bamberg, 2025
DOI:
Abstract:
Chapter 1 - Identifying and addressing poverty is challenging in administrative units with limited information on income distribution and well-being. To overcome this obstacle, small area estimation methods have been developed to provide reliable and efficient estimators at disaggregated levels, enabling informed decision-making by policymakers despite the data scarcity. We propose a robust and flexible approach for estimating poverty indicators based on binary response variables within the small area estimation context: the generalized mixed effects random forest. Our method employs machine learning techniques to identify predictive, non-linear relationships from data, while also modeling hierarchical structures. Mean squared error estimation is explored using a parametric bootstrap. From an applied perspective, we examine the impact of information loss due to converting continuous variables into binary variables on the performance of small area estimation methods. We evaluate the proposed point and uncertainty estimates in both model- and design-based simulations. Finally, we apply our method to a case study revealing spatial patterns of poverty in the Mexican state of Tlaxcala.
Chapter 2 - Small area estimation methods are proposed that use generalized tree-based machine learning techniques to improve the estimation of disaggregated means in small areas using discrete survey data. Specifically, two existing approaches based on random forests - the Generalized Mixed Effects Random Forest (GMERF) and a Mixed Effects Random Forest (MERF) - are extended to accommodate count outcomes, addressing key challenges such as overdispersion. Additionally, three bootstrap methodologies designed to assess the reliability of point estimators for area-level means are evaluated. The numerical analysis shows that the MERF, which does not assume a Poisson distribution to model the mean behavior of count data, excels in scenarios of severe overdispersion. Conversely, the GMERF performs best under conditions where Poisson distribution assumptions are moderately met. In a case study using real-world data from the state of Guerrero, Mexico, the proposed methods effectively estimate area-level means while capturing the uncertainty inherent in overdispersed count data. These findings highlight their practical applicability for small area estimation.
Chapter 3 - The R package SAEforest simplifies the estimation of regionally disaggregated indicators using machine learning techniques for small area estimation. It provides tools for model presentation and diagnostics. The package version 1.0.0 includes mixed effect random forests for continuous outcomes. Since version 2.0.0, the package has incorporated generalized mixed effect random forests for binary and count-based indicators. To assess the uncertainty of the area-level estimates, corresponding mean squared error estimators are implemented. Additionally, version 2.0.0 introduces two new diagnostic plots and an updated hyperparameter tuning function for the generalized random forest components. The functionality of these enhancements is illustrated with examples using synthetic datasets for Austrian districts.
GND Keywords: ; ; ;
Einkommensverteilung
Schätzung
Bootstrap-Statistik
Maschinelles Lernen
Keywords: ; ;
Chapter 1 - Data integration, Generalized mixed models, MSE estimation, Parametric bootstrap
Chapter 2 - Bootstrap, Generalized linear mixed models, Overdispersed count data, Random forest, Small area estimation
Chapter 3 - Generalized linear mixed models, Machine learning, Small area estimation, Survey statistics
DDC Classification:
RVK Classification:
Type:
Doctoralthesis
Activation date:
June 2, 2025
Permalink
https://fis.uni-bamberg.de/handle/uniba/107660