Filter Methods for Feature Selection in Supervised Machine Learning Applications : Review and Benchmark

Hopf, Konstantin; Reifenrath, Sascha

doi:10.48550/arxiv.2111.12140

Faculty/Chair:

Information Systems and Energy Efficient Systems

Author:

Hopf, Konstantin

;

Reifenrath, Sascha

Publisher Information:

arXiv

Year of publication:

2021

Pages:

1-38

Language:

English

DOI:

10.48550/arxiv.2111.12140

Abstract:

The amount of data for machine learning (ML) applications is constantly growing. Not only the number of observations, especially the number of measured variables (features) increases with ongoing digitization. Selecting the most appropriate features for predictive modeling is an important lever for the success of ML applications in business and research. Feature selection methods (FSM) that are independent of a certain ML algorithm — so-called filter methods — have been numerously suggested, but little guidance for researchers and quantitative modelers exists to choose appropriate approaches for typical ML problems. This review synthesizes the substantial literature on feature selection benchmarking and evaluates the performance of 58 methods in the widely used R environment. For concrete guidance, we consider four typical dataset scenarios that are challenging for ML models (noisy, redundant, imbalanced data and cases with more features than observations). Drawing on the experience of earlier benchmarks, which have considered much fewer FSMs, we compare the performance of the methods according to four criteria (predictive performance, number of relevant features selected, stability of the feature sets and runtime). We found methods relying on the random forest approach, the double input symmetrical relevance filter (DISR) and the joint impurity filter (JIM) were well-performing candidate methods for the given dataset scenarios.

GND Keywords:

Maschinelles Lernen

;

Merkmal

;

Benchmark

;

Forschungsmethode

;

Datenanalyse

Keywords:

Businss Analytics

;

Big Data Analytics

;

Feature Selection

;

Filter Methods

;

Machine Learning

;

Benchmark

DDC Classification:

004 Computer science

;

330 Economics

RVK Classification:

ST 530

Type:

Preprint

URI:

https://fis.uni-bamberg.de/handle/uniba/95199

Activation date:

May 10, 2024

Project(s):

Kombinierte Verhaltens- und Analyse-Innovation zur Steigerung der Energieeffizienz mittels Smart Meter in Privathaushalt; Teilprojekt: Maschinelle Lernverfahren für Energieeffizienz-Feedback

Permalink https://fis.uni-bamberg.de/handle/uniba/95199

FIS

Versioning

Question on publication

Options

Versioning

Question on publication