Abstract
Missing data is a common data quality issue that can make data analysis challenging. Tabular datasets, widely used by companies and organizations in a variety of sectors, often suffer from missing data problems. Furthermore,
it is crucial for data analysis to address missing data problems by means of, for example, data imputation, because most predictive models are not designed to handle missing data natively.
For several decades, new imputation methods have been developed continuously, yet no single approach performs well across all missing data scenarios. The choice of an appropriate imputation method depends on several factors, such as missingness level and mechanism. Traditional trial-and-error approaches for method selection are time-consuming and computationally expensive. In this thesis, a meta-learning-based recommendation system is proposed to efficiently recommend suitable imputation methods based on dataset characteristics.
The purpose of imputing a dataset is for the subsequent development of a predictive model. We have shown that well or even perfectly reconstructed data does not guarantee the best predictive performance. Therefore, it is
important to consider imputation method selection and machine learning method selection as a joint selection problem to account for the relationship between them. Our meta-learning approach can provide joint imputer
and regressor recommendations equally inexpensively. To ensure transparency and trust in these automated recommendations, we incorporate global and local explanation techniques that provide insight into both general model behaviour and reasoning behind individual recommendations.
In recent years, large language models (LLMs) have penetrated domains far beyond natural language processing, including tabular data analysis. As the use of LLMs for handling missing data remains understudied, we explore their potential to support algorithm selection and inspire novel imputation strategies based on the type of missing data mechanisms. We find that LLMs offer data analysts a promising alternative approach for addressing missing data issues, both as a new imputation method as well as for imputation method selection.
In conclusion, this thesis provides insight and practical tools for addressing missing data issues: an explainable, computationally inexpensive meta-learning-based recommender system, several frameworks for further development,
and promising ways in which LLMs can be utilized for this purpose.
it is crucial for data analysis to address missing data problems by means of, for example, data imputation, because most predictive models are not designed to handle missing data natively.
For several decades, new imputation methods have been developed continuously, yet no single approach performs well across all missing data scenarios. The choice of an appropriate imputation method depends on several factors, such as missingness level and mechanism. Traditional trial-and-error approaches for method selection are time-consuming and computationally expensive. In this thesis, a meta-learning-based recommendation system is proposed to efficiently recommend suitable imputation methods based on dataset characteristics.
The purpose of imputing a dataset is for the subsequent development of a predictive model. We have shown that well or even perfectly reconstructed data does not guarantee the best predictive performance. Therefore, it is
important to consider imputation method selection and machine learning method selection as a joint selection problem to account for the relationship between them. Our meta-learning approach can provide joint imputer
and regressor recommendations equally inexpensively. To ensure transparency and trust in these automated recommendations, we incorporate global and local explanation techniques that provide insight into both general model behaviour and reasoning behind individual recommendations.
In recent years, large language models (LLMs) have penetrated domains far beyond natural language processing, including tabular data analysis. As the use of LLMs for handling missing data remains understudied, we explore their potential to support algorithm selection and inspire novel imputation strategies based on the type of missing data mechanisms. We find that LLMs offer data analysts a promising alternative approach for addressing missing data issues, both as a new imputation method as well as for imputation method selection.
In conclusion, this thesis provides insight and practical tools for addressing missing data issues: an explainable, computationally inexpensive meta-learning-based recommender system, several frameworks for further development,
and promising ways in which LLMs can be utilized for this purpose.
| Original language | English |
|---|---|
| Qualification | Doctor of Philosophy |
| Awarding Institution |
|
| Supervisors/Advisors |
|
| Award date | 30 Jan 2026 |
| Place of Publication | Enschede |
| Publisher | |
| Print ISBNs | 978-90-365-7062-6 |
| Electronic ISBNs | 978-90-365-7063-3 |
| DOIs | |
| Publication status | Published - 30 Jan 2026 |
Keywords
- Missing values
- Imputation
- Meta-learning
- Large language model
Fingerprint
Dive into the research topics of 'Handling missing data with meta-learning and large language models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver