Skip to main navigation Skip to search Skip to main content

Handling missing data with meta-learning and large language models

  • Işıl Baysal Erez

Research output: ThesisPhD Thesis - Research UT, graduation UT

67 Downloads (Pure)

Abstract

Missing data is a common data quality issue that can make data analysis challenging. Tabular datasets, widely used by companies and organizations in a variety of sectors, often suffer from missing data problems. Furthermore,
it is crucial for data analysis to address missing data problems by means of, for example, data imputation, because most predictive models are not designed to handle missing data natively.

For several decades, new imputation methods have been developed continuously, yet no single approach performs well across all missing data scenarios. The choice of an appropriate imputation method depends on several factors, such as missingness level and mechanism. Traditional trial-and-error approaches for method selection are time-consuming and computationally expensive. In this thesis, a meta-learning-based recommendation system is proposed to efficiently recommend suitable imputation methods based on dataset characteristics.

The purpose of imputing a dataset is for the subsequent development of a predictive model. We have shown that well or even perfectly reconstructed data does not guarantee the best predictive performance. Therefore, it is
important to consider imputation method selection and machine learning method selection as a joint selection problem to account for the relationship between them. Our meta-learning approach can provide joint imputer
and regressor recommendations equally inexpensively. To ensure transparency and trust in these automated recommendations, we incorporate global and local explanation techniques that provide insight into both general model behaviour and reasoning behind individual recommendations.

In recent years, large language models (LLMs) have penetrated domains far beyond natural language processing, including tabular data analysis. As the use of LLMs for handling missing data remains understudied, we explore their potential to support algorithm selection and inspire novel imputation strategies based on the type of missing data mechanisms. We find that LLMs offer data analysts a promising alternative approach for addressing missing data issues, both as a new imputation method as well as for imputation method selection.

In conclusion, this thesis provides insight and practical tools for addressing missing data issues: an explainable, computationally inexpensive meta-learning-based recommender system, several frameworks for further development,
and promising ways in which LLMs can be utilized for this purpose.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • University of Twente
Supervisors/Advisors
  • van Keulen, Maurice, Supervisor
  • Poel, Mannes, Co-Supervisor
Award date30 Jan 2026
Place of PublicationEnschede
Publisher
Print ISBNs978-90-365-7062-6
Electronic ISBNs978-90-365-7063-3
DOIs
Publication statusPublished - 30 Jan 2026

Keywords

  • Missing values
  • Imputation
  • Meta-learning
  • Large language model

Fingerprint

Dive into the research topics of 'Handling missing data with meta-learning and large language models'. Together they form a unique fingerprint.

Cite this