Scalable CNN-based classification of selective sweeps using derived allele frequencies

Sjoerd van den Belt, Hanqing Zhao, Nikolaos Alachiotis*

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

2 Citations (Scopus)
17 Downloads (Pure)

Abstract

Motivation: Selective sweeps can successfully be distinguished from neutral genetic data using summary statistics and likelihood-based methods that analyze single nucleotide polymorphisms (SNPs). However, these methods are sensitive to confounding factors, such as severe population bottlenecks and old migration. By virtue of machine learning, and specifically convolutional neural networks (CNNs), new accurate classification models that are robust to confounding factors have been recently proposed. However, such methods are more computationally expensive than summary-statistic-based ones, yielding them impractical for processing large-scale genomic data. Moreover, SNP data are frequently preprocessed to improve classification accuracy, further exacerbating the long analysis times. Results: To this end, we propose a 1D CNN-based model, dubbed FAST-NN, that does not require any preprocessing while using only derived allele frequencies instead of summary statistics or raw SNP data, thereby yielding a sample-size-invariant, scalable solution. We evaluated several data fusion approaches to account for the variance of the density of genetic diversity across genomic regions (a selective sweep signature), and performed an extensive neural architecture search based on a state-of-the-art reference network architecture (SweepNet). The resulting model, FAST-NN, outperforms the reference architecture by up to 12% inference accuracy over all challenging evolutionary scenarios with confounding factors that were evaluated. Moreover, FAST-NN is between 30× and 259× faster on a single CPU core, and between 2.0× and 6.2× faster on a GPU, when processing sample sizes between 128 and 1000 samples. Our work paves the way for the practical use of CNNs in large-scale selective sweep detection. Availability and implementation: https://github.com/SjoerdvandenBelt/FAST-NN

Original languageEnglish
Pages (from-to)ii29-ii36
JournalBioinformatics
Volume40
Issue numbersuppl. 2
DOIs
Publication statusPublished - 1 Sept 2024

Fingerprint

Dive into the research topics of 'Scalable CNN-based classification of selective sweeps using derived allele frequencies'. Together they form a unique fingerprint.

Cite this