Abstract

Local Ancestry Inference (LAI) is the high resolution prediction of ancestry (African, European, ...) across a DNA sequence. LAI is becoming increasingly important in DNA sequence analysis for the study of human ancestry and migrations. It is also necessary for polygenic risk scores research (prediction of traits and disease risk). Most current LAI models are built for specific species, set of ancestries and chromosomes, hence a new model needs to be trained from scratch for every slightly different setting. This creates a big barrier for research and industry to shift across different LAI scenarios. In this thesis we present SALAI-Net, the first statistical method for LAI with reference panel that can be used on any set of species and ancestries (species-agnostic). Loter is the state of the art in species-agnostic models with reference panel, and is based on a dynamic programming algorithm. However, it is slow and does not perform very well in small reference panel settings. Our model is based on a novel hand-engineered template matching block followed by a convolutional smoothing filter optimized to minimize cross-entropy loss on a training dataset. The right choice of DNA sequence encoding, similarity features and architecture is what makes our model able to generalize well to unseen ancestries, species, and different chromosomes. We benchmark our models on whole genome data of humans and we test the ability to generalize to dog species when trained on human data. Our models outperform the state-of-the-art method by a big margin in terms of accuracy, testing in different settings and datasets. Moreover, it is up to two orders of magnitude faster. Our model also shows close to no generalization gap when switching between species.