Abstract

Introduction to Research, BSc Data Science and Engineering, Autumn 2021:

Predicting Dog Phenotypes from Genotypes

In this paper, we analyze dog genotypes – positions of DNA sequences that often vary between different dogs –in order to predict the corresponding phenotypes – unique characteristics that result from different genetic code. More specifically, given chromosome data from a dog, we aim to predict its breed category, height, and weight. We explore a variety of linear and non-linear classification and regression techniques to accomplish these three tasks. We also investigate the use of a neural network (both in linear and non-linear modes) for breed classification and compare its performance to traditional statistical methods. We show that linear methods generally outperform or match the performance of non-linear methods for breed classification. However, the reverse case is true for height and weight regression. We also evaluate the results of all of these methods based on the number of input features used in the analy sis and demonstrate that phenotypes can be predicted with as few as 0.5% of the input features, and dog breeds can be classified with 50% balanced accuracy with as few as 0.02% of the full genomic sequences for our analysis.

 

MergeGenome. A Python-based Toolkit for Merging VCF files

A challenge of genomic studies is the lack of easy to access and properly formatted datasets. When having access to more than one dataset, it seems desirable to combine them. There is a lack of tools to duly merge genomic datasets without losing all non-matching features. To fill this gap, we present the MergeGenome toolkit, designed to integrate DNA sequences from two files in variant call format (VCF) while targeting data quality. MergeGenome is a robust pipeline of comprehensive steps to standardize nomenclature, remove ambiguities, correct flips, eliminate mismatches, select important features, and filter likely erroneous features (the latter with machine learning). MergeGenome is Python-based and relies on pre-existing software for manipulation and imputation of VCF data. We evaluate the result of merging two datasets with dog DNA sequences of dissimilar lengths and notice that genotype imputation with Beagle v5.1 usually fails for low-frequency alleles. Alternatively, we explore several multi-label machine learning classifiers. Although K-Nearest Neighbors achieves competitive results, none of the methods tried outperforms Beagle v5.1.