Abstract

In the field of computational biology, microarryas are used to measure the activity of thousands of genes at once and create a global picture of cellular function. Microarrays allow scientists to analyze expression of many genes in a single experiment quickly and efficiently. Even if microarrays are a consolidated research technology nowadays and the trends in high-throughput data analysis are shifting towards new technologies like Next Generation Sequencing (NGS), an optimum method for sample classification has not been found yet.

 

Microarray classification is a complicated task, not only due to the high dimensionality of the feature set, but also to an apparent lack of data structure. This characteristic limits the applicability of processing techniques, such as wavelet filtering or other filtering techniques that take advantage of known structural relation. On the other hand, it is well known that genes are not expressed independently from other each other: genes have a high interdependence related to the involved regulating biological process.

 

This thesis aims to improve the current state of the art in microarray classification and to contribute to understand how signal processing techniques can be developed and applied to analyze microarray data. The goal of building a classification framework needs an exploratory work in which algorithms are constantly tried and adapted to the analyzed data. The developed algorithms and classification frameworks in this thesis tackle the problem with two essential building blocks. The first one deals with the lack of a priori structure by inferring a data-driven structure with unsupervised hierarchical clustering tools. The second key element is a proper feature selection tool to produce a precise classifier as an output and to reduce the overfitting risk.

 

The main focus in this thesis is the binary data classification, field in which we obtained relevant improvements to the state of the art. The first key element is the data-driven structure, obtained by modifying hierarchical clustering algorithms derived from the Treelets algorithm from the literature. Several alternatives to the original reference algorithm have been tested, changing either the similarity metric to merge the feature or the way two feature are merged. Moreover, the possibility to include external sources of information from publicly available biological knowledge and ontologies to improve the structure generation has been studied too. About the feature selection, two alternative approaches have been studied: the first one is a modification of the IFFS algorithm as a wrapper feature selection, while the second approach involved an ensemble learning focus. To obtain good results, the IFFS algorithm has been adapted to the data characteristics by introducing new elements to the selection process like a reliability measure and a scoring system to better select the best feature at each iteration. The second feature selection approach is based on Ensemble learning, taking advantage of the microarryas feature abundance to implement a different selection scheme. New algorithms have been studied in this field, improving state of the art algorithms to the microarray data characteristic of small sample and high feature numbers.

 

In addition to the binary classification problem, the multiclass case has been addressed too. A new algorithm combining multiple binary classifiers has been evaluated, exploiting the redundancy offered by multiple classifiers to obtain better predictions.

 

All the studied algorithm throughout this thesis have been evaluated using high quality publicly available data, following established testing protocols from the literature to offer a proper benchmarking with the state of the art.

 

Whenever possible, multiple Monte Carlo simulations have been performed to increase the robustness of the obtained results. 

Demos and Resources