Mathematics and computer science oriented projects

Machine learning classification is a well-established discipline whose algorithms have shown excellent performances in many fields. In healthcare, they are currently being developed to improve precision medicine, and are already used in the clinic to stratify patient from medical images. Nevertheless, despite the clinical need, classification models based on high-dimensional tabular data (such as gene expression profiles or other biological high-throughput measurements) are currently not applicable in clinical practice due to their poor performances . In this particular area (predictive oncology), new strategies are still required to improve classification models.

1. Development of Wasserstein distance-based classifiers

Our first axe to improve omics-based classification models was the development of classifier using Wasserstein distance (in collaboration with the LAREMA , mathematical lab in Angers). We tested two new strategies. Firstly, we used as a metric the Wasserstein distance derived from optimal transport instead of the commonly used Euclidean distance. The Wasserstein distance has gained popularity with omics data in recent years, but no Wasserstein-based classifier has been developed for omics data yet. Secondly, our algorithm computes the exact Wasserstein distance (or more precisely the optimizer related to the 1-Wasserstein distance) instead of a distance approximation typically performed using a neural network. The use of an exact algorithm allows to increase model precision while avoiding complex approximation steps and optimization pitfalls. Exact methods are not generally used due to their high computational cost for large sample sizes, but omics data, with their small sample sizes, allow the use of an algorithm in O(n3) time, turning a disadvantage into an advantage. Compared to the state-of-the-art classification algorithms, our algorithm is faster to run since there are no hyperparameter tuning steps. We also observed that it was best suited when class information was shared between a large number of informative variables. This funded work, funded by la Ligue Contre Le Cancer, has been published (Cordier et al., Bioinformatics, 2025).

2. High dimension reduction algorithms

Molecular omics are high dimensional, complex and multivariate data used to decipher intricate and multifaceted molecular exchanges in cells. Their analysis is generally performed with machine learning, but derived models are often hampered by high background noise. One way to counter this issue is to use dimension reduction methods, which retain important information while removing noise. Here, we tested several autoencoder based-algorithms including our own algorithm. We evaluated their performance in latent space construction as well as in reconstructed data, in several contexts. This work was done in collaboration with the LAREMA, mathematical lab in Angers and the LERIA, computer lab in Angers (submitted manuscript).

3. Data integration and multitask learning

In this project, funded by the SIRIC ILIAD, we aim to address the challenge in predictive oncology by combining data from various modalities while developing a multitask deep learning (MTL) approach to overcome data over-scarcity inherent to the field. MTL is a subfield of supervised learning where a single model is trained simultaneously on multiple tasks. It offers a solution to data scarcity since multitasking enhances its ability to generalize. Another advantage is that it can be trained by combining multiple small- and medium-sized datasets from distinct tasks and from heterogeneous datasets, making it a great candidate to cope with the limited data availability in oncology. Additionally, multitask models can be coupled to multimodal models to enhance learning, improve generalization and be more robust to missing data.

Agnes Basseville, PhD

Omics and Data Science Unit / ICO

Mathematics and computer science oriented projects

1. Development of Wasserstein distance-based classifiers

2. High dimension reduction algorithms

3. Data integration and multitask learning