Developing deep learning algorithm for de novo variants detection in Pacbio long-read sequencing data

Note: This application has been filled. Please see our Vacancies page for open vacancies.

Clinical problem

Many developmental disorders, such as intellectual disability, autism spectrum disorder and multiple congenital anomalies are known to be caused by de novo mutations (DNMs). The reliable identification of DNMs is, therefore, of paramount importance both for genetic testing as well as research studies. Because of the genetic heterogeneity that exists for disorders where DNMs play a major role, the identification of DNMs is typically performed based on whole exome (WES) or whole-genome sequencing (WGS) data. In our recent study, we developed DeNovoCNN, a deep-learning-based tool that identifies de novo variants in WES and WGS short-read data more accurately than existing tools. However, the short-read sequencing approaches pose a limitation for the identification of variants in difficult regions. These limitations may significantly contribute to the diagnostic gap in patients who have undergone standard WES and WGS. The emerging long-read sequencing (LRS) technologies offer improvements in the characterization of genetic variation and regions that are difficult to assess. Therefore, this is considered the genetics technology of the future. Further development of the DeNovoCNN tool is necessary to use the advantages of the LRS technologies.

Solution

Based on the previous work we showed that deep-learning algorithms can be efficient in the identification of de novo variants in short-read sequencing data (WES and WGS). In this project, we will work on further improvements of DeNovoCNN to detect de novo variants in long-read sequencing data.

Data

We have a set of 33 child-parent trios in-house, 8 of them are with 80 fully validated de novo events per trio. There are also published datasets available (Genome In A Bottle consortium data >1000 events, Noyes et al. AJHG ~ 200 events). Additional datasets might be collected through collaboration with PacBio.

Results

Long read sequencing is expected to replace SRS within the next 10 years. LRS is already offered for research projects and will become the future of genetic diagnostics. The integration and support will be done jointly with the existing DeNovoCNN by the genetics department (Production and Support team).

Embedding

You will be embedded in the department of Human Genetics at Radboudumc. We provide access to a GPU machine and research cluster.

Requirements

Proficiency in Python programming
Understanding of machine learning basics and deep learning architectures.
Knowledge of bioinformatics as a plus

Information

Project duration: 6 months
Location: Radboud University Medical Center
For more information or to apply for this project, please contact Gelana.Khazeeva@radboudumc.nl .

People

Gelana Khazeeva

PhD student

Radboudumc

Christian Gilissen

Associate professor

Human Genetics, Radboudumc