Bita Khalili is a Senior Algorithm Researcher in our SOPHiA GENETICS Data Science team. She joined the team after completing her PhD in Physics and a post-doctoral research position in Bioinformatics. For the last two years, Bita has been analyzing NGS data at SOPHiA GENETICS and developing copy number variation (CNV) detection modules.
We invite you to spend a few moments with Bita to learn about the challenges associated with CNV detection and how the SOPHiA DDM CNV detection algorithm was developed to overcome these challenges.
Why is CNV detection important when analyzing next-generation sequencing data?
Next-generation sequencing (NGS) is a high-throughput technique that generates high-resolution genomic data which allows for simultaneous detection of many genomic variants, such as SNVs, Indels, and CNVs. CNVs are a structural variation in which DNA segments of one kilobase or larger are present at a variable copy number (duplications or deletions) compared to a reference genome. They have clinical and diagnostic relevance as they have been associated with cancers and rare genetic disorders. Although microarray (or SNP-array) comparative genomic hybridization (aCGH) and multiplex ligation-dependent probe amplification (MLPA) are the gold standards for CNV detection, neither can detect small variations such as SNVs and Indels. The decreasing cost of NGS and the ability to simultaneously detect multiple genomic alterations in a single run have encouraged the widespread use of NGS for CNV detection.
Why are CNVs generally difficult to detect using NGS?
CNVs are challenging to detect via targeted capture because the relationship between sequencing depth and copy number is affected by many sources of bias, e.g., GC content and target region length, capture efficiency, amplification efficiency, DNA concentration, hybridization temperature, nature of capture, batch effects, and so on. These biases result in coverage heterogeneity, even for diploid regions (copy number of 2) and must be accounted for to accurately infer copy number from coverage data.
What challenges are associated with CNV detection in exome data?
On top of overcoming the biases mentioned above, when analyzing the human exome we have the cumulative challenge of sequencing only the protein-coding regions (exons). This results in sparse coverage, as the targeted regions only cover about 1% of the whole genome. Lack of coverage across the entire genomic profile causes us to miss most breakpoints, leaving read depth as the only available information source for CNV detection. Other challenges with detecting CNVs in exome data include the presence of many polymorphic regions for which the normal copy number is already higher or lower than two, and the presence of homologous regions, which is problematic for short read alignment.
How are CNVs detected using the SOPHiA DDM Platform?
CNV analysis by SOPHiA DDM™ is performed based on coverage analysis of targeted regions. Our CNV algorithm automatically selects reference samples among the samples within the same run to perform normalization. We apply a double normalization to account for both sample-specific and region-specific biases. CNV detection is performed by using a hidden-Markov-model algorithm to find CNVs spanning adjacent regions. Additionally, the algorithm provides quality measures for each sample based on the residual noise.
What is the reasoning behind SOPHiA GENETICS’ approach?
Our normalization approach corrects for read-depth variations among regions by leveraging information from different samples in the same run. Assuming that all samples are processed in parallel, the double-normalization step corrects for all sources of targeted sequencing bias mentioned earlier. We also use our knowledge of the genome to curate target regions for each specific exome panel so that regions that would be problematic for our CNV detection algorithm are excluded, e.g., regions with systematically low coverage, high noise, or polymorphic or homologous regions.
What parameters does the exome sequencing panel need to achieve for good quality results?
Datasets with high coverage and low capture bias achieve high-quality results.
What resolution of CNVs can be achieved?
It depends on the exome panel, but with high-quality panels (good probe design) and deep sequencing depth (~600x), we can achieve even single-exon resolution.
What sets SOPHiA GENETICS’ CNV-calling algorithm apart from others?
Four key features set the SOPHiA GENETICS CNV-calling algorithm apart from others. The algorithm…
- efficiently normalizes coverage without relying on predefined parameters
- uses the Hidden-Markov-Model with optimized parameters to call CNVs while considering CNV frequency and length
- provides quality measures for each sample
- curates target regions for each exome panel by excluding regions not appropriate for our CNV detection algorithm.