The Challenge of Accuracy in Next-Generation Sequencing

TAIPEI, TAIWAN, Nov.30, 2021 - Next-Generation Sequencing (NGS) technologies are able to process millions or billions of DNA strands to be sequenced in parallel, generate significantly more throughput and minimize the need for Sanger sequencing of genomes. The output of NGS instruments is not the complete linear genome sequence of the individual being analyzed. NGS yields billions of short sequences known as reads then assembles the data back together.

Mapping is used to align the short sequences to known human genome reference sequences. Comparisons are made between the newly mapped individual’s sequences and reference sequences to find differences, which are called variants. These variants can be very small, such as single nucleotide variants (SNVs), or much larger structural variants up to the size of a chromosome. Many structural variants are associated with genetic diseases, however many are not. Recent research about structure variants indicates that they are more difficult to detect than single nucleotide polymorphisms (SNPs). Rapidly accumulating evidence indicates that structural variations can comprise millions of nucleotides of heterogeneity within every genome, and are likely to make an important contribution to human diversity and disease susceptibility.

Figure Resource from: Wikimedia Commons

NGS technologies can access as much as 80% to 90% of the genome which read sequences of 50 to 250s of base pairs with low per-base error rates of approximately 0.1%. However, 10% to 20% of the genome containing large repetitive structures are also well-known as difficult regions in the genome that become the biggest challenge to map short sequences accurately.

Most large genomes are filled with repetitive sequences. For instance, nearly half of the human genome is covered by repeats. Although some repeats appear to be nonfunctional, others have played a part in human evolution, at times creating novel functions, but also acting as independent, ‘selfish’ sequence elements. Repeats arise from a variety of biological mechanisms that result in extra copies of a sequence being produced and inserted into the genome.

Triplet repeat disorders, also known as microsatellite expansion diseases, are those caused by a type of genetic mutation during DNA replication or repair, where a group of three nucleotides of DNA is present in abnormally high numbers, eventually becoming unstable after reaching a certain threshold number of repeats and causing disease, such as myotonic dystrophy, fragile X syndrome, Huntington's disease, spinal and bulbar muscular atrophy, and several ataxias.

Repeats come in all shapes and sizes. It can be widely tandem repeats from just two copies or millions of copies, ranging in size from 1-2 bases to millions of bases. A tandem repeat is a sequence of two or more DNA base pairs that is repeated in such a way that the repeats lie adjacent to each other on the chromosome. Tandem repeats are generally associated with non-coding DNA. More than that, the relatively short length of NGS reads is up to 150~250 base pairs compared to Sanger Sequencing up to 300-500 base pairs, which are often insufficient to resolve complex structural variants and long insertions.

Types of structural variants Unbalanced SVs represented in the top two rows include(Figure Resource)

The analytic validity of NGS is high for selected regions of the human genome. It is important for physicians to understand both the strengths and limitations of sequencing technologies. Long-read sequencing technologies are increasingly being applied to resolve large structure variants and complex sequences. And for NGS, the researchers had built restricted methods to ensure the quality of the analytical data to improve characterization of these difficult regions. As advancements in technology make Whole Genome Sequencing faster, more accessible and better cost-efficient, scientists are seeing tremendous potential for applications in disease diagnosis as well as therapeutics and drug development. Next time we will discuss how scientists tackle NGS repetitive issues in difficult regions.


[1] Challenges of Accuracy in Germline Clinical Sequencing Data

[2] Structural variation in the human genome

[3] Best practices for variant calling in clinical sequencing

[4] Tandem Repeat

[5] Repetitive DNA and next-generation sequencing: computational challenges

and solutions

[6] LoQus23 Exits Stealth Mode to Target Huntington’s and Other Triplet Repeat Diseases

About WASAI Technology Inc.

WASAI Technology's mission is to deliver acceleration technologies of High-Performance Data Analysis (HPDA) in future data centers for targeted vertical applications with massive volumes and high velocities of scientific data. To strengthen and advance scientific discovery and technological research via big data-intensive acceleration in high-performance computing, WASAI Technology aims to improve commercialization and commoditization of scientific and technological applications.


WASAI Tecnology Inc.

4F, No. 6, Zhiyuan 3rd Rd., Beitou Dist., Taipei 112025, Taiwan