Introduction to the Principles of Reproducing Three Generations of Genome Sequencing Technology

Abstract: Since the development of the first generation of DNA sequencing technology (Sanger method)[1] in 1977, sequencing technology has made considerable development in more than three decades, from the first generation to the third and even the fourth generation, the sequencing read lengths have changed from long to short, and then from short to long. Although it seems that the second generation of short read length sequencing technology still occupies an absolutely dominant position in the global sequencing market, the third and fourth generation of sequencing technology has also been developing rapidly in the past one or two years. Every change in sequencing technology has had a huge impact on genome research, medical research, drug discovery, breeding, and other fields. Here I will give a brief summary of current sequencing technologies and their sequencing principles.

The rapid acquisition of genetic information of living organisms is of great significance to life science research. Figure 1 (right-click to open the image to view a larger version, the same below) describes the development of sequencing technology since Watson and Crick established the double-helix structure of DNA in 1953.

First Generation Sequencing Technology

The first generation of DNA sequencing technology used either strand termination, pioneered by Sanger and Coulson in 1975, or chemical methods (strand degradation), invented by Maxam and Gilbert in 1976-1977. chemical method (chain degradation) . And in 1977, Sanger determined the first genome sequence, of phage X174, with a total length of 5375 bases [1]. Since then, mankind has gained the ability to peek into the nature of genetic differences in life, and in doing so has stepped into the era of genomics. Researchers have continued to improve the Sanger method over the years. In 2001, the first human genome was sequenced based on the improved Sanger method. The core principle of the Sanger method is that since the 2' and 3' of ddNTP do not contain any hydroxyl groups, they cannot form a phosphodiester bond during DNA synthesis, and therefore can be used to interrupt the DNA synthesis reaction. In the four DNA synthesis reaction systems, a certain proportion of radioisotope-labeled ddNTP (ddATP, ddCTP, ddGTP and ddTTP) is added, and the DNA sequences of the molecules to be tested can be determined according to the position of the electrophoretic bands through gel electrophoresis and autoradiography (Figure 2). This web site produces a short, visual and lively video of the sanger sequencing method.

It is worth noting that during the initial development of sequencing technology, in addition to the Sanger method, some other sequencing technologies also appeared, such as pyrophosphate sequencing, linkage enzyme method, etc. Among them, pyrophosphate sequencing, linkage enzyme method, and so on. Among them, pyrophosphate sequencing is the sequencing method used by the later Roche 454 technology2-4, and ligase sequencing is the sequencing method used by the later ABI SOLID technology2,4, but their *** same core means are utilized in Sanger1 which can interrupt the DNA synthesis reaction of the dNTP.

Second Generation Sequencing Technology

Overall, the main features of the first generation sequencing technology are sequencing read lengths of up to 1000bp and accuracy of up to 99.999%, but its shortcomings in terms of high sequencing cost and low throughput have seriously affected its real large-scale application. Thus, the first generation sequencing technology is not the most ideal sequencing method. After continuous technological development and improvement, the second generation sequencing technology labeled with Roche's 454 technology, illumina's Solexa, Hiseq technology and ABI's Solid technology was born. Second-generation sequencing technology greatly reduces the cost of sequencing while dramatically increasing the speed of sequencing and maintaining a high level of accuracy. Whereas it used to take three years to sequence a human genome, second-generation sequencing technology takes only one week, but is much shorter in terms of sequence read lengths than first-generation sequencing technology. Table 1 and Figure 3 provide a brief comparison of the characteristics of first- and second-generation sequencing technologies, as well as the cost of sequencing.5 In the following, I will briefly describe the main principles and characteristics of the three major second-generation sequencing technologies.

Illumine

Illumina's Solexa and Hiseq are the most widely used second-generation sequencing machines in the world at present, and the core principles of these two series of technologies are the same.2,4 Both series of machines use the method of synthesis and sequencing at the same time, and the sequencing process is divided into the following 4 steps, as shown in Figure 4.< /p>

? (1) DNA to be tested library construction

The use of ultrasound to break the DNA samples to be tested into small fragments, at present, in addition to the assembly and some other special requirements, mainly broken into 200-500bp long sequence fragments, and add different joints on the two ends of these small fragments, to build a single-stranded DNA library.

? (2) Flowcell

Flowcell is a channel for adsorption of flowing DNA fragments. When libraries are constructed, the DNA in these libraries will be randomly attached to the channels on the surface of the flowcell as it passes through the flowcell. Each Flowcell has 8 channels, and each channel has many junctions attached to its surface. These junctions can pair with the junctions that are added to the ends of the DNA fragments during the library construction process (this is why the flowcell can adsorb the DNA after the library construction process), and can support the amplification of DNA by bridge PCR on its surface.

? (3) Bridge PCR amplification and denaturation

Bridge PCR uses the junctions immobilized on the surface of Flowcell as templates for bridge amplification, as shown in Figure 4.a. After continuous cycles of amplification and denaturation, each DNA fragment will eventually be concentrated into bundles at their respective locations, each containing many sub-copies of a single DNA template, and the process is performed to achieve amplification of the signal intensity of the bases in order to achieve the signal requirements needed for sequencing.

(4) Sequencing

The sequencing method uses the method of sequencing while synthesizing. DNA polymerase, junction primers, and 4 dNTPs with base-specific fluorescent markers are added to the reaction system (as in the Sanger sequencing method). The 3'-OH of these dNTPs is chemically protected so that only one dNTP can be added at a time. after the dNTPs have been added to the synthesized strand, all the unused free dNTPs and the DNA polymerase are washed off. Next, the buffer needed to excite the fluorescence is added, the fluorescence signal is excited with a laser, and there is an optical device to complete the recording of the fluorescence signal, and finally computer analysis is used to convert the optical signal into sequencing bases. After the fluorescence signal is recorded, chemical reagents are added to quench the fluorescence signal and remove the dNTP 3'-OH protecting group so that the next round of sequencing reaction can be carried out. Illumina's sequencing technology, which features adding only one dNTP at a time, is able to solve the problem of accurately measuring the length of homopolymers, and its main sources of sequencing errors are The main source of sequencing errors is base substitutions, and the current sequencing error rate is between 1% and 1.5%. The sequencing cycle is about 1 week for a 30x sequencing depth in the case of human genome resequencing.

Roche 454

The Roche 454 Sequencing System was the first platform to commercially operate second-generation sequencing technology. Its main sequencing principles are (Figure 5 abc)2:

(1) DNA library preparation

The 454 sequencing system's file construction method is different from illumina's, which utilizes a spraying method to break the DNA to be tested into small fragments of 300-800 bp in length and add different connectors at the ends of the fragments, or denature the DNA to be tested and then amplify it by PCR with hybridization primers. PCR amplification was performed, and the vector was ligated to construct a single-stranded DNA library (Figure 5a).

(2) Emulsion PCR (emulsion PCR, which is actually a unique process of injecting water into oil)

454 Of course the DNA amplification process is also very different from illumina's, which combines these single-stranded DNAs on water- and oil-encapsulated magnetic beads of about 28 um in diameter, which are incubated and annealed on top of them.

The most important feature of emulsion PCR is the ability to create a large number of separate reaction spaces for DNA amplification. The key technology is "water to oil" (oil-in-water), the basic process is before the PCR reaction, the PCR contains all the components of the reaction of the water solution injected into the surface of the high-speed rotation of the mineral oil, the water solution instantly formed countless small droplets wrapped in mineral oil. These droplets then constitute the independent PCR reaction space. Ideally, each droplet contains only one DNA template and one magnetic bead.

The surface of these droplet-coated magnetic beads contains DNA sequences that are complementary to the junctions, so that these single-stranded DNA sequences bind specifically to the beads. The incubation system also contains PCR reaction reagents, so it is ensured that each small fragment bound to the magnetic beads can undergo PCR amplification independently and the amplification product can still bind to the magnetic beads. When the reaction is complete, the incubation system can be destroyed and the magnetic beads with DNA can be enriched. Through amplification, each small fragment will be amplified approximately 1 million times to reach the amount of DNA required for the next sequencing step.

(3) Pyrophosphate Sequencing

Sequencing involves treating the DNA-containing magnetic beads with a polymerase and a single-stranded binding protein, then placing the beads on a PTP plate. The plate is customized with a number of holes of approximately 44um in diameter, each of which can hold only one bead, and in this way each bead is held in place so that it can be detected during the sequencing reaction that follows.

The sequencing method uses pyrophosphate sequencing, in which a magnetic bead with a smaller diameter than that of the wells on the PTP plate is placed into the wells to initiate the sequencing reaction. The sequencing reaction uses the single-stranded DNA amplified in large quantities on the magnetic beads as a template, and one type of dNTP is added per reaction for the synthesis reaction. If the dNTP can be paired with the sequence to be tested, a pyrophosphate group will be released upon synthesis. The released pyrophosphate group reacts with ATP sulfatase in the reaction system to generate ATP, which is co-oxidized with luciferase*** to make the luciferin molecules in the sequencing reaction and emits fluorescence, which is recorded by the CCD camera on the other side of the PTP plate, and then finally processed by a computer for the optical signal to obtain the final sequencing results. Since each type of dNTP produces a different fluorescence color in the reaction, the sequence of the molecule under test can be determined according to the color of the fluorescence. At the end of the reaction, the free dNTP will degrade ATP in the presence of bisphosphatase, which leads to fluorescence quenching in order to move the sequencing reaction to the next cycle. Because each sequencing reaction is performed in separate wells on the PTP plate in 454 sequencing, interferences and sequencing bias are greatly reduced. 454 technology's greatest advantage is its ability to achieve long sequencing read lengths, currently averaging up to 400bp, and unlike illumina's Solexa and Hiseq technologies, 454 technology is not capable of performing sequencing in the same manner as other sequencing technologies. One of its main drawbacks is that it cannot accurately measure the length of homopolymers, such as when there is a PolyA-like situation in the sequence, the sequencing reaction will incorporate multiple T's at a time, and the number of T's incorporated can only be obtained by fluorescence intensity estimation, which may lead to inaccurate results. It is also for this reason that 454 technology introduces insertion and deletion sequencing errors during sequencing.

Solid Technology

Solid sequencing technology is an instrument that ABI began putting into use for commercial sequencing applications in 2007. It is based on the ligase method, which utilizes DNA ligase to sequence during the ligation process (Figure 6).2,4 It is based on the following principles:

(1) DNA library construction

Fragments are interrupted and sequencing junctions are added to the ends of the fragments, which are ligated into vectors to construct a single-stranded DNA library.

(2) Emulsion PCR

The PCR process for Solid is similar to that of 454, with the use of small droplet emulsion PCR, but these beads are much smaller than those of the 454 system, at only 1 um. The 3' end of the amplification product is modified during amplification, which is a step towards sequencing. The 3' end of the amplification product is modified at the same time as the amplification, which prepares the product for the sequencing process, and the 3' modified beads are deposited on a slide. During bead loading, the deposition chamber divides each slide into 1, 4, or 8 sequencing regions (Figure 6-a).

The biggest advantage of the Solid system is that it can accommodate a higher density of beads per slide than the 454, making it easy to achieve higher throughputs in the same system.

(3) Ligase sequencing

This step is unique to Solid sequencing. Instead of using DNA polymerase, which has been commonly used for sequencing, ligase is used, and the substrate for the Solid ligation reaction is a mixture of 8-base single-stranded fluorescent probes, which are simply represented here as 3'-XXnnnnzzz-5'. In the ligation reaction, these probes are paired with the single-stranded DNA template strand according to the base complementation rule. The 5' ends of the probes were labeled with four colors of fluorescent dyes, CY5, Texas Red, CY3, and 6-FAM (Figure 6-a). In this 8-base single-stranded fluorescent probe, the bases on bases 1 and 2 (XX) are identified and different fluorescent labels are added to positions 6-8 (zzzz) depending on the species. This is Solid's unique sequencing method where two bases determine a fluorescent signal, which is equivalent to being able to determine two bases at a time. This sequencing method is also called two-base sequencing. When a fluorescent probe is able to pair with the DNA template strand and attach, it emits a fluorescent signal representing bases 1 and 2. The different combinations of bases 1 and 2 in relation to the fluorescent color are indicated by the colorimetric versions in Figures 6-a and 6-b. After the fluorescent signal is recorded, a chemical cut is made between bases 5 and 6, which removes the fluorescent signal for sequencing at the next position. It is worth noting, however, that by this sequencing method, each sequenced position differs by 5 positions. That is, the first time is positions 1 and 2, and the second time is positions 6 and 7 ...... After sequencing to the end, the newly synthesized strand is denatured and eluted. Then the second round of sequencing is performed with primer n-1. The difference between primer n-1 and primer n is that the two differ by one base in the position where they pair with the junction (Figure 6-a. 8). That is, by primer n-1 moving the sequencing position to the 3' end by one base position on the basis of primer n, thus the 0th and 1st position and the 5th and 6th positions can be determined ...... The second round of sequencing is completed, and so on until the fifth round of sequencing, and eventually all the bases at all positions can be sequenced, and each position is sequenced. and the bases at each position are detected twice. The read length of this technique is 2 × 50 bp, and the subsequent sequence splicing is similarly complex. Due to the double detection, the raw sequencing accuracy of this technology is as high as 99.94%, and the accuracy at 15x coverage is 99.999%, which is the highest among the current second-generation sequencing technologies. However, in the fluorescence decoding stage, given that it is a two-base determination of a fluorescence signal, it is easy to generate a chain of decoding errors once an error occurs.

Third-generation sequencing

Sequencing technology has seen new milestones in the last two to three years. Take PacBio's SMRT and Oxford Nanopore Technologies nanopore single-molecule sequencing technologies, which are called third-generation sequencing technologies. Compared to the previous two generations, their biggest feature is single-molecule sequencing, which eliminates the need for PCR amplification.

Among them, PacBio SMRT technology actually applies the idea of sequencing while synthesizing5 and uses the SMRT chip as the sequencing carrier. The basic principle is: DNA polymerase and template binding, 4-color fluorescence labeling 4 bases (that is, dNTP), in the base pairing stage, the addition of different bases, will emit different light, according to the wavelength and peak value of the light can be judged by the type of bases entering. At the same time, this DNA polymerase is one of the keys to realize the ultra-long read length, which is mainly related to the maintenance of the enzyme's activity, and it is mainly affected by the damage caused by the laser.One of the keys to the PacBio SMRT technology is how to distinguish the reaction signal from the strong fluorescence background of the surrounding free bases. They utilize the ZMW (Zero Mode Waveguide Hole) principle: a number of small, dense holes like those found in the walls of a microwave oven. If the diameter of the hole is larger than the microwave wavelength, the energy will leak out through the panel under the diffraction effect, thus interfering with the surrounding holes. If the aperture is smaller than the wavelength, the energy will not be radiated to the surrounding, but to maintain a straight line (the principle of light diffraction), which can play a protective role. Similarly, in a reaction tube (SMRTCell: Single Molecule Reaction Tube in Real Time) there are many such small circular nano-holes, ZMW (Zero Mode Waveguide), the outer diameter of more than 100 nanometers, than the detection of the laser wavelength is small (hundreds of nanometers), the laser from the bottom of the hit on the small holes can not be penetrated into the upper solution area, the energy is confined to a small range (volume of 20X10-21 L), just enough to cover the part that needs to be detected. The energy is limited to a small area (volume 20X 10-21 L), which is just enough to cover the part to be detected, so that the signal only comes from this small reaction area, and too many free nucleotide monomers outside the hole remain in the dark, thus minimizing the background. In addition, some base modifications can be detected by detecting the sequencing time between two adjacent bases, both if the bases have modifications, the speed will be slowed down when passing through the polymerase, and the distance between two adjacent peaks increases, and information such as methylation can be detected by this to between (Figure 7).SMRT technology has a fast sequencing rate of about 10 dNTP per second, but at the same time its sequencing error rate is relatively high (which is almost the same as the current sequencing rate of single nucleotide monomers). However, at the same time, its sequencing error rate is relatively high (which is almost a common problem with current single-molecule sequencing technologies), reaching 15%, but the good thing is that its errors are random and not biased by sequencing errors as in second-generation sequencing technologies, and thus can be effectively corrected by multiple sequencing.

The nanoscale single-molecule sequencing technology developed by Oxford Nanopore Technologies is different from all previous sequencing technologies in that it is based on electrical rather than optical signals.5 One of the keys to the technology is that they designed a special nanopore with molecular junctions bonded to the inside of the pore***valence. As DNA bases pass through the nanopore, they cause a change in charge that transiently affects the strength of the current flowing through the nanopore (the magnitude of the current change is different for each base), and a sensitive electronic device detects these changes to identify the base that has passed through (Figure 8).

The company introduced the first commercially available nanopore sequencer at last year's Annual Meeting on Advances in Genome Biology Technology (AGBT), which attracted a great deal of attention from the scientific community. Nanopore sequencing (and other third-generation sequencing technologies) is expected to address the shortcomings of current sequencing platforms. The key features of nanopore sequencing are: long read lengths, on the order of tens of kilobytes, or even 100 kilobytes; error rates, currently between 1% and 4%, and random errors rather than clustering at the ends of reads; data can be read in real time; high throughput (30x human genome is expected to be completed in one day); starting DNA is not destroyed during sequencing; and sample preparation is simple and inexpensive. Theoretically, it can also directly sequence RNA.

Nanopore single-molecule sequencing calculations have another great feature; they are able to directly read out methylated cytosines without the need to bisulfite the genome, as is the case with traditional methods. This is extremely helpful for the direct study of epigenetically related phenomena at the genomic level. Moreover, the sequencing accuracy of the modified method can reach 99.8%, and the sequencing errors can be easily corrected if found. However, there seems to be no report on the application of this technology.

Other Sequencing Technologies

There is also a new generation of revolutionary sequencing technology based on semiconductor chips, Ion Torrent6, which uses a high-density semiconductor chip filled with small holes, each of which serves as a sequencing reaction cell. When DNA polymerase polymerizes a nucleotide onto an extended DNA strand, it releases a hydrogen ion, which changes the pH of the reaction cell, and an ion receptor under the cell senses the H+ ion signal, which is then converted directly into a digital signal that can be used to read out the DNA sequence (Figure 9). The inventor of this technology is also one of the inventors of 454 sequencing technology - Jonathan Rothberg, its library and sample preparation is very similar to the 454 technology, or even can be said to be a replica of the 454, except that the sequencing process is not through the detection of pyrophosphate fluorescence coloration, but through the detection of H+ signal changes to obtain the sequence bases. Compared with other sequencing technologies, Ion Torrent does not require expensive physical imaging equipment, so the cost is relatively low, the volume is relatively small, and the operation is also much simpler, the speed is also quite fast, in addition to the library production time of 2 days, the entire on-line sequencing can be completed in 2-3.5 hours, however, the throughput of the chip is not high, the current is about 10G. However, the throughput of the whole chip is not high, currently around 10G, but it is very suitable for sequencing small genomes and exon validation.

Summary

Above, the principles of each generation of sequencing technology have been briefly explained, and a comparison of the characteristics of these three generations of sequencing technology is summarized in Tables 1 and 2 below. Sequencing cost, read length, and throughput are three important indicators for evaluating the advancement of a sequencing technology. Apart from the differences in throughput and cost, the core sequencing principles of first- and second-generation sequencing technologies (except for Solid, which is sequencing while ligating) are based on the idea of sequencing while synthesizing. The advantages of second-generation sequencing technology are that the cost is greatly reduced and the throughput is greatly increased, but the disadvantages are that the introduction of the PCR process will increase the error rate of sequencing to a certain extent, and there is a systematic bias, as well as a shorter read length. The third generation of sequencing technology was developed to address the shortcomings of the second generation, and is fundamentally characterized by single-molecule sequencing without the need for any PCR process, which is designed to effectively avoid systematic errors due to PCR bias, and to increase the read length, while maintaining the high-throughput, low-cost advantages of second-generation technology.

Table 1: Comparison of Sequencing Technologies

Table 2: Comparison of Cost Sequencing of Mainstream Sequencing Machines

Figure 10 below shows the current distribution of sequencing machines around the world. Several hotspots in the figure are mainly located in Shenzhen, China (mainly UW), Southern Europe, Western Europe and the United States.

References

Original link: http://www.huangshujia.me/2013/08/02/2013-08-02-An-Introduction-of-NGS-Sequence.html