Different from the first generation sequencing, NGS adopts the strategy of sequencing while synthesizing. The main technical routes are represented by Roche's 454 technology, illumina's Solexa technology, Hiseq technology and ABI's Solid technology. In order to enhance the accuracy of sequencing, it is necessary to amplify multiple copies of the same template by PCR to correct the deviation. So the whole sequencing is divided into two steps: PCR amplification (a technology that can quickly replicate a large number of identical DNA fragments) and sequencing. However, the PCR process will increase the bit error rate of the system to some extent, and the error will be biased, which is also one of the problems existing in the second generation technology.
Illumina's main products are MiSeq sequencer, HiSeq X Ten sequencer, Miseq FGx sequencer, NextSeq 500/550 desktop sequencer, MiniSeq desktop sequencer, etc. , covering different needs of different application scenarios.
The second generation sequencing technology, such as sequencing platform, sequencing cost, sequencing cost, time-consuming, database construction and other experimental technical difficulties, error rate and reading length (150-400bp), analysis workload, etc., are not small obstacles to meet higher scientific research needs and promote in medical diagnosis. The errors and preferences brought by PCR process may hinder its large-scale application in medical diagnosis. The third generation technology mainly solves the problem of short length measurement of the second generation.
SMRT technology of PacBio, IonTorrent semiconductor sequencing technology of LifeTechnologies and nano-porous single molecule sequencing technology of Oxford are the representatives of the third generation sequencing technology.
Pakbio ·SMR
PacBio's SMRT still adopts the strategy of sequencing while synthesizing, but its super-active DNA polymerase is the key to achieve a very long reading length (~ 1000bp). The reaction is carried out in nanotubes, which is convenient to achieve the purpose of super Qualcomm. ZMW (Zero Mode Waveguide Hole) principle is used to distinguish the background of fluorescence signals in ultra-small nano-holes. Its sequencing speed is very fast, about 10 dNTP per second. The current problem is that the error rate of sequencing is too high (8 1-83%), which is also the same problem that most third-generation technologies need to solve. However, the error is random and almost unbiased, which makes it possible to reduce the error rate through correction. At present, this technology has been put into the market.
Oxford nano materials company
However, the nano-porous MinlON sequencer uses nano-porous single molecule technology, which is an electrical signal-based sequencing technology and an innovation compared with other optical signal sequencing technologies. The technical core is a special nanopore with molecular connectors, which is formed by embedding protein pores in artificial membrane. Voltage is applied to both sides of the membrane to make current pass through the micropores. When different DNA bases pass through the nanopore, their blocking effect on the current will temporarily affect the current intensity flowing through the nanopore. Different bases have different degrees of influence, and this difference is captured by sensitive electronic equipment to identify the types of bases passed. This technology has many advantages, such as long reading (about tens of kb, even 100 kb), random errors, rather than gathering at both ends of reading, and high throughput. The company also tried to simplify the sample preparation process. Theoretically, RNA can also be directly sequenced, and methylated cytosine can be detected by this technique. However, it can not achieve the ideal error rate control, or become an obstacle to its entry into the market.
Life science and technology
IonTorrent uses a semiconductor chip to fix DNA chains in the micropores of the chip. If bases can combine with template chains during DNA synthesis, the bases added with AGCT in turn will release hydrogen ions. This hydrogen ion causes local HP value change. After the ion sensor detects the change of PH value, it converts the chemical signal into sequence information. However, if the DNA chain has two consecutive identical bases, the recorded signal will be doubled and can be recognized. If there is no match, no changes will be recorded. Because this technology does not involve fluorescence excitation and photographing, the running time is greatly shortened (only a few hours), and there is no need for laser light source, optical system and photographing system, and there is no need for fluorescent labeling, thus avoiding the errors caused by these links. However, its reading length is not too long (200bp), and when it encounters multiple consecutive identical bases, strong PH changes will bring errors.
De novo sequencing is also called de novo sequencing: it can sequence a species without any existing sequence data, and use bioinformatics analysis to splice and assemble the sequences, thus obtaining the genome map of the species.
Exon sequencing refers to the genome analysis method of high-throughput sequencing after capturing and enriching the whole genome exon DNA by sequence capture technology. Exon sequencing is cheaper than genome resequencing, and it has great advantages in studying SNP and Indel of known genes, but it can't study genome structural variation such as chromosome breakage and recombination.
The research object of magnetism is the whole microbial community. Compared with the traditional single bacteria research, it has many advantages, two of which are very important: (1) Microorganisms are usually born in a niche in the form of communities, and many of their characteristics are based on the interaction between the whole community environment and individuals, so metagenomics research can find their characteristics better than individual research; (2) Metagenomics can study those microorganisms that cannot be isolated and cultured in the laboratory without isolating individual bacteria.
Single nucleotide polymorphism single nucleotide polymorphism, SNP or single nucleotide site variation SNV. Polymorphism caused by single nucleotide variation (substitution, insertion or deletion) in the same position of genomic DNA sequence between individuals. The single nucleotide in the same position in the genomic DNA sequence of different species and individuals is different. Loci and DNA sequences with such differences can be used as markers for genome mapping. There may be 1000 single nucleotide polymorphism in human genome, some of which may be related to diseases, but most of them may not be related to diseases. Single nucleotide polymorphism is an important basis for studying the genetic variation of human families and animal and plant strains. When studying cancer genome variation, compared with normal tissues, the specific single nucleotide variation in cancer is a somatic mutation called SNV.
Small fragments of the genome (
When a fragment of the genome is deleted or the transcription group is spliced, during the sequencing process, when the reading across the deletion site and splicing site is sent back to the genome, one reading is cut into two fragments and matched to different regions. This kind of reading is called soft shear reading, which plays an important role in identifying chromosome structural variation and foreign sequence integration.
Because most sequencing readings are short, one reading can be matched to multiple locations in the genome, and it is impossible to distinguish the location of its real source. Some tools are based on statistical models, such as allocating such readings to areas with more readings.
Splicing software is based on overlapping areas between readings, and the sequences obtained by splicing are called overlapping groups. ?
Genomic de novo sequencing, after obtaining the overlapping group through reading splicing, it is often necessary to construct 454 opposite-end libraries or Illumina Mate-pair libraries to obtain the sequences of fragments with a certain size (such as 3Kb, 6Kb, 10Kb, 20Kb) at two readings. Based on these sequences, we can determine the order relationship between some overlapping groups, and these overlapping groups with known order constitute the scaffold. ?
After splicing, reading will get some overlapping groups with different lengths. Add all the overlapping group lengths to get the total overlapping group length. Then all the overlapping groups are sorted from long to short, such as overlapping group 1, overlapping group 2, overlapping group 3 ... overlapping group 25. Add overlapping groups in this order. When the added length reaches half of the total length of the overlapping group, the last added overlapping group length is the overlapping group N50. For example, when conti1+conti2+conti3+conti 4 = conti total length *1/2, the length of conti4 is ContiN50. Overlapping group N50 can be used as a criterion to judge the quality of genome splicing. ?
The definition of scaffold N50 is similar to that of overlapping group N50. Some stents with different lengths were obtained by splicing and assembling overlapping groups. Add all the lengths of scaffolding to get the total length of scaffolding. Then sort all scaffolding from long to short, such as scaffolding 1, scaffolding 2 and scaffolding 3. ............................................................................................................................................... adds scaffolding in this order. When the added length reaches half of the total scaffold length, the last added scaffold length is N50. For example, when scaffold 1+ scaffold 2+ scaffold 3+ scaffold 4+ scaffold 5 = total scaffold length * 1/2, the length of scaffold 5 is scaffold N50. Scaffold N50 can be used as a criterion to judge the quality of genome splicing. ?
The sequencing depth refers to the ratio of the total number of bases obtained by sequencing to the size of the genome to be measured. Assuming that a gene is 2M in size and the sequencing depth is 10X, the total amount of data obtained is 20M. Coverage refers to the proportion of sequences obtained by sequencing in the whole genome. Due to the existence of complex structures such as high GC and repeated sequences in the genome, the final assembled sequence can not cover some areas, which is called Gap. For example, when sequencing a bacterial genome, the coverage rate is 98%, so there are still 2% sequence regions that have not been sequenced.
Transcripts were collected from sequencing data. There are two ways to assemble: 1, and build from scratch; 2, there is reference genome reconstruction. Among them, de-novo assembly refers to connecting overlapping reading fragments into longer sequences without relying on reference genomes, and splicing them into overlapping groups and scaffolds after continuous extension. Common tools are tinder, crossing the abyss, trinity and so on. Reconstruction with reference genome is to paste read onto the genome first, and then obtain the transcript in the genome through the coverage of reads and the information of linkage sites. Commonly used tools are scriptures and cufflinks.
ComparativeGenomics is a discipline based on genome mapping and sequencing. By comparing known genes with genome structure, we can understand the function, expression mechanism and species evolution of genes. Using the homology of coding sequence and structure between model organism genome and human genome, we can clone human disease genes, reveal the molecular mechanism of gene function and disease, and clarify the evolutionary relationship of species and the internal structure of genome.
Q30 means that the recognition reliability of a base is equal to 99.9%, or the error probability is 0. 1%. Q20 means that the reliability of base recognition is equal to 99%. Q30 data volume refers to the sum of data with quality higher than or equal to Q30 in a batch of data.
PF stands for pass filter. That is to say, the quality is qualified. Illumina's instrument sequence will automatically grade the quality reliability of readings (sequences). Whether the recognition reliability of two of the first 25 bases is lower than 0.6 is the criterion of PF. If there are two or more low-quality data in the first 25 bases, it is judged that this reading is unqualified and PF fails. Otherwise, the quality inspection passed.
PF is an internationally recognized quality inspection standard. For mammalian genome resequencing and exon sequencing, we guarantee that the data quality of Q30 is higher than 80%. For mRNA sequencing and smRNA sequencing, we guarantee that the data quality of the control lane is 80% higher than that of Q30.
Generally speaking:
In mammalian genome resequencing and exon sequencing, the proportion of GC is about 40%, and the proportion of Q30 is 80 ~ 95%.
The proportion of RNA-seq and GC is about 50%, and the proportion of Q30 is ~80%. If the poly (a) is particularly high, the Q30 will be lower;
SmRNA-seq, because there are many read, only a string of A is left, and the quality will be lower. Our experimental result %Q30 is 70~75%.
Illumina's sequencer has high data output and the highest data quality. Because of the use of fluorescent dNTP with terminal groups, there will be no frame shift misreading when measuring homopolymers (base homopolymers, such as a string of 4 t: tttt).
Roche 454 adopts the sequencing principle of pyrophosphate sequencing, which releases light by hydrolyzing pyrophosphate produced in the process of DNA synthesis, and reads the sequence by measuring this light. The advantage is that the reading length is the longest. But the data output is the lowest.
The ion torrent, including PGM and proton, obtains the sequence by measuring the change of PH caused by hydrogen ions released during DNA synthesis. The advantage is the fastest speed, it takes about 3~4 days to get on the computer, and it takes 2~4 hours to get on the computer.
Solid uses hybridization, ligation reaction and fluorescence measurement. Because of hybridization, it is slow and short. Now it has actually been eliminated.
PacBio is the third generation sequencing, that is, single molecule sequencing. At present, the sequencing length can reach more than 1 KB, and the modification of DNA sequence can be detected. But its disadvantage is that the accuracy of sequencing is very low. At present, the accuracy of sequencing is only 80~90% per base. On the other hand, the flux is small, reading 70 thousand times at a time.
Partial reference: /p/ACD 38 E4 a 1
1977, British chemist frederick sanger invented the dideoxy chain termination method. This technology and the chemical degradation method invented by W.Gilbert are called the first generation sequencing technology. Sanger won the Nobel Prize in chemistry twice in 1958 and 1980. He is the fourth person to win the Nobel Prize twice and the only one to win the Prize twice. The first prize was obtained by sequencing the amino acid sequence of insulin, which proved that protein had a clear structure, and the second prize was obtained by inventing the dideoxy chain termination method-Sanger method. Using this technique, he successfully determined the genome sequence of phage φ-X 174. Sanger is also a legendary scientist. The Sanger Institute, which plays an important role in genome research, was founded by this Daniel.
The feature of the first generation sequencing technology is that the reading length of sequencing can reach 1000bp, and the accuracy can reach 99.999%. However, the disadvantages of high cost and low throughput of sequencing seriously affect its real large-scale application. However, due to its high accuracy, contemporary sequencing is still the gold standard for gene detection, and it is also the main means to evaluate and verify the results of the new generation sequencing. At that time, it was a generation of sequencing technology that made genome research possible at that time, and the mighty human genome project was about to be launched vigorously. 1977, British chemist frederick sanger invented the dideoxy chain termination method. This technology and the chemical degradation method invented by W.Gilbert are called the first generation sequencing technology. Sanger won the Nobel Prize in chemistry twice in 1958 and 1980. He is the fourth person to win the Nobel Prize twice and the only one to win the Prize twice. The first prize was obtained by sequencing the amino acid sequence of insulin, which proved that protein had a clear structure, and the second prize was obtained by inventing the dideoxy chain termination method-Sanger method. Using this technique, he successfully determined the genome sequence of phage φ-X 174. Sanger is also a legendary scientist. The Sanger Institute, which plays an important role in genome research, was founded by this Daniel.
The feature of the first generation sequencing technology is that the reading length of sequencing can reach 1000bp, and the accuracy can reach 99.999%. However, the disadvantages of high cost and low throughput of sequencing seriously affect its real large-scale application. However, due to its high accuracy, contemporary sequencing is still the gold standard for gene detection, and it is also the main means to evaluate and verify the results of the new generation sequencing. At that time, it was a generation of sequencing technology that made genome research possible at that time, and the mighty human genome project was about to be launched vigorously.