Genome survey (second generation sequencing data quality control)

What data should be prepared for investigation and analysis?

Introduction of (1)QC method

(2) Introduce 2)NT method

1. Why do you want to investigate and analyze?

2. Preparation of investigation and analysis data

3. Survey data quality control software

Step 4 focus on summarization

The mass of bases is expressed in ASCII values. According to the different quality schemes used in sorting, the calculation method of decimal quality value is also different. Common calculation methods are as follows:

Display mode: Phred+33 and Phred+64, where 33 and 64 refer to the values that should be subtracted when ASCII values are converted into fractions.

(1)Phred+64: ASCII value of quality character -64.

(2)Phred+33: ASCII value of quality characters -33.

The range of basic mass values in Illumina sequencing is that ASCII values are expressed as [B, h] rings [#, I].

Illumina sequencing error rate is inconsistent with sequencing quality value. Specifically, if the sequencing error rate is expressed by E and the basic quality value of Illumina is expressed by Q, there is the following relationship: Q =-10 log 10(E).

Each proposed 10000 comparison matches the NT database. If all homologous species match, there is no pollution. If bacteria and fungi match, the data may be contaminated.

1.NT library

Except GSS, STS, PAT, EST, HTG and WGS, some non-redundant nucleotides from all traditional branches of GenBank, EMBL and DDBJ.

2.NT comparison

Software: explosion

Basic Local Alignment Search Tool (BLAST) is the most widely used tool for sequence similarity. Some versions of BLAST can compare protein query with protein database and nucleotide query with nucleotide database, and some versions can translate nucleotide queries or databases in all six frameworks and compare them with protein databases or queries.

3.NT comparison result file statistics

Because it is difficult to quantify species characteristics during the experiment, the data can be quantitatively displayed by qc and relevant information can be separated from the data.

At the same time, it prepares for the subsequent Kmer analysis and obtains accurate genome prediction.

Pollution is the most important problem. If the sorting quality on the data report is low and the sorting effect is good, it can often be clearly seen from the display diagram.

But the pollution problems may be * * * bacteria, organelles, experimental pollution and sample pollution. This information only comes from NT comparison and gc peak.

Understand, but also combined with species characteristics for linkage analysis. For example, some insects with diseases have bacteria, and some mammals also have related bacteria.

Bacteria.