Introduction to Bioinformatics

Table of Contents 1 Pinyin 2 English Reference 3 Current Research in Bioinformatics 3.1 Obtaining the Complete Genomes of Human and Various Organisms 3.2 Discovering New Genes and Novel Single Nucleotide Polymorphisms 3.3 Non-Coding Proteins in Genomes 3.4 Studying Biological Evolution at the Genome Level 3.5 Comparative Study of Complete Genomes 3.6 From Functional Genomes to Systems Biology 3.7 Simulation of Protein Structure and Drug Design 3.8 Application and Development of Bioinformatics 1 Pinyin

shēng wù xìn xī xué

2 English Reference

Bioinformatics

Bioinformatics is an emerging interdisciplinary discipline. Many people will think: Bioinformatics involves both biology and physics, it must be a very broad subject area. In fact, its connotation is very specific and its scope is very clear. Bioinformatics is accompanied by genome research, so its research content closely follows the genome research and development.

Broadly speaking, bioinformatics is engaged in acquiring, processing, storing, distributing, analyzing and interpreting biological information related to genome research. This definition includes two meanings, one is the collection, organization and service of massive data, that is, to manage these data; the other is to discover new laws from them, that is, to use these data.

Specifically, bioinformatics is to take the genome D NA sequence information analysis as the source, to find the coding region of the genome sequence that represents the protein and R NA genes; at the same time, to elucidate the essence of the information of the large number of non-coding regions of the genome, and to decipher the laws of the genetic language hidden in the D NA sequences; on this basis, to summarize and organize the transcriptional data related to the release and regulation of the genetic information of the genome. On this basis, we can summarize and organize the data of transcriptional profiles and protein profiles related to the release of genomic genetic information and its regulation, so as to understand the laws of metabolism, development, differentiation, and evolution.

Bioinformatics also uses the information in the coding region of the genome for the simulation of protein spatial structure and the prediction of protein function, and combines this kind of information with the physiological and biochemical information of organisms and life processes to elucidate their molecular mechanisms, and ultimately to carry out the molecular design of proteins, nucleic acids, drug design, and individualized healthcare design.

Genome informatics, structural computation and modeling of proteins, and drug design are tightly intertwined around the central law of genetic information transfer, and are therefore necessarily connected.

Why does genomic research need to rely on bioinformatics? First of all, along with genome research, there is an explosion of relevant information, and there is an urgent need to process the massive amount of biological information. Since 1995, when scientists deciphered the genome of Haemophilus influenzae, which is 1.8 million nucleotides long, the complete genomes of about 60 microorganisms and several eukaryotes, such as yeast, nematode, fruit fly, and Arabidopsis thaliana, have been sequenced. By the spring of 2001, scientists had published a working sketch of the majority of the human genome, i.e., the human genome. These achievements mean that the study of genome will fully enter a brand new stage of information extraction and data analysis. According to the statistics of international databases, the number of DNA bases was 3 billion in December 1999, 6 billion in April 2000, and now it has reached 14 billion, which is doubled about every 14 months. At the same time, the growth in digital processing power of electronic computer chips is equivalent to a doubling every 18 months. As a result, computers are able to effectively manage and operate on huge amounts of data.

But the more fundamental reason is the complexity of genomic data. By the genome of a particular organism is meant the sum of all the genetic material of that organism. The genetic material of organisms is a class of biological macromolecules called deoxyribonucleic acid (DNA), which is composed of four nucleotides in series, usually represented by the characters A, T, G, and C. In general, the genetic code of organisms is a set of molecules called DNA, which is composed of four nucleotides in series. In layman's terms, the genetic code of a living thing is a long linear chain of these four characters. This chain is often very long, for example: the genetic code of human beings contains 3.2 billion characters, and they are piled up to constitute a more than 1 million pages, each page has 3,000 characters of the "heavenly book". This "heavenly book" contains a large amount of information about the structure and function of the human body and the process of life activities, but only consists of four characters, neither lexical nor syntactic, and there is no punctuation, and it seems that every page is similar. How to read it is a great difficulty. Genome research is ultimately about translating biological problems into problems of dealing with numerical symbols. To solve such a problem it is necessary to develop new analytical theories, methods, techniques, and tools, and it is necessary to rely on computer information processing.

Research in bioinformatics should have a multifaceted scientific foundation. First of all, it requires a certain degree of computing power, including the corresponding soft and hard equipment. There should be a variety of databases or be able to communicate effectively with international and domestic database systems. There should be a developed and stable Internet system; at the same time, bioinformatics needs strong and innovative algorithms and software. Without algorithmic innovation, bioinformatics cannot gain sustainable development. Finally, it has to establish extensive and strong links with experimental sciences, especially with automated large-scale high-throughput biological research methods and platform technologies. These technologies are, at the same time, the primary methods for generating bioinformatic data and the key means for validating bioinformatics research results. Therefore, those engaged in bioinformatics research must also have knowledge of multidisciplinary intersections.

China's bioinformatics research and application has a certain foundation, and thus is expected to achieve breakthrough results, which is very important to enhance China's strength in the field of basic research, and to occupy the international leading position in some aspects. The application of bioinformatics results will also produce huge social and economic benefits.

3 Bioinformatics current main research content 3.1 Obtain the complete genome of human and various organisms

The primary goal of genome research is to obtain the entire set of human genetic code. There are 3.2 billion bases in the human genetic code, and the current D NA sequencer can only read a few hundred to a thousand bases per reaction. In other words, to get the entire human genetic code, you first have to break up the human genome, and then put it back together again after sequencing small segments.

But it's easy to imagine that if you tear a book into equal-sized pieces, you can't put them back together correctly, because the context of the book is lost in the tearing. What can be done about this? We can take two identical books and break them into separate pieces according to different tearing methods. By cross-referencing the different pieces and finding the same words, we can partially restore the contextual connection of the book. The more books are torn, the more contextual connections are recovered. Thus to obtain the entire set of human genetic code one cannot measure the 3.2 billion bases of a human being just once, but often many times. For example, the working sketch of the human genome published earlier this year in the journals Nature and Science reported that it contains about 2.9 billion alkaloids, with 96% physical map coverage and 94% sequence coverage. More than 90% of the contiguous sequence clusters have been greater than 100,000 bases; about 25% of the contiguous sequence clusters have been equal to or greater than 10 million bases. In these sequences 30-40,000 genes coding for proteins were found. To obtain such a map is equivalent to measuring the human genome about five times. To do this, tens of millions of small fragments need to be reconnected by comparison, which is often referred to as splicing and assembly of genome sequence data.

Every aspect of large-scale genome sequencing is closely related to information analysis. From sequencer optical density sampling and analysis, base readout, vector identification and removal, splicing, and filling sequence gaps, to repeat sequence identification, reading frame prediction, and gene annotation, each step is closely dependent on bioinformatics software and databases. Among them, sequence splicing and filling sequence gaps are the most critical and primary challenges. Its difficulty comes not only from its huge amount of data, but also from the fact that it contains highly repetitive sequences. For this reason, this process especially requires linking the experimental design and information analysis at all times. On the other hand, it is necessary to develop appropriate algorithms and corresponding software according to the requirements of the different steps in order to cope with various complex problems. Many famous international genome research centers have their own splicing and assembly strategies, and such work is done on supercomputers.

With a complete genome, human beings have a more detailed and precise understanding of themselves. For example, it was found that the portion of our genome that actually codes for proteins (called exons) and so on is very small, accounting for only 1.1%; the regions between exons (called introns) account for 24%; and the spacer sequences between genes account for 75%, which means that regions that don't code for proteins make up the vast majority of the human genome. The genes that code for proteins in humans are found to be more complex and more abundantly spliced than those in other organisms. The discovery that segmental duplication is common in the genome reflects the complex evolutionary history of humans. Chromosome 13 is found to be stable, while chromosome 12 in males and chromosome 16 in females are variable, and so on.

3.2 Discovery of new genes and new single nucleotide polymorphisms

The discovery of new genes is a hot topic in international genome research, and the use of bioinformatics is an important means of discovering new genes. For example, about 60% of the 6000 genes contained in the complete genome of brewer's yeast were obtained by information analysis.

(1) Computerized cloning of genes

The discovery of new genes using the E ST database is also known as computerized cloning of genes. E ST sequences are short c DNA sequences of gene expression that carry information about certain segments of the complete gene. By October 2001, there were more than 3.8 million human E ST sequences in GenBank's EST database, which covers roughly more than 90% of human genes.

Our research on finding new genes through computerized cloning began as early as 1996. The principle is very simple: find all the E ST fragments belonging to the same gene and link them together. Since the EST sequences are randomly generated in many laboratories around the world, there are bound to be a large number of repetitive small fragments between many EST sequences belonging to the same gene, and by using these small fragments as markers, we can connect different ESTs until we find out their full lengths, and then we can say that we have found a gene through computer cloning. If the gene has not been found before, then we have found a new gene. But computer cloning programs are complex and computationally intensive.

(2) Predicting new genes from genomic DNA sequences

Predicting new genes from genomic sequences is essentially distinguishing between regions of the genome that code for proteins and those that do not. For the theoretical approach it is to find which mathematical and physical features are different in coding and non-coding regions. By comparing these sequences with a database of known genes, new genes can be discovered.

The discovery of new genes is a step forward in our understanding of life's activities. According to the December 2, 1999 issue of Nature, data from human chromosome 22 have identified 679 genes, 55 percent of which are unknown. Thirty-five diseases are associated with mutations in this chromosome, like immune system disorders, congenital heart disease and schizophrenia. However, the complete and correct integration of all human genes and their corresponding proteins, as well as the functions associated with them, into a single index remains a very important and daunting task. The International Human Genome Collaboration is working to establish a complete "integrated gene index" and an associated "integrated protein index".

(3) Discovery of Single Nucleotide Polymorphisms (S NP)

Some people live long lives despite smoking and drinking, while others have been sick since childhood; the same drug for treating tumors is very effective for some, but completely ineffective for others. The answer is that there is a difference in their genome. The answer is that there are differences in their genomes. Many of these differences manifest themselves as variations on individual bases, known as single nucleotide polymorphisms (S NPs).

It is now widely recognized that S NP research is an important step in moving the Human Genome Project toward application. This is mainly because S NPs will provide a powerful tool for the discovery of high-risk groups, the identification of disease-associated genes, the design and testing of drugs, and basic research in biology. S NPs are quite widely distributed in the genome, and recent studies have shown that they occur once every 300 base pairs in the human genome. The presence of a large number of S NP sites provides an opportunity to identify genomic mutations associated with various diseases, including tumors; from the point of view of experimental manipulation, it is easier to identify disease-associated mutations through S NPs than through the family line; some S NPs do not directly lead to the expression of disease genes, but become important markers because of their proximity to certain disease genes. S NPs have also played a huge role in basic research. In recent years, the analysis of Y chromosome S NPs has led to a series of important results in the fields of human evolution and the evolution and migration of human populations.

3.3 Structural and Functional Studies of Non-Coding Protein

Regions of the Genome

Studies in recent years have shown that in microorganisms such as bacteria, the regions of non-coding proteins account for only 10% to 20% of the entire genome sequence. With the evolution of organisms, the non-coding regions are more and more, and in the genomes of higher organisms and human beings, non-coding sequences have accounted for the majority of the genome sequence. This suggests that these non-coding sequences must have important biological functions. It is widely recognized that they are involved in the regulation of gene expression.

For the human genome, so far, people really grasp the laws only on the D NA region encoding proteins (genes), the latest information shows that this part of the sequence accounts for only 1.1% of the genome. Only 1.1% of the coding region of the human genome related research has created dozens of Nobel Prize winners, 98% of the non-coding region contains the number of results will be very substantial, so the search for these regions of the coding characteristics, information regulation and expression of the law is the future of a long time is a hot topic, is the source of important results.

3.4 Studying Biological Evolution at the Genome Level

In recent years, with the massive increase of genome sequence data, the debate on the relationship between sequence differences and evolution has become more and more intense. Firstly, it was found that the evolutionary trees reconstructed by the same population based on different molecular sequences may be different. At the same time, the discussion on the relationship between "vertical evolution" and "horizontal evolution" is gradually attracting attention. In other words, the phenomenon of "horizontal migration" of genes has been discovered in recent years. That is to say, genes can migrate between populations that exist at the same time, and the result can lead to sequence differences, but such differences are not related to evolution. Even the analysis of the human genome has revealed dozens of human genes that are similar only to bacterial genes, but not to those found in fruit flies or nematodes. To study evolution in terms of these human gene sequences would lead to absurd conclusions. Therefore, vertically evolved molecules must be selected as samples in current molecular evolutionary studies. In particular, "similarity" and "homology" are two different concepts in molecular evolutionary analysis. Similarity only reflects that the two are similar and does not contain any hint of evolutionary relevance. Homology, on the other hand, is similarity related to the *** same ancestor.

3.5 Comparative studies of complete genomes

In the post-genomic era, there is an increasing amount of complete genome data, with which one can analyze and study a number of major biological questions, such as: Where did life originate? How did life evolve? How did the genetic code originate? What is the minimum number of genes needed to estimate the smallest independently living organism? How do these genes bring an organism to life? And so on. These big questions can only be answered at the genome level. For example, the mouse and human genomes are similar in size, both contain about 3 billion base pairs, have similar numbers of genes, and are largely homologous. But why are rats and humans so different? Similarly, some scientists estimate that the difference in genomes between human species is only 0.1%; the difference between apes is about 1%. But the difference between their phenotypes is very significant. Therefore, such differences should not only be found in the genes, D NA sequences, but should also take into account the entire genome, considering the differences in chromosome organization. This work pioneered comparative genomics.

Scientists discovered that all genes can be classified into several categories according to function and phylogeny, including genes related to replication, transcription, translation, molecular bridesmaids, energy production, ion transport, and various metabolisms. This work also provides a new way to categorize proteins. At the same time, scientists have counted the minimum number of genes required to sustain life activities at around 250 by comparing several complete genomes. Similarly, when we compare the genomes of the rat and the human, we find that the organization of the genomes is very different, even though they are similar in size and number of genes. For example, the genes present on chromosome 1 of the rat have been distributed to the seven chromosomes 1, 2, 5, 6, 8, 13, and 18 of the human. Studies have shown that in the same community, the differences in the order of arrangement of certain ribosomal proteins can reflect the kinship between species, the closer the kinship, the closer the order of gene arrangement. This allows us to study the phylogenetic relationships between species by comparing the order of genes.

China has been carrying out large-scale sequencing and analysis of complete microbial genomes since 1998. Now being carried out and completed are: China's own identification of the T hermotogales family of high-temperature fungi, Quan Sheng hot robe bacteria; Fuchs dysentery bacillus; Leptospira haemorrhagic jaundice type Lai strains; Staphylococcus epidermidis; chrysanthemum yellow monoclonal bacterium. Our scientists have completed the sequencing of 1% of the human genome, and have recently completed a "working sketch" of the rice genome, which has 430 million base pairs. These data will provide the most direct material for China's research in this field.

3.6 From Functional Genomes to Systems Biology

The number of genes expressed in different tissues varies greatly, with the brain having the largest number of genes expressed, about 30,000-40,000 transcripts, and some tissues having only a few dozen or a few hundred genes expressed. The same tissue expresses different types and numbers of genes at different stages of individual growth and development; some genes are expressed in early childhood, some in middle age, and some not until old age. We need to understand not only the sequence of genes, but also the function of genes, that is, we need to understand the expression profile of genes at different times and in different tissues. This is often referred to as functional genomic research.

In order to obtain gene expression profiles, new technologies have been developed internationally at both the nucleic acid and protein levels. These are the gene chip (or D NA chip) technology at the nucleic acid level and the large-scale protein isolation and sequence characterization technology at the protein level, also known as proteomic technology. Due to the high density of sample spots on the chip, which can reach hundreds of thousands per chip, expression profiling data mining and knowledge discovery become the key to the success of this research. The development of both biochip and proteomic technology relies more strongly on bioinformatics theories, techniques and databases. In the next step, functional genome research will move towards the direction of complex systems, i.e.: exploring the interactions of various parts and levels in biological systems, thus entering the field of systems biology.

3.7 Protein Structure Simulation and Drug Design

The spatial structure simulation of proteins and drug design has a history of two to three decades. With the rapid development of human genome research, this field is facing a new situation, that is, the discovery of the basic sequences of 30-40,000 human genes is just around the corner, and thus the determination of the amino acid sequence of their expression products will gradually be realized, and then the prediction of the spatial structure of these proteins, and thus the realization of targeted drug design, will become an urgent task. This is also a large-scale computational problem.

3.8 Bioinformatics application and development research

Bioinformatics research results not only have important theoretical value, but also can be directly applied to industrial and agricultural production and medical practice. Therefore, the analysis and application algorithms, software and databases related to bioinformatics have important economic value, and will eventually be formed into commodities to provide economic and social benefits.

(1) Disease-related genetic information and related algorithms and software development

Many diseases are related to gene mutations or gene polymorphisms, and it has been estimated that there are about 1,000 proto-oncogenes and 100 oncogenes related to cancer. About 6,000 or more human afflictions are associated with changes in various human genes. Many more diseases are the result of interactions between the environment (including disease-causing microorganisms) and human genes (gene products). With the advancement of the Human Genome Project, when we know the location of all human genes on chromosomes, their sequence characteristics (including S NPs), and their expression patterns and product (RNA and protein) characteristics, we will be able to efficiently determine the molecular mechanisms of various diseases, and then develop appropriate diagnostic and therapeutic tools. To this end, two bioinformatics efforts are important: the construction of databases of disease-related human genetic information (including S NP databases), and the development of bioinformatics algorithms to efficiently analyze genotyping data, in particular computational methods to correlate S NP data with diseases and pathogenic factors.

(2) Establishing genomic databases related to the breeding of good animal and plant breeds, and developing molecular marker-assisted breeding technology

Based on the evolutionary distance between different species and the homology of functional genes, it is relatively easy to locate genes related to their economic benefits in various livestock and cash crops, and to further understand the various pathways and mechanisms of their development, growth, and resistance to adversity. On this basis, the use of relevant genomic molecular markers can accelerate the speed of breeding and modify them according to people's wishes.

(3) Research and development of drug design software and bioinformatics-based molecular biology technology

The human genome information provides new candidate molecules and new candidate drug target genes for drug development. At the same time, the design of expression vectors, P CR and hybridization primers, and various kits (including D NA microarrays) commonly used in molecular biology must rely on nucleic acid sequence information. The large amount of information provided by genome informatics provides a wide world for the development of such technologies.