Fifteen years on, genome sequencing technology has evolved faster than anyone could have imagined. Ten years ago, the technology was just a "glamorous" but expensive research tool in the lab. Now, it is making its way into the medical world as a slightly more "cutting-edge" diagnostic technique. It's also ushering in the era of big data in the biomedical field.
It was predicted earlier that when the cost of sequencing a person's genome dropped to $1,000, it would mark the beginning of the era of personalized medicine. Now, this goal has basically been reached, with the rapid development of this technology and the flattening of costs, it has begun to bring us a huge amount of data, including genomes, proteomes, and other types of genomics (omics) has also brought a lot of data.
1. Massive data generation
Just in the past seven or eight years, the amount of data we have stored on personal genomes has reached 106 scale, which is such a staggering amount, and it's only just begun. Already sequencing the genomes of more than 18,000 people each year, Illumina's HiSeq X 10 sequencers have been distributed to the world's top sequencing centers, generating huge amounts of data every day. The U.K. also launched the 100,000 Genomes Project in 2014, and the U.S. and China have announced plans to collect genomic data from up to a million people.
Genetic sequencing data is doubling at an even faster rate: after 2015, it will double every seven months in terms of historical cumulative sequencing data, and every 12 months for data sequenced on Illumina instruments; if Moore's Law alone is taken into account, the amount of data will double every 18 months. This would create a huge "data black hole". Image courtesy of nature.com
The above is just a snapshot of what is happening in the age of big data, and there are other data out there. For example, along with the development of the genome project, the human proteome project and the application of gene sequencing results in the medical world have also been gradually proposed, and they are also adding to the big data "brick and mortar". The so-called Human Proteome Project is aimed at studying all the proteins encoded by human genes. About this, let's look at the story of a researcher.
Michael Snyder of Stanford University in the United States. Michael Snyder (R).
Michael Snyder is a molecular geneticist at Stanford University, USA. When he had his genome tested for curiosity, he got a few "surprises". He discovered that he was a carrier of a gene that predisposes him to type II diabetes, even though he hadn't previously identified any risk factors for the disease in himself, including obesity, a family history of the disease, and so on. Over the next 14 months, Snyder continued to monitor the activity and protein expression of the corresponding RNA in his body. After an infection with a respiratory virus, he noticed a change in protein expression in his body and the activation of a corresponding biological pathway. Then he was diagnosed with diabetes. It appears that the disease was triggered by this viral infection. Thereafter, he also monitored the protein expression changes in his body when he developed Lyme arthritis as well. By this time, his research had generated as much as 50 Gb of data, and that was just about his personal research. When he expanded the study to 100 people, and expanded the target to 13 "histologies" (including proteomes, transcriptomes of intestinal flora, etc.), in fact, according to his plan, in order to be able to truly predict the disease, it will be necessary to increase the number of patients to millions of patients. In that case, how much data will it generate?
The proliferation of various electronic devices and the emergence of health-data-recording apps has brought a huge amount of data to this day and age, and a sizable pool of research subjects to the medical community. In decades past, doctors who wanted to observe a patient's cardiovascular health would often give them this little test: have them walk for six minutes on a gentle, firm stretch of road and record the distance they walked. This test can be used not only to predict the survival rate of lung transplant recipients, but also to detect the progression of muscular dystrophy and even to assess the health of cardiovascular patients. This small test has been used in several medical studies, but in the past, such participants have rarely reached a thousand in the largest medical research programs.
The emergence of health apps in smartphones, thus enabling researchers to access data from large populations. Image via nature.com
However, this has changed dramatically in recent years. In a cardiovascular study conducted in March 2015, researcher Euan Ashley got test results from 6,000 people in a two-week period, thanks to the millions of people who now own smartphones and fitness trackers. By June, the number of participants in the study had reached 40,000, thanks to an Apple app called My Health Counts (pictured above). With the app, Ashley could even recruit participants from around the world and get their test results. In that case, how much data would he get? In the face of this state of affairs, many researchers say that these massive amounts of data may overwhelm existing analytical channels, and put unprecedented "high" demands on data storage.
2. Challenges in the era of "big data"
In the wave of population genome research, although more people are only focusing on the exon part of the whole genome, i.e., the part of the genome that can code for the production of proteins, which accounts for 1-5% of the whole genome, this can reduce the amount of data to be analyzed to 1%. the amount of data to be analyzed down to 1% of what it was. But even in this case, the amount of data produced can be up to 40 million Gb per year, which brings us to the first challenge: how to store such a large amount of data?
Although this is still only the most basic problem in this area, it still requires enormous resources to solve. This is the opportunity for the emergence of one of the most frequent words on the Internet in recent years - cloud (Cloud). With such a large amount of data, it is inevitable that it cannot be stored on a fixed device alone, but needs to be realized with the help of the Internet, which is also known as "cloud storage". In addition, these data bring a huge processing crisis, computer processing power will also limit their applications. The initial solution to this problem still depends on the "cloud", which is now called "cloud computing".
Even if the problem of storing huge amounts of data is dealt with, we'll have another, more vexing problem -- what does the data tell us? Clinical research on genomics now tends to focus on identifying "small errors" in an individual's genome that can disrupt gene function, so-called single-nucleotide variants (SNPs), even though these mutations tend to be found in exons that make up only 1 percent of the genome. Even though these mutations tend to be found in only 1% of the exon regions of the genome, on average, there are still nearly 13,000 of them, and 2% of them have been predicted to affect changes in the corresponding proteins, but it is still a huge challenge to identify the specific causative genes for a particular type of disease.
Since Barack Obama introduced the concept of "precision medicine," this direction has been red-hot. Even with the availability of sequencing technologies and analytical tools, and the "help" of electronic health records, there is still a huge gap between the ideal and the reality of this approach to medicine. There are still many barriers in this area. For example, even with the widespread availability of EHRs and the successful development of new therapies, relying on clinicians to implement those therapies often requires ongoing training to help them understand enough details before making medical decisions.
In addition, the un****able nature of electronic health records (i.e., issues related to patient privacy) creates a considerable barrier to the realization of precision medicine. Very often, the specific information for treating individual cases of patients is often held by individual patients and treatment institutions, and does not reach the hands of researchers, so it is not possible to improve some treatments based on this information, and therefore there is no way to achieve "individualized medicine" for individuals. These problems often reflect the need for information processing experts in the biomedical field. Unfortunately, bioinformaticians occupy only a small number of positions in academia, let alone in the medical field, and more positions and opportunities need to be made available to them.
3. Opportunities brought by "big data"
Challenges also bring opportunities, which can be found in many aspects of the biomedical field, such as updated diagnostic methods in the medical field, updated disease typing, new directions in drug development in the pharmaceutical field, new ways of treating diseases in the medical field, and even new tools for basic research in the biological sciences. new tools for basic research in the field of biology.
In 2013, Angelina Jolie's story took the world by storm when she underwent a preventive double mastectomy in order to reduce her risk of breast cancer, a decision she made only after she detected that she carried a risk gene, the BRCA gene. This type of gene poses a significant risk of causing disease, with about 55-65% of breast cancer patients carrying a harmful BRCA1 gene mutation and 45% carrying a BRCA2 mutation. For Julie, though, the fact that she carried just the former gene was enough to make the decision to have preventive surgery. This story gives a vivid example of how data from individual sequencing can be linked to clinical diagnosis, and it's as if human beings are finding these lost treasures from their own genomes to help prevent some of the malignant diseases they have, but that's just one of the perks of the age, and it's only a very small part of it.
In the case of diabetes, for example, imprecise typing of the disease is very detrimental to both prevention in the early stages and treatment in the later stages. Previously, the medical community has known that there are upwards of a hundred pathways that can lead to diabetes, involving different changes in the pancreas, liver, muscles, brain and even fat. Modern research through genetics has revealed that for different types of diabetes, the causative genes are very diverse. At this point, when these different subtypes of diabetes are lumped together, it can make it very difficult to figure out why, carrying the same genetic mutation, a patient facing the same treatment regimen would have a completely different treatment outcome.
As biochemist Alan Attie puts it, "the process from the disease-causing gene to the emergence of a phenotype, such as weight and blood glucose levels, often has many steps, each of which can be characterized by a mutation, which ultimately weakens the link between the gene and the phenotype." Therefore, looking only at the phenotype (i.e., clinical symptoms) and looking only at the mutated gene will yield only one-sided results. Only by combining the two can we deepen our understanding of the disease and more accurately characterize it so that we can more easily "prescribe the right medicine".
The U.S. National Institutes of Health (NIH) has initiated a large-scale project to build a cancer genome database (the Cancer Genome Altas, or TCGA for short), which classifies and saves all cancer-related gene mutations, and saves 2.5 million gigabits of data, which greatly improves researchers' understanding of various types of cancer. But that alone hasn't made much difference to the clinical experience of patients who have provided tissue samples.
Another aspect relevant to cancer treatment is the personal electronic health record and its case-specific information. For many researchers, this information, if available from hospitals or individuals, can be fruitfully used to make improvements in cancer treatment programs. Overall, only with the sequencing data, along with the patient's interventional record (from the individual's electronic health record) and clinical characteristics (from the institution's clinicopathology record), can we ultimately "upgrade" the clinical treatment plan for tumors.
There's no denying that pharmaceutical R&D can also benefit greatly from big data. In the world of pharmaceutical R&D, gene technology companies prefer to conduct long-term biological studies and link them to clinical data so that drugs can be "right-sized" for each individual, and even help pharmaceutical companies make bolder R&D decisions, such as personalizing immunotherapies. even help pharmaceutical companies make bolder R&D decisions and conduct research on personalized immunotherapies.
Take microbiota research, for example. The idea is being floated: when will we want to develop drugs that alter the microflora in our bodies? These billions of microorganisms, found in our gut, on the surface of our skin, and in the environment, affect not only whether we get sick, but also how effective a drug is at treating a disease. Most microflora research is still only available for a small percentage of the population, but could this be a good direction for research? After all, we still lack stable tests that would allow us to alter the microbiota in a sustained way and have a meaningful impact on disease development.
What does big data bring to the table for immunology research? First of all, there are the following "omics" that can have a beneficial impact on immunology research: genome, microbiome, epigenome, transcriptome, metabolome, pathway, cytome, and proteome. Specifically, for example, the analysis of all antibody-antigen molecules for a given B or T cell, the results of which (especially when combined with techniques that recognize the antigenic determinants of the corresponding antibody) can take clinical diagnostics, antibody drug discovery, and vaccine discovery to a new level, as well as provide new insights into the binding of antibodies to self-antigenic peptides.
The lead with the thorns often leads to the nightingale with a good singing voice as well. Big data presents us with challenges as well as opportunities, especially for the treatment of some malignant diseases such as cancer. A single type of tumor is often accompanied by a diverse set of genetic mutations, but with more time and money invested, more therapeutic targets are obtained. When the accuracy of big data analysis is higher and higher, the understanding of the entire disease process will be more and more in-depth, with the "big data analysis" of this tool, more accurate treatment options will be generated to help people make better choices.