Reflections on Big Data Mining in Virtual Pharmaceutical Research

Thinking about big data mining in virtual medical research

1. Virtual medical research cases based on big data mining

Data mining has developed to the present day, in accordance with current concepts should be to the era of "big" data mining. Let's start with a few relevant cases.

1.1 Virtual Clinical Trials - Big Data Acquisition

Let's first look at such a case. In June 2011, Pfizer announced a "virtual" clinical study, which is an FDA-approved pilot program, with the acronym "REMOTE". REMOTE". "The goal of the REMOTE program, the first clinical study in the United States in which patients have access only to cell phones and the Internet, rather than repeated trips to the hospital, is to determine whether such "virtual" clinical studies can produce the same results as traditional clinical studies. The goal is to determine whether such "virtual" clinical studies can produce the same results as traditional clinical studies. Traditional clinical studies require patients to live near a hospital and travel regularly to a hospital or clinic for an initial checkup and multiple follow-ups. If this program works, it could mean that patients across the United States could participate in many future medical studies. As a result, groups underrepresented in the original research program would be able to participate, data collection would be much faster, and costs would likely drop dramatically, and the chances of participants dropping out would likely be quite a bit lower.

From the above example, we can see that the use of the Internet allows for the collection of clinical data from a far larger number of patients than the traditional sample size for clinical research, and that some of this clinical data may come from more convenient wearable health monitoring devices. If such research is conducted with rigorous research design, quality standards are effectively enforced, and various errors are effectively controlled, the efficiency of the research and the credibility of the results can be significantly improved. As Pfizer's Chief Medical Officer Freda? Lewis-Hall said, "Enabling a more diverse population to participate in research has the potential to drive medical progress and lead to better outcomes for more patients."

1.2 Virtual Drug Clinical Trials - Big Data Mining

Let's look at another case study: in 1992, the antidepressant drug paroxetine (Paxil) was approved for marketing; in 1996, the cholesterol-lowering drug Pravachol was officially launched. Studies by both drug manufacturers have proven that each drug is effective and safe when taken alone. However, no one knows, and few have even thought about, whether it is safe for patients to take both drugs at the same time. After analyzing the electronic medical records of tens of thousands of patients using data-mining techniques, researchers at Stanford University in the United States soon found a surprising answer: patients who took both medications at the same time had higher blood sugar levels. This has a big impact on diabetics, for whom too much blood sugar is a serious health risk! The scientists also analyzed blood sugar test results and drug prescriptions to look for hidden patterns.

For a single doctor, his experience of patients taking both drugs is limited, and while a few of these diabetics may have inexplicably elevated blood sugar, it is difficult for a doctor to realize that this is due to the patient taking both Paxil and Pravachol. Because this is an implicit pattern hidden in the big data, it would be difficult for an individual physician to reveal this pattern if someone hadn't purposefully and specifically studied the safety of combining Paxil and Pravachol. However, with thousands of clinical drugs, how can we possibly study the safety and efficacy of any combination of two or three drugs in combination one by one? Data mining is likely to be an effective, rapid, and proactive way to explore the problem of combining multiple drugs!

Investigators no longer have to call patients for clinical trials, which would be far too expensive. The proliferation of electronic medical records and their computerized applications has opened up new opportunities for medical data mining. Instead of limiting themselves to traditional research by calling on volunteers, scientists are sifting through real-life experiments, such as large numbers of everyday clinical cases, and conducting virtual research, which is not the result of a planned project and is stored in the medical records of many hospitals.

In cases like this one, the application of data technology allows researchers to identify problems that could not have been foreseen at the time a drug was approved for marketing, such as how a drug might affect a particular population. In addition, data mining of medical records will not only benefit research, but will also improve the efficiency of the healthcare delivery system.

1.3 Virtual Drug Target Discovery - Knowledge Discovery

Let's look at another such class of research. The process of drug discovery is usually long, expensive, and risky. Some data show that the average time for new drug development is up to 15 years, and the average cost is more than 800 million dollars. However, the poor efficacy of drugs and the high level of toxicity and side effects have caused many drug developments to fail often at the clinical stage, resulting in huge economic losses. As the source of drug development, the discovery and identification of drug targets play a pivotal role in the success rate of drug development. With the continuous development of biological information technology, and the increasing growth of proteomics data and chemical genomics data, the application of data mining technology combined with traditional biological experimental technology can provide a new technical means for the discovery of new drug targets and a new method for target identification and prediction. Constructing a drug target database and utilizing intelligent computing technology and data mining technology to carry out in-depth exploration of existing drug target data in order to discover new drug targets is exactly such a type of research, which is also called knowledge discovery of drug targets.

Traditional drug target discovery is usually realized through a large number of repeated biochemical experiments, which is not only costly and inefficient, but also has a very low success rate, like a blind man feeling an elephant, which is not good for grasping the direction. The application of data mining, an automatic, proactive and efficient exploration technology, can carry out virtual drug target discovery, which not only greatly accelerates the process of drug target discovery, but also dramatically reduces the number and cost of biochemical experiments, and at the same time improves the success rate of traditional biochemical experiments.

2. Application of data mining on virtual medical research

In the era of big data, pharmaceutical research and development is facing more challenges and opportunities, in order to better save R & D costs, improve the success rate of new drug R & D, and develop more competitive new drugs, data mining technology can be applied to carry out virtual medical research and drug research. The application of data mining on virtual medical research can be summarized as follows.

2.1 Help pharmaceutical companies reduce R&D costs and improve R&D efficiency through predictive modeling. The model is based on the dataset before the clinical trial stage of the drug and the dataset in the early clinical stage to predict the clinical results as timely as possible. Evaluation factors include product safety, efficacy, potential side effects, and overall trial results. Predictive modeling can reduce research and development costs for pharmaceutical product companies by holding off on studying sub-optimal drugs or stopping costly clinical trials on sub-optimal drugs after predicting the drug's clinical outcome through data modeling and analysis.

2.2 Mining patient data to assess whether recruited patients are eligible for a trial can speed up the clinical trial process and suggest more effective clinical trial designs. For example, by clustering the patient group, we can find the characteristics of age, gender, condition, and laboratory indexes to determine whether the conditions of the trial are met, and we can also set up a control group based on these characteristics.

2.3 Analyzing clinical trial data and patient records can identify more indications for a drug and discover side effects. After analyzing clinical trial data and patient records, drugs can be repositioned or marketing for additional indications can be achieved. Mining trial data through methods such as correlation analysis may reveal results that would not have been expected beforehand, greatly increasing the extent to which the data can be utilized.

2.4 Real-time or near real-time collection of adverse reaction reports can facilitate pharmacovigilance. Pharmacovigilance is a safety and security system for marketed drugs that monitors, evaluates and prevents adverse drug reactions. Through clustering, correlation and other big data mining means to analyze the situation of adverse drug reactions, the use of medication, disease, the manifestation of adverse reactions, whether it is related to a certain chemical composition and so on. For example, clustering analysis of symptoms of adverse reactions, association analysis of chemical composition and symptoms of adverse reactions, and so on. In addition, in some cases, clinical trials implied some situations but there is not enough statistical data to prove it, and now the analysis based on clinical trial big data can give evidence.

2.5 Targeted drug development: the development of personalized medicines through the analysis of large datasets, such as genomic data. This application examines the relationship between genetic variation, susceptibility to specific diseases, and response to particular drugs, and then takes into account the individual's genetic variation in the drug development and dosing process. In many cases, patients are on the same medication regimen but have different outcomes, in part because of genetic variation. Different medications are developed for different patients with the same disease, or different uses are given.

2.6 Inspire R&D by exploring combinations of chemical components and pharmacology of drugs. For example, for Chinese medicine drug development, data mining means to analyze and research on Chinese medicine formulas and symptoms, to explore the connection between formulas and symptoms, and to classify and analyze the characteristics of the efficacy, classification, medicinal properties and flavors of the medicines.

3. Virtual Drug Clinical Trial Analysis System

Nowadays, more and more clinical researches and drug clinical trials are extracting data from the big data generated from daily clinical work after strict condition screening. As we mentioned in the cases in 1.1 and 1.2 of this paper, the so-called virtual drug clinical trials are carried out with wider clinical data collection, and from the huge amount of electronic medical records of hospitals according to the needs of the design in advance after rigorous conditions of screening, although it is a virtual method rather than the traditional method, this kind of drug clinical trial research has the advantages of a wider representation of the samples, low cost, and high efficiency, The advantages of this kind of drug clinical trial research include wider sample representation, lower cost, higher efficiency, and richer research results. The use of virtual research methods can completely replace certain traditional drug clinical research, can also be used as certain traditional drug clinical research pre-test or exploratory research, in order to make the real drug clinical research work more, faster, better and more economical. We now look at how the Virtual Drug Clinical Trial Analysis System works.

3.1 The basic idea of virtual drug research

1, the construction of drug clinical trial data warehouse, fully integrated and accumulated clinical data and drug application data. 2、Design and select observation group samples and control group samples of drug clinical trials. 3、Apply data mining technology to explore the effect of drugs on the treatment of diseases and the side effects produced. 4、Apply statistical techniques to infer and evaluate the effects of drug clinical trials.

3.2 Establishment of Drug Clinical Data Warehouse

There are two ways to build a drug clinical trial data warehouse, one is to customize and collect the relevant data through the classic drug clinical trial design, the traditional method is mainly recorded on paper documents, there is also a specialized data entry software, this method of data collection is carried out in accordance with the pre-designed, directly form the drug clinical trial Another method is to extract, transform and load a large amount of historical clinical medication data from hospitals, and then fully integrate other accumulated clinical data and drug application data to form a drug clinical trial data source, which provides support for the generation of drug clinical trial data, and the amount of such sample data may be very large, and the method that we are going to demonstrate later on is to use a kind of "virtual" data. The method we demonstrate later is to use this kind of data for "virtual" sample screening and analysis.

3.3 Clinical Trial Sample Design

There are many designs for clinical trial samples according to the needs of the drug study, such as one-factor one-level design, one-factor two-level design, one-factor multiple-level design, paired design, block design, repeated measures design, etc. We take the two-factor block design as the main example here. Here we take two-factor compartmentalized design as an example to introduce sample screening. This example is for the purpose of method demonstration only and does not take into account the strict medical professional meaning.

The disease in this study was atherosclerotic heart disease, and the treatment factor was drug application,*** there were three drugs, namely betalactam, novolin, and isosorbide nitrate. The district grouping factor was age, which was divided into three age groups. The observation indicator was blood sodium. Our scientific research design is based on the "three elements, four principles" for data screening. The so-called "three elements" are the study population, treatment factors and observation subjects. The so-called four principles are randomization, control, repetition, and balance. According to the input conditions in Figure 1 below, the dataset can be screened out, and then statistically analyzed with statistical analysis tools.

3.4 Drug Clinical Data Mining

Applying data mining technology can not only improve the utilization of drug clinical data, but also explore and discover new positive and new negative effects in the clinical application of drugs. The analysis of clinical trial data and patient electronic data using a variety of data mining methods can identify additional indications for drugs and discover unknown side effects. After mining and analyzing clinical trial data and patient records, it is possible to reposition drugs or to achieve extended use for other indications. Mining of drug trial data may reveal unexpected results and greatly increase the benefits of data application.

In this example, we use data mining to study the effects of drugs on laboratory indicators. Exploring and discovering the positive and negative effects of a drug's clinical application can be done by observing many of the medical characteristics and physiological indicators of a patient before and after the drug is administered, and observing a more objective variety of laboratory indicators is one of the necessary designs for many drug studies. The following is a study on the application of Betalucil in the treatment of coronary heart disease. We applied the relevant techniques of data mining to analyze the effects of changes in the blood concentration of Betalucil on various laboratory indicators of patients, and Figure 2, below, shows the results of the effects of some of the laboratory indicators.

The above results need to be discussed with clinical staff and drug researchers***. After stripping out the various human factors and objective influences of the business system, we can find previously unknown effects of bethanechol on the patient's physiological indicators, some of which may be medically positive and some of which may be medically negative.

3.5 Statistical analysis design

The statistical analysis module of the virtual drug clinical trial analysis system contains statistical analysis methods commonly used in drug development, such as T-test, ANOVA, correlation analysis, regression analysis, non-parametric test, etc. The design idea is in accordance with the statistical thinking, firstly, the data are validated, and then the statistical analysis methods are selected according to the validation results. Here we take repeated measures design as an example to illustrate.

The disease of this study is atherosclerotic heart disease, the treatment factor is the drug application of betalactam, and the observation index is the blood potassium index that we found to be influential from data mining. We can use the module provided in 3.3 to extract and analyze the screened samples, or we can select the required data and analyze it directly from this module. There are two methods for repeated measures analysis, one is Hotelling T2 test and the other is ANOVA, both statistical tests are provided in this system.

Some of the sample data are shown in Figure 3 below:

Here, we only observe the result output of the ANOVA method as shown in Figure 4 below.

From the figure, we can see that according to the P-value: the treatment factor "Betalucil" drug plays a role in potassium, the measurement time has an effect on potassium, and there is an interaction effect between the treatment factor and the measurement time. This validates the results we obtained by applying data mining.

4. Application of Data Mining to Traditional Chinese Medicine Research and Development

In the above content, we focus on the research application of western medicine as an example to illustrate the methodology of virtual medicine research featuring data mining. In fact, data mining and virtual medicine research is also very suitable for the research work of Chinese medicine, because Chinese medicine itself is a medical science with a complete theoretical system after thousands of years of continuous exploration, accumulation and validation of a huge body of knowledge, but we still need to apply modern knowledge to continuously deepen the understanding, mining, improvement and application, so that it can be better integrated with modern science. Data mining is a powerful tool to explore and explain the mysteries of Chinese medicine!

Many units in China also carry out some localized attempts of data mining in Chinese medicine. Now, we will summarize these attempts of data mining in the research of traditional Chinese medicine as follows: 1, text data mining in traditional Chinese medicine formulas; 2, the "active ingredient" - the monomer or chemical component that plays a key role in the "pharmacology" - is a single component. Mining of monomers or chemical components that play a key role in "pharmacology"; 3, data mining and research on the laws of Chinese medicine prescription; 4, data mining on the relationship between the material basis of prescription and its efficacy such as (evidence, symptoms); 5, mining on the relationship between the dosage of prescription and the potency level of the prescription (quantitative-effective relationship and modeling); 6, mining on the relationship between the theory of the pharmacological properties of traditional Chinese medicines and the active ingredients of traditional Chinese medicine; 7, correlation mining between flavors of the medicinal formula Mining the correlation between the flavors of medicines; 8. Mining the implied similarity of similar diseases; 9. Mining and researching the similarity and difference of different prescriptions for the same kind of diseases. 10, Data mining for the classification and research of inexact diseases.