The advent of the era of big data
The development of the Internet, especially the mobile Internet, has accelerated the penetration of information technology into all aspects of the social economy and daily life of the general public. According to some data, the average monthly traffic used by global Internet users was 1MB (megabyte) in 1998, 10MB in 2000, 100MB in 2003, 1GB (1GB is equal to 1024MB) in 2008, and will be 10GB in 2014. the cumulative total of all-network traffic reached 1EB (i.e., 1 billion GB or 1,000 petabytes) in 2001 for one year, in 2004 for one month, and in 2007 for one month. In 2004 it was one month, in 2007 it was one week, and in 2013 it will take only one day, i.e. the amount of information generated in one day can be engraved on 188 million DVD disks. The number of Internet users in China is the highest in the world, and the amount of data generated every day is also among the top in the world. Taobao website has more than tens of millions of transactions every day, and the amount of data generated in a single day is more than 50TB (1TB equals to 1,000GB), and the storage capacity is 40PB (1PB equals to 1,000TB). Baidu Inc. currently has a total data volume of close to 1,000PB, with close to 1 trillion pages of stored web pages, approximately 6 billion search requests and dozens of petabytes of data to be processed every day. An 8Mbps (megabits per second) camera can generate 3.6GB of data in an hour, and a city that installs hundreds of thousands of traffic and security cameras will generate dozens of petabytes of data per month. hospitals are also places where data generation is concentrated. Now, a patient's CT image data volume of tens of gigabytes, while the country's annual outpatient number of billions of people, and their information needs to be preserved for a long time. In short, big data exists in all walks of life, and an era of big data is upon us.
The information explosion has not started today, but in recent years people have felt the momentum of big data even more. On the one hand, the number of Internet users is increasing, on the other hand, the number of connected devices represented by the Internet of Things and home appliances is growing even faster. 500 million devices were connected globally in 2007, with 0.1 per capita; in 2013, there will be 50 billion devices connected globally, with 70 per capita. With broadbandization, per capita Internet access bandwidth and traffic are also rising rapidly. New data generated globally is increasing by 40% annually, meaning that the total amount of information can double every two years, and this trend is set to continue. It is no longer uncommon for a single dataset to exceed tens of terabytes or even petabytes in size, so large that its contents cannot be crawled, managed, and processed in a tolerable amount of time with conventional software tools.
The larger the scale of the data, the more difficult it is to process, but the greater the value that may be gained from mining it, which is why big data is hot. First, big data reflects public opinion and public opinion. The huge amount of data generated by netizens online records their thoughts, behaviors and even emotions, which is a product of the deep integration of the real society and cyberspace in the information age, and contains rich connotations and a lot of regular information. According to the statistics of China Internet Network Information Center, at the end of 2012, the number of Internet users in China was 564 million, and the number of cell phone users was 420 million. By analyzing the relevant data, it is possible to understand the public's needs, demands and opinions. Secondly, the information systems of enterprises and governments constantly generate a large amount of data every day. According to Symantec's research report, the total amount of information stored in global enterprises has reached 2.2ZB (1ZB equals to 1,000EB), with an annual increase of 67%. Hospitals, schools and banks also collect and store large amounts of information. Governments can deploy sensing units such as sensors to collect information needed for environmental and social management.In 2011, the British journal Nature published a special issue stating that if big data can be organized and used more effectively, human beings will be given more opportunities to play the role of science and technology as a great impetus to social development.
Big data applications
Big data technology can be applied to various industries. In macroeconomics, IBM Japan builds an economic indicator forecasting system that searches for 480 items of economic data affecting the manufacturing industry from Internet news and calculates the forecast value of the purchasing managers' index. Indiana University, using the mood analysis tool provided by Google, from nearly ten million Internet user messages summarized in six moods, and then the changes in the Dow Jones Industrial Index prediction, the accuracy rate of 87%. In the manufacturing industry, Wall Street hedge funds analyze the sales of enterprise products based on customer comments on shopping websites; some enterprises use big data analysis to realize the management of procurement and reasonable inventory levels, and analyze online data to understand customer demand and grasp market trends. According to some data, global retailers lose $100 billion a year in sales due to blind purchases, and data analytics in this area can make a big difference.
In the field of agriculture, there is a climate company in Silicon Valley, from the U.S. Weather Bureau and other databases to obtain decades of weather data, rainfall, temperature, soil conditions and the correlation between crop yields in previous years into a sophisticated chart, predicting the farm's output in the coming year, to sell personalized insurance to farmers. In the commercial field, Wal-Mart analyzes sales data to understand the shopping habits of customers, and come up with suitable products that can be sold together, and can also segment customer groups to provide personalized service. In the financial sector, Wall Street's Derwent Capital Markets analyzes the messages of 340 million microblogging accounts to determine public sentiment, and decides whether to buy or sell a company's stock based on the pattern of buying stock when people are happy and selling stock when they are anxious. Based on the transactions of small and medium-sized enterprises on Taobao, Alibaba screens out financially healthy and honest enterprises and issues loans to them without collateral. It has lent more than 30 billion yuan, with a bad debt rate of just 0.3 percent.
In the healthcare field, the "Google Flu Trends" project analyzes the spread of influenza and other diseases globally based on the content of Internet users' searches, and compares it with the reports provided by the U.S. Centers for Disease Control and Prevention, with an accuracy of 97% in tracking diseases. Social networks provide a platform for many patients with chronic diseases to share their clinical symptoms and experiences, allowing doctors to gain access to clinical outcome statistics not usually available in hospitals. Based on big data analysis of human genes, personalized treatment can be achieved. In the field of social security management, through the mining of cell phone data, it is possible to analyze the real-time dynamics of the source of the mobile population, travel, real-time traffic flow information and congestion. Using SMS, Weibo, WeChat and search engines, hot events can be collected, public opinion can be mined, and the source of rumor-mongering information can also be tracked. The Massachusetts Institute of Technology (MIT) in the U.S. processes information such as calls, text messages and spatial locations of more than 100,000 people's cell phones to extract the spatial and temporal regularities of people's behavior and make crime predictions. In the field of scientific research, scientific discovery based on intensive data analysis has become the fourth paradigm after experimental, theoretical and computational sciences, and material genomics and synthetic biology based on big data analysis are emerging.
McKinsey & Company's 2011 report speculated that if big data is used for healthcare in the U.S., a potential value of $300 billion a year could be generated, and for public **** management in Europe a potential annual value of €250 billion could be gained; service providers utilizing personal location data could gain a potential annual consumer surplus of $600 billion; and using big data analytics, retailers could increase operating profits by 60%, and manufacturing equipment assembly costs would be reduced by 50%.
Challenges and Implications of Big Data Technology
At present, there are still some difficulties and challenges in the use of big data technology, which are reflected in the four aspects of big data mining. First in data collection. It is necessary to attach temporal and spatial signs to the data from the network, including the Internet of Things and institutional information systems, to remove falsehoods, to collect heterogeneous sources or even heterogeneous data as far as possible, and to cross-check the data with historical data, if necessary, to verify the comprehensiveness and credibility of the data from multiple perspectives. Next is data storage. To achieve the goals of low cost, low energy consumption and high reliability, redundant configuration, distribution and cloud computing technologies are usually used. When storing, the data should be classified according to certain rules, and the amount of storage should be reduced by filtering and de-duplication, while labels for easy retrieval in the future should be added. The third is data processing. Some industry data involves hundreds of parameters, and its complexity is not only reflected in the data samples themselves, but also in the interaction dynamics between multiple sources of heterogeneous, multi-entity and multi-space, which is difficult to be described and measured by traditional methods, and the complexity of processing is great, which needs to be measured and processed by downgrading multimedia data such as high-dimensional images, using contextual correlation to perform semantic analysis, and synthesizing information from a large amount of dynamic and possibly ambiguous information from a large amount of dynamic and possibly ambiguous data, and derive comprehensible content. The fourth is the visual presentation of results, which makes the results more intuitive for insight. At present, despite the great progress in computer intelligence, it can only analyze small-scale, structured or class-structured data, but not deep data mining, and the existing data mining algorithms are difficult to generalize in different industries.
The prospect of using big data technology is very bright. At present, China is in the process of building a moderately prosperous society in all aspects, industrialization, information technology, urbanization, agricultural modernization is a very heavy task, the construction of the next generation of information infrastructure, the development of modern information technology industry system, improve information security system, promote the extensive use of information network technology is to achieve the four simultaneous development of the guarantee. Big data analysis is of great significance to our deep understanding of the world and national conditions, grasp the laws, realize scientific development, make scientific decisions, we must re-understand the important value of data.
In order to develop the gold mine of big data, we have to do a lot of work. First, big data analysis needs to be supported by big data technology and products. Some information technology (IT) companies in developed countries have been ahead of schedule, through increased development efforts and mergers and other means, efforts to become a big data solution provider transformation. Some foreign enterprises have put up the signboard of undertaking big data analysis for free, both for the purpose of practicing and obtaining intelligence. It is difficult to avoid the risk of information leakage by relying too much on foreign big data analysis technologies and platforms. Some daily life information may seem insignificant, but in fact, it is possible to feel the pulse of the national economy and society. Therefore, we need to have independent and controllable big data technology and products. The U.S. government released the "Big Data Research and Development Initiative" in March 2012, which is another major scientific and technological deployment after the announcement of the "information superhighway" in 1993, and the federal government and some ministries have arranged funds for big data development. There are quite a few gaps between us and the developed countries, making it all the more important to have national policy support.
China, with the world's largest population, will become the country that generates the largest amount of data, but we don't pay enough attention to data preservation and don't make much use of stored data. In addition, some departments and organizations in China have a large amount of data but are unwilling to share it with other departments***, leading to incomplete information or duplication of investment. The government should break data fragmentation and blockade through institutional mechanism reform, should focus on open information, and should emphasize data mining. The United States federal government has established a unified open data portal to provide information services to society and encourage mining and utilization. For example, it provides information on the relationship between weather and flight delays in different places, and promotes airlines to improve their on-time performance.
The mining and utilization of big data should be based on law. A decision by the National People's Congress late last year to strengthen the protection of online information is a good start, and an "information disclosure law" should be enacted as soon as possible to adapt to the advent of the big data era. Nowadays, many organizations and enterprises have a large amount of customer information. It is important to encourage data mining for the benefit of groups and the community, while preventing the infringement of individual privacy; it is also important to advocate the enjoyment of data***, while preventing the misuse of data. It is also necessary to define the authority and scope of data mining and utilization. The security of the big data system itself also deserves special attention, and attention should be paid to both technical security and management system security to prevent information from being damaged, tampered with, leaked or stolen, and to protect the information security of citizens and the state.
The era of big data calls for innovative talents. Geithner Consulting predicts big data will bring 4.4 million new IT jobs and tens of millions of non-IT jobs worldwide. McKinsey & Company predicts that the United States will need 440,000-490,000 in-depth data analytics talent by 2018, a shortfall of 140,000-190,000 people; 1.5 million managers who are familiar with both the needs of their organization and the technology and application of big data, a much larger shortfall of talent in this area. China is a big country of talent, but can understand and apply big data innovation talent is a scarce resource.
Big data is the concentrated reflection of a new generation of information technology, is a strong application-driven service area, is an emerging industry with endless potential; at present, its standards and industrial pattern has not yet formed, which is a valuable opportunity for China to realize leapfrog development. We need to pay strategic attention to the development and utilization of big data, as an effective hand in transforming the mode of economic growth, but pay attention to scientific planning, avoid a rush.