What are the companies that do big data solutions in China?

With the advent of the "era of big data", enterprises are paying more and more attention to the role of data, and the value that data brings to enterprises is also increasing. This document describes the opportunities and challenges of big data and the big data solutions for organizations.

First StepFirst figure out what is big data? It is not simply a large amount of data or massive data, but a gold mine of data with 4V characteristics. He will bring opportunities and challenges to our business.

The second step is to analyze what kind of capabilities an enterprise big data platform should have in order to meet the challenges of big data based on the characteristics of big data.

In the third part, based on the requirements of big data platform, we propose a technical solution for enterprise big data and introduce how the solution solves the big data challenges.

Finally I look at the current problems with big data applications and how they will develop in the future.

What is Big Data?

From the data point of view, big data is not simply big and many,Big Data Call a seven three ear zero one Si two five collar, but has the characteristics of 4V. Simply put, it is a large volume, style, speed, and low value.

Large volume: The latest research report, by 2020, global data usage is expected to skyrocket by 44 times to 35.2ZB. when we say big data, the general enterprise data volume has to reach the petabyte level before it can be called big data.

Multi-style: In addition to the large volume, big data also includes structured data and unstructured data, email, Word, pictures, audio information, video information and other types of data, has not been the previous relational database can be solved.

Fast speed: Here is the speed of data collection, with the development of e-commerce, mobile office, wearable devices, the Internet of Things, intelligent neighborhoods, etc., the speed of data generation has evolved to the second. Enterprises require the ability to access data in real time and make decisions in real time.

Low value: refers to the density of value, the value of the entire data is increasingly high, but because of the growth of the volume of data, the density of data value is correspondingly lower, the value of the data to occupy the majority of the worthless data, the enterprise needs to find the value of the business from the massive.

From the developer's point of view, big data and the previous database technology, data warehouse technology is different, he represents a series of new technologies led by Hadoop, Spark.

The distinguishing features of such technologies are: distributed, in-memory computing.

Distributed: Simply put, distributed means splitting a complex, time-consuming task into multiple tiny tasks and processing them in parallel. The tasks here would include data collection, data storage, and data processing.

In-memory computing: Essentially, the CPU reads data directly from memory rather than the hard disk and calculates and analyzes the data. In-memory computing is ideally suited to handling massive amounts of data, and data that requires real-time results. For example, it is possible to save almost all the financial, marketing, and marketing data of an enterprise for the past ten years in memory in all aspects at once.

Data mining: The core of big data should actually also include data mining technology, which is a technology closely linked to statistics, roughly divided into four major categories of classification, clustering, prediction, and correlation, which can be utilized to extract potential laws or knowledge from a large amount of incomplete and fuzzy data using mathematical methods.

Big Data Platform Requirements

The capabilities of big data are categorized into five aspects: data collection, data storage, data computation or processing, data mining, and data presentation.

Data collection: It needs the ability to collect massive data and real-time data, which is the first step of data utilization.

Data storage: corresponding to the characteristics of big data, need to be large-capacity, high fault-tolerant, high-efficiency storage capabilities, which is the basis for data utilization.

Data computation: need powerful, cheap, fast data processing cargo computation ability, powerful corresponds to the big data's large volume and type, cheap corresponds to the big data's low value density, fast corresponds to the big data's speed, which is the key to the development of the big data can be.

Data Mining: To be able to full-angle, multi-directional three-dimensional analysis of mining data value, the application of good data mining in order to transform data into value, which is the core of the use of data.

Data presentation: Multi-channel, intuitive, rich data presentation form is the external image of the data, which is the highlight of the data application, is recognized by the user window.

The above is for the big data platform needs to solve the problem, must have the ability to data put forward requirements.

Technical solutions

Enterprise big data solutions from the data processing process is divided into data collection layer, data storage layer, data computing layer, data mining layer, data presentation layer, each layer to solve the key challenges required for big data. The part labeled yellow is the traditional data processing technology.

Data Acquisition Layer:

Data acquisition technology is divided into real-time acquisition and timed acquisition, real-time acquisition using tools such as Oracle GoldenGate, real-time incremental acquisition of data to ensure the timeliness of the data; timed acquisition using a combination of tools such as SAP Data Services. Timed collection uses SAP Data Services and other tools to extract data at regular intervals, which is mainly used for high-volume, non-real-time data. Join kettle, sqoop and other distributed ETL tools to enrich diversified data extraction services, while joining the integration of real-time data kafka services, processing large amounts of real-time data.

Data storage layer:

Data storage area in the traditional oracle based on the addition of distributed file system, distributed columnar database, memory file system, in-memory database, full-text search and other modules. Among them, distributed file system ceph is used to store unstructured data because it has the characteristics of balanced data distribution and high degree of parallelization; distributed file system Hdfs is used to store other structured data because of its excellent scalability and compatibility; columnar storage database hbase is mainly used to store massive data for specific needs for computing and querying services.

Data Computing Layer:

The computing layer adopts standard SQL query, full-text search, interactive analytics Spark, real-time data processing Streaming, offline batch processing, graph computation Graph X and other technologies to carry out data computation on structured data, unstructured data, real-time data, and large-volume data processing.

Core computing method spark in-memory computing engine advantages:

Lightweight and fast processing.

Easy to use, Spark supports multiple languages.

Support for complex queries.

Real-time stream processing.

Can integrate with Hadoop and existing Hadoop data.

Can integrate with Hive?

Data Mining Layer: Use Spark_Mllib, R, Mhout and other analysis tools to create models and algorithm libraries based on the model analysis engine. The model algorithm library trains the model, generates model instances, and finally makes real-time and offline decisions based on the model instances.

Data Presentation Layer: Provide portal presentation, data charts, e-mail, office software and other data analysis methods, in the presentation of the way to support the big screen, computer desktop, mobile terminals and so on.

Conclusion

With the continuous optimization of high-performance computers, massive data storage and management processes, the problems that can be solved by technology will not be a problem. There are three aspects that will really constrain or become bottlenecks in the development and application of big data:

First, the legitimacy of data collection and extraction, and the trade-off between the protection of data privacy and the application of data privacy.

When any enterprise or organization extracts private data from the crowd, users have the right to know, and the use of users' private data for commercial behavior needs to be approved by users. However, at present, a series of management issues in China and around the world on how user privacy should be protected, how business rules should be formulated, how violations of user privacy should be punished, how legal norms should be formulated, and so on, are **lagging behind the pace of development of big data. In the future, a lot of big data business will be wandering in the gray area in the initial stage of development, when the commercial operation has begun to take shape and began to have an impact on a large number of consumers and companies, the relevant laws and regulations as well as market norms will be forced to accelerate the development of out. It can be expected that, although the application of big data technology can be unlimited, due to the limitations of data collection, the data that can be used for commercial applications and serve people is much smaller than the data that can be theoretically collected and processed by big data. The limited collection of data sources will **limit the commercial application of big data.

Secondly, the synergistic effect of big data requires that enterprises in each link of the industry chain reach a balance between competition and cooperation.

Big Data puts forward more cooperation requirements for enterprises based on its ecosystem. Without a macro grasp of the overall industry chain, individual companies can't understand the relationship between the data of each link in the industry chain based on their own independent data, and the judgment and influence on consumers is very limited. In some industries where information asymmetry is more obvious, such as the banking and insurance industries, the need for data*** sharing among enterprises is more urgent. For example, the banking and insurance industries often need to establish an industry *** shared database, so that their members can understand the credit history of individual users, eliminate the information asymmetry between the guarantor and the consumer, and make the transaction smoother. However, in many cases, competition and cooperation exist between these businesses that need to **** share information, and businesses need to weigh the pros and cons of **** sharing data before they do so, so that they do not lose their competitive advantage while **** sharing data. In addition, when many businesses work together, it is easy to form alliances with sellers, resulting in a loss of consumer interest and affecting the fairness of competition. The most imaginative direction of development for big data is to integrate data from different industries to provide an all-round three-dimensional data mapping that seeks to understand and reshape user needs from a systemic perspective. However, cross-industry data *** enjoyment needs to balance the interests of too many enterprises, if there is no neutral third-party organization to step in, coordinate the relationship between all participating enterprises, and formulate the rules of data *** sex and application, it will ** limit the usefulness of big data. The lack of an authoritative third-party neutral organization will constrain big data from reaching its maximum potential.

Third, the interpretation and application of big data conclusions.

Big data can be analyzed from the level of data to reveal the possible correlation between the variables, but how can the correlation on the level of data be visualized in the industry practice? How do you develop actionable solutions to apply the findings of big data? These questions require practitioners to not only be able to interpret big data, but also to understand the linkages between the various elements of industry development. This link is based on the development of Big Data technology but involves management and execution factors. In this segment, the human factor becomes the key to success. From the technical point of view, the executive needs to understand the big data technology and be able to interpret the conclusions of the big data analysis; from the industry point of view, the executive needs to understand the relationship between the processes of the various production segments of the industry, the possible correlation between the various elements, and the conclusions obtained by the big data and the industry-specific implementation of the link one by one; from the management point of view, the executive needs to formulate an enforceable problem-solving solution and ensure that there is no conflict between this solution and the management process. From the management perspective, the executive needs to develop an executable solution to the problem, and ensure that this solution does not conflict with the management process, and does not create new problems while solving the problem. These requirements not only require executives to be technologically savvy, but also to be excellent managers with a systematic mindset, able to look at the relationship between big data and the industry from the perspective of a complex system. The scarcity of such talent will constrain the development of big data.