Interpretation of big data analysis methods and introduction to related tools

Interpretation of big data analysis methods and introduction of related tools

You must know that big data is no longer big data. The most important reality is to analyze big data. Only through analysis can we Get a lot of smart, in-depth, valuable information.

More and more applications involve big data. The attributes of these big data, including quantity, speed, diversity, etc., all present the growing complexity of big data. Therefore, the characteristics of big data Analysis methods are particularly important in the field of big data, and can be said to be the decisive factor in determining whether the final information is valuable. Based on this, what are the theories of big data analysis methods?

Five basic aspects of big data analysis

PredictiveAnalyticCapabilities

Data mining allows analysts to better understand data, and prediction Sexual analysis allows analysts to make some predictive judgments based on the results of visual analysis and data mining.

DataQualityandMasterDataManagement

Data quality and data management are some of the best practices in management. Processing data through standardized processes and tools ensures a predefined, high-quality analysis result.

AnalyticVisualizations (visual analysis)

Whether it is a data analysis expert or an ordinary user, data visualization is the most basic requirement for data analysis tools. Visualization can display data intuitively, let the data speak for itself, and let the audience hear the results.

SemanticEngines

We know that the diversity of unstructured data has brought new challenges to data analysis. We need a series of tools to parse, extract, Analyze the data. Semantic engines need to be designed to intelligently extract information from "documents".

DataMiningAlgorithms (data mining algorithm)

Visualization is for people to see, and data mining is for machines to see. Clustering, segmentation, outlier analysis, and other algorithms allow us to dig deep into the data and discover value. These algorithms must handle not only the volume of big data, but also the speed with which it can be processed.

If big data is really the next important technological innovation, we'd better focus on the benefits that big data can bring us, not just the challenges.

Big data processing

Big data processing has three major changes in the concept of the data era: totality rather than sampling, efficiency rather than absolute accuracy, correlation rather than causation. There are actually many specific big data processing methods, but based on long-term practice, the author has summarized a basic big data processing process, and this process should be able to help everyone streamline the processing of big data. The entire processing process can be summarized into four steps, namely collection, import and preprocessing, statistics and analysis, and mining.

Collection

Big data collection refers to the use of multiple databases to receive data from clients, and users can perform simple queries and processing through these databases. For example, e-commerce companies use traditional relational databases such as MySQL and Oracle to store each transaction data. In addition, NoSQL databases such as Redis and MongoDB are also commonly used for data collection.

In the process of collecting big data, its main feature and challenge is the high number of concurrency, because there may be thousands of users accessing and operating at the same time, such as train ticket sales websites and Taobao , their concurrent access volume reaches millions at the peak, so a large number of databases need to be deployed on the collection end to support it. And how to perform load balancing and sharding among these databases does require in-depth thinking and design.

Statistics/Analysis

Statistics and analysis mainly use distributed databases or distributed computing clusters to perform ordinary analysis, classification and summary of the massive data stored in them, in order to Meet most common analysis needs. In this regard, some real-time requirements will use EMC's GreenPlum, Oracle's Exadata, and MySQL-based column storage Infobright, etc., while some batch processing or semi-structured data-based requirements will be used. Hadoop can be used. The main feature and challenge of the statistics and analysis part is that the analysis involves a large amount of data, which consumes a lot of system resources, especially I/O.

Import/preprocessing

Although the collection end itself will have many databases, if you want to effectively analyze these massive data, you should still import the data from the front end into a A centralized large-scale distributed database, or a distributed storage cluster, and can do some simple cleaning and preprocessing work based on the import. Some users will also use Storm from Twitter to perform streaming calculations on data when importing to meet the real-time computing needs of some businesses. The characteristics and challenges of the import and preprocessing process are mainly the large amount of imported data, and the amount of imported data per second often reaches hundreds of megabytes or even gigabytes.

Mining

Different from the previous statistics and analysis process, data mining generally does not have any preset themes and is mainly based on various algorithms on existing data. Calculation, thereby achieving a prediction effect, thereby achieving some high-level data analysis requirements. Typical algorithms include K-Means for clustering, SVM for statistical learning, and Naive Bayes for classification. The main tools used include Hadoop's Mahout, etc. The characteristics and challenges of this process are mainly that the algorithms used for mining are very complex, and the amount of data and calculation involved in the calculation are large. In addition, commonly used data mining algorithms are mainly single-threaded.

Detailed explanation of big data analysis tools, IBM, HP and Microsoft tools are on the list

Last year, IBM announced the acquisition of data analysis company Netezza for US$1.7 billion; EMC made another acquisition after acquiring data warehouse software vendor Greenplum. Cluster NAS manufacturer Isilon; Teradata acquired Aster Data; subsequently, HP acquired real-time analysis platform Vertica, etc. These acquisitions point to the same target market - big data. Yes, the era of big data has arrived, and everyone is gearing up to seize market opportunities.

The most dazzling star here is Hadoop. Hadoop has been recognized as a new generation of big data processing platform. EMC, IBM, Informatica, Microsoft and Oracle have all invested in Hadoop. For big data, the most important thing is to analyze the data and find valuable data to help companies make better business decisions. Next, let’s take a look at the following eight tools for big data analysis.

EMC Greenplum Unified Analytics Platform (UAP)

Greenplum was acquired by EMC in 2010. Its EMC Greenplum Unified Analytics Platform (UAP) is a single software platform for data teams and analytics Teams can seamlessly share information and collaborate on analysis on the platform without having to work in different silos or move data between silos. As such, UAP includes ECM Greenplum Relational Database, EMC Greenplum HD Hadoop distribution, and EMC Greenplum Chorus.

The hardware developed by EMC for big data is the modular EMC Data Computing Appliance (DCA), which can run and expand Greenplum relational database and Greenplum HD nodes in one device.

DCA provides a shared Command Center interface that allows administrators to monitor, manage, and configure Greenplum Database and Hadoop system performance and capacity. As the Hadoop platform matures, analytics capabilities are expected to increase dramatically.

IBM combines BigInsights and BigCloud

A few years ago, IBM began to experiment with Hadoop in its laboratories, but it incorporated related products and services into the commercial version of IBM last year InfoSphere BigInsights, which launched the cloud version of InfoSphere BigI in May last year, enables any user within the organization to do big data analysis. The BigInsights software on the cloud can analyze structured and unstructured data in the database, allowing decision makers to quickly turn insights into actions.

IBM subsequently made BigInsights and BigSheets available as a service through its SmartCloud Enterprise infrastructure in October. The service is available in basic and enterprise versions; a major selling point is that customers can learn and try out big data processing and analysis capabilities without purchasing supporting hardware or IT expertise. According to IBM, customers can set up a Hadoop cluster and transfer data to the cluster in less than 30 minutes, and data processing charges start at 60 cents per hour per cluster.

Informatica 9.1: Turning big data challenges into big opportunities

Informatica went one step further last October when it launched HParser, This is a data transformation environment optimized for Hadoop. According to Informatica, the software supports flexible and efficient processing of any file format in Hadoop, providing Hadoop developers with out-of-the-box parsing capabilities to process complex and diverse data sources, including logs, documents, binary data or hierarchical data, as well as numerous industry standard formats (such as NACHA in the banking industry, SWIFT in the payments industry, FIX in the financial data industry, and ACORD in the insurance industry). Just as in-database processing technology accelerates various analysis methods, Informatica is also adding parsing code into Hadoop to take advantage of all this processing power, and will soon add other data processing code.

Informatica HParser is the latest addition to the Informatica B2B Data Exchange family of products and the Informatica platform, designed to meet the growing demand for extracting business value from massive amounts of unstructured data. Last year, Informatica successfully launched the innovative Informatica 9.1 for Big Data, which is the world's first unified data integration platform built specifically for big data.

Oracle Big Data Appliance - Oracle Big Data Appliance

Oracle's Big Data Appliance integrated system includes Cloudera's Hadoop system management software and support services Apache Hadoop and Cloudera Manager. Oracle views Big Data Appliance as a "building system" that includes Exadata, Exalogic and Exalytics In-Memory Machine.

Oracle Big Data Appliance is a software and hardware integrated system that integrates Cloudera's Distribution Including Apache Hadoop, Cloudera Manager and an open source R into the system. The big data machine uses the Oracle Linux operating system and is equipped with Oracle NoSQL database community version and Oracle HotSpot Java virtual machine. Big Data Appliance is a full-architecture product with 864GB storage per architecture, 216 CPU cores, 648TBRAW storage, and 40GB per second InifiniBand connection. The Big Data Appliance sells for $450,000, and annual hardware and software support costs $12.

Oracle Big Data Appliance rivals EMC Data Computing Appliance. IBM has also launched InfoSphere BigInsights, a data analysis software platform. Microsoft also announced the release of the Hadoop-based SQL Server 2012 large-scale data processing platform in 2012.

Detailed introduction to statistical analysis methods and statistical software

What are the statistical analysis methods? Below we will elaborate on it and introduce some commonly used statistical analysis software.

1. Indicator comparative analysis method Indicator comparative analysis method

Eight methods of statistical analysis 1. Indicator comparative analysis method Indicator comparative analysis method, also known as comparative analysis method, is a statistical analysis method the most commonly used method. It is a method to reflect the differences and changes in the quantity of things through comparison of relevant indicators. Only by comparison can we identify. Looking at some indicators alone can only explain certain quantitative characteristics of the whole, and no conclusive understanding can be drawn; once compared, such as with foreign countries and foreign units, with historical data, and with plans, the scale can be Make judgments and evaluations based on size, level, and speed.

Indicator analysis and comparative analysis methods can be divided into static comparison and dynamic comparative analysis. Static comparison is the comparison of different overall indicators under the same time conditions, such as comparisons between different departments, different regions, and different countries, which is also called horizontal comparison; dynamic comparison is the comparison of indicator values ????in different periods under the same overall conditions, which is also called vertical comparison. These two methods can be used alone or in combination. When conducting comparative analysis, you can use total indicators, relative indicators, or average indicators alone, or you can combine them for comparison. The comparison results can be expressed by relative numbers, such as percentages, multiples, coefficients, etc., or by the absolute number of the difference and the relevant percentage points (every 1% is one percentage point), that is, subtracting the compared indicators.

2. Grouping analysis method Index comparative analysis method

Grouping analysis method Index comparative analysis method, but the units that make up the statistical population have a variety of characteristics, which makes the same population There are many differences between units within the scope. Statistical analysis must not only analyze the overall quantitative characteristics and quantitative relationships, but also conduct in-depth group analysis within the overall population. The grouping analysis method is to divide the study population into several parts according to one or several signs according to the purpose of statistical analysis, organize them, observe and analyze them, and reveal their inner connections and regularities.

The key issue in the statistical grouping method is the correct selection of grouping values ??and the delineation of the boundaries of each group.

3. Time series and dynamic analysis method

Time series. It is a series of values ????that change and develop in time for the same indicator, and are arranged in chronological order to form a time sequence, also known as a dynamic sequence. It can reflect the development and changes of social and economic phenomena. Through the preparation and analysis of time series, the dynamic change rules can be found and provide a basis for predicting future development trends. Time series can be divided into absolute time series, relative time series, and average time series.

Time series speed indicator.

Speed ??indicators that can be calculated based on absolute time series include development speed, growth speed, average development speed, and average growth speed.

Dynamic analysis method. In statistical analysis, it is difficult to make a judgment if there is only an isolated indicator value for a period. If a time series is compiled, dynamic analysis can be performed to reflect the changing rules of its development level and speed.

When performing dynamic analysis, pay attention to the comparability of each indicator in the sequence. The overall scope, indicator calculation method, calculated price and measurement unit should all be consistent. The time intervals should generally be consistent, but different intervals can also be adopted based on the research purpose, such as by historical periods. In order to eliminate the incomparability of indicator values ??caused by different time intervals, the annual average and annual average development speed can be used to compile dynamic series. In addition, in statistics, many comprehensive indicators use value forms to reflect physical totals, such as gross domestic product, gross industrial output, total retail sales of social goods, etc. When calculating the development speed in different years, the influence of price changes must be eliminated in order to Correctly reflect changes in physical quantities. That is to say, the value of the same products in different years must be calculated using comparable prices (such as constant prices or price index adjustments) before comparison can be made.

In order to observe the fluctuation trajectory of my country's economic development, the development speed of GDP in each year can be compiled into a time series and drawn into a curve chart to provide an intuitive understanding.

IV. Index analysis method

Index refers to the relative number that reflects the changes in social and economic phenomena. There are broad and narrow senses. Depending on the research scope of the index, it can be divided into individual index, category index and total index.

The functions of the index: First, it can comprehensively reflect the direction and degree of overall quantitative changes in complex socio-economic phenomena; second, it can analyze the extent to which the total changes in a certain socio-economic phenomenon are affected by changes in various factors, This is a factor analysis method. The operation method is: through the quantitative relationship in the index system, assuming other factors remain unchanged, to observe the impact of a change in a certain factor on the total change.

Factor analysis using indices. Factor analysis is to decompose the research object into various factors, and regard the overall research object as the result of the changes of each factor. Through the analysis of each factor, the degree of influence of each factor in the total change of the research object is measured. . Factor analysis can be divided into factor analysis of changes in total indicators and factor analysis of changes in average indicators according to the different statistical indicators of the objects studied.

5. Balance analysis method

Balance analysis is a method of studying the reciprocal relationship between quantitative changes in social and economic phenomena. It arranges the two sides of the unity of opposites one by one according to their constituent elements, giving people an overall concept, so as to facilitate the overall observation of the balanced relationship between them. Balanced relationships widely exist in economic life, ranging from national macroeconomic operations to personal economic income and expenditure. There are many types of balances, such as fiscal balance sheet, labor balance sheet, energy balance sheet, balance of international payments, input-output balance sheet, etc. The functions of balance analysis are: first, to reflect the balance of social and economic phenomena in terms of quantitative equivalence, and to analyze the compatibility of various proportional relationships; second, to reveal unbalanced factors and development potential; third, to use balance relationships to analyze various aspects of Estimating unknown individual indicators among known indicators.

6. Comprehensive evaluation analysis

Socioeconomic analysis phenomena are often complicated. The socioeconomic operating conditions are the result of the combined effects of multiple factors, and the direction and degree of change of each factor is different. For example, the evaluation of macroeconomic operations involves all aspects of life, distribution, circulation, and consumption; the evaluation of corporate economic benefits involves the rational utilization of people, finances, materials, and market sales. If only a single indicator is used, it will be difficult to make an appropriate evaluation.

Comprehensive evaluation includes four steps:

1. Determine the evaluation index system, which is the basis and basis for comprehensive evaluation. Pay attention to the comprehensiveness and systematicness of the indicator system.

2. Collect data and process the indicator values ????in different measurement units with the same measurement. Methods such as relativization processing, functional processing, and standardization processing can be used.

3. Determine the weight of each indicator to ensure the scientific nature of the evaluation. Depending on the status of each indicator and the degree of its overall impact, different weights need to be assigned to different indicators.

4. Summarize the indicators, calculate the comprehensive score, and make a comprehensive evaluation based on this.

7. Prosperity Analysis

Economic fluctuations exist objectively and are difficult for any country to completely avoid. How to avoid major economic fluctuations and maintain stable economic development has always been an important issue faced by governments and economic experts in macroeconomic control and decision-making. Prosperity analysis was born and developed to meet this requirement. Boom analysis is a comprehensive evaluation analysis, which can be divided into macroeconomic boom analysis and corporate boom survey analysis.

Macroeconomic climate analysis. The National Bureau of Statistics began to establish a monitoring indicator system and evaluation method in the late 1980s. After more than ten years and continuous improvement, a system has been formed to provide regular business analysis reports and serve as a barometer and alarm for the macroeconomic operating status. It will facilitate the State Council and relevant departments to take macro-control measures in a timely manner. Prevent economic ups and downs with regular small adjustments.

Survey and analysis of business climate. It adopts a sampling survey method in various large and medium-sized enterprises across the country and uses a questionnaire to allow the person in charge of the enterprise to answer relevant situation judgments and expectations. The content is divided into two categories: one is the judgment and expectation of the overall macro economy; the other is the judgment and expectation of the company's operating conditions, such as product orders, raw material purchases, prices, inventories, employment, market demand, fixed asset investment, etc.

8. Forecast Analysis

Macroeconomic decision-making and microeconomic decision-making not only require understanding of the actual situation that has occurred in the economic operation, but also the need to foresee what will happen in the future. Predicting the future based on the known past and present is predictive analysis.

Statistical forecasting is quantitative forecasting, which is mainly based on data analysis and combines qualitative analysis in forecasting. Statistical forecasting methods can be roughly divided into two categories: one is mainly based on the dependence between changes in the indicator time series and time, which belongs to time series analysis; the other is based on the causal relationship between indicators. Belongs to regression analysis.

Forecasting analysis methods include regression analysis, moving average method, exponential smoothing method, periodic (seasonal) change analysis and random change analysis, etc. More complex predictive analysis requires the establishment of an econometric model, and there are many methods to solve the parameters in the model.