Statistical methods: principal component analysis (PCA) in practice

This article focuses on one of the commonly used statistical analysis methods for dimensionality reduction: principal component analysis. On the eight indicators affecting the comprehensive evaluation of 31 cities, the principal component analysis method to determine the weight of the eight indicators, and the use of SPASS and Python two practical ways to operate.

The idea of principal components analysis (principal components analysis) is mainly to transform the original multiple variables into several line-independent variables by means of linear combination (matrix rotation), and the newly generated variables contain most of the information of the original variables, so as to achieve the purpose of dimension reduction. However, because all the original variables in the newly generated components occupy a certain proportion, and there is no standardized measurement between the different proportions, this approach is relatively poor in terms of interpretability.

When used in practice, if the amount of data fluctuation between variables is relatively large, data normalization is required. But in the process of normalization will erase part of the original portrayal of the discrete degree of difference between the variables. So normalization is dependent on the actual usage scenario.

Principal component analysis does not require the data to be normally distributed, mainly using the technique of linear transformation, because of its wide range of applications, through the synthesis and simplification of the original variables, you can objectively determine the weight of each indicator to avoid the arbitrariness of subjective judgment. However, from the idea of principal components, it is mainly applicable to data with strong correlation between variables, if the original data correlation is weak, it will not play a good role in dimensionality reduction, and there is a certain amount of data loss after dimensionality reduction.

From the food, clothing, housing, household equipment, transportation and communication, education and entertainment, health care, other eight indicators of the 31 major cities in the country statistics

Note: The data does not have a practical meaning, only for the analysis process learning.

Note Bartlett test of sphericity: to test whether it is suitable for subjective analysis. The original hypothesis is that the variables are independent of each other. kmo determines the degree of suitability for mastery analysis.

View the characteristic root of each principal component, variance, variance share.

The main view of the loadings of each dimension on the components

Calculate the main each city score according to the weights calculated in the previous step:

Indicator = ∑Di*Wi (D represents the original indicator value, W represents the weight of the current dimensions)

We use the machine learning library Scikit-learn to perform the PCA operation that performs matrix transformation based on covariance.

From the results of 3.1 and 3.2, we can see that some of the cities in the ranking are slightly different in the two ways, this is the SPASS and Scikit-learn implementation of certain differences, the focus of this paper is to discuss the implementation of principal component analysis in the two ways.

If the problem, welcome to reply to the exchange. If you need source data, you can reply to get it.

Special note, the data in this article comes from random manufacturing, does not constitute any validity, only for technical learning use.