Missing value processing in data analysis

Without high-quality data, there is no high-quality data mining results, and missing data values are one of the frequently encountered problems in data analysis. When the proportion of missing is very small, the missing records can be directly discarded or manually processed. However, in the actual data, often the missing data occupy a considerable proportion. At this point, if the manual processing is very inefficient, if the missing records are discarded, a lot of information will be lost, so that incomplete observations of data and complete observations of data between the systematic differences, the analysis of such data, you are likely to draw the wrong conclusions.

Causes of missing data

Real-world data are exceptionally heterogeneous, and missing attribute values are frequent or even inevitable. The reasons for missing data are many:

The information is temporarily inaccessible. For example, in the medical database, not all clinical test results of all patients can be obtained within a given time, resulting in some of the attribute values are vacant.

Information is missing. It may be omitted because it was considered unimportant at the time of input, forgotten to fill in, or misunderstood the data, or it may be lost due to malfunctioning of the data collection equipment, malfunctioning of the storage media, malfunctioning of the transmission media, or some human factor, and so on.

Some attributes of some objects are not available. For example, the name of the spouse of an unmarried person, the fixed income status of a child, etc.

Some information is (perceived to be) unimportant. Such as an attribute whose value is irrelevant to the given context.

The cost of obtaining this information is too great.

The real-time performance of the system is required to be high. That is, it is required to make a quick judgment or decision before getting this information.

The handling of missing values should be analyzed specifically, why should specific problems be analyzed specifically? Because missing attributes sometimes does not mean that the data is missing, the missing itself contains information, so you need to be based on different application scenarios under the missing values may contain information to fill in the reasonable. Here are some examples to illustrate how to specifically analyze specific problems, benevolent and wise, for reference only:

"Annual income": commodity recommendation scenarios filled with the average value, the minimum value filled with the borrowing amount scenarios;

"Behavioral point in time ": populate the plural;

"Price": populate the minimum value in the commodity recommendation scenario, populate the average value in the commodity matching scenario;

"Human lifespan": populate the insurance cost estimation scenario with the users may not have gone to college, it is more reasonable to fill it with positive infinity;

"Marital status": users who did not fill this item may be more sensitive to their privacy, and it should be set as a separate category, such as married 1, unmarried 0, unfilled -1.

Types of missing

Before processing the missing data, it is important to know the type of data that is missing. missing data, it is essential to understand the mechanism and form of missing data before processing. Variables in a data set that do not contain missing values are referred to as complete variables, and variables in a data set that contain missing values are referred to as incomplete variables. In terms of the distribution of missingness, missingness can be categorized as completely random missingness, random missingness and completely non-random missingness.

Missing completely at random (MCAR): means that the missing data is completely random, does not depend on any incomplete or complete variable, and does not affect the unbiasedness of the sample. Such as missing home addresses.

Missing at random (missing at random,MAR): refers to the missing data is not completely random, that is, the missing data of this type depends on other complete variables. For example, missing financial data is related to the size of the business.

Non-random missing (missing not at random, MNAR): refers to the missing data is related to the value of the incomplete variable itself. For example, the high income people are not intended to provide household income.

For missing at random and not at random, deletion of records is not appropriate, missing at random can be estimated by known variables for missing values; while not at random has no good solution.

Explanation:For the classification problem, the missing samples can be analyzed for the proportion between categories and the overall dataset for the proportion of categories

Necessity of missing value handling

Missing data is a complex problem in many research areas. For data mining, the presence of missing values has the following effects:

The system loses a significant amount of useful information;

The uncertainty exhibited in the system is more pronounced and the deterministic components embedded in the system are more difficult to grasp;

Data containing null values can throw the mining process into disarray and lead to unreliable output.

Data mining algorithms themselves are more committed to avoiding data overfitting the models they build, a property that makes it difficult to handle incomplete data well through their own algorithms. Therefore, default values need to be derived, filled, etc. by specialized methods to reduce the gap between data mining algorithms and real-world applications.

Analysis and Comparison of Methods for Dealing with Missing Values

There are three main categories of methods for dealing with incomplete datasets: deletion of tuples, data completion, and no processing.

Deleting tuples

That is, deleting objects (tuples, records) that have missing information attribute values to get a complete information table. This method is simple and easy to use, and is very effective when objects have multiple missing values for attributes, and the deleted objects containing missing values are very small compared to the amount of data in the initial dataset, and is often used when class labels are missing.

However, this method has significant limitations. It trades reduction of historical data for completeness of information, discarding a great deal of information hidden in these objects. In cases where the initial dataset contains few objects, removing a small number of objects is enough to seriously affect the objectivity of the information and the correctness of the results; therefore, when the proportion of missing data is large, and especially when the omitted data are not randomly distributed, this method may lead to biased data that can lead to erroneous conclusions.

Note:Deleting tuples, or just deleting the column feature, can sometimes lead to performance degradation.

Data Completion

This type of method completes the table of information by filling the null values with certain values. Usually based on statistical principles, a missing value is filled in based on the distribution of values taken by the remaining objects in the initial data set. The following filling methods are commonly used in data mining:

Filling manually

Since it is the user who knows the data best, this method produces the least amount of data deviation and is probably the most effective one. In general, however, the method is time-consuming and is not feasible when the data is very large and there are many null values.

Treating Missing Attribute values as Special values

Treating null values as a special kind of attribute value is different from any other attribute value. For example, all null values are filled with "unknown". This creates another interesting concept that can lead to serious data deviations and is generally not recommended.

Mean/Mode Completer

Categorizes the attributes in the initial dataset into numeric and non-numeric attributes for separate processing.

If the null value is numeric, the average of the values of the attribute in all other objects will be used to fill in the missing attribute value;

If the null value is non-numeric, according to the principle of the plurality in statistics, the value of the attribute with the highest number of times of values (i.e., the value that occurs most frequently) in all other objects will be used to fill in the missing attribute value. Another method similar to it is called Conditional Mean Completer. In this method, the values used for averaging are not taken from all the objects in the data set, but from the objects that have the same decision attribute value as that object.

Both of these methods of data completer have the same basic starting point of replacing missing attribute values with the maximum probability of possible values to be taken, but are only a little different in their specific methods. In contrast to the other methods, it uses most of the information from the existing data to infer the missing values.

Hot deck imputation, or proximity filling

For an object that contains null values, the hot deck imputation method finds an object in the complete data that is most similar to it, and then fills it with the values of that similar object. Different problems may choose different criteria to make a determination of similarity. The method is conceptually simple and utilizes the relationship between the data for null estimation. The disadvantage of this method is that it is difficult to define the similarity criteria and there are more subjective factors.

K-means clustering

The K-means clustering method first determines the K closest samples to the sample with missing data based on Euclidean distance or correlation analysis, and then weights these K values to estimate the missing data for that sample.

Assigning All Possible values of the Attribute

Assigning All Possible values of the Attribute with all possible attribute values of the vacant attribute value can get a better complementary effect. However, when there is a large amount of data or a large number of missing attribute values, it is costly to compute and there are many possible test scenarios.

Combinatorial Completer

Try with all possible attribute fetches of the vacant attribute value and choose the best one from the final attribute approximation result as the filled attribute value. This is a data-filling method for the purpose of approximation, and it can get good approximation results; however, it is computationally expensive when the amount of data is large or when there are many missing attribute values.

Regression

Based on the complete dataset, regression equations are built. For objects containing null values, the known attribute values are substituted into the equation to estimate the unknown attribute values, which are then used to fill in the estimates. This can lead to biased estimates when the variables are not linearly related.

Expectation maximization (EM)

The EM algorithm is an iterative algorithm that computes a great likelihood estimate or posterior distribution on incomplete data. Two steps are executed alternately during each iterative loop: E step (Excepctaion step, Expectation step), which calculates the conditional expectation of the log-likelihood function corresponding to the complete data given the complete data and the parameter estimates obtained in the previous iteration; and M step (Maximzation step, Maximization step), which uses the maximization of the log-likelihood function to determine parameter values and used in the next iteration step. The algorithm iterates between the E and M steps until convergence, i.e., it ends when the change in the parameters between the two iterations is less than a pre-given threshold. The method may fall into local extremes, convergence is not very fast, and it is computationally complex.

Multiple Imputation (MI)

Multiple Imputation methods are divided into three steps:

A set of possible filler values is generated for each null value, which reflect the uncertainty of the unresponsive model; each of these values is used to fill in the missing values in the dataset, yielding a number of complete data sets.

Each set of filled data is statistically analyzed using statistical methods specific to the complete data set.

Results from each filled data set are synthesized to produce a final statistical inference that takes into account the uncertainty due to data filling. The method treats vacancies as random samples, so that the calculated statistical inference may be affected by the uncertainty in the vacancy values. The method is also computationally complex.

Method C4.5

Populates missing values by looking for relationships between attributes. It looks for two attributes with maximum correlation between them, where the one with no missing values is called the proxy attribute and the other is called the original attribute, and the proxy attribute is used to determine the missing values in the original attribute. This rule-based induction approach can only handle nominal attributes with a small base.

In terms of several statistically based methods, the deletion tuple and mean methods are worse than hot deck filling, expectation maximization methods, and multiple filling; regression is one of the better methods, but still not as good as hot deck and EM; and EM lacks the uncertainty component that MI contains. It is worth noting that these methods deal directly with the estimation of model parameters rather than the vacancy prediction itself. They are appropriate for problems dealing with unsupervised learning, whereas the same cannot be said for supervised learning. For example, you can remove objects containing nulls to train with the full dataset, but you cannot ignore objects containing nulls when predicting. Also, C4.5 and the use of all possible value-population methods have a better complementary effect, while manual filling and special value populations are generally not recommended.

Non-processing

Complementary processing is just to fill the unknown value with our subjective estimate, not necessarily fully consistent with the objective facts, in the incomplete information to complementary processing, we more or less change the original information system. Moreover, incorrect filling of null values often introduces new noise into the data, giving erroneous results for the mining task. Therefore, in many cases, we still want to process the information system while keeping the original information unchanged.

Methods that do not process missing values and perform data mining directly on data containing null values include Bayesian networks and artificial neural networks, among others.

Bayesian networks provide a natural way to represent causal information between variables and are used to discover potential relationships between data. In this network, nodes are used to represent variables and directed edges represent dependencies between variables. Bayesian networks are only suitable for situations where there is some knowledge of the domain, or at least a clearer picture of the dependencies between variables. Otherwise, learning the structure of Bayesian nets directly from data is not only more complex (increasing exponentially with the increase of variables) and expensive to maintain the network, but it also has more estimation parameters, which brings high variance to the system and affects its prediction accuracy.

Artificial neural networks can effectively deal with missing values, but the research of artificial neural networks in this area is yet to be further developed in depth.

One solution on Zhihu:

4. Mapping variables to higher dimensional space. For example, gender, with male, female, and missing, is mapped to 3 variables: whether it is male, whether it is female, and whether it is missing. Continuous variables can also be handled in this way. For example, the CTR prediction model of Google and Baidu will treat all variables in this way when pre-processing, reaching hundreds of millions of dimensions. The advantage of doing so is to retain the full information of the original data, do not have to take into account missing values, do not have to consider issues such as linear inseparability. The disadvantage is that the amount of computation is much higher.

And the effect is only good when the sample size is very large, otherwise it will be too sparse, the effect is very poor.

Summary

Most data mining systems use the first and second types of methods to process vacant data in the data preprocessing stage before data mining. There does not exist a single method of dealing with null values that can be appropriate for any problem. No matter which way is filled, it is impossible to avoid the influence of subjective factors on the original system, and it is not feasible to complete the system in the case of too many null values. Theoretically, Bayes takes everything into account, but a fully Bayesian analysis is only feasible when the dataset is small or when certain conditions (e.g., multivariate normal distribution) are met. And the application of artificial neural network methods in data mining is still limited at this stage. It is worth mentioning that the use of imprecise information to deal with the incompleteness of data has been widely studied. The theories on which the expression methods of incomplete data are based are mainly credibility theory, probability theory, fuzzy set theory, possibility theory, and the evidence theory of D-S.