Application of anti-fraud data mining techniques in the health insurance industry

I. Project Background

Recently in the news are users who, after seemingly normal consumption or withdrawal, find that their cards have been stolen, and this phenomenon is fraudulent transactions. Fraudulent transactions is a hazardous phenomenon that exists in various industries such as banking, insurance, securities, etc., bringing greater losses and threats to people's economy and life. As a world problem, developed countries have complemented a strong information management system, through data mining and artificial intelligence to assist in detecting, identifying and evaluating fraudulent transactions, effectively improving the anti-fraud technology.

CRISP-DM, the Cross-Industry Standard Process for Data Mining (shown below), is by far the most popular data mining process reference model. The associations between the various size nodes shown in the figure will vary cyclically and roughly, and the process is not the focus; the key is that the results of data mining can ultimately be embedded in business processes to improve business efficiency and effectiveness.

CRISP-DM and SPSS own development of SPSS Modeler fit very well, support rigorous design, semi-experimental research, biased intelligence of the three major statistical methodologies, is one of the world's most outstanding statistical software. This time, using SPSS Modeler18 as a modeling tool, we use non-real health insurance industry data (policyholder information, medical institution information sheet, claim information sheet, medical diagnosis and treatment information sheet) as internal business data, non-real microfinance data as a third-party customer data source, to carry out fraudulent transaction discovery data mining modeling and analysis, and we believe that it has a reference significance to other industries as well.

In the case of the CMS, the data mining modeling and analysis was done on internal business data and non-real microfinance data as a third-party customer data source.

In the business understanding phase of CRISP-DM, firstly, the enterprise is assessed on the situation of owning resources, demand, risk, and cost-benefit, in order to determine the target of data mining.

The risk analysis of Medicare fraud for business combing is as follows:

1) Domestic Medicare fraud manifestations

The main ones are: impersonation (i.e., falsification of eligibility for medical treatment); falsification of medical causes (changing non-Medicare-paid diseases (such as car accidents, work-related injuries, fights and assaults, suicides, etc.) into Medi-Cal-paid diseases); exaggerating losses; falsification of bills; falsification of medical documents; hospitalization; and falsification of beds. The hospitalization of beds (i.e., hospitalization); make up false hospitalization, outpatient special diseases and other relevant information to "fraudulent insurance".

2) The subject of fraud

In the system of "third-party payment" , medical personnel and the insured may conspire to defraud the insurance organization.

There are three main players: the insured, the medical organization, the insurance company, the source of the possibility of fraud occurs the insured, the medical organization. Combined with the business characteristics to organize the data mining objectives and ideas in the following directions:

Data anomaly detection;

Classification study of the policyholder, using user profiles, and combined with external data to predict fraud scores of existing and potential customers;

Classification study of the medical institution information;

Medical claims detection.

Disclaimer: Given the space, this is a general overview, and specific ideas and algorithms will be featured in the future.

Second, data and model analysis

2.1 Data anomaly detection

Quite a lot of data anomalies from the business logic is a thing that can be directly judged by experience. For example, a customer's claim frequency and amount in a period of time a large increase in the size of the relationship between the policyholder's payment amount and the size of the policyholder's medical cost data anomalies, etc., can be regarded as suspected fraud, the relevant process does not do a technical demonstration.

Benford's Law and anomaly detection are anomaly monitoring methods that are more widely used in industries such as auditing and securities. The so-called anomaly detection is the discovery of objects that are different from the majority of objects, which is actually the discovery of outliers. We can use multiple anomaly detection methods at the same time to improve the hit rate of detecting fraudulent transactions.Benford's law is a somewhat interesting law that reveals the distribution characteristics of the first digit of a large amount of data: the larger the first digit of the data, the lower the frequency of occurrence. Using cluster modeling, with healthcare organization number, payment amount, and number of claims as input variables:

We can derive a report of suspected fraud for an organization with a claim queue greater than 50, and a cluster with a distance queue greater than 0.2: "Healthcare organization number: 10083642887,Healthcare organization subcategory: psychology," "Healthcare organization number of claims 58," "Healthcare organization number: 10083642887," "Healthcare organization subcategory: psychology," "Healthcare organization number of claims 58," "Healthcare organization number of claims 58," "Healthcare organization number of claims" "Healthcare organization number of claims 58," and "Healthcare organization number of claims 58. Healthcare Organization Claim Number 58" and "Healthcare Organization Number: 10085843968, Healthcare Organization Subcategory: med trans, Healthcare Organization Claim Number 71".

To broaden the search for anomalies, Anomaly modeling, a specialized anomaly detection method, was used:

A list of suspected fraudulent enrollees with an anomaly deviation index of greater than 1.5 and an Anomaly flag of "T" was obtained as shown in the following table:

By looking at the results of the modeling, the table also shows the results that led to the entry in question. By looking at the results of the model, the table also shows the 3 most important influencing factors and influence indices that cause the record to be considered an outlier, and it can be easily seen that factors including DIAG diagnosis, Procedure process, and MEDcode medical measures are important factors that lead to suspected fraud.

After the fraud department has completed the audit, it is possible to compare the hit rate of the two algorithms.

2.2 Policyholder fraud analysis

Including: cluster migration, fraud scoring, user profiling.

2.2.1 Cluster Migration of Customers

Generally speaking, in a relatively short period of time, whether it is the state of the organization or the individual, the behavioral pattern is more stable, and will not change much. A suspected fraud report can be submitted if the clustering segmentation done on the insured has customers changing the segmentation group they are in within a year or even half a year. Cluster modeling picks a few key input variables (refer to the RFM model), such as the amount paid, the number of payments, the insurance terms and conditions of the first year and the second year, respectively, cluster modeling and marking the cluster change, you can get the suspected fraud list.

In the cluster analysis of customers, you can find a number of clusters with a very small number of records, which are often ignored in marketing activities, but in fraud detection is an abnormal behavioral taxon worthy of attention.

2.2.2 Fraud scoring: single classifiers and Ensemble Learning

The construction of personal credit systems has been very mature in developed countries, and the familiar banking industry involves the application of specialties such as credit approvals, line of credit determination, and anti-fraud. In the U.S. banking industry, the annual $800 billion in credit card volume caused only a loss of about 100 million, accounting for about 0.02% of the total, and its mature development of data mining technology results in a great deal.

Fraud scoring can be divided into three main steps: variable transformation, generation of logsitic regression models and score transformation. The sample is randomly divided into two parts: one part is used to build the model and the other part is used to test the model. The Bining (binning) of variables is actually a loss to the data, but due to the need to serve the business as a starting point, it is important to take into account that binning variables is easier for business people to use and understand.

The input to the logistic regression model is the WOE value (weight of evidence) of each (split-box) variable. The formula for calculating the WOE value: WOE=ln(percentage of good customers/ percentage of pregnant customers)*100.

The variable transformation consists of the following steps:

1) eliminate redundant variables (the correlation coefficient of the variable to retain one of the larger). The variables with large correlation coefficients can be retained);

2) Bining processing of continuous variables and category subsumption processing of discrete variables;

3) Calculation of IV value and WOE value, in order to enhance the predictive ability, try to screen the variables with IV value greater than or equal to 0.02 and less than or equal to 0.05.

The above figure is part of the model and output of the variable conversion data flow, as can be seen in the first output table, as a discrete variable credit card data can also continue to calculate its default rate for conversion classification.

After logistic regression modeling by stepwise method, but also use statistical methods to score the regression coefficients for transformation, score transformation step involves a scale preparation of business quantitative process, for the time being, will not be described in detail. The test of the prediction model can use roc, k-s indicator method, etc. The scorecard test needs to reflect which segment is the distinction between the largest, the choice of ks indicator method:

In general, KS>0.2 can be considered that the model has a relatively good prediction accuracy.

Regression is one of the basic common algorithms for single classifiers, which can also be modeled with Decision Tree C5.0.

Looking at the C5.0 model one can get 8 rules for the occurrence of fraud in a customer, according to which one can understand a number of salient features prior to the occurrence of a fraudulent transaction, so as to detect the signs of fraud in a customer and take early precautions. In Rule 1, it can be seen that customers under 27 years of age, with a credit card type of "check" and a nationality of Greece and Yugoslavia are one of the high-risk customer groups for fraudulent transactions.

Single classifiers, although widely used in the past, have obvious shortcomings. In recent years the U.S. banking industry has adopted a large number of tree algorithm families, and there are 2 main types of integrated learning that have been exposed to a large number of applications: Boosting-based and Bagging-based, and the newer ones are the Gradient Incremental Tree algorithms. These integrated learning methods avoid the problem of interdependence between variables, and the predictive analyzing ability has been gradually enhanced, with a wide range of applicability, which has been proved to be very effective in anti-fraud and some other fields, and is a direction of concern for our professionals.

The main idea of the Boosting algorithm is that in T iterations, each iteration of the misclassification of the sample to increase the resampling weight, so that in the next iteration to pay more attention to these samples. Multiple weak classifiers trained in this way are weighted and fused to produce a final resultant classifier, improving the accuracy of this weak classification algorithm. We use boosting to set up 50 decision tree iterations:

Modeling and results:

2.2.3 User Portrait

In recent years, the relatively hot user portrait, in order for the company to trace back to the origin of the customer base to have a more perceptual understanding, to assist the marketing department to carry out accurate marketing, and the use of internal data and external (third-party) data to establish a large-scale The data warehouse system utilizes both internal and external (third-party) data to build a large-scale data warehouse system, which becomes the core value resource of the company. Users usually have demographics, social group characteristics, financial business characteristics, personal interests and hobbies, and so on several major labeling systems. Through the study of user profiles, the construction of various types of labeling systems of customers can help us recognize customers in minutes.

Generally speaking, banks have abundant transaction data, personal attribute data, consumption data, credit data and customer data, and the need for user profiling is greater and practiced earlier. At present, a lot of information such as social interests and hobbies come from third-party supplements. The insurance industry is a long-cycle product, the conversion rate of insurance customers to buy insurance products again is very high, and user profiling will also be a necessary process.

According to business experience and integration algorithm theory (when the data set is large, it can be divided into different subsets, trained separately, and then synthesized into a classifier), like the banking industry, telecommunications and other large companies customer data, we can first classify according to the value of the customer (long-tail theory) of the high and low classification, and then respectively, high-value customers, low and medium-value customers, and so on, to establish a possible different type of model to achieve a better classification effect. better categorization effect. For each different and rich marketing business needs, the first step is to build a subset of labeling features from the huge customer labeling system, and then calculate the labeling impact factor by performing LR (RANKING MODEL) and other labeling weight assignments, the resulting top-ranking labels are the business personnel need to know about the target user's portrait, and at the same time, it can also more accurately provide the marketing department with the corresponding marketing customer list, greatly improving the marketing customer list, which will help the marketing department to better understand the target users. The first is a list of marketing customers, which greatly improves business efficiency.

Assuming that the anomaly detection results of the anomaly data used at the beginning of the real, increase the policyholder information table in the customer attributes: "Yes / No fraud" and according to the results of the respective labels, the use of k-Means modeling and the output of the fraud rate of each cluster group, view the results report:

From the output results for fraud, the fraud rate for the first time in the first half of the year is the same as the fraud rate for the second half of the year. > From the output results, for the higher proportion of fraud clusters, we can focus on their cluster feature labels, spss modeler can be directly viewed in the comparison of cluster features, resulting in a cluster 7 model features are described below, to achieve the realization of a minute to recognize the fraudulent transactions of strangers.

2.3 Classification of medical institutions

Classification of medical institutions can also be the first to use the clustering migration analysis method (the same as the clustering migration method of the insured), foreign anti-fraud technology has been y integrated into the management process of each institution, and achieved good results.

2.4 Detection of Medical Claims

The medical service process is handled by the organizations, through manual review of fraud is a more difficult and costly thing. Combined with the concept and experience of the clinical pathway, with the help of data mining technology to build a model to automatically identify the series of characteristics of each specific medical service, such as anti-shooting course of treatment, the degree of chemotherapy treatment, etc., is to promote the fraud detection of the health insurance industry significant progress. More in-depth research and application has also begun in China.

III. Summary