Enhance Fraud Detection in Insurance Claims| Analysis with Excel

A Linear Regression Approach for Insurance Claims and Fraud Detection Using Excel

Unlock the power of Excel for precise fraud detection in insurance claims with our comprehensive linear regression model. This skilful approach not only identifies key variables such as claim amount and annual income but also optimizes investigation thresholds, showcasing the efficiency and cost-effectiveness of an Excel-driven solution in safeguarding against fraudulent activities. Embrace the future of fraud detection with precision and ease using Excel.

Problem Description:

Fraudulent insurance claims pose a significant challenge, consuming valuable resources and costing companies substantial financial losses. This paper addresses this issue through a statistical fraud detection framework applied to insurance claims using linear regression using Excel assignment. The model, employing multiple linear regression, analyzes customer data, including variables such as annual income, marital status, age, and more, to identify potential fraud instances.

Data Selection: A random sample of 8,000 data points was selected from a pool of approximately 9,000 for model development.

Model Development: Variable selection is crucial in building an effective linear regression model. After exploring various variables against fraudulent frequency, the following were chosen:

Claim Amount

Chart 1: An analysis of the average claim amount used in the regression model

Employment status

Chart 2: Employment status vs. fraud detection

Age

Chart 3: Age vs. fraud detection

Annual income

Fraud	Average of Annual Income ( $ )
0	41,017.28
1	23,802.24
Grand Total	37507.57

Table 1: The average annual income vs. fraud detection

Monthly Premium
Marital Status (Single)

Chart 4: Marital status vs. fraud detection

Model Results:

The multiple regression model explained approximately 19.4% of the variation in fraud using the selected regressors. The model was statistically significant (F = 275.34, p < 0.001), indicating its efficacy in predicting fraud likelihood. While marital status (single) and employment status were statistically insignificant, they were retained in the model due to similar adjusted R-squared values.

Prediction of Fraud:

The likelihood of fraud was predicted using the estimated equation, provided below. A higher value signifies a greater likelihood of a fraudulent claim.

y ̂=0.697-0.012×Age+0.0003×Montly Premium-0.0000098×Claim Amount-0.000024×Annual Income+0.002×I(Single)+0.063×I(Unemployed)+0.039×I(Retired)

Investigation Decision Rule:

The threshold for fraud investigation was optimized through a cost function. The model suggests a threshold of 0.25, resulting in 3,350 investigations and a total cost of $4.94 million to the firm.

Threshold	Investigated?	Non-Fraud Investigated	Fraud Not Investigated	Cost
0.1	5785	4195	41	$ 5,990,000
0.15	5106	3555	80	$ 5,506,000
0.2	4231	2783	183	$ 5,146,000
0.25	3350	2037	318	$ 4,940,000
0.3	2463	1360	528	$ 5,103,000
0.35	1731	846	746	$ 5,461,000
0.4	1076	432	987	$ 6,011,000
0.45	619	218	1230	$ 6,769,000
0.5	344	100	1387	$ 7,279,000
0.55	112	15	1534	$ 7,782,000
0.6	0	0	1631	$ 8,155,000

Table 2: The cost of investigation of fraud on the firm

Model Evaluation:

Comparing the model-based recommendation to a random decision model:

Total Investigated: 3,350
Investigation Cost: $3.35 million
Fraud Cost: $4.775 million
Total Cost: $8.125 million
Cost from Regression: $4.94 million
Money Saved: $3.185 million

The regression model's cost was $3.185 million lower, demonstrating a 64.5% cost reduction compared to the random decision model.

Areas of Improvement:

Consider using logistic regression for binary response variables.
Explore penalized methods or best subset selection for more optimal variable selection.
Incorporate higher-order terms to capture additional variation in the data.