Curious about data science approaches to real life cases I started my CAS in Big Data Analytics in 2019. Today I’d like to share the management summary to my Certification of Advanced Studies degree, I received in 2020. At the information technology department of Lucerne University of Applied Sciences I completed my studies in Big Data Analytics with a thesis on customer churn supported by a leading swiss health insurance company. The thesis is confidential, I cannot blog about specitic content, but I want to share some thoughts on the process and some interesting insights.
First of all students were asked to find supporting companies and work on real life problems with the companies databases. We decided to tackle customer churn and used the companies CRM, combined with invoices, free to use market research information and public available price trend data in health insurance sector in Switzerland. We used this huge amount of several years of strictly anonymized data, tagged with a variable “churn” (yes/no) and hundreds of raw data features. The goal was to help marketing to better understand the process of customer churn. We identified groups of churners on various paths from clients to churning customers and gave insights about further modeling in these differing groups.
Photo by Chris Liverani on Unsplash
Insurance companies in Switzerland hardly differ from each other in terms of the legally required insurance benefits in the basic health insurance. It is therefore very easy for clients to obtain comparative information about different insurance providers and to change their own insurer. In this highly dynamic competitive environment with a central annual date for changing basic insurance, there is a lot of movement through large-scale marketing campaigns by insurers. Since the benefits are the same, the comparison of the various insurers is mainly based on premiums and customer service. Every autumn, the topic of changing insurers is taken up again in the media. Price differences and the reasons for changing insurers base on a customers savings in the premiums. Since the basic insurance and the supplementary insurance do not have to be with the same insurer, existing supplementary insurance policies do not speak against a change.
Backed by these business insights we started our exploratory data analysis, to better understand the features and their correlations with churn. We saw that there are two areas in the age feature of the customer that highly tied with higher customer churn. We also found some incidence, that other sociodemographic information from the CRM explain customer churn quite well. After some visualizations we dug ourselves deep in SQL queries and R scripts to engineer the most important features. We had to deal with unbalanced data, since our leading insurance company is not facing a high rate of customer churn. From an academic paper we knew that logistic regression and random forest predicted best on customer churn in a dutch insurance company [https://pure.tue.nl/ws/portalfiles/portal/47019808]. Our database performed much better on random forest, so we chose to concentrate on it and skip logistic regression. After some iterations of hyperparameter tuning we landed a final accuracy on test data of 0.86, precision of 0.81, sensitivity of 0.76, speciticity of 0.91 and auc of 0.94.
Photo by Alexander Sinn on Unsplash
Further steps should cover modeling with XGBoost and training of very specific models for smaller groups that lead to individual marketing concepts to control the customer churn. Our results also showed some interesting cases for cluster analysis to help marketing to better adress their customers on different stages in customer life cycle.