SCORISTA Hotline +7 (495) 268-09-63
SCORISTA took part in Chinese modeling competition “2020 Xiamen International Bank Digital Finance Cup Marketing Modeling Contest”. According to the first stage, Scorista took the 126th place (pic. 1). In this article we will describe in detail about the task of competition and what was our solution.
Pic. 1
Relevance and task
With the development of the scientific and technical progress, banks have formed a variety points of contact with customers to meet the daily needs of customers in making transactions, operations of various channels and so on, both offline and online. While serving a large number of customers, banks need to understand customer needs more comprehensively and accurately. For the profitability of the business, it is necessary to investigate customer losses, predict changes in the client's assets, organize marketing activities in advance/in time, reduce the loss of bank funds.
There were provided an actual business scenario of customer behavior and information about their assets as a modeling object. At the initial stage, the contestants had to show the ability to select data. Semi-final contestants will have to provide modeling results as well as appropriate marketing solutions that fully reflect the value of data analysis.
Data
1) General overview of the data. The data is divided into two data sets, x_train.rar, y_train.rar and x_test.rar where x_train.rar contains the characteristics of the training set, and y_train.rar is the target variable in the training sample. The training set consists of two samples of data for 2 quarters. x_test.rar - these are the characteristics of the test set, the variables of the characteristics are consistent with the training set. The goal of the modeling is to train the model according to the training set and predict the test set.
2) Description of the table and data fields. The training set mostly consists of sample data for the 3rd and 4th quarters, and the test set consists of sample data for the 1st quarter.
a) aum_m (Y) presents data on the assets at the end of the month for the month Y
Field |
Meaning |
cust_no |
Unique user id |
X1 |
Structured deposit balance at the end of the month |
X2 |
Term deposit balance at the end of the month |
X3 |
Current account balance at the end of the month |
X4 |
Balance of finances at the end of the month |
X5 |
Fund balance at the end of the month |
X6 |
Asset management balance at the end of the month |
X7 |
Credit balance at the end of the month |
X8 |
Balance of the deposit certificate at the end of the month |
b) behavior_m (Y) represents behavior data for month Y
Field |
Meaning |
cust_no |
Unique user id |
B1 |
Number of entries to mobile banking, online banking for Y-month |
B2 |
Number of monthly receipts for Y-month |
B3 |
Amount of receipts for Y-month |
B4 |
Number of monthly transfers for Y-month |
B5 |
Amount of transfers for Y-month |
В6 |
Most recent time of transaction only for March, June, September and December |
B7 |
Number of actions on the account for the quarter, only for March, June, September and December |
c) big_event_Q (Z) представляет важные исторические данные клиента за квартал Z
Field |
Meaning |
cust_no |
Unique user id |
E1 |
The date of account opening |
E2 |
Date of online banking account opening |
E3 |
The date of mobile app account opening |
E4 |
Date of first Internet Bank registration |
E5 |
Date of first login into mobile app |
E6 |
Date of the first active demand deposit |
E7 |
Start date of the first term deposit |
E8 |
Date of the first credit |
E9 |
First date of overdue |
E10 |
Date of the first financial transaction |
E11 |
Date of the first transfer between the personal bank account and the stock account of the securities company (Bank-Securities Account Transfer) |
E12 |
Date of the first transfer at the counter |
E13 |
Date of the first transfer via Internet banking |
E14 |
Date of the first transfer via the mobile app |
E15 |
The maximum amount of money transferred to another bank. |
E16 |
Date of the maximum amount transfer to another bank |
E17 |
Maximum amount of money transferred from another bank. |
E18 |
Date of the maximum amount transfer from another bank |
d) cunkuan_m (Y) represents the deposit data for the Y-th month
Field |
Meaning |
cust_no |
Unique user id |
C1 |
Amount of products on deposit in the current month |
C2 |
Number of products on Deposit in the current month |
e) cust_avli _Q (Z) represents active customers for quarter Z, only cust_no
f) cust_info_q (Z) represents customer information for quarter Z
Field |
Meaning |
cust_no |
Unique user id |
I1 |
Gender |
I2 |
Age |
I3 |
Customer level |
I4 |
Mark of the bank's staff |
I5 |
Career description |
I6 |
Bank credit customer mark |
I7 |
Number of bank products |
I8 |
Astrological description |
I9 |
Customer deposit |
I10 |
Description of academic history |
I11 |
Annual family income |
I12 |
Description of the working industry |
I13 |
Description of the marriage status |
I14 |
Job description |
I15 |
Customer's QR code mark from the receipt |
I16 |
Vip client mark |
I17 |
Mark of the Internet bank client |
I18 |
Mobile banking client mark |
I19 |
SMS customer |
I20 |
Mark of Wechat payment client |
The ranking of the initial stage results was determined based on the Kappa value of the test group.
Description of the SCORISTA team's solution
One of the most important stage of machine learning process is to create variables to train models. Initially, very diverse data were given: social data (gender, age, etc.), data of financial aspects (accounts balance, credits, etc.), data on customer activity (dates and number of transfers, other actions). In the beginning we created more than 1000 variables, then we selected about 200 variables for further model building.
The competition task didn’t imply an accurate interpretation of the final model, so the use of logistic regression was optional. However, we tried several attempts to test the logistic regression. The final estimate of the model built on the basis of logistic regression is about 0.4 Kappa, which is quite high for models that are used in real life situations (the accuracy is about 0.7). Nevertheless, for the competition, it was necessary to use other methods (Pic. 2).
№ |
Model description |
Kappa |
1 |
The best interpretable model (based on logistic regression) |
0,40 |
2 |
Best single gradient boosting model (based on CatBoost), without class calibration |
0,42 |
3 |
Best single gradient boosting model (based on CatBoost), with class calibration |
0,46 |
4 |
Best ensemble (6 gradient boosting models), without class calibration |
0,43 |
5 |
Best ensemble (6 gradient boosting models), with calibration by class |
0,47 |
Pic. 2
Since the data was heterogeneous, we decided to use models based on gradient boosting. Such models are interpreted only at a qualitative level through various libraries, such as shap, so they are not used in real-world tasks where models need to be strictly interpreted (for example, in credit scoring). The final model was built as follows: the class probabilities of the 6 models based on different types of gradient boosting (CatBoost, XGBoost, LightGBM) were multiplied and calibrated based on the class distribution in the original data. In the result we got Kappa 0.47 (Pic. 2).
For our team, it was a new and useful experience in analyzing and solving a non-standard problem for us. You may also download the data of competition and try to make your own decision. Good luck!
Additional materials: