SOLUTION OF MODELING CONTEST TASK

24.12.2020

Completed: press service SKORISTA

SCORISTA took part in Chinese modeling competition “2020 Xiamen International Bank Digital Finance Cup Marketing Modeling Contest”. According to the first stage, Scorista took the 126th place (pic. 1). In this article we will describe in detail about the task of competition and what was our solution.

Pic. 1

Relevance and task

With the development of the scientific and technical progress, banks have formed a variety points of contact with customers to meet the daily needs of customers in making transactions, operations of various channels and so on, both offline and online. While serving a large number of customers, banks need to understand customer needs more comprehensively and accurately. For the profitability of the business, it is necessary to investigate customer losses, predict changes in the client's assets, organize marketing activities in advance/in time, reduce the loss of bank funds.

There were provided an actual business scenario of customer behavior and information about their assets as a modeling object. At the initial stage, the contestants had to show the ability to select data. Semi-final contestants will have to provide modeling results as well as appropriate marketing solutions that fully reflect the value of data analysis.

Data

1) General overview of the data. The data is divided into two data sets, x_train.rar, y_train.rar and x_test.rar where x_train.rar contains the characteristics of the training set, and y_train.rar is the target variable in the training sample. The training set consists of two samples of data for 2 quarters. x_test.rar - these are the characteristics of the test set, the variables of the characteristics are consistent with the training set. The goal of the modeling is to train the model according to the training set and predict the test set.

2) Description of the table and data fields. The training set mostly consists of sample data for the 3rd and 4th quarters, and the test set consists of sample data for the 1st quarter.

a) aum_m (Y) presents data on the assets at the end of the month for the month Y

Field	Meaning
cust_no	Unique user id
X1	Structured deposit balance at the end of the month
X2	Term deposit balance at the end of the month
X3	Current account balance at the end of the month
X4	Balance of finances at the end of the month
X5	Fund balance at the end of the month
X6	Asset management balance at the end of the month
X7	Credit balance at the end of the month
X8	Balance of the deposit certificate at the end of the month

b) behavior_m (Y) represents behavior data for month Y

Field	Meaning
cust_no	Unique user id
B1	Number of entries to mobile banking, online banking for Y-month
B2	Number of monthly receipts for Y-month
B3	Amount of receipts for Y-month
B4	Number of monthly transfers for Y-month
B5	Amount of transfers for Y-month
В6	Most recent time of transaction only for March, June, September and December
B7	Number of actions on the account for the quarter, only for March, June, September and December

c) big_event_Q (Z) представляет важные исторические данные клиента за квартал Z

Field	Meaning
cust_no	Unique user id
E1	The date of account opening
E2	Date of online banking account opening
E3	The date of mobile app account opening
E4	Date of first Internet Bank registration
E5	Date of first login into mobile app
E6	Date of the first active demand deposit
E7	Start date of the first term deposit
E8	Date of the first credit
E9	First date of overdue
E10	Date of the first financial transaction
E11	Date of the first transfer between the personal bank account and the stock account of the securities company (Bank-Securities Account Transfer)
E12	Date of the first transfer at the counter
E13	Date of the first transfer via Internet banking
E14	Date of the first transfer via the mobile app
E15	The maximum amount of money transferred to another bank.
E16	Date of the maximum amount transfer to another bank
E17	Maximum amount of money transferred from another bank.
E18	Date of the maximum amount transfer from another bank

d) cunkuan_m (Y) represents the deposit data for the Y-th month

Field	Meaning
cust_no	Unique user id
C1	Amount of products on deposit in the current month
C2	Number of products on Deposit in the current month

e) cust_avli _Q (Z) represents active customers for quarter Z, only cust_no

f) cust_info_q (Z) represents customer information for quarter Z

Field	Meaning
cust_no	Unique user id
I1	Gender
I2	Age
I3	Customer level
I4	Mark of the bank's staff
I5	Career description
I6	Bank credit customer mark
I7	Number of bank products
I8	Astrological description
I9	Customer deposit
I10	Description of academic history
I11	Annual family income
I12	Description of the working industry
I13	Description of the marriage status
I14	Job description
I15	Customer's QR code mark from the receipt
I16	Vip client mark
I17	Mark of the Internet bank client
I18	Mobile banking client mark
I19	SMS customer
I20	Mark of Wechat payment client

The ranking of the initial stage results was determined based on the Kappa value of the test group.

Description of the SCORISTA team's solution

One of the most important stage of machine learning process is to create variables to train models. Initially, very diverse data were given: social data (gender, age, etc.), data of financial aspects (accounts balance, credits, etc.), data on customer activity (dates and number of transfers, other actions). In the beginning we created more than 1000 variables, then we selected about 200 variables for further model building.

The competition task didn’t imply an accurate interpretation of the final model, so the use of logistic regression was optional. However, we tried several attempts to test the logistic regression. The final estimate of the model built on the basis of logistic regression is about 0.4 Kappa, which is quite high for models that are used in real life situations (the accuracy is about 0.7). Nevertheless, for the competition, it was necessary to use other methods (Pic. 2).

№	Model description	Kappa
1	The best interpretable model (based on logistic regression)	0,40
2	Best single gradient boosting model (based on CatBoost), without class calibration	0,42
3	Best single gradient boosting model (based on CatBoost), with class calibration	0,46
4	Best ensemble (6 gradient boosting models), without class calibration	0,43
5	Best ensemble (6 gradient boosting models), with calibration by class	0,47

Pic. 2

Since the data was heterogeneous, we decided to use models based on gradient boosting. Such models are interpreted only at a qualitative level through various libraries, such as shap, so they are not used in real-world tasks where models need to be strictly interpreted (for example, in credit scoring). The final model was built as follows: the class probabilities of the 6 models based on different types of gradient boosting (CatBoost, XGBoost, LightGBM) were multiplied and calibrated based on the class distribution in the original data. In the result we got Kappa 0.47 (Pic. 2).

For our team, it was a new and useful experience in analyzing and solving a non-standard problem for us. You may also download the data of competition and try to make your own decision. Good luck!

Additional materials:

Targets y_train_3
x_test
x_train

Analytics

SOLUTION OF MODELING CONTEST TASK