ENG

SCORISTA Hotline +7 (495) 268-09-63

Analytics

SOLUTION OF MODELING CONTEST TASK

24.12.2020
Completed: press service SKORISTA

SCORISTA took part in Chinese modeling competition “2020 Xiamen International Bank Digital Finance Cup Marketing Modeling Contest”. According to the first stage, Scorista took the 126th place (pic. 1). In this article we will describe in detail about the task of competition and what was our solution.

Pic. 1

Relevance and task

With the development of the scientific and technical progress, banks have formed a variety points of contact with customers to meet the daily needs of customers in making transactions, operations of various channels and so on, both offline and online. While serving a large number of customers, banks need to understand customer needs more comprehensively and accurately. For the profitability of the business, it is necessary to investigate customer losses, predict changes in the client's assets, organize marketing activities in advance/in time, reduce the loss of bank funds.

There were provided an actual business scenario of customer behavior and information about their assets as a modeling object. At the initial stage, the contestants had to show the ability to select data. Semi-final contestants will have to provide modeling results as well as appropriate marketing solutions that fully reflect the value of data analysis.

Data

1) General overview of the data. The data is divided into two data sets, x_train.rar, y_train.rar and x_test.rar where x_train.rar contains the characteristics of the training set, and y_train.rar is the target variable in the training sample. The training set consists of two samples of data for 2 quarters. x_test.rar - these are the characteristics of the test set, the variables of the characteristics are consistent with the training set. The goal of the modeling is to train the model according to the training set and predict the test set.

2) Description of the table and data fields. The training set mostly consists of sample data for the 3rd and 4th quarters, and the test set consists of sample data for the 1st quarter.

a) aum_m (Y) presents data on the assets at the end of the month for the month Y

Field   

Meaning

cust_no 

Unique user id 

X1 

Structured deposit balance at the end of the month

X2 

Term deposit balance at the end of the month

X3 

Current account balance at the end of the month

X4 

Balance of finances at the end of the month

X5 

Fund balance at the end of the month

X6 

Asset management balance at the end of the month

X7 

Credit balance at the end of the month

X8 

Balance of the deposit certificate at the end of the month

 

b) behavior_m (Y) represents behavior data for month Y

Field  

Meaning  

cust_no 

Unique user id 

B1 

Number of entries to mobile banking, online banking for Y-month

B2 

Number of monthly receipts for Y-month

B3 

Amount of receipts for Y-month

B4 

Number of monthly transfers for Y-month

B5 

Amount of transfers for Y-month

В6 

Most recent time of transaction only for March, June, September and December

B7 

Number of actions on the account for the quarter, only for March, June, September and December

c) big_event_Q (Z) представляет важные исторические данные клиента за квартал Z 

Field  

Meaning  

cust_no 

Unique user id 

E1 

The date of account opening

E2 

Date of online banking account opening 

E3 

The date of mobile app account opening

E4 

Date of first Internet Bank registration 

E5 

Date of first login into mobile app

E6 

Date of the first active demand deposit

E7 

Start date of the first term deposit

E8 

Date of the first credit

E9 

First date of overdue

E10 

Date of the first financial transaction

E11 

Date of the first transfer between the personal bank account and the stock account of the securities company (Bank-Securities Account Transfer)

E12 

Date of the first transfer at the counter

E13 

Date of the first transfer via Internet banking

E14 

Date of the first transfer via the mobile app

E15 

The maximum amount of money transferred to another bank.

E16 

Date of the maximum amount transfer to another bank

E17 

Maximum amount of money transferred from another bank.

E18 

Date of  the maximum amount transfer from another bank

d) cunkuan_m (Y) represents the deposit data for the Y-th month

Field   

Meaning

cust_no 

Unique user id 

C1 

Amount of products on deposit in the current month

C2 

Number of products on Deposit in the current month

e) cust_avli _Q (Z) represents active customers for quarter Z, only cust_no

f) cust_info_q (Z) represents customer information for quarter Z

Field   

Meaning

cust_no 

Unique user id 

I1 

Gender

I2 

Age

I3 

Customer level

I4 

Mark of the bank's staff

I5 

Career description

I6 

Bank credit customer mark

I7 

Number of bank products

I8 

Astrological description

I9 

Customer deposit

I10 

Description of academic history

I11 

Annual family income

I12 

Description of the working industry

I13 

Description of the marriage status

I14 

Job description

I15 

Customer's QR code mark from the receipt

I16 

Vip client mark

I17 

Mark of the Internet bank client

I18 

Mobile banking client mark

I19 

SMS customer

I20 

Mark of Wechat payment client

The ranking of the initial stage results was determined based on the Kappa value of the test group.

Description of the SCORISTA team's solution

One of the most important stage of machine learning process is to create variables to train models. Initially, very diverse data were given: social data (gender, age, etc.), data of financial aspects (accounts balance, credits, etc.), data on customer activity (dates and number of transfers, other actions). In the beginning we created more than 1000 variables, then we selected about 200 variables for further model building.

The competition task didn’t imply an accurate interpretation of the final model, so the use of logistic regression was optional. However, we tried several attempts to test the logistic regression. The final estimate of the model built on the basis of logistic regression is about 0.4 Kappa, which is quite high for models that are used in real life situations (the accuracy is about 0.7). Nevertheless, for the competition, it was necessary to use other methods (Pic. 2). 

Model description

Kappa

1

The best interpretable model (based on logistic regression)

0,40

2

Best single gradient boosting model (based on CatBoost), without class calibration

0,42

3

Best single gradient boosting model (based on CatBoost), with class calibration

0,46

4

Best ensemble (6 gradient boosting models), without class calibration

0,43

5

Best ensemble (6 gradient boosting models), with calibration by class

0,47

                                                          Pic. 2

Since the data was heterogeneous, we decided to use models based on gradient boosting. Such models are interpreted only at a qualitative level through various libraries, such as shap, so they are not used in real-world tasks where models need to be strictly interpreted (for example, in credit scoring). The final model was built as follows: the class probabilities of the 6 models based on different types of gradient boosting (CatBoost, XGBoost, LightGBM) were multiplied and calibrated based on the class distribution in the original data. In the result we got Kappa 0.47 (Pic. 2).

For our team, it was a new and useful experience in analyzing and solving a non-standard problem for us. You may also download the data of competition and try to make your own decision. Good luck!

Additional materials:

Targets y_train_3 
x_test 
x_train