Monthly Dacon: Credit card default prediction competition(월간 데이콘 신용카드 사용자 연체 예측 AI 경진대회)

Using RandomForest to predict credit card default. I want you to show that you can use R to make model in fast and easy. You can see the detail of this competiton in here.

First, call functions with library.


Read data with readr::read_csv. specially chain with mutate_if to change chracter data type to factor type.

df <- read_csv("train.csv") %>% mutate_if(is.character, as.factor)

skim::skim is function which enable showing summary in detail.

-- Data Summary ------------------------
Name                       df    
Number of rows             26457 
Number of columns          20    
Column type frequency:           
  factor                   8     
  numeric                  12    
Group variables            None  

-- Variable type: factor -------------------------------------------------------
  skim_variable n_missing complete_rate ordered n_unique
1 gender                0         1     FALSE          2
2 car                   0         1     FALSE          2
3 reality               0         1     FALSE          2
4 income_type           0         1     FALSE          5
5 edu_type              0         1     FALSE          5
6 family_type           0         1     FALSE          5
7 house_type            0         1     FALSE          6
8 occyp_type         8171         0.691 FALSE         18
1 F: 17697, M: 8760                          
2 N: 16410, Y: 10047                         
3 Y: 17830, N: 8627                          
4 Wor: 13645, Com: 6202, Pen: 4449, Sta: 2154
5 Sec: 17995, Hig: 7162, Inc: 1020, Low: 257 
6 Mar: 18196, Sin: 3496, Civ: 2123, Sep: 1539
7 Hou: 23653, Wit: 1257, Mun: 818, Ren: 429  
8 Lab: 4512, Cor: 2646, Sal: 2539, Man: 2167 

-- Variable type: numeric ------------------------------------------------------
   skim_variable n_missing complete_rate        mean         sd     p0    p25
 1 index                 0             1  13228        7638.         0   6614
 2 child_num             0             1      0.429       0.747      0      0
 3 income_total          0             1 187307.     101878.     27000 121500
 4 DAYS_BIRTH            0             1 -15958.       4202.    -25152 -19431
 5 DAYS_EMPLOYED         0             1  59069.     137475.    -15713  -3153
 6 FLAG_MOBIL            0             1      1           0          1      1
 7 work_phone            0             1      0.225       0.417      0      0
 8 phone                 0             1      0.294       0.456      0      0
 9 email                 0             1      0.0913      0.288      0      0
10 family_size           0             1      2.20        0.917      1      2
11 begin_month           0             1    -26.1        16.6      -60    -39
12 credit                0             1      1.52        0.702      0      1
      p50    p75    p100 hist                            
 1  13228  19842   26456 "\u2587\u2587\u2587\u2587\u2587"
 2      0      1      19 "\u2587\u2581\u2581\u2581\u2581"
 3 157500 225000 1575000 "\u2587\u2581\u2581\u2581\u2581"
 4 -15547 -12446   -7705 "\u2583\u2586\u2587\u2587\u2585"
 5  -1539   -407  365243 "\u2587\u2581\u2581\u2581\u2582"
 6      1      1       1 "\u2581\u2581\u2587\u2581\u2581"
 7      0      0       1 "\u2587\u2581\u2581\u2581\u2582"
 8      0      1       1 "\u2587\u2581\u2581\u2581\u2583"
 9      0      0       1 "\u2587\u2581\u2581\u2581\u2581"
10      2      3      20 "\u2587\u2581\u2581\u2581\u2581"
11    -24    -12       0 "\u2585\u2586\u2586\u2587\u2587"
12      2      2       2 "\u2582\u2581\u2583\u2581\u2587"
Split your train dataset to train set and validation set. It prevents from data leakage and make it possible to assess your prediction accuracy.

split = df  %>% initial_split(prop=0.75, strata='credit')
tr = split  %>% training()
vl = split  %>% testing()

Record your preprocessing step with recipe. the specific steps are below.

rec <- tr %>% 
  recipe(credit ~.) %>% 
  step_mutate(credit = as.factor(credit), skip=TRUE) %>% 
  step_mutate(yrs_birth = -ceiling(DAYS_BIRTH/365),
              yrs_employed = -ceiling(DAYS_EMPLOYED/365)) %>% 
  step_rm(index, DAYS_BIRTH, DAYS_EMPLOYED) %>% 
  step_unknown(occyp_type) %>% 
  step_integer(all_nominal(), -all_outcomes()) %>% 
  step_center(all_predictors(), -all_outcomes())


      role #variables
   outcome          1
 predictor         19

Training data contained 26457 data points and 8171 incomplete rows. 


Variables removed index [trained]
Variable mutation for ~factor(credit) [trained]
Variable mutation for ~factor(gender), ~factor(car), ~factor(rea... [trained]
Variable mutation for ~ifelse(DAYS_EMPLOYED > 0, 0, DAYS_EMPLOYED) [trained]
Unknown factor level assignment for occyp_type [trained]
Yeo-Johnson transformation on DAYS_BIRTH, DAYS_EMPLOYED, begin_month [trained]
Centering and scaling for child_num, family_size [trained]
Log transformation on income_total [trained]

You can view the prepped data with juice.

rec_tr <- rec  %>% prep(tr)  %>% juice()
A tibble: 6 × 19
gender car reality child_num income_total income_type edu_type family_type house_type FLAG_MOBIL work_phone phone email occyp_type family_size begin_month credit yrs_birth yrs_employed
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
-0.3319726 0.6210563 0.3278399 0.5708094 -29379.44 -2.400867 0.9036891 -0.370376 -0.2791049 0 -0.2271948 -0.2958371 -0.09081746 -10.8347445 0.8033968 -33.948191 0 -11.192924 165.8533
0.6680274 0.6210563 0.3278399 1.5708094 -29379.44 1.599133 0.9036891 -0.370376 3.7208951 0 -0.2271948 0.7041629 -0.09081746 -6.8347445 1.8033968 -33.948191 0 -8.192924 171.8533
0.6680274 0.6210563 0.3278399 -0.4291906 -51879.44 -2.400867 -2.0963109 -0.370376 -0.2791049 0 0.7728052 -0.2958371 0.90918254 -2.8347445 -0.1966032 8.051809 0 -11.192924 169.8533
-0.3319726 -0.3789437 0.3278399 -0.4291906 -51879.44 1.599133 -2.0963109 -0.370376 -0.2791049 0 -0.2271948 -0.2958371 -0.09081746 -2.8347445 -0.1966032 24.051809 0 -7.192924 164.8533
0.6680274 0.6210563 0.3278399 0.5708094 83120.56 1.599133 0.9036891 -0.370376 -0.2791049 0 -0.2271948 0.7041629 -0.09081746 -2.8347445 0.8033968 4.051809 0 -4.192924 164.8533
0.6680274 -0.3789437 -0.6721601 0.5708094 83120.56 1.599133 0.9036891 -0.370376 -0.2791049 0 0.7728052 -0.2958371 -0.09081746 -0.8347445 0.8033968 -8.948191 0 -16.192924 161.8533

As you can see below, all features are in numeric type. Also, mean is 0 because we use step_center above.

-- Data Summary ------------------------
Name                       rec_tr
Number of rows             19842 
Number of columns          19    
Column type frequency:           
  factor                   1     
  numeric                  18    
Group variables            None  

-- Variable type: factor -------------------------------------------------------
  skim_variable n_missing complete_rate ordered n_unique
1 credit                0             1 FALSE          3
1 2: 12726, 1: 4700, 0: 2416

-- Variable type: numeric ------------------------------------------------------
   skim_variable n_missing complete_rate      mean         sd           p0
 1 gender                0             1 -1.03e-16      0.471      -0.332 
 2 car                   0             1  8.70e-17      0.485      -0.379 
 3 reality               0             1  9.21e-17      0.469      -0.672 
 4 child_num             0             1 -2.68e-17      0.743      -0.429 
 5 income_total          0             1 -8.67e-12 101083.    -159879.    
 6 income_type           0             1 -3.82e-17      1.74       -2.40  
 7 edu_type              0             1  3.28e-16      1.34       -3.10  
 8 family_type           0             1  4.62e-17      0.950      -1.37  
 9 house_type            0             1 -1.01e-16      0.943      -1.28  
10 FLAG_MOBIL            0             1  0             0           0     
11 work_phone            0             1 -6.31e-18      0.419      -0.227 
12 phone                 0             1  4.65e-18      0.456      -0.296 
13 email                 0             1 -9.23e-18      0.287      -0.0908
14 occyp_type            0             1 -7.37e-16      5.99      -10.8   
15 family_size           0             1  5.77e-17      0.914      -1.20  
16 begin_month           0             1  7.42e-16     16.5       -33.9   
17 yrs_birth             0             1 -1.24e-15     11.5       -22.2   
18 yrs_employed          0             1  7.96e-15    375.       -840.    
           p25         p50        p75        p100
 1     -0.332      -0.332      0.668        0.668
 2     -0.379      -0.379      0.621        0.621
 3     -0.672       0.328      0.328        0.328
 4     -0.429      -0.429      0.571       13.6  
 5 -65379.     -29379.     38121.     1388121.   
 6     -1.40        1.60       1.60         1.60 
 7     -2.10        0.904      0.904        0.904
 8     -0.370      -0.370     -0.370        2.63 
 9     -0.279      -0.279     -0.279        3.72 
10      0           0          0            0    
11     -0.227      -0.227     -0.227        0.773
12     -0.296      -0.296      0.704        0.704
13     -0.0908     -0.0908    -0.0908       0.909
14     -4.83       -0.835      7.17         7.17 
15     -0.197      -0.197      0.803       12.8  
16    -12.9         2.05      14.1         26.1  
17     -9.19       -1.19       9.81        24.8  
18    162.        165.       169.         204.   
 1 "\u2587\u2581\u2581\u2581\u2583"
 2 "\u2587\u2581\u2581\u2581\u2585"
 3 "\u2583\u2581\u2581\u2581\u2587"
 4 "\u2587\u2581\u2581\u2581\u2581"
 5 "\u2587\u2581\u2581\u2581\u2581"
 6 "\u2583\u2582\u2581\u2581\u2587"
 7 "\u2581\u2583\u2581\u2581\u2587"
 8 "\u2581\u2587\u2581\u2582\u2581"
 9 "\u2587\u2581\u2581\u2581\u2581"
10 "\u2581\u2581\u2587\u2581\u2581"
11 "\u2587\u2581\u2581\u2581\u2582"
12 "\u2587\u2581\u2581\u2581\u2583"
13 "\u2587\u2581\u2581\u2581\u2581"
14 "\u2585\u2582\u2586\u2583\u2587"
15 "\u2587\u2581\u2581\u2581\u2581"
16 "\u2585\u2586\u2586\u2587\u2587"
17 "\u2585\u2587\u2587\u2586\u2583"
18 "\u2582\u2581\u2581\u2581\u2587"
rec_tr %>% 
  map_df(~sum( %>% 
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "na_count") %>% 
  filter(na_count > 0)
A tibble: 0 × 2
variable na_count
<chr> <int>

No null values in columns. good to go now!

cores = parallel::detectCores()-1

Parallel processing is available in ranger. Use num.threads parameter for this.

m = rand_forest(trees=100)  %>% 
    set_engine('ranger', num.threads=cores)  %>% 
wf = workflow()  %>% 
    add_model(m)  %>% 

workflow::workflow is great method to chain your model and recipe in one variable. It helps you research many models with reducing danger of messing up.

fit_wf = wf  %>% fit(data=tr)
preds = predict(fit_wf, vl, type='prob')
t = bind_cols(preds, vl$credit)
colnames(t) = c('0','1','2','y_true')
t = t  %>% mutate(y_true = as.factor(y_true))

logloss is used for this competition. It measure the proficiency of multiclass classification problem.

mn_log_loss(t, `0`:`2`, truth='y_true')
A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
mn_log_loss multiclass 0.722948

It is ready to submit your first submission! Load your test set and make prediction with predict and extract to csv file.

ts <- read_csv("test.csv") %>% mutate_if(is.character, as.factor)
preds = predict(fit_wf, ts, type='prob')
submission <- bind_cols(index = ts$index, preds)
colnames(submission) <- c("index", 0, 1, 2)
write_csv(submission, "ss.csv")