[Kaggle] Ubiquant Market Prediction, ๊ธˆ์œต๋ฐ์ดํ„ฐ ์˜ˆ์ธก - Part 3

2022. 4. 3. 11:15ใ†๐Ÿงช Data Science/Kaggle



Ubiquant Market Prediction ๋Œ€ํšŒ

 

https://www.kaggle.com/code/miingkang/ml-from-the-beginning-to-the-end-for-newbies?scriptVersionId=91431811

 

ML from the beginning to the end (For newbies๐Ÿข)

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

[์›๋ณธ Kaggle kernel] ๋„์›€์ด ๋˜์…จ๋‹ค๋ฉด, Upvote ๋ˆ„๋ฅด์ž >_<

* ํ•ด๋‹น ํฌ์ŠคํŒ…์€ ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„์ธ ๋ชจ๋ธ์ƒ์„ฑ&ํ•™์Šต&์ œ์ถœ์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

 



Seventh. Modeling & Training -๐Ÿ—ก

 

๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„๋Š” ๋ชจ๋ธ๋ง๊ณผ ํ›ˆ๋ จ์ž…๋‹ˆ๋‹ค.

Model์€ lightgbm์œผ๋กœ ์„ ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

import lightgbm
import xgboost

train_ds = lightgbm.Dataset(train_x, label = train_target) 
val_ds = lightgbm.Dataset(test_x, label = test_target) 
params = {'learning_rate': 0.01, 
          'max_depth': 5, 
          'objective': 'regression', 
          'metric': 'mse', 
          'is_training_metric': True, 
          'num_leaves': 144}
model = lightgbm.train(params, train_ds, 85, val_ds)

 

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 6.336616 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 77010
[LightGBM] [Info] Number of data points in the train set: 2513128, number of used features: 302
[LightGBM] [Info] Start training from score -0.020952
[1]	valid_0's l2: 0.842425
[2]	valid_0's l2: 0.842277
[3]	valid_0's l2: 0.842132
[4]	valid_0's l2: 0.84199
[5]	valid_0's l2: 0.841848
[6]	valid_0's l2: 0.841708
[7]	valid_0's l2: 0.841571
[8]	valid_0's l2: 0.84144
[9]	valid_0's l2: 0.841309
[10]	valid_0's l2: 0.84118
[11]	valid_0's l2: 0.841053
[12]	valid_0's l2: 0.840929
[13]	valid_0's l2: 0.84081
[14]	valid_0's l2: 0.84069
[15]	valid_0's l2: 0.840566
[16]	valid_0's l2: 0.840451
[17]	valid_0's l2: 0.840337
[18]	valid_0's l2: 0.840221
[19]	valid_0's l2: 0.84011
[20]	valid_0's l2: 0.840003
[21]	valid_0's l2: 0.83989
[22]	valid_0's l2: 0.839781
[23]	valid_0's l2: 0.839675
[24]	valid_0's l2: 0.839571
[25]	valid_0's l2: 0.839468
[26]	valid_0's l2: 0.839368
[27]	valid_0's l2: 0.839269
[28]	valid_0's l2: 0.839171
[29]	valid_0's l2: 0.839074
[30]	valid_0's l2: 0.838981
[31]	valid_0's l2: 0.838889
[32]	valid_0's l2: 0.838794
[33]	valid_0's l2: 0.838699
[34]	valid_0's l2: 0.838613
[35]	valid_0's l2: 0.838527
[36]	valid_0's l2: 0.838437
[37]	valid_0's l2: 0.838353
[38]	valid_0's l2: 0.838269
[39]	valid_0's l2: 0.838182
[40]	valid_0's l2: 0.838103
[41]	valid_0's l2: 0.838018
[42]	valid_0's l2: 0.837939
[43]	valid_0's l2: 0.837863
[44]	valid_0's l2: 0.837781
[45]	valid_0's l2: 0.837707
[46]	valid_0's l2: 0.837624
[47]	valid_0's l2: 0.837545
[48]	valid_0's l2: 0.837464
[49]	valid_0's l2: 0.837396
[50]	valid_0's l2: 0.83732
[51]	valid_0's l2: 0.837251
[52]	valid_0's l2: 0.837187
[53]	valid_0's l2: 0.83711
[54]	valid_0's l2: 0.837037
[55]	valid_0's l2: 0.836963
[56]	valid_0's l2: 0.836899
[57]	valid_0's l2: 0.836829
[58]	valid_0's l2: 0.836768
[59]	valid_0's l2: 0.836699
[60]	valid_0's l2: 0.83663
[61]	valid_0's l2: 0.836567
[62]	valid_0's l2: 0.836504
[63]	valid_0's l2: 0.836444
[64]	valid_0's l2: 0.836385
[65]	valid_0's l2: 0.836325
[66]	valid_0's l2: 0.836268
[67]	valid_0's l2: 0.836206
[68]	valid_0's l2: 0.836151
[69]	valid_0's l2: 0.836095
[70]	valid_0's l2: 0.836042
[71]	valid_0's l2: 0.835975
[72]	valid_0's l2: 0.835921
[73]	valid_0's l2: 0.835867
[74]	valid_0's l2: 0.835802
[75]	valid_0's l2: 0.835753
[76]	valid_0's l2: 0.8357
[77]	valid_0's l2: 0.835637
[78]	valid_0's l2: 0.835589
[79]	valid_0's l2: 0.835536
[80]	valid_0's l2: 0.835475
[81]	valid_0's l2: 0.83543
[82]	valid_0's l2: 0.835379
[83]	valid_0's l2: 0.835319
[84]	valid_0's l2: 0.835277
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[85]	valid_0's l2: 0.835227

 

 

from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

prediction = model.predict(test_x)
mse = mean_squared_error(test_target, prediction)
print(f'model mse is {mse}')

 

model mse is 0.8352271303733023

 


1) KFold

 

KFold๋Š” ๊ต์ฐจ ํ•™์Šต ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
training data๋ฅผ ๋Œ€๊ฒŒ 5์กฐ๊ฐ์œผ๋กœ ๋‚˜๋ˆ„์–ด, ๊ทธ ์•ˆ์—์„œ ๋ฒˆ๊ฐˆ์•„๊ฐ€๋ฉฐ ํ•™์Šตํ•˜๊ณ  ํ…Œ์ŠคํŠธํ•˜๋ฉฐ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

๊ฐ fold๋งˆ๋‹ค ํ›ˆ๋ จ๋œ model ์„ models ๋ฆฌ์ŠคํŠธ์— ๋„ฃ์–ด์คฌ์Šต๋‹ˆ๋‹ค.

 

%%time
from sklearn.model_selection import KFold
params = {'learning_rate': 0.01, 
          'max_depth': 5, 
          'objective': 'regression', 
          'metric': 'mse', 
          'is_training_metric': True, 
          'num_leaves': 144}
kfold = KFold(n_splits=5)
models = []
print('start')

for  train_indices, valid_indices in kfold.split(train_x):
    print('start')
    train_x, val_x = train_x.iloc[train_indices], train_x.iloc[valid_indices]
    train_y, val_y = train_target.iloc[train_indices], train_target.iloc[valid_indices]
    train_ds = lightgbm.Dataset(train_x, label = train_y) 
    val_ds = lightgbm.Dataset(val_x, label = val_y) 
    print('middle')
    #checkpoint = keras.callbacks.ModelCheckpoint(f"model_{index}", save_best_only=True)
    early_stop = keras.callbacks.EarlyStopping(patience=10)
    model = lightgbm.train(params, train_ds, 100, val_ds)
    models.append(model)
    print('finishs')
    pearson_score = stats.pearsonr(model.predict(val_x).ravel(), val_y.values)[0]
    print('Pearson:', pearson_score)
    del train_x
    del val_x
    del train_y
    del val_y
    del train_ds
    del val_ds
    gc.collect()
    break

 

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 3.963272 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 77010
[LightGBM] [Info] Number of data points in the train set: 2010502, number of used features: 302
[LightGBM] [Info] Start training from score -0.021015
[1]	valid_0's l2: 0.84607
[2]	valid_0's l2: 0.845918
[3]	valid_0's l2: 0.845769
[4]	valid_0's l2: 0.845623
[5]	valid_0's l2: 0.84548
[6]	valid_0's l2: 0.845339
[7]	valid_0's l2: 0.845203
[8]	valid_0's l2: 0.845067
[9]	valid_0's l2: 0.844936
[10]	valid_0's l2: 0.8448
[11]	valid_0's l2: 0.844668
[12]	valid_0's l2: 0.844543
[13]	valid_0's l2: 0.844412
[14]	valid_0's l2: 0.844286
[15]	valid_0's l2: 0.844164
[16]	valid_0's l2: 0.844043
[17]	valid_0's l2: 0.843924
[18]	valid_0's l2: 0.843805
[19]	valid_0's l2: 0.843685
[20]	valid_0's l2: 0.843569
[21]	valid_0's l2: 0.843456
[22]	valid_0's l2: 0.843345
[23]	valid_0's l2: 0.843235
[24]	valid_0's l2: 0.843124
[25]	valid_0's l2: 0.843017
[26]	valid_0's l2: 0.84291
[27]	valid_0's l2: 0.842806
[28]	valid_0's l2: 0.842703
[29]	valid_0's l2: 0.8426
[30]	valid_0's l2: 0.842502
[31]	valid_0's l2: 0.842404
[32]	valid_0's l2: 0.842298
[33]	valid_0's l2: 0.842191
[34]	valid_0's l2: 0.842096
[35]	valid_0's l2: 0.841995
[36]	valid_0's l2: 0.841907
[37]	valid_0's l2: 0.84181
[38]	valid_0's l2: 0.841714
[39]	valid_0's l2: 0.841625
[40]	valid_0's l2: 0.841533
[41]	valid_0's l2: 0.841449
[42]	valid_0's l2: 0.841358
[43]	valid_0's l2: 0.841267
[44]	valid_0's l2: 0.841183
[45]	valid_0's l2: 0.841096
[46]	valid_0's l2: 0.841016
[47]	valid_0's l2: 0.840939
[48]	valid_0's l2: 0.840855
[49]	valid_0's l2: 0.840777
[50]	valid_0's l2: 0.840694
[51]	valid_0's l2: 0.840619
[52]	valid_0's l2: 0.84054
[53]	valid_0's l2: 0.840467
[54]	valid_0's l2: 0.840393
[55]	valid_0's l2: 0.840317
[56]	valid_0's l2: 0.840253
[57]	valid_0's l2: 0.840181
[58]	valid_0's l2: 0.840107
[59]	valid_0's l2: 0.840043
[60]	valid_0's l2: 0.839971
[61]	valid_0's l2: 0.839902
[62]	valid_0's l2: 0.839843
[63]	valid_0's l2: 0.839777
[64]	valid_0's l2: 0.839709
[65]	valid_0's l2: 0.839648
[66]	valid_0's l2: 0.839583
[67]	valid_0's l2: 0.839517
[68]	valid_0's l2: 0.839454
[69]	valid_0's l2: 0.839397
[70]	valid_0's l2: 0.839337
[71]	valid_0's l2: 0.839278
[72]	valid_0's l2: 0.839218
[73]	valid_0's l2: 0.839163
[74]	valid_0's l2: 0.839089
[75]	valid_0's l2: 0.839033
[76]	valid_0's l2: 0.838981
[77]	valid_0's l2: 0.838912
[78]	valid_0's l2: 0.838856
[79]	valid_0's l2: 0.838788
[80]	valid_0's l2: 0.838739
[81]	valid_0's l2: 0.838682
[82]	valid_0's l2: 0.838618
[83]	valid_0's l2: 0.83857
[84]	valid_0's l2: 0.838515
[85]	valid_0's l2: 0.83845
[86]	valid_0's l2: 0.838404
[87]	valid_0's l2: 0.838342
[88]	valid_0's l2: 0.83829
[89]	valid_0's l2: 0.838246
[90]	valid_0's l2: 0.838182
[91]	valid_0's l2: 0.838135
[92]	valid_0's l2: 0.838093
[93]	valid_0's l2: 0.838038
[94]	valid_0's l2: 0.837988
[95]	valid_0's l2: 0.837942
[96]	valid_0's l2: 0.837885
[97]	valid_0's l2: 0.837835
[98]	valid_0's l2: 0.83778
[99]	valid_0's l2: 0.837738
[100]	valid_0's l2: 0.837694
finishs
Pearson: 0.11830862796319644
CPU times: user 8min 46s, sys: 5.83 s, total: 8min 51s
Wall time: 2min 25s

 

 

 

 

 

Eighth. Tunning -๐ŸŽน

 

 

์›๋ž˜๋Š” ๋ชจ๋ธ์— ์ž…๋ ฅํ•  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํŠœ๋‹ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 

GridSearchCV๋ฅผ ํ†ตํ•ด ํŠœ๋‹์„ ํ•˜๋Š”๋ฐ, kaggle์˜ RAM์ด ํ—ˆ๋ฝํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

 

# from sklearn.model_selection import GridSearchCV
# from lightgbm import LGBMRegressor
# LGB = LGBMRegressor()

# lgb_param_grid = {
#     'num_leaves' : [1,5,10],
#     'learning_rate': [1,0.1,0.01,0.001],
#     'n_estimators': [50, 100, 200, 500, 1000,5000], 
#     'max_depth': [15,20,25],
#     'num_leaves': [50, 100, 200],
#     'min_split_gain': [0.3, 0.4],
# }
# gsLGB = GridSearchCV(LGB,param_grid = lgb_param_grid, cv=5, scoring="neg_mean_squared_error", n_jobs= 4, verbose = 1)
# gsLGB.fit(train_x, train_target)
# LGB_best = gsLGB.best_estimator_

# print('์ตœ์  ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ: ', gsLGB.best_params_)
# print('์ตœ๊ณ  ์˜ˆ์ธก ์ •ํ™•๋„: {:.4f}'.format(gsLGB.best_score_))

 

 

 

Submission -โ›ท

 

inference() ํ•จ์ˆ˜๋Š” kfold๋ฅผ ํ†ตํ•ด ํ›ˆ๋ จ๋œ ๋ชจ๋ธ๋“ค์„ ๋ชจ๋‘ ์˜ˆ์ธก์— ์‚ฌ์šฉํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.

 

def inference(models, ds):
    y_preds = []
    for model in models:
        y_pred = model.predict(ds)
        y_preds.append(y_pred)
    return np.mean(y_preds, axis=0)

 

import ubiquant
env = ubiquant.make_env()
iter_test = env.iter_test() 
for (test_df, sample_prediction_df) in iter_test:
    time_df = test_df.row_id.str.split('_').str[0].astype(int)
    test_df.drop(['row_id'], axis=1, inplace=True)
    test_df['time_id'] = time_df
    sample_prediction_df['target'] = inference(models, test_df)
    env.predict(sample_prediction_df) 

 

This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.

 

 

์œ„์˜ ๋ฌธ๊ตฌ๊ฐ€ ๋œจ๋ฉด, ์ œ์ถœ์ด ์™„๋ฃŒ๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ˆ˜๊ณ ํ•˜์…จ์Šต๋‹ˆ๋‹ค.