2022. 4. 3. 11:15ใ๐งช Data Science/Kaggle
[์๋ณธ Kaggle kernel] ๋์์ด ๋์ จ๋ค๋ฉด, Upvote ๋๋ฅด์ >_<
* ํด๋น ํฌ์คํ ์ ๋ง์ง๋ง ๋จ๊ณ์ธ ๋ชจ๋ธ์์ฑ&ํ์ต&์ ์ถ์ ์งํํฉ๋๋ค.
Seventh. Modeling & Training -๐ก
๋ง์ง๋ง ๋จ๊ณ๋ ๋ชจ๋ธ๋ง๊ณผ ํ๋ จ์ ๋๋ค.
Model์ lightgbm์ผ๋ก ์ ์ ํ์ต๋๋ค.
import lightgbm
import xgboost
train_ds = lightgbm.Dataset(train_x, label = train_target)
val_ds = lightgbm.Dataset(test_x, label = test_target)
params = {'learning_rate': 0.01,
'max_depth': 5,
'objective': 'regression',
'metric': 'mse',
'is_training_metric': True,
'num_leaves': 144}
model = lightgbm.train(params, train_ds, 85, val_ds)
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 6.336616 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 77010
[LightGBM] [Info] Number of data points in the train set: 2513128, number of used features: 302
[LightGBM] [Info] Start training from score -0.020952
[1] valid_0's l2: 0.842425
[2] valid_0's l2: 0.842277
[3] valid_0's l2: 0.842132
[4] valid_0's l2: 0.84199
[5] valid_0's l2: 0.841848
[6] valid_0's l2: 0.841708
[7] valid_0's l2: 0.841571
[8] valid_0's l2: 0.84144
[9] valid_0's l2: 0.841309
[10] valid_0's l2: 0.84118
[11] valid_0's l2: 0.841053
[12] valid_0's l2: 0.840929
[13] valid_0's l2: 0.84081
[14] valid_0's l2: 0.84069
[15] valid_0's l2: 0.840566
[16] valid_0's l2: 0.840451
[17] valid_0's l2: 0.840337
[18] valid_0's l2: 0.840221
[19] valid_0's l2: 0.84011
[20] valid_0's l2: 0.840003
[21] valid_0's l2: 0.83989
[22] valid_0's l2: 0.839781
[23] valid_0's l2: 0.839675
[24] valid_0's l2: 0.839571
[25] valid_0's l2: 0.839468
[26] valid_0's l2: 0.839368
[27] valid_0's l2: 0.839269
[28] valid_0's l2: 0.839171
[29] valid_0's l2: 0.839074
[30] valid_0's l2: 0.838981
[31] valid_0's l2: 0.838889
[32] valid_0's l2: 0.838794
[33] valid_0's l2: 0.838699
[34] valid_0's l2: 0.838613
[35] valid_0's l2: 0.838527
[36] valid_0's l2: 0.838437
[37] valid_0's l2: 0.838353
[38] valid_0's l2: 0.838269
[39] valid_0's l2: 0.838182
[40] valid_0's l2: 0.838103
[41] valid_0's l2: 0.838018
[42] valid_0's l2: 0.837939
[43] valid_0's l2: 0.837863
[44] valid_0's l2: 0.837781
[45] valid_0's l2: 0.837707
[46] valid_0's l2: 0.837624
[47] valid_0's l2: 0.837545
[48] valid_0's l2: 0.837464
[49] valid_0's l2: 0.837396
[50] valid_0's l2: 0.83732
[51] valid_0's l2: 0.837251
[52] valid_0's l2: 0.837187
[53] valid_0's l2: 0.83711
[54] valid_0's l2: 0.837037
[55] valid_0's l2: 0.836963
[56] valid_0's l2: 0.836899
[57] valid_0's l2: 0.836829
[58] valid_0's l2: 0.836768
[59] valid_0's l2: 0.836699
[60] valid_0's l2: 0.83663
[61] valid_0's l2: 0.836567
[62] valid_0's l2: 0.836504
[63] valid_0's l2: 0.836444
[64] valid_0's l2: 0.836385
[65] valid_0's l2: 0.836325
[66] valid_0's l2: 0.836268
[67] valid_0's l2: 0.836206
[68] valid_0's l2: 0.836151
[69] valid_0's l2: 0.836095
[70] valid_0's l2: 0.836042
[71] valid_0's l2: 0.835975
[72] valid_0's l2: 0.835921
[73] valid_0's l2: 0.835867
[74] valid_0's l2: 0.835802
[75] valid_0's l2: 0.835753
[76] valid_0's l2: 0.8357
[77] valid_0's l2: 0.835637
[78] valid_0's l2: 0.835589
[79] valid_0's l2: 0.835536
[80] valid_0's l2: 0.835475
[81] valid_0's l2: 0.83543
[82] valid_0's l2: 0.835379
[83] valid_0's l2: 0.835319
[84] valid_0's l2: 0.835277
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[85] valid_0's l2: 0.835227
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
prediction = model.predict(test_x)
mse = mean_squared_error(test_target, prediction)
print(f'model mse is {mse}')
model mse is 0.8352271303733023
1) KFold
KFold๋ ๊ต์ฐจ ํ์ต ๋ฐฉ๋ฒ์
๋๋ค.
training data๋ฅผ ๋๊ฒ 5์กฐ๊ฐ์ผ๋ก ๋๋์ด, ๊ทธ ์์์ ๋ฒ๊ฐ์๊ฐ๋ฉฐ ํ์ตํ๊ณ ํ
์คํธํ๋ฉฐ ์ฑ๋ฅ์ ๋์ด๋ ๋ฐฉ์์
๋๋ค.
๊ฐ fold๋ง๋ค ํ๋ จ๋ model ์ models ๋ฆฌ์คํธ์ ๋ฃ์ด์คฌ์ต๋๋ค.
%%time
from sklearn.model_selection import KFold
params = {'learning_rate': 0.01,
'max_depth': 5,
'objective': 'regression',
'metric': 'mse',
'is_training_metric': True,
'num_leaves': 144}
kfold = KFold(n_splits=5)
models = []
print('start')
for train_indices, valid_indices in kfold.split(train_x):
print('start')
train_x, val_x = train_x.iloc[train_indices], train_x.iloc[valid_indices]
train_y, val_y = train_target.iloc[train_indices], train_target.iloc[valid_indices]
train_ds = lightgbm.Dataset(train_x, label = train_y)
val_ds = lightgbm.Dataset(val_x, label = val_y)
print('middle')
#checkpoint = keras.callbacks.ModelCheckpoint(f"model_{index}", save_best_only=True)
early_stop = keras.callbacks.EarlyStopping(patience=10)
model = lightgbm.train(params, train_ds, 100, val_ds)
models.append(model)
print('finishs')
pearson_score = stats.pearsonr(model.predict(val_x).ravel(), val_y.values)[0]
print('Pearson:', pearson_score)
del train_x
del val_x
del train_y
del val_y
del train_ds
del val_ds
gc.collect()
break
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 3.963272 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 77010
[LightGBM] [Info] Number of data points in the train set: 2010502, number of used features: 302
[LightGBM] [Info] Start training from score -0.021015
[1] valid_0's l2: 0.84607
[2] valid_0's l2: 0.845918
[3] valid_0's l2: 0.845769
[4] valid_0's l2: 0.845623
[5] valid_0's l2: 0.84548
[6] valid_0's l2: 0.845339
[7] valid_0's l2: 0.845203
[8] valid_0's l2: 0.845067
[9] valid_0's l2: 0.844936
[10] valid_0's l2: 0.8448
[11] valid_0's l2: 0.844668
[12] valid_0's l2: 0.844543
[13] valid_0's l2: 0.844412
[14] valid_0's l2: 0.844286
[15] valid_0's l2: 0.844164
[16] valid_0's l2: 0.844043
[17] valid_0's l2: 0.843924
[18] valid_0's l2: 0.843805
[19] valid_0's l2: 0.843685
[20] valid_0's l2: 0.843569
[21] valid_0's l2: 0.843456
[22] valid_0's l2: 0.843345
[23] valid_0's l2: 0.843235
[24] valid_0's l2: 0.843124
[25] valid_0's l2: 0.843017
[26] valid_0's l2: 0.84291
[27] valid_0's l2: 0.842806
[28] valid_0's l2: 0.842703
[29] valid_0's l2: 0.8426
[30] valid_0's l2: 0.842502
[31] valid_0's l2: 0.842404
[32] valid_0's l2: 0.842298
[33] valid_0's l2: 0.842191
[34] valid_0's l2: 0.842096
[35] valid_0's l2: 0.841995
[36] valid_0's l2: 0.841907
[37] valid_0's l2: 0.84181
[38] valid_0's l2: 0.841714
[39] valid_0's l2: 0.841625
[40] valid_0's l2: 0.841533
[41] valid_0's l2: 0.841449
[42] valid_0's l2: 0.841358
[43] valid_0's l2: 0.841267
[44] valid_0's l2: 0.841183
[45] valid_0's l2: 0.841096
[46] valid_0's l2: 0.841016
[47] valid_0's l2: 0.840939
[48] valid_0's l2: 0.840855
[49] valid_0's l2: 0.840777
[50] valid_0's l2: 0.840694
[51] valid_0's l2: 0.840619
[52] valid_0's l2: 0.84054
[53] valid_0's l2: 0.840467
[54] valid_0's l2: 0.840393
[55] valid_0's l2: 0.840317
[56] valid_0's l2: 0.840253
[57] valid_0's l2: 0.840181
[58] valid_0's l2: 0.840107
[59] valid_0's l2: 0.840043
[60] valid_0's l2: 0.839971
[61] valid_0's l2: 0.839902
[62] valid_0's l2: 0.839843
[63] valid_0's l2: 0.839777
[64] valid_0's l2: 0.839709
[65] valid_0's l2: 0.839648
[66] valid_0's l2: 0.839583
[67] valid_0's l2: 0.839517
[68] valid_0's l2: 0.839454
[69] valid_0's l2: 0.839397
[70] valid_0's l2: 0.839337
[71] valid_0's l2: 0.839278
[72] valid_0's l2: 0.839218
[73] valid_0's l2: 0.839163
[74] valid_0's l2: 0.839089
[75] valid_0's l2: 0.839033
[76] valid_0's l2: 0.838981
[77] valid_0's l2: 0.838912
[78] valid_0's l2: 0.838856
[79] valid_0's l2: 0.838788
[80] valid_0's l2: 0.838739
[81] valid_0's l2: 0.838682
[82] valid_0's l2: 0.838618
[83] valid_0's l2: 0.83857
[84] valid_0's l2: 0.838515
[85] valid_0's l2: 0.83845
[86] valid_0's l2: 0.838404
[87] valid_0's l2: 0.838342
[88] valid_0's l2: 0.83829
[89] valid_0's l2: 0.838246
[90] valid_0's l2: 0.838182
[91] valid_0's l2: 0.838135
[92] valid_0's l2: 0.838093
[93] valid_0's l2: 0.838038
[94] valid_0's l2: 0.837988
[95] valid_0's l2: 0.837942
[96] valid_0's l2: 0.837885
[97] valid_0's l2: 0.837835
[98] valid_0's l2: 0.83778
[99] valid_0's l2: 0.837738
[100] valid_0's l2: 0.837694
finishs
Pearson: 0.11830862796319644
CPU times: user 8min 46s, sys: 5.83 s, total: 8min 51s
Wall time: 2min 25s
Eighth. Tunning -๐น
์๋๋ ๋ชจ๋ธ์ ์ ๋ ฅํ ํ๋ผ๋ฏธํฐ๋ฅผ ํ๋ํด์ผ ํฉ๋๋ค.
GridSearchCV๋ฅผ ํตํด ํ๋์ ํ๋๋ฐ, kaggle์ RAM์ด ํ๋ฝํ์ง ์์ต๋๋ค.
# from sklearn.model_selection import GridSearchCV
# from lightgbm import LGBMRegressor
# LGB = LGBMRegressor()
# lgb_param_grid = {
# 'num_leaves' : [1,5,10],
# 'learning_rate': [1,0.1,0.01,0.001],
# 'n_estimators': [50, 100, 200, 500, 1000,5000],
# 'max_depth': [15,20,25],
# 'num_leaves': [50, 100, 200],
# 'min_split_gain': [0.3, 0.4],
# }
# gsLGB = GridSearchCV(LGB,param_grid = lgb_param_grid, cv=5, scoring="neg_mean_squared_error", n_jobs= 4, verbose = 1)
# gsLGB.fit(train_x, train_target)
# LGB_best = gsLGB.best_estimator_
# print('์ต์ ํ์ดํผ ํ๋ผ๋ฏธํฐ: ', gsLGB.best_params_)
# print('์ต๊ณ ์์ธก ์ ํ๋: {:.4f}'.format(gsLGB.best_score_))
Submission -โท
inference() ํจ์๋ kfold๋ฅผ ํตํด ํ๋ จ๋ ๋ชจ๋ธ๋ค์ ๋ชจ๋ ์์ธก์ ์ฌ์ฉํ๋ ํจ์์ ๋๋ค.
def inference(models, ds):
y_preds = []
for model in models:
y_pred = model.predict(ds)
y_preds.append(y_pred)
return np.mean(y_preds, axis=0)
import ubiquant
env = ubiquant.make_env()
iter_test = env.iter_test()
for (test_df, sample_prediction_df) in iter_test:
time_df = test_df.row_id.str.split('_').str[0].astype(int)
test_df.drop(['row_id'], axis=1, inplace=True)
test_df['time_id'] = time_df
sample_prediction_df['target'] = inference(models, test_df)
env.predict(sample_prediction_df)
This version of the API is not optimized and should not be used to estimate the runtime of your code on the hidden test set.
์์ ๋ฌธ๊ตฌ๊ฐ ๋จ๋ฉด, ์ ์ถ์ด ์๋ฃ๋ ๊ฒ์ ๋๋ค. ์๊ณ ํ์ จ์ต๋๋ค.
'๐งช Data Science > Kaggle' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[Kaggle] Ubiquant Market Prediction, ๊ธ์ต๋ฐ์ดํฐ ์์ธก - Part 2 (0) | 2022.04.03 |
---|---|
[Kaggle] Ubiquant Market Prediction, ๊ธ์ต๋ฐ์ดํฐ ์์ธก - Part 1 (0) | 2022.04.03 |