[Kaggle] Ubiquant Market Prediction, ๊ธˆ์œต๋ฐ์ดํ„ฐ ์˜ˆ์ธก - Part 2

2022. 4. 3. 11:08ใ†๐Ÿงช Data Science/Kaggle

 

Ubiquant Market Prediction ๋Œ€ํšŒ

 

 

 

https://www.kaggle.com/code/miingkang/ml-from-the-beginning-to-the-end-for-newbies?scriptVersionId=91431811

 

ML from the beginning to the end (For newbies๐Ÿข)

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

[์›๋ณธ Kaggle kernel] ๋„์›€์ด ๋˜์…จ๋‹ค๋ฉด, Upvote ๋ˆ„๋ฅด์ž >_<

* ํ•ด๋‹น ํฌ์ŠคํŒ…์€ ๊ฐ„๋‹จํ•œ EDA์™€ Feature engineering์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ML training์„ ๋ณด๊ณ  ์‹ถ๋‹ค๋ฉด ๋‹ค์Œ ํฌ์ŠคํŒ…์œผ๋กœ!

 

 

 

 

 

Fifth. EDA & Visualization -๐Ÿ“Š

 

ubiquant = train_set.copy()

 

 

1) Check time_id

 

time_id์˜ ๋ถ„ํฌ๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด time_id๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฌถ์—ˆ์Šต๋‹ˆ๋‹ค.

๊ฐ๊ฐ€์˜ time_id๋ฅผ ๊ฐ€์ง„ investment_id์— count()/ mean() / std() ํ•จ์ˆ˜๋ฅผ ์จ์„œ ๊ทธ๋ž˜ํ”„๋กœ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.

 

time_count = ubiquant['time_id'].groupby(ubiquant['investment_id']).count()
time_count.plot(kind='hist', bins=25, grid=True, title='time_count')
plt.show()

time_mean = ubiquant['time_id'].groupby(ubiquant['investment_id']).mean()
time_mean.plot(kind='hist', bins=25, grid=True, title='time_mean')
plt.show()

time_std = ubiquant['time_id'].groupby(ubiquant['investment_id']).std()
time_std.plot(kind='hist', bins=25, grid=True, title='time_std')
plt.show()

del time_count
del time_mean
del time_std

 

 

 

 

 

2) Scatter plot

 

 

๊ฐ„๋žตํ•˜๊ฒŒ Scatter plot๋กœ ํ‘œํ˜„ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

 

from pandas.plotting import scatter_matrix

attri = ['investment_id', 'time_id', 'f_0', 'f_1']
scatter_matrix(ubiquant[attri], figsize = (12,8))

 

 

 

 

 

3) Check Outlier

 

 

training์— ์•ž์„œ, ์ด์ƒ์น˜๋ฅผ ํƒ์ƒ‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

'investment_id'๋ฅผ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ์€ ํ›„, Target์˜ ๊ฐœ์ˆ˜์™€ ํ‰๊ท ์„ ์ธก์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

 

investment_count = ubiquant.groupby(['investment_id'])['target'].count()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_count.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Count of investment by target")
plt.show()

investment_mean = ubiquant.groupby(['investment_id'])['target'].mean()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_mean.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Mean of investment by target")
plt.show()

ax = sns.jointplot(x=investment_count, y=investment_mean, kind='reg',
                  height=8, color = 'blue')
ax.ax_joint.set_xlabel('observations')
ax.ax_joint.set_ylabel('mean target')
plt.show()

 

 

 

 

 

 

 

Sixth. Feature Engineering -๐Ÿ› 

 

 

1) Make label

 

 

train_x = train_set.drop(['target', 'row_id'], axis=1).copy()
train_target = train_set['target'].copy()
display(train_x.head())
train_target.head()

 

1290902    1.552734
2284606    8.234375
2070141   -1.006836
2284188   -0.957031
901119     0.775391
Name: target, dtype: float16

 

 

 

 

2) Remove outlier

 

 

๊ทธ๋ž˜ํ”„์—์„œ ๋‚˜ํƒ€๋‚œ ์ด์ƒ์น˜๋“ค์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋งŒ, training์„ ํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค๋ฉด, ์ •๋ง ์ด์ƒ์น˜๊ฐ€ ๋งž๋Š” ๊ฒƒ์ธ์ง€ ๊ฒ€์‚ฌ ์ž‘์—…์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

 

# Step 2.
outlier_id = investment_mean.reset_index(name='mean')
outlier_id = outlier_id[abs(outlier_id['mean']) < 0.15]
outlier_id = outlier_id['investment_id'].tolist()

# removeing outlier_id
remove_df = train_set[train_set['investment_id'].isin(outlier_id)].copy()
remove_df

 

 

์ด์ƒ์น˜๋ฅผ ์ œ๊ฑฐํ•œ ํ›„, ๊ทธ๋ž˜ํ”„๋ฅผ ๋‹ค์‹œ ๊ทธ๋ ค๋ด…๋‹ˆ๋‹ค.

์•ž์„œ ํ™•์ธํ•œ ๊ทธ๋ž˜ํ”„์—์„  ์ด์ƒ์น˜ ๋•Œ๋ฌธ์— ๊ทธ๋ž˜ํ”„๊ฐ€ ํ•„์š”์—†๋Š” ๋ถ€๋ถ„๊นŒ์ง€ ํ‘œํ˜„ํ•˜๊ณ  ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ง€๊ธˆ์€ ๋ณธ์ฒด๋ฅผ ๋” ํฌ๊ฒŒ ํ‘œํ˜„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. (2,3๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„)

 

# Step 3.
investment_count = remove_df.groupby(['investment_id'])['target'].count()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_count.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Count of investment by target")
plt.show()

investment_mean = remove_df.groupby(['investment_id'])['target'].mean()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_mean.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Mean of investment by target")
plt.show()

ax = sns.jointplot(x=investment_count, y=investment_mean, kind='reg',
                  height=8, color = 'blue')
ax.ax_joint.set_xlabel('observations')
ax.ax_joint.set_ylabel('mean target')
plt.show()

 

 

 

 

 

 

3) Scaling & Simple pipeline 
but f_0 ~ f_300 seem to be similar scale. so we don't need scaling

 

 

์›๋ž˜๋Š” ๊ฐ ๋ณ€์ˆ˜๋“ค์˜ training์„ ๋” ์›ํ™œํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด Scaling์„ ํ•ฉ๋‹ˆ๋‹ค.

Scaling์ด๋ž€ ๋‹จ์œ„๋ฅผ ๋น„์Šทํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ์ž‘์—…์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ณธ ๋Œ€ํšŒ ๋ฐ์ดํ„ฐ๋Š” ์ด๋ฏธ Scaling์ด ๋˜์–ด ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์—ฌ ์ฝ”๋“œ๋งŒ ์ž‘์„ฑํ•˜๊ณ  ์‹ค์ œ๋กœ ๋Œ๋ฆฌ์ง„ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

 

# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler

# f_num_pipeline = Pipeline([
#     ('std_scaler', StandardScaler())
# ])

# ubi_f_pipe = f_num_pipeline.fit_transform(train_set[features])

 

del train_set