2022. 4. 3. 11:08ใ๐งช Data Science/Kaggle
[์๋ณธ Kaggle kernel] ๋์์ด ๋์ จ๋ค๋ฉด, Upvote ๋๋ฅด์ >_<
* ํด๋น ํฌ์คํ ์ ๊ฐ๋จํ EDA์ Feature engineering์ ์งํํฉ๋๋ค. ML training์ ๋ณด๊ณ ์ถ๋ค๋ฉด ๋ค์ ํฌ์คํ ์ผ๋ก!
Fifth. EDA & Visualization -๐
ubiquant = train_set.copy()
1) Check time_id
time_id์ ๋ถํฌ๋ฅผ ํ์ ํ๊ธฐ ์ํด time_id๋ฅผ ๊ธฐ์ค์ผ๋ก ๋ฌถ์์ต๋๋ค.
๊ฐ๊ฐ์ time_id๋ฅผ ๊ฐ์ง investment_id์ count()/ mean() / std() ํจ์๋ฅผ ์จ์ ๊ทธ๋ํ๋ก ๋ํ๋์ต๋๋ค.
time_count = ubiquant['time_id'].groupby(ubiquant['investment_id']).count()
time_count.plot(kind='hist', bins=25, grid=True, title='time_count')
plt.show()
time_mean = ubiquant['time_id'].groupby(ubiquant['investment_id']).mean()
time_mean.plot(kind='hist', bins=25, grid=True, title='time_mean')
plt.show()
time_std = ubiquant['time_id'].groupby(ubiquant['investment_id']).std()
time_std.plot(kind='hist', bins=25, grid=True, title='time_std')
plt.show()
del time_count
del time_mean
del time_std
2) Scatter plot
๊ฐ๋ตํ๊ฒ Scatter plot๋ก ํํํด๋ณด์์ต๋๋ค.
from pandas.plotting import scatter_matrix
attri = ['investment_id', 'time_id', 'f_0', 'f_1']
scatter_matrix(ubiquant[attri], figsize = (12,8))
3) Check Outlier
training์ ์์, ์ด์์น๋ฅผ ํ์ํ์ต๋๋ค.
'investment_id'๋ฅผ ๊ทธ๋ฃน์ผ๋ก ๋ฌถ์ ํ, Target์ ๊ฐ์์ ํ๊ท ์ ์ธก์ ํ์ต๋๋ค.
investment_count = ubiquant.groupby(['investment_id'])['target'].count()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_count.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Count of investment by target")
plt.show()
investment_mean = ubiquant.groupby(['investment_id'])['target'].mean()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_mean.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Mean of investment by target")
plt.show()
ax = sns.jointplot(x=investment_count, y=investment_mean, kind='reg',
height=8, color = 'blue')
ax.ax_joint.set_xlabel('observations')
ax.ax_joint.set_ylabel('mean target')
plt.show()
Sixth. Feature Engineering -๐
1) Make label
train_x = train_set.drop(['target', 'row_id'], axis=1).copy()
train_target = train_set['target'].copy()
display(train_x.head())
train_target.head()
1290902 1.552734
2284606 8.234375
2070141 -1.006836
2284188 -0.957031
901119 0.775391
Name: target, dtype: float16
2) Remove outlier
๊ทธ๋ํ์์ ๋ํ๋ ์ด์์น๋ค์ ์ ๊ฑฐํฉ๋๋ค.
๋ค๋ง, training์ ํ์ ๋ ์ฑ๋ฅ์ด ๋จ์ด์ง๋ค๋ฉด, ์ ๋ง ์ด์์น๊ฐ ๋ง๋ ๊ฒ์ธ์ง ๊ฒ์ฌ ์์
์ ํด์ผ ํฉ๋๋ค.
# Step 2.
outlier_id = investment_mean.reset_index(name='mean')
outlier_id = outlier_id[abs(outlier_id['mean']) < 0.15]
outlier_id = outlier_id['investment_id'].tolist()
# removeing outlier_id
remove_df = train_set[train_set['investment_id'].isin(outlier_id)].copy()
remove_df
์ด์์น๋ฅผ ์ ๊ฑฐํ ํ, ๊ทธ๋ํ๋ฅผ ๋ค์ ๊ทธ๋ ค๋ด ๋๋ค.
์์ ํ์ธํ ๊ทธ๋ํ์์ ์ด์์น ๋๋ฌธ์ ๊ทธ๋ํ๊ฐ ํ์์๋ ๋ถ๋ถ๊น์ง ํํํ๊ณ ์์์ต๋๋ค. ์ง๊ธ์ ๋ณธ์ฒด๋ฅผ ๋ ํฌ๊ฒ ํํํ๊ณ ์์ต๋๋ค. (2,3๋ฒ์งธ ๊ทธ๋ํ)
# Step 3.
investment_count = remove_df.groupby(['investment_id'])['target'].count()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_count.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Count of investment by target")
plt.show()
investment_mean = remove_df.groupby(['investment_id'])['target'].mean()
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
investment_mean.plot.hist(bins=60, color = 'blue', alpha = 0.4)
plt.title("Mean of investment by target")
plt.show()
ax = sns.jointplot(x=investment_count, y=investment_mean, kind='reg',
height=8, color = 'blue')
ax.ax_joint.set_xlabel('observations')
ax.ax_joint.set_ylabel('mean target')
plt.show()
3) Scaling & Simple pipeline
but f_0 ~ f_300 seem to be similar scale. so we don't need scaling
์๋๋ ๊ฐ ๋ณ์๋ค์ training์ ๋ ์ํํ๊ฒ ํ๊ธฐ ์ํด Scaling์ ํฉ๋๋ค.
Scaling์ด๋ ๋จ์๋ฅผ ๋น์ทํ๊ฒ ๋ง๋๋ ์์ ์ ์๋ฏธํฉ๋๋ค. ํ์ง๋ง ๋ณธ ๋ํ ๋ฐ์ดํฐ๋ ์ด๋ฏธ Scaling์ด ๋์ด ์๋ ๊ฒ์ผ๋ก ๋ณด์ฌ ์ฝ๋๋ง ์์ฑํ๊ณ ์ค์ ๋ก ๋๋ฆฌ์ง ์์์ต๋๋ค.
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import StandardScaler
# f_num_pipeline = Pipeline([
# ('std_scaler', StandardScaler())
# ])
# ubi_f_pipe = f_num_pipeline.fit_transform(train_set[features])
del train_set
'๐งช Data Science > Kaggle' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[Kaggle] Ubiquant Market Prediction, ๊ธ์ต๋ฐ์ดํฐ ์์ธก - Part 3 (0) | 2022.04.03 |
---|---|
[Kaggle] Ubiquant Market Prediction, ๊ธ์ต๋ฐ์ดํฐ ์์ธก - Part 1 (0) | 2022.04.03 |