[Analytics] ๋ฏธ๊ตญ ํƒ์‹œ๋กœ ์ƒ์กดํ•˜๋Š” ๋ฐฉ๋ฒ• (2) feat. TLC

2022. 5. 25. 19:38ใ†๐Ÿงช Data Science/Analytics

๋‹น์‹ ์€ NEWYORK์—์„œ ํƒ์‹œ ๊ธฐ์‚ฌ๋กœ ์ผํ•˜๊ณ  ์žˆ๋‹ค.

ํƒ์‹œ๋Š” Yellow Taxi์ด๋‹ค.

ํƒ์‹œ ๊ธฐ์‚ฌ๋กœ ์ƒ์กดํ•˜๊ธฐ ์œ„ํ•ด์„  ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋…ธ๋ ฅ์ด ํ•„์š”ํ•˜๋‹ค.

 

 

1) ๊ทผ๋ฌด ์‹œ๊ฐ„ ๋™์•ˆ, ์†๋‹˜์„ ์ตœ๋Œ€ํ•œ ๋งŽ์ด ํƒœ์šธ ๊ฒƒ (์†๋‹˜)

2) ๋งŽ์€ ์š”๊ธˆ + ํ’๋ถ€ํ•œ ํŒ (์š”๊ธˆ)

3) ํ˜„๊ธˆ ๊ฒฐ์ œ๋Š” ์†Œ๋“์— ํฌํ•จ์‹œํ‚ค์ง€ ์•Š๊ณ  ์Šฌ์ฉํ•˜๊ธฐ (์„ธ๊ธˆ ์ ˆ์•ฝ)

 

 

์ฒซ ๋ฒˆ์งธ ๋…ธ๋ ฅ์€ ์ด์ „ ํฌ์ŠคํŒ…์—์„œ ๋‹ค๋ค˜๋‹ค.

์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„  ๋งŽ์€ ์š”๊ธˆ๊ณผ ํŒ์„ ๋ฐ›๋Š” ๋ฐฉ๋ฒ•, ํ˜„๊ธˆ ๊ฒฐ์ œํ•˜๋Š” ์†๋‹˜์„ ๋ฐ›๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด๊ฒ ๋‹ค.

 

 

[Data]

๋ฐ์ดํ„ฐ๋Š” SparkSQL ํฌ์ŠคํŒ…์—์„œ ์ „ ์ฒ˜๋ฆฌํ•œ Yellow Taxi ๋ฐ์ดํ„ฐ(cleaned)๋ฅผ ์‚ฌ์šฉํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค.

NEWYORK Yellow Taxi์˜ ์šดํ–‰์„ 2021.01~2021.07๊นŒ์ง€ ๋ชจ๋‘ ๋ชจ์•„๋†“์€ ๋ฐ์ดํ„ฐ๋‹ค.

[Yellow Taxi Data: https://mengu.tistory.com/50]

 

[SparkSQL] ํƒ์‹œ ๋ฐ์ดํ„ฐ ๋‹ค์šด/์ „์ฒ˜๋ฆฌ/๋ถ„์„ feat. TLC

์ด์ „ ํฌ์ŠคํŒ…์—์„œ ๊ณต๋ถ€ํ•œ SparkSQL ์ง€์‹์„ ๋ฐ”ํƒ•์œผ๋กœ, ์‹ค์ œ Taxi ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒ˜๋ฆฌํ•ด๋ณด์ž. * ์ „์ฒ˜๋ฆฌ๋ž€? ์ด์ƒ์น˜ ์ œ๊ฑฐ, ๊ทธ๋ฃนํ™” ๋“ฑ ๋ฐ์ดํ„ฐ ๋ถ„์„์ด ์šฉ์ดํ•˜๋„๋ก ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ˜•ํ•˜๋Š” ๊ณผ์ •์„ ๋งํ•œ๋‹ค. TLC Trip Recor

mengu.tistory.com

 

 

 

 

Basic Setting

 

from matplotlib import font_manager, rc
font_path = 'C:\\WINDOWS\\Fonts\\HBATANG.TTF'
font = font_manager.FontProperties(fname=font_path).get_name()
rc('font', family=font)

import os
import findspark
findspark.init(os.environ.get("SPARK_HOME"))
import pyspark
from pyspark import SparkConf, SparkContext
import pandas as pd
import faulthandler
faulthandler.enable()
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').appName("taxi-analysis").getOrCreate()

# ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ํŒŒ์ผ
zone_data = "C:/DE study/data-engineering/01-spark/data/taxi_zone_lookup.csv"
trip_files = "C:/DE study/data-engineering/01-spark/data/trips/*"

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
trips_df = spark.read.csv(f"file:///{trip_files}", inferSchema = True, header = True)
zone_df = spark.read.csv(f"file:///{zone_data}", inferSchema = True, header = True)

# ๋ฐ์ดํ„ฐ ์Šคํ‚ค๋งˆ
trips_df.printSchema()
zone_df.printSchema()

# ๋ฐ์ดํ„ฐ createOrReplaceTempView()
trips_df.createOrReplaceTempView("trips")
zone_df.createOrReplaceTempView("zone")


# ์“ธ๋งŒํ•œ column๋งŒ ๋ชจ์•„๋†“์ž.
query = '''
SELECT
    t.VendorID,
    TO_DATE(t.tpep_pickup_datetime) AS pickup_date,
    HOUR(t.tpep_pickup_datetime) AS pickup_time,
    TO_DATE(t.tpep_dropoff_datetime) AS dropoff_date,
    HOUR(t.tpep_dropoff_datetime) AS dropoff_time,
    t.passenger_count,
    t.trip_distance,
    t.payment_type,
    t.fare_amount,
    t.tip_amount,
    t.tolls_amount,
    t.total_amount,
    pz.Zone as pzone,
    dz.Zone as dzone
FROM
    trips t
    LEFT JOIN 
        zone pz
    ON
        t.PULocationID == pz.LocationID
    LEFT JOIN 
        zone dz
    ON
        t.DOLocationID == dz.LocationID
'''
taxi_df = spark.sql(query)
taxi_df.createOrReplaceTempView("taxi")


spark.sql('select * from taxi').show(5)


+--------+-----------+-----------+------------+------------+---------------+-------------+------------+-----------+----------+------------+------------+-----------------+--------------+
|VendorID|pickup_date|pickup_time|dropoff_date|dropoff_time|passenger_count|trip_distance|payment_type|fare_amount|tip_amount|tolls_amount|total_amount|            pzone|         dzone|
+--------+-----------+-----------+------------+------------+---------------+-------------+------------+-----------+----------+------------+------------+-----------------+--------------+
|       2| 2021-03-01|          0|  2021-03-01|           0|              1|          0.0|           2|        3.0|       0.0|         0.0|         4.3|               NV|            NV|
|       2| 2021-03-01|          0|  2021-03-01|           0|              1|          0.0|           2|        2.5|       0.0|         0.0|         3.8|   Manhattanville|Manhattanville|
|       2| 2021-03-01|          0|  2021-03-01|           0|              1|          0.0|           2|        3.5|       0.0|         0.0|         4.8|   Manhattanville|Manhattanville|
|       1| 2021-03-01|          0|  2021-03-01|           0|              0|         16.5|           1|       51.0|     11.65|        6.12|       70.07|LaGuardia Airport|            NA|
|       2| 2021-03-01|          0|  2021-03-01|           0|              1|         1.13|           1|        5.5|      1.86|         0.0|       11.16|     East Chelsea|            NV|
+--------+-----------+-----------+------------+------------+---------------+-------------+------------+-----------+----------+------------+------------+-----------------+--------------+
only showing top 5 rows



# cleaning -------------------------------------------------------------------

# data clearning
# fare, tip, tolls ์— ๋Œ€ํ•œ ๊ฒƒ์€ ๋‚˜์ค‘์— df ๋งŒ๋“ค๋•Œ ์ „์ฒ˜๋ฆฌํ•ด์ฃผ์ž.
query = '''
SELECT
    *
FROM
    taxi t
WHERE
    t.total_amount < 2000
    AND t.total_amount > 0
    AND t.trip_distance < 100
    AND t.passenger_count < 6
    AND t.pickup_date >= '2021-01-01'
    AND t.pickup_date < '2021-08-01'
    AND t.dropoff_date >= '2021-01-01'
    AND t.dropoff_date < '2021-08-03'
'''

df_c = spark.sql(query)
df_c.createOrReplaceTempView('cleaned')

 

 

 

 

 

 

 

์ „๋žต 2. ๋งŽ์€ ์š”๊ธˆ & ๋งŽ์€ ํŒ


๋งŽ์€ ์†๋‹˜์„ ๋ฐ›์œผ๋ฉด์„œ, ์š”๊ธˆ๊ณผ ํŒ๊นŒ์ง€ ๋งŽ์ด ๋ฒŒ์–ด๋“ค์ผ ์ˆ˜ ์žˆ๋‹ค๋ฉด ์ˆ˜์ž…์ด ์ ์ ํ•  ๊ฒƒ์ด๋‹ค.
์–ด๋–ค ์ƒํ™ฉ์—์„œ ์š”๊ธˆ๊ณผ ํŒ์ด ๋งŽ์ด ๋‚˜์˜ฌ๊นŒ? ์š”๊ธˆ๊ณผ ํŒ์ด ๋†’์€ ๊ฒฝ์šฐ๋ฅผ ๋ถ„์„ํ•ด๋ณด๊ณ , ๋”๋‚˜์•„๊ฐ€ ์ผ๋ถ€ ์นผ๋Ÿผ ๋งŒ์œผ๋กœ ์š”๊ธˆ๊ณผ ํŒ์„ ์˜ˆ์ธกํ•ด๋ณด์ž. ์†๋‹˜์ด ์ฝœํ–ˆ์„ ๋•Œ, ์š”๊ธˆ๊ณผ ํŒ์ด ๋งŽ์ด ๋‚˜์˜ฌ ๊ฒƒ์œผ๋กœ ์˜ˆ์ธกํ•˜๋ฉด ์ฝœ์„ ์ˆ˜๋ฝํ•˜๋Š” API๋ฅผ ๋งŒ๋“ค์–ด๋ณด์ž.

 

 

 

 

(1) ์š”๊ธˆ & ํŒ ๋ถ„์„

 

 

A. ์š”์ผ๊ณผ fare, distance, tip์˜ ๊ด€๊ณ„

๋จผ์ € ์š”์ผ๋ณ„๋กœ fare, distance, tip์˜ ํ‰๊ท ์„ ๋น„๊ตํ–ˆ๋‹ค.

์›”ํ™”์ˆ˜๋ชฉ๊ธˆํ† ์ผ์€ ๊ฑฐ๋ฆฌ, ์š”๊ธˆ, ํŒ์— ์˜ํ–ฅ์„ ์คฌ์„๊นŒ?

 

 

# ์š”์ผ๋ณ„ fare, distance, tip ํ‰๊ท 
query = '''

SELECT
    DATE_FORMAT(c.pickup_date, 'EEEE') AS day_of_week,
    MEAN(c.trip_distance) AS distance_mean,
    MEAN(c.fare_amount) AS fare_mean,
    MEAN(c.tip_amount) AS tip_mean
FROM
    cleaned c
GROUP BY
    day_of_week
'''

weekday_df = spark.sql(query).toPandas()


# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ

for i in ['distance_mean', 'fare_mean', 'tip_mean']:
    fig, ax = plt.subplots(figsize=(16,6))
    plt.title(i, fontsize = 30)
    sns.barplot(
        x = 'day_of_week',
        y = i,
        data = weekday_df
    )

 

 

์š”์ผ๋ณ„ ๊ฑฐ๋ฆฌ๋Š” ์›”์š”์ผ -> ์ผ์š”์ผ๋กœ ํ–ฅํ• ์ˆ˜๋ก ์ปค์ง„๋‹ค. ์ฃผ๋ง์— ์ถœํ‡ด๊ทผ๋ณด๋‹จ ์ƒ‰๋‹ค๋ฅธ ๊ณณ์— ๋†€๋Ÿฌ ๊ฐ€๊ฑฐ๋‚˜ ๋จผ ๊ณณ์— ๋ณผ ์ผ์„ ๊ฐ€๋Š” ์ผ์ด ๋” ๋งŽ์•„์„œ ์ด๋Ÿฐ ์–‘์ƒ์ด ๋‚˜์˜จ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

 

 

 

fare๋Š” ์š”์ผ๋ณ„๋กœ ํฌ๊ฒŒ ์ฐจ์ด๊ฐ€ ๋‚˜๋ณด์ด์ง€ ์•Š๋Š”๋‹ค. fare๊ฐ€ distance์ฒ˜๋Ÿผ ์›”์š”์ผ -> ์ผ์š”์ผ๋กœ ์ปค์ง€๋Š” ์–‘์ƒ์„ ๋ ์ง€๋งŒ, distance๊ฐ€ ์ปค์กŒ๊ธฐ์— ๋‹น์—ฐํ•œ ์ˆ˜์ˆœ์ด๋‹ค. ์š”์ผ ์ž์ฒด๊ฐ€ fare์— ํฐ ์˜ํ–ฅ์„ ์ฃผ๋Š” ๊ฒƒ ๊ฐ™์ง€๋Š” ์•Š๋‹ค.. 

 

 

 

fare์— ๋ฐ˜ํ•ด, ํŒ์€ ์œ ์˜๋ฏธํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์• ์ดˆ์— ํŒ์€ ๋ณด๋„ˆ์Šค์˜ ๋Š๋‚Œ์ด๊ธฐ์— 1๋‹ฌ๋Ÿฌ๋งŒ ๋ฐ›์•„๋„ ๊ฐœ๊ฟ€์ด๋‹ค. ์ผ์š”์ผ์— ์‚ฌ๋žŒ๋“ค์ด ๊ธฐ๋ถ„์ด ์ข‹์€์ง€ tip์„ ๋” ๋งŽ์ด ์ค€๋‹ค. ์–ด๋ผ? ์•ž์—์„œ ๊ฑฐ๋ฆฌ๊ฐ€ ์ผ์š”์ผ๋กœ ๊ฐˆ์ˆ˜๋ก ๋Š˜์–ด๋‚ฌ๋‹ค. ํŒ๋„ ์›”์š”์ผ -> ์ผ์š”์ผ๋กœ ๊ฐˆ์ˆ˜๋ก ์กฐ๊ธˆ์”ฉ ๋Š˜์–ด๋‚˜๋Š” ์–‘์ƒ์ธ๋ฐ, ๊ทธ๋ ‡๋‹ค๋ฉด tip๊ณผ distance๊ฐ€ ๊ด€๊ณ„์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹๊นŒ?

 

 

 

 

 

 

 

B. Distance์™€ Fare, Tip ๊ณผ์˜ ๊ด€๊ณ„

 

๊ฑฐ๋ฆฌ์™€ ์š”๊ธˆ ๋ฐ ํŒ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๋ฐํžˆ๊ธฐ ์œ„ํ•ด ์‚ฐํฌ๋„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ธ๋‹ค. ์ด์–ด ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ํ›„, ํžˆํŠธ๋งต์„ ๊ทธ๋ ธ๋‹ค.

 

 

query = '''
SELECT
    c.trip_distance,
    c.fare_amount,
    c.tip_amount
FROM
    cleaned c
WHERE
    c.fare_amount > 0
    AND c.fare_amount < 100
    AND c.tip_amount > 0
    AND c.tip_amount < 100
'''

distance_df = spark.sql(query).toPandas()

for i in ['fare_amount', 'tip_amount']:
    fig, ax = plt.subplots(figsize=(16,6))
    plt.title(i, fontsize = 30)
    sns.scatterplot(
        x = 'trip_distance',
        y = i,
        data = distance_df
    )
    
fig, ax = plt.subplots(figsize=(16,6))
plt.title("Person Correlation of Features", fontsize = 30)
sns.heatmap(distance_df.corr(), cmap=colormap, annot=True)

 

 

 

[๊ฑฐ๋ฆฌ์™€ ์š”๊ธˆ ๋ฐ์ดํ„ฐ์˜ ์‚ฐํฌ๋„]

 

์š”๊ธˆ์€ ์• ์ดˆ์— ๊ฑฐ๋ฆฌ์— ๋น„๋ก€ํ•จ์œผ๋กœ ๊ทธ๋ž˜ํ”„๋Š” ๋‹น์—ฐํ•œ ๊ฒฐ๊ณผ๋‹ค.

 

 

 

 

 

[๊ฑฐ๋ฆฌ์™€ ํŒ ๋ฐ์ดํ„ฐ์˜ ์‚ฐํฌ๋„]

 

ํŒ์€ ์š”๊ธˆ๋งŒํผ์ด๋‚˜ ๊ฑฐ๋ฆฌ์™€ ๋น„๋ก€ํ•˜์ง„ ์•Š๋‹ค.

๊ฑฐ๋ฆฌ๊ฐ€ 10 mile ์ดํ•˜์ธ ๊ฒฝ์šฐ, ํŒ์ด ๋ฒ”์œ„๊ฐ€ ๊ฑฐ์˜ ๋น„์Šทํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ 10 mile์ด ๋„˜์–ด๊ฐˆ์ˆ˜๋ก ํŒ๋„ ์กฐ๊ธˆ์”ฉ ์ฆ๊ฐ€ํ•จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

[๊ฑฐ๋ฆฌ, ์š”๊ธˆ, ํŒ์˜ ์ƒ๊ด€๊ด€๊ณ„ ํžˆํŠธ๋งต]

 

๊ฑฐ๋ฆฌ & ํŒ์˜ ์ƒ๊ด€๊ด€๊ณ„ : 0.78

๊ฑฐ๋ฆฌ & ์š”๊ธˆ์˜ ์ƒ๊ด€๊ด€๊ณ„ : 0.95

์š”๊ธˆ & ํŒ์˜ ์ƒ๊ด€๊ด€๊ณ„ : 0.81 

์ด๋“ค์€ ๋งค์šฐ ์ค‘์š”ํ•œ ๊ฒฐ๊ณผ๋‹ค. ํŠนํžˆ ์š”๊ธˆ์ด ์ปค์งˆ ์ˆ˜๋ก ํŒ๋„ ์ปค์ง„๋‹ค๋Š” ๊ฒƒ์€, ๊ณง ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€์–ด์ง€๋ฉด ํŒ๋„ ์ปค์งˆ ํ™•๋ฅ ์ด ํผ์„ ์˜๋ฏธํ•œ๋‹ค.

 

 

 

 

 

 

C. ์žฅ์†Œ๋ณ„ Fare, Tip ๊ด€๊ณ„ ๋ถ„์„

 

์žฅ์†Œ์— ๋”ฐ๋ผ์„œ ์š”๊ธˆ๊ณผ ํŒ์ด ๋‹ฌ๋ผ์งˆ๊นŒ? ํ™•์ธํ•œ ๊ฒฐ๊ณผ, ์žฅ์†Œ์— ๋”ฐ๋ผ ์š”๊ธˆ๊ณผ ํŒ์˜ ํ‰๊ท ์ด ์ฒœ์ฐจ๋งŒ๋ณ„๋กœ ๋‚˜๋‰จ์„ ํ™•์ธํ–ˆ๋‹ค. ๋ฐ์ดํ„ฐ์— ํฌํ•จ๋œ ์žฅ์†Œ๊ฐ€ ์›Œ๋‚™ ๋งŽ์•„์„œ, Top 5์œผ๋กœ ์š”๊ธˆ๊ณผ ํŒ์ด ๋†’์€ ์žฅ์†Œ๋ฅผ ์ถ”๋ ธ๋‹ค.

 

# pzone
query = '''

SELECT
    c.pzone,
    MEAN(c.trip_distance) AS distance_mean,
    MEAN(c.fare_amount) AS fare_mean,
    MEAN(c.tip_amount) AS tip_mean
FROM
    cleaned c
GROUP BY
    c.pzone
ORDER BY
    fare_mean desc
LIMIT 5
'''

pzone_df = spark.sql(query).toPandas()


# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
fig, ax = plt.subplots(figsize=(16,6))
plt.title('Top 5 pzone in fare_mean', fontsize = 30)
sns.barplot(
    x = 'pzone',
    y = 'fare_mean',
    data = pzone_df
)


--------

# pzone
query = '''

SELECT
    c.pzone,
    MEAN(c.trip_distance) AS distance_mean,
    MEAN(c.fare_amount) AS fare_mean,
    MEAN(c.tip_amount) AS tip_mean
FROM
    cleaned c
GROUP BY
    c.pzone
ORDER BY
    tip_mean desc
LIMIT 5
'''

pzone_df = spark.sql(query).toPandas()


# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
fig, ax = plt.subplots(figsize=(16,6))
plt.title('Top 5 pzone in tip_mean', fontsize = 30)
sns.barplot(
    x = 'pzone',
    y = 'tip_mean',
    data = pzone_df
)

-------

# dzone
query = '''

SELECT
    c.dzone,
    MEAN(c.trip_distance) AS distance_mean,
    MEAN(c.fare_amount) AS fare_mean,
    MEAN(c.tip_amount) AS tip_mean
FROM
    cleaned c
GROUP BY
    c.dzone
ORDER BY
    fare_mean desc
LIMIT 5
'''

dzone_df = spark.sql(query).toPandas()


# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
fig, ax = plt.subplots(figsize=(16,6))
plt.title('Top 5 dzone in fare_mean', fontsize = 30)
sns.barplot(
    x = 'dzone',
    y = 'fare_mean',
    data = dzone_df
)

--------

# dzone
query = '''

SELECT
    c.dzone,
    MEAN(c.trip_distance) AS distance_mean,
    MEAN(c.fare_amount) AS fare_mean,
    MEAN(c.tip_amount) AS tip_mean
FROM
    cleaned c
GROUP BY
    c.dzone
ORDER BY
    tip_mean desc
LIMIT 5
'''

dzone_df = spark.sql(query).toPandas()


# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
fig, ax = plt.subplots(figsize=(16,6))
plt.title('Top 5 dzone in tip_mean', fontsize = 30)
sns.barplot(
    x = 'dzone',
    y = 'tip_mean',
    data = dzone_df
)

 

 

 

 

[Pickup zone : ์š”๊ธˆ TOP 5]

์Šน์ฐจ ์ง€์—ญ ์ค‘ ์š”๊ธˆ Top 5๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

1์œ„. Charleston/Tottenville

2์œ„. Prince's Bay/Annadale

3์œ„. Great Kills Park

4์œ„. Rossville/Woodrow

5์œ„. Arden Heights

 

 

 

 

[Pickup zone : ํŒ TOP 5]

์Šน์ฐจ ์ง€์—ญ ์ค‘ ํŒ์ด ๊ฐ€์žฅ ๋งŽ์€ Top 5์ด๋‹ค. ์š”๊ธˆ์ด ๋งŽ์ด ๋‚˜์˜ค๋Š” ์ง€์—ญ๊ณผ ๋‹ค์†Œ ๋‹ค๋ฆ„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

1์œ„. Oakwood

2์œ„. Newark Airport

3์œ„. NA

4์œ„. Port Richmond

5์œ„. Flushing Meadows-Corona Park

 

 

 

 

[Dropoff zone : ์š”๊ธˆ TOP 5]

ํ•˜์ฐจ ์ง€์—ญ ์ค‘ ์š”๊ธˆ์ด Top 5์ธ ๊ณณ๋“ค์ด๋‹ค. ์Šน์ฐจ ์ง€์—ญ Top 5์™€ ๊ฝค ๋น„์Šทํ•˜๋‹ค!

 

1์œ„. Charleston/Tottenville

2์œ„. Prince's Bay/Annadale

3์œ„. Rossville/Woodrow

4์œ„. Arden Heights

5์œ„. NA

 

 

 

[Dropoff zone : ํŒ TOP 5]

ํ•˜์ฐจ ์ง€์—ญ ์ค‘ ํŒ์ด Top 5์ธ ๊ณณ๋“ค์ด๋‹ค. ์Šน์ฐจ ์ง€์—ญ ํŒ Top 5์™€ ๊ฑฐ์˜ ๋น„์Šทํ•˜๋ฉฐ, JFK Airport๋งŒ ๋‹ค๋ฅด๋‹ค.

 

1์œ„. Oakwood

2์œ„. NA

3์œ„. Oakwood

4์œ„. JFK Airport

5์œ„. Prince's Bay/Annadale

 

 

 

์ด ๋ชจ๋‘๋ฅผ ๋‰ด์š• ์ง€๋„์— ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

 

 

 

์ด์ „ ํฌ์ŠคํŒ…์—์„œ ์•Œ์•„๋ณธ ์†๋‹˜์ด ๋งŽ์€ ์ง€์—ญ์€ ๋‰ด์—ญ ๋ฒˆํ™”๊ฐ€์˜€๋‹ค.

ํ•˜์ง€๋งŒ ์š”๊ธˆ, Tip์ด ๋งŽ์€ ์ง€์—ญ์€ ๊ทธ๋ณด๋‹ค ๋ฉ€์งํžˆ ๋–จ์–ด์ง„ ๊ณณ๋“ค์ด๋‹ค. ์ด๊ณณ๋“ค์€ ๋ณดํ†ต ๋ฒˆํ™”๊ฐ€์™€ ๋–จ์–ด์ ธ ์žˆ๊ธฐ์—, ์Šน์ฐจํ•˜๋ฉด ๋จผ ๊ณณ์œผ๋กœ ๊ฐ€๊ธฐ์— ์š”๊ธˆ์„ ๋งŽ์ด ๋‚ด์•ผ ํ•œ๋‹ค. ๋˜ ์ด๊ณณ๋“ค์— ํ•˜์ฐจํ•˜๋ ค๋ฉด ๋จผ ๊ณณ์—์„œ ์™€์•ผ ํ•˜๊ธฐ์— ๋˜ ์š”๊ธˆ์„ ๋งŽ์ด ๋‚ด์•ผ ํ•œ๋‹ค. ์š”๊ธˆ๋„ ๋งŽ๊ณ  ๊ฑฐ๋ฆฌ๋„ ํฌ๋‹ˆ ํŒ๋„ ์ž์—ฐ์Šค๋ ˆ ์ปค์ง€๋Š” ๊ฒƒ์ด๋‹ค. ์•„๋ฌด๋ž˜๋„ ๋งˆ๋ƒฅ ํŒ๊ณผ ์š”๊ธˆ์ด ๋งŽ๋‹ค๊ณ  ์ข‹์•„ํ•ด์„œ๋Š” ์•ˆ๋˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ์š”๊ธˆ, ํŒ์„ ์ข€ ๋” ๋งŽ์ด ์–ป์œผ๋ ค๋‹ค ์‚ฌ๋žŒ์ด ์—†๋Š” ๊ณณ์œผ๋กœ ๊ฐ€์„œ, ๋‹ค์Œ ์†๋‹˜์„ ๋ชป ๋ฐ›์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

 

 

 

 

 

D. Fare, Tip ์˜ˆ์ธกํ•˜๊ธฐ

 

Fare, Tip๊ณผ ๋‹ค๋ฅธ ๋ณ€์ˆ˜๋“ค๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋Š” ์–ด๋Š ์ •๋„ ํŒŒ์•…ํ–ˆ๋‹ค. ์ด์   ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ํ†ตํ•ด ์•„์˜ˆ ์˜ˆ์ธกํ•˜๋Š” ๋ด‡์„ ๋งŒ๋“ค์–ด๋ณด์ž. ์•Œ์•„์„œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•ด์„œ ์˜ˆ์ธกํ•ด ์ค„ ํ…Œ๋‹ˆ ๊ธฐ์‚ฌ๋“ค์—๊ฒ ์ด๋ณด๋‹ค ์ˆ˜์›”ํ•œ ์˜์‚ฌ๊ฒฐ์ • ๋„์šฐ๋ฏธ๊ฐ€ ์—†์„ ๊ฒƒ์ด๋‹ค. 

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •๊ณผ ๋ชจ๋ธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ๊ณผ์ •์€ ์ด์ „ ํฌ์ŠคํŒ…์—์„œ ๋‹ค๋ค˜์—ˆ๋‹ค. ์ฝ”๋“œ๊ฐ€ ์ดํ•ด๋˜์ง€ ์•Š์„ ๊ฒฝ์šฐ, ์ด์ „ ํฌ์ŠคํŒ…์—์„œ ๋ณด๊ณ  ์˜ค๊ธธ ๋ฐ”๋ž€๋‹ค.

 

 

Fare ์˜ˆ์ธก ๋ณ€์ˆ˜ : ์Šน์ฐจ ์žฅ์†Œ, ํ•˜์ฐจ ์žฅ์†Œ, ์š”์ผ, ์Šน๊ฐ ์ˆ˜, ๊ฑฐ๋ฆฌ, ์Šน์ฐจ ์‹œ๊ฐ„

Tip ์˜ˆ์ธก ๋ณ€์ˆ˜ : ์Šน์ฐจ ์žฅ์†Œ, ํ•˜์ฐจ ์žฅ์†Œ, ์š”์ผ, ์Šน๊ฐ ์ˆ˜, ๊ฑฐ๋ฆฌ, ์Šน์ฐจ ์‹œ๊ฐ„, ์ง€๋ถˆ ๋ฐฉ๋ฒ•

 

 

# fare ์˜ˆ์ธก ๋ชจ๋ธ ------------------------------------------------------------

# ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ
transform_stages = stages
pipeline = Pipeline(stages = transform_stages)
fitted_transformer = pipeline.fit(train_df)

vtrain_df = fitted_transformer.transform(train_df)
vtest_df = fitted_transformer.transform(test_df)


# ๋ชจ๋ธ๋ง ๋ฐ ํ•™์Šต
model = lr.fit(vtrain_df)


# R2
model.summary.r2

๊ฒฐ๊ณผ : 0.8121402678393606


# ๋ชจ๋ธ ์ €์žฅ
model_dir = "C:/DE study/data-engineering/01-spark/data/model"
model.save(model_dir)




# tip ์˜ˆ์ธก ๋ชจ๋ธ ------------------------------------------------------------

# ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ
transform_stages = stages
pipeline = Pipeline(stages = transform_stages)
fitted_transformer = pipeline.fit(train_df)

vtrain_df = fitted_transformer.transform(train_df)
vtest_df = fitted_transformer.transform(test_df)


# ๋ชจ๋ธ๋ง ๋ฐ ํ•™์Šต
model2 = lr.fit(vtrain_df)


# R2
model2.summary.r2

๊ฒฐ๊ณผ : 0.46363365869062256


# ๋ชจ๋ธ ์ €์žฅ
model_dir = "C:/DE study/data-engineering/01-spark/data/model"
model2.save(model_dir)

 

ํŒ ์˜ˆ์ธก ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ์ข€ ๋‚ฎ๋‹ค... ์ผ๋‹จ ์ฃผ์–ด์ง„ ์นผ๋Ÿผ์—์„œ ์ตœ์„ ์œผ๋กœ ๋ณด์ด๋ฏ€๋กœ, ๋„˜์•„๊ฐ€์ž!

 

 

 

 

 

E. Fare, ํŒ ์˜ˆ์ธกํ•ด์„œ ์ฝœ ์กฐ์–ธํ•ด์ฃผ๋Š” API

 

๋Œ€๋ง์˜ Fare, Tip ์˜ˆ์ธก API๋‹ค. ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ๊นŒ์ง€ ๊ตณ์ด ์˜ฌ๋ฆฌ์ง„ ์•Š์•˜๋‹ค. ์•ž์—์„œ ์„ค์ •ํ•œ ๋ณ€์ˆ˜์— ๋งž์ถฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์ถ•ํ–ˆ๊ณ , ์ด์— ๋งž๋Š” ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์™€์„œ Fare, Tip์„ ์˜ˆ์ธกํ•œ๋‹ค.

 

 

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

def fare_tip_API(pickup_id, dropoff_id, day_of_week, passenger, distance, pickup_time, p_type):
    columns = ['pickup_location_id', 'dropoff_location_id', 'day_of_week',
              'passenger_count', 'trip_distance', 'pickup_time', 'p_type']
    data = [(pickup_id, dropoff_id, day_of_week, passenger, distance, pickup_time, p_type)]
    rdd = spark.sparkContext.parallelize(data)
    df = rdd.toDF(columns)
    
    df_t = fitted_transformer.transform(df)
    df_t_2 = fitted_transformer_2.transform(df)
    fare_prediction = lr_model.transform(df_t)
    tip_prediction = lr_model2.transform(df_t_2)
    fare_prediction = fare_prediction.select(fare_prediction.prediction)
    fare = fare_prediction.rdd.flatMap(list).first()
    tip_prediction = tip_prediction.select(tip_prediction.prediction)
    tip = tip_prediction.rdd.flatMap(list).first()
    return fare, tip

 

 

 

 

์‹œ๋‚˜๋ฆฌ์˜ค ์ƒํ™ฉ

 

๋”ฐ๋œปํ•œ ์ฃผ๋ง, ์ผ์š”์ผ ์•„์นจ 6์‹œ์— ๋‹น์‹ ์€ ํƒ์‹œ๋ฅผ ์šด์ „ํ•˜๊ณ  ์žˆ๋‹ค.

์ฝœ์ด ๋“ค์–ด์™”๋‹ค.

์Šน์ฐจ ์žฅ์†Œ๋Š” Queens Astoria(7) ์ด๊ณ , ํ•˜์ฐจ ์žฅ์†Œ๋Š” Brooklyn Bay Ridge(15)์ด๋‹ค. ๊ฑฐ๋ฆฌ๋Š” ๋Œ€๋žต 17 mile. ์Šน๊ฐ์€ ๋‘ ๋ช…์ด๋‹ค. ๊ณ„์‚ฐ์€ ํ˜„๊ธˆ์œผ๋กœ ํ•˜๊ฒ ๋‹ค๊ณ  ํ•œ๋‹ค. ์ง€๊ธˆ ๋‹น์žฅ ์š”๊ธˆ๊ณผ ํŒ์„ ์˜ˆ์ธกํ•˜๋ผ!

 

 

fare, tip = fare_tip_API(7, 15, 0, 2, 17, 6, "Cash")


print(f'์˜ˆ์ƒ ์š”๊ธˆ์€ {fare} ์ž…๋‹ˆ๋‹ค.')
print(f'์˜ˆ์ƒ ํŒ์€ {tip} ์ž…๋‹ˆ๋‹ค.')


์˜ˆ์ƒ ์š”๊ธˆ์€ 42.24078328599208 ์ž…๋‹ˆ๋‹ค.
์˜ˆ์ƒ ํŒ์€ 3.6536566893473803 ์ž…๋‹ˆ๋‹ค.

 

์˜ˆ์ƒ ์š”๊ธˆ : 42.2 ๋‹ฌ๋Ÿฌ

์˜ˆ์ƒ ํŒ: 3.65 ๋‹ฌ๋Ÿฌ

 

์ฝœ์„ ๋ฐ›์„ ๊ฒƒ์ธ์ง€ ๋ง ๊ฒƒ์ธ์ง€๋Š” ๋‹น์‹ ์˜ ๋ชซ์ด๋‹ค!

 

 

 

 

 

์ „๋žต 3. ํ˜„๊ธˆ ๊ฒฐ์ œ 


๋Œ€๋ง์˜ ๋งˆ์ง€๋ง‰ ๊ผผ์ˆ˜๋‹ค. ์†๋‹˜์„ ํƒœ์›Œ์„œ ๋ˆ์„ ์—ด์‹ฌํžˆ ๋ฒŒ์—ˆ๋‹ค๋ฉด, ์„ธ๊ธˆ์„ ์ตœ๋Œ€ํ•œ ์•ˆ๊ปด์„œ ์‚ด์•„๋‚จ์•„์•ผ ํ•œ๋‹ค.
* ๋ณธ ๋ฐœ์–ธ์€ ๊ธ€์“ด์ด์˜ ์‚ถ์˜ ํƒœ๋„์™€๋Š” ์—ฐ๊ด€์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

์†๋‹˜์ด ํ˜„๊ธˆ ๊ฒฐ์ œ๋ฅผ ํ•  ๊ฒฝ์šฐ, ์„ธ๊ธˆ์„ ํ”ผํ•ด๊ฐˆ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ƒฅ ์†Œ๋“ ์‹ ๊ณ ๋ฅผ ์•ˆ ํ•˜๋ฉด ๊ทธ๋งŒ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๋”ฐ๋ผ์„œ ์–ด๋–ค ์žฅ์†Œ, ์–ด๋–ค ์†๋‹˜์„ ํƒœ์šธ ๋•Œ ํ˜„๊ธˆ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š”์ง€ ๋ถ„์„ํ•ด๋ณด์ž.

 

 

 

[ํ˜„๊ธˆ๊ณผ ์‹ ์šฉ์นด๋“œ ๊ฒฐ์ œ์˜ ๋นˆ๋„์ˆ˜]

 

์—ญ์‹œ ์š”์ฆ˜์€ ์‹ ์šฉ์นด๋“œ๋ฅผ ๋งŽ์ด ์“ด๋‹ค.

 

 

 

[ํ˜„๊ธˆ๊ณผ ์‹ ์šฉ์นด๋“œ์˜ ํŒ ํ‰๊ท ]

 

 

 

 

์˜ˆ์ƒ์น˜ ๋ชปํ•œ ์ผ์ด ๋ฐœ์ƒํ–ˆ๋‹ค. ํ˜„๊ธˆ์œผ๋กœ ํŒ์„ ์ค€ ๊ฒฝ์šฐ๊ฐ€ ๊ฑฐ์˜ ์—†๋‹ค๊ณ  ๋‚˜์™”๋‹ค. ์ ˆ๋Œ€ ์žˆ์„ ์ˆ˜๊ฐ€ ์—†๋Š” ์ผ์ด๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ํ™•์ธํ–ˆ๋Š”๋ฐ ๊ณ„์† ๋˜‘๊ฐ™์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋‹ค. ํ•œ ๊ฐ€์ง€ ์œ ์ถ”ํ•œ ๊ฒƒ์€,

 

 

ํŒ๋„ ์†Œ๋“์— ํฌํ•จ๋˜๊ธฐ ๋•Œ๋ฌธ์—, ํ˜„๊ธˆ ๊ฒฐ์ œ๋ฅผ ํ•œ ๊ฒฝ์šฐ ๊ทธ๋ƒฅ ํŒ์„ ๊ฟ€๊บฝํ•œ ๊ฒƒ ๊ฐ™๋‹ค. ์–ด์ฉŒ๋ฉด ๋‹น์—ฐํ•œ ๊ฒฐ๊ณผ๋ผ์„œ ๋‚ฉ๋“์ด ๋œ๋‹ค. ์ด์— ๋”ฐ๋ผ ํ˜„๊ธˆ&์นด๋“œ์— ๋”ฐ๋ฅธ ํŒ ํฌ๊ธฐ์˜ ๋ณ€ํ™”๋Š” ์•Œ ์ˆ˜ ์—†๊ฒŒ ๋˜์—ˆ๋‹ค. ํŒ ๋ง๊ณ  ์š”๊ธˆ, ๊ฑฐ๋ฆฌ ์ธก๋ฉด์€ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๋‹ค.

 

 

 

 

 

 

์ง€๊ธˆ๊นŒ์ง€ ํƒ์‹œ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ, ์–ด๋–ป๊ฒŒ ํ•˜๋ฉด ํƒ์‹œ๋กœ ๋ˆ์„ ๋” ๋ฒŒ ์ˆ˜ ์žˆ์„์ง€ ๊ณ ์ฐฐํ•ด ๋ณด์•˜๋‹ค. ์œ„์˜ ๋ถ„์„๋“ค์„ ๊ณฑ์”น์–ด๋ณด๋ฉฐ ๋‹น์‹ ์˜ ํƒ์‹œ ์ „๋žต์„ ์ƒˆ๋กญ๊ฒŒ ์„ธ์›Œ๋ณด๊ธธ ๋ฐ”๋ž€๋‹ค!

์—ฌ๊ธฐ๊นŒ์ง€ ์˜ค๋Š๋ผ ๊ณ ์ƒํ•˜์…จ์Šต๋‹ˆ๋‹ค.