2022. 5. 10. 11:32ใ๐ Data Engineering/Apache Spark
์ด์ ํฌ์คํ ์์ ๊ณต๋ถํ SparkSQL ์ง์์ ๋ฐํ์ผ๋ก, ์ค์ Taxi ๋ฐ์ดํฐ๋ฅผ ์ ์ฒ๋ฆฌํด๋ณด์.
* ์ ์ฒ๋ฆฌ๋? ์ด์์น ์ ๊ฑฐ, ๊ทธ๋ฃนํ ๋ฑ ๋ฐ์ดํฐ ๋ถ์์ด ์ฉ์ดํ๋๋ก ๋ฐ์ดํฐ๋ฅผ ๋ณํํ๋ ๊ณผ์ ์ ๋งํ๋ค.
TLC Trip Record Data์์ ๋จผ์ ๋ฐ์ดํฐ๋ฅผ ๋ฐ์์ค์. TLC๋ ๋ฏธ๊ตญ์ ํ์ ์ด์ ๋ฐ์ดํฐ๋ฅผ ๋ชจ์๋์ ์์ฃผ ์ ์ฉํ ์ฌ์ดํธ๋ค.
[https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page]
"๋ณธ ํฌ์คํ ์ ํจ์คํธ์บ ํผ์ค์ ๊ฐ์๋ฅผ ๋ฃ๊ณ , ์ ๋ฆฌํ ์๋ฃ์์ ๋ฐํ๋๋ค."
Data Download
1) ๊ธฐ๋ณธ ํ์ด์ง
2) 2021๋ ๋ 01์~07์ Yellow Taxi Trip Records(CSV) ํด๋ฆญํด์ ๋ค์ด๋ก๋ํ๊ธฐ
3) Taxi Zone Lookup Table (CSV) ๋ค์ด๋ก๋
4) Spark ์์ ํ๋ ๊ฒฝ๋ก์ data ํด๋ ์์ฑํ๊ธฐ C:\์์ ์ด ์์ ํ๋ ๊ฒฝ๋ก\data / data ํ์ผ ์์ trips ํด๋ ์์ฑํ๊ธฐ
5) data ํด๋ ์์ Lookup Table(CSV) ํ์ผ์ ๋ฃ๊ณ , trips ํด๋ ์์ Yellow Taxi Trip Records(CSV) ๋ฃ๊ธฐ
C:\์์ ์ด ์์ ํ๋ ๊ฒฝ๋ก\data
C:\์์ ์ด ์์ ํ๋ ๊ฒฝ๋ก\data\trips
์ด์ ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ๋ฅผ ์ํ ํ์ผ ์ธํ ์ด ๋๋ฌ๋ค. ๋ณธ๊ฒฉ์ ์ผ๋ก Spark๋ฅผ ์ธํ ํ๊ณ ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ๋ฅผ ์์ํด๋ณด์.
Basic Settings
# matplotlib ํฐํธ ์ง์
from matplotlib import font_manager, rc
font_path = '์์ ์ด ์ฐ๊ณ ์ถ์ ํฐํธ ๊ฒฝ๋ก'
font = font_manager.FontProperties(fname=font_path).get_name()
rc('font', family=font)
import os
import findspark
findspark.init(os.environ.get("SPARK_HOME"))
import pyspark
from pyspark import SparkConf, SparkContext
import pandas as pd
import faulthandler
faulthandler.enable()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName("taxi-analysis").getOrCreate()
# ๋ฐ์ดํฐ๊ฐ ์๋ ํ์ผ
# 'trips/*'์ ํตํด trips ํด๋ ์์ ์๋ ๋ชจ๋ ๋ฐ์ดํฐ๋ฅผ ๋ถ๋ฌ์ต๋๋ค.
zone_data = "C:/Spark ์์
๊ฒฝ๋ก/data/taxi_zone_lookup.csv"
trip_files = "C:/Spark ์์
๊ฒฝ๋ก/data/trips/*"
# ๋ฐ์ดํฐ ๋ก๋
trips_df = spark.read.csv(f"file:///{trip_files}", inferSchema = True, header = True)
zone_df = spark.read.csv(f"file:///{zone_data}", inferSchema = True, header = True)
๋ฐ์ดํฐ ์คํค๋ง ์ดํด๋ณด๊ณ , createOrReplaceTempView() ์ ์ฉํ๊ธฐ
# ๋ฐ์ดํฐ ์คํค๋ง
trips_df.printSchema()
zone_df.printSchema()
root
|-- VendorID: integer (nullable = true)
|-- tpep_pickup_datetime: string (nullable = true)
|-- tpep_dropoff_datetime: string (nullable = true)
|-- passenger_count: integer (nullable = true)
|-- trip_distance: double (nullable = true)
|-- RatecodeID: integer (nullable = true)
|-- store_and_fwd_flag: string (nullable = true)
|-- PULocationID: integer (nullable = true)
|-- DOLocationID: integer (nullable = true)
|-- payment_type: integer (nullable = true)
|-- fare_amount: double (nullable = true)
|-- extra: double (nullable = true)
|-- mta_tax: double (nullable = true)
|-- tip_amount: double (nullable = true)
|-- tolls_amount: double (nullable = true)
|-- improvement_surcharge: double (nullable = true)
|-- total_amount: double (nullable = true)
|-- congestion_surcharge: double (nullable = true)
root
|-- LocationID: integer (nullable = true)
|-- Borough: string (nullable = true)
|-- Zone: string (nullable = true)
|-- service_zone: string (nullable = true)
----
# ๋ฐ์ดํฐ createOrReplaceTempView()
trips_df.createOrReplaceTempView("trips")
zone_df.createOrReplaceTempView("zone")
[trips_df์ ์นผ๋ผ ์ค๋ช ]
1. VendorID: ๊ธฐ๋ก์ ์ ๊ณตํ๋ the TPEP provider๋ฅผ ๋ํ๋ด๋ ์ฝ๋
2. tpep_pickup_datetime: ์น์ฐจ ๋ ์ง ๋ฐ ์๊ฐ
3. tpep_dropoff_datetime: ํ์ ๋ ์ง ๋ฐ ์๊ฐ
4. passenger_count: ์น๊ฐ ์
5. trip_distance: ๊ฑฐ๋ฆฌ(mile)
6. PULocationID: ์น์ฐจ ์ฅ์(ID)
7. DOLocationID: ํ์ฐจ ์ฅ์(ID)
8. payment_type: ์ง๋ถ ๋ฐฉ๋ฒ
9. fare_amount: ์๊ธ
10. extra: ์ถ๊ฐ ์๊ธ
11. tip_amount: ํ
12. tolls_amount: ํจ๋น
13. total_amount: ์ด ๋น์ฉ
[zone_df์ ์นผ๋ผ ์ค๋ช ]
1. LocationID: ์ง์ญ ID
2. Borough: ํฌ๊ฒ ๋ณธ ์ง์ญ๊ตฌ
3. Zone: ์ง์ญ ๋
๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ
1) ์ธ๋งํ column๋ง ๋ชจ์๋๊ธฐ
์น/ํ์ฐจ ์๊ฐ, ์ฅ์, ์ง๋ถ ๋ฐฉ๋ฒ, ์น๊ฐ ์, ์ ์ฒด ๋น์ฉ ๋ฑ ๋ถ์์ ํ์ํ ์นผ๋ผ๋ง ๋จ๊ธฐ๊ณ ๋๋จธ์ง ์ง์ ๋ค.
# ์ธ๋งํ column๋ง ๋ชจ์๋์.
query = '''
SELECT
t.VendorID,
TO_DATE(t.tpep_pickup_datetime) AS pickup_date,
HOUR(t.tpep_pickup_datetime) AS pickup_time,
TO_DATE(t.tpep_dropoff_datetime) AS dropoff_date,
HOUR(t.tpep_dropoff_datetime) AS dropoff_time,
t.passenger_count,
t.trip_distance,
t.payment_type,
t.fare_amount,
t.tip_amount,
t.tolls_amount,
t.total_amount,
pz.Zone as pzone,
dz.Zone as dzone
FROM
trips t
LEFT JOIN
zone pz
ON
t.PULocationID == pz.LocationID
LEFT JOIN
zone dz
ON
t.DOLocationID == dz.LocationID
'''
taxi_df = spark.sql(query)
taxi_df.createOrReplaceTempView("taxi")
spark.sql('select * from taxi').show(5)
+--------+-----------+-----------+------------+------------+---------------+-------------+------------+-----------+----------+------------+------------+-----------------+--------------+
|VendorID|pickup_date|pickup_time|dropoff_date|dropoff_time|passenger_count|trip_distance|payment_type|fare_amount|tip_amount|tolls_amount|total_amount| pzone| dzone|
+--------+-----------+-----------+------------+------------+---------------+-------------+------------+-----------+----------+------------+------------+-----------------+--------------+
| 2| 2021-03-01| 0| 2021-03-01| 0| 1| 0.0| 2| 3.0| 0.0| 0.0| 4.3| NV| NV|
| 2| 2021-03-01| 0| 2021-03-01| 0| 1| 0.0| 2| 2.5| 0.0| 0.0| 3.8| Manhattanville|Manhattanville|
| 2| 2021-03-01| 0| 2021-03-01| 0| 1| 0.0| 2| 3.5| 0.0| 0.0| 4.8| Manhattanville|Manhattanville|
| 1| 2021-03-01| 0| 2021-03-01| 0| 0| 16.5| 1| 51.0| 11.65| 6.12| 70.07|LaGuardia Airport| NA|
| 2| 2021-03-01| 0| 2021-03-01| 0| 1| 1.13| 1| 5.5| 1.86| 0.0| 11.16| East Chelsea| NV|
+--------+-----------+-----------+------------+------------+---------------+-------------+------------+-----------+----------+------------+------------+-----------------+--------------+
only showing top 5 rows
์์ 5๊ฐ ๋ฐ์ดํฐ๋ง ๋ด๋ ์ ์ ์๋ฏ์ด, trip_distance๊ฐ 0์ด๊ณ ์น์ฐจ ์ฅ์๊ฐ NA๋ก ๋์์๋ ๋ฑ ๋ฐ์ดํฐ ์์ฒด๊ฐ ์ข ๋๋ฝ๋ค. ๋ถ์์ ๋ฐ๋ก ๊ฐ๋ค ์ฐ๋ฉด ์ญ์คํ๊ตฌ ์ด์ํ ๊ฒฐ๋ก ์ ๋๋ฌํ ๊ฒ์ด๋ค. ๊นจ๋ํ๊ฒ ์ ์ ํด์ฃผ์.
2) ๋ฐ์ดํฐ์ ๋๋ฌ์ด ๋ถ๋ถ๋ค์ ์ฐพ์๋ณด์.
# step 1. 2021.01 ์๋ฃ๋ถํฐ 2021.06 ์๋ฃ๊น์ง๋ง ๋ค์ดํ์ผ๋ฏ๋ก, ๊ทธ ์ธ ๋ฐ์ดํฐ๋ ์ง์ด๋ค.
# 2002๋
๋ฐ์ดํฐ๊ฐ ์ ์ฌ๊ธฐ์...?
spark.sql('select pickup_date from taxi order by pickup_date').show()
+-----------+
|pickup_date|
+-----------+
| 2002-12-31|
| 2003-01-05|
| 2004-04-04|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
| 2008-12-31|
+-----------+
# step 2 - ๋ค์ํ ๊ฐ๋ค์์ ์ด์์น ์ฐพ๊ธฐ
# ๋
ธ๋ ํ์์ ์น๊ฐ์ด 9๋ช
ํ ์ ์์๋?
for i in ['passenger_count', 'trip_distance', 'fare_amount', 'tip_amount', 'tolls_amount', 'total_amount']:
print(f'{i}์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------')
taxi_df.select(i).describe().show()
passenger_count์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+------------------+
|summary| passenger_count|
+-------+------------------+
| count| 14166672|
| mean|1.4253783104458126|
| stddev| 1.04432704905968|
| min| 0|
| max| 9|
+-------+------------------+
trip_distance์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+-----------------+
|summary| trip_distance|
+-------+-----------------+
| count| 15000700|
| mean|6.628629402627818|
| stddev|671.7293482115828|
| min| 0.0|
| max| 332541.19|
+-------+-----------------+
fare_amount์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+------------------+
|summary| fare_amount|
+-------+------------------+
| count| 15000700|
| mean| 12.89269334830367|
| stddev|145.54843567115813|
| min| -643.5|
| max| 398466.38|
+-------+------------------+
tip_amount์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+-----------------+
|summary| tip_amount|
+-------+-----------------+
| count| 15000700|
| mean|2.146797558780939|
| stddev|2.610914434555077|
| min| -333.32|
| max| 1140.44|
+-------+-----------------+
tolls_amount์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+-------------------+
|summary| tolls_amount|
+-------+-------------------+
| count| 15000700|
| mean|0.31795104561765897|
| stddev| 1.6542914124457562|
| min| -38.02|
| max| 956.55|
+-------+-------------------+
total_amount์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+-----------------+
|summary| total_amount|
+-------+-----------------+
| count| 15000700|
| mean|18.75545205708744|
| stddev|145.7442452805979|
| min| -647.8|
| max| 398469.2|
+-------+-----------------+
์์ธํ ๋ณด๋ฉด ์ฌ๋ฐ๋ ๋ฐ์ดํฐ๋ค์ ๋ฐ๊ฒฌํ ์ ์๋ค. trip_distance์ ์ต๊ณ ๊ฐ์ด 332541 mile (535172.863km)์ด๋ค. ์ง๊ตฌ ํ ๋ฐํด์ 42000km์ธ ๊ฒ์ ์๊ฐํ๋ฉด, ํ์๋ ๋๋์ฒด ์ง๊ตฌ๋ฅผ ๋ช ๋ฐํด ๋ ๊ฒ์ผ๊น? ๋ง๋ ์ ๋๋ฏ๋ก ์ญ์ ํด์ค์ผ ํ๋ ๋ฐ์ดํฐ๋ค. ๊ทธ ์ธ์๋ ์ญ์ ํด์ผ ํ ๋ฐ์ดํฐ๋ค์ด ๋ง๋ค. ์ฌ์ฉ ๋ณด๋ฉด์ ์ด๋ค ๋ฐ์ดํฐ๋ฅผ ์ ์ฒ๋ฆฌํด์ค์ผ ํ ์ง ์ค์ค๋ก ์๊ฐํ๊ธธ ๋ฐ๋๋ค.
passenger_count: 9๋ช
์ ๋๋ฌด ๋ง๋ค. 5๋ช
๊น์ง ์ ํ.
trip_distance: 0๋ณด๋จ ์ปค์ผ ํ๊ณ , max๋ 100๋ง์ผ๋ก ์ ํ.
fare_amount: 0๋ณด๋ค ์ปค์ผํ๊ณ , max๋ 100์ผ๋ก ์ ํ
tip_amount: 0๋ณด๋ค ์ปค์ผํ๊ณ , max๋ 50์ผ๋ก ์ ํ
tolls_amount: 0๋ณด๋ค ์ปค์ผํ๊ณ , max๋ 10์ผ๋ก ์ ํ
total_amount: 0๋ณด๋ค ์ปค์ผํ๊ณ , max๋ 1000์ผ๋ก ์ ํ
3) Data Cleaning
# data clearning
# fare, tip, tolls ์ ๋ํ ๊ฒ์ ๋์ค์ df ๋ง๋ค๋ ์ ์ฒ๋ฆฌํด์ฃผ์.
query = '''
SELECT
*
FROM
taxi t
WHERE
t.total_amount < 2000
AND t.total_amount > 0
AND t.trip_distance < 100
AND t.passenger_count < 6
AND t.pickup_date >= '2021-01-01'
AND t.pickup_date < '2021-08-01'
AND t.dropoff_date >= '2021-01-01'
AND t.dropoff_date < '2021-08-03'
'''
df_c = spark.sql(query)
df_c.createOrReplaceTempView('cleaned')
SQL๋ฌธ์ ์ด์ฉํ์ฌ ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ๋ฅผ ํด์ฃผ์๋ค.
์ ์ฒ๋ฆฌ๊ฐ ์๋ง๊ฒ ๋์๋์ง ํ์ธํด๋ณด์.
# ์ ์ฒ๋ฆฌ ํ์ธ
spark.sql('SELECT pickup_date FROM cleaned ORDER BY pickup_date').show()
+-----------+
|pickup_date|
+-----------+
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
| 2021-01-01|
+-----------+
2021-01-01 ๋ฐ์ดํฐ๋ถํฐ ์กด์ฌํ๋ ๊ฒ์ ํ์ธํ ์ ์๋ค.
๋ค๋ฅธ ์นผ๋ผ๋ค์ ์์ฝ ๋ฐ์ดํฐ๋ค์ ํ์ธํด๋ณด์.
# step 2 - ๋ค์ํ ๊ฐ๋ค์์ ์ด์์น ์ฐพ์์ ์์ ๊ธฐ
for i in ['passenger_count', 'trip_distance', 'fare_amount', 'tip_amount', 'tolls_amount', 'total_amount']:
print(f'{i}์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------')
df_c.select(i).describe().show()
passenger_count์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+------------------+
|summary| passenger_count|
+-------+------------------+
| count| 13855011|
| mean|1.3471861552473687|
| stddev|0.8633321134239649|
| min| 0|
| max| 5|
+-------+------------------+
trip_distance์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+------------------+
|summary| trip_distance|
+-------+------------------+
| count| 13855011|
| mean|2.8443242809406217|
| stddev|3.6296305328849514|
| min| 0.0|
| max| 99.96|
+-------+------------------+
fare_amount์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+------------------+
|summary| fare_amount|
+-------+------------------+
| count| 13855011|
| mean|12.173092585057173|
| stddev|10.921311212928543|
| min| -0.8|
| max| 1320.0|
+-------+------------------+
tip_amount์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+------------------+
|summary| tip_amount|
+-------+------------------+
| count| 13855011|
| mean| 2.189352297880318|
| stddev|2.5818332665863704|
| min| 0.0|
| max| 700.0|
+-------+------------------+
tolls_amount์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+-------------------+
|summary| tolls_amount|
+-------+-------------------+
| count| 13855011|
| mean|0.27130274598821397|
| stddev| 1.540763209706977|
| min| 0.0|
| max| 956.55|
+-------+-------------------+
total_amount์ ๋ฐ์ดํฐ ์์ฝ๋ณธ -----------------------------------
+-------+------------------+
|summary| total_amount|
+-------+------------------+
| count| 13855011|
| mean| 18.08681097854774|
| stddev|13.216750985022102|
| min| 0.01|
| max| 1320.8|
+-------+------------------+
fare, tolls ๋ฑ ์ ์ฒ๋ฆฌ๋ฅผ ํ์ง ์์ ์นผ๋ผ์ ๊ทธ๋๋ก์ด์ง๋ง, passenger์ trip_distance ๋ฑ์ ๋ฐ์ดํฐ๋ ํ์คํ ๋ณํ ๊ฒ์ ํ์ธํ ์ ์๋ค. ๊ทธ๋ ๋ค๋ฉด ์ ์ ๋ ๋ฐ์ดํฐ๋ก ๊ฐ๋จํ ๋ถ์์ ์งํํด๋ณด์.
4) ๊ฐ๋จํ ๋ถ์ 1. ์์ผ ๋ณ trips ์ ์ธ๊ธฐ
SQL ๋ฌธ์ DATE_FORMAT()๋ฅผ ์ฌ์ฉํ์ฌ date๋ฅผ ์์ผ ๋ฐ์ดํฐ๋ก ๋ณํํ๋ค. ์์ผ ๋ณ๋ก ๊ทธ๋ฃน ํ์ฌ trips ์๋ฅผ ์ ๋ค.
import matplotlib.pyplot as plt
import seaborn as sns
# 1. ์์ผ ๋ณ trips ์ ์ธ๊ธฐ
query = '''
SELECT
DATE_FORMAT(c.pickup_date, 'EEEE') AS day_of_week,
COUNT(*) AS trips
FROM
cleaned c
GROUP BY
day_of_week
'''
weekday_df = spark.sql(query).toPandas()
# ๊ทธ๋ํ ๊ทธ๋ฆฌ๊ธฐ
fig, ax = plt.subplots(figsize=(16,6))
sns.barplot(
x = 'day_of_week',
y = 'trips',
data = weekday_df
)
4) ๊ฐ๋จํ ๋ถ์ 2. ์ง๋ถ ๋ฐฉ๋ฒ ํต๊ณ ๋ด๊ธฐ
์ง๋ถ ์ฝ๋๋ฅผ ์ง๋ถ ๋ฐฉ๋ฒ String์ผ๋ก ๋ณํํ๊ณ , ์ง๋ถ ๋ฐฉ๋ฒ์ Grouping ํ์ฌ trips ์๋ฅผ ์ธ๋๋ก ํ๋ค.
# 6. ์ง๋ถ ๋ฐฉ๋ฒ ๋จ์ ํต๊ณ
payment_type_to_string = {
1: "Credit Card",
2: "Cash",
3: "No Charge",
4: "Dispute",
5: "Unknown",
6: "Voided Trip",
}
# UDF ์ง์ ํ๊ธฐ
def parse_payment_type(payment_type):
return payment_type_to_string[payment_type]
spark.udf.register('parse_payment_type', parse_payment_type)
# ์ฟผ๋ฆฌ ์ง๊ธฐ
query = '''
SELECT
parse_payment_type(payment_type) as payment,
COUNT(*) AS trips
FROM
cleaned c
GROUP BY
payment_type
'''
df_payment = spark.sql(query).toPandas()
----
display(df_payment)
payment trips
0 Credit Card 10534838
1 No Charge 59181
2 Dispute 23856
3 Cash 3237135
4 Unknown 1
----
# ์ง๋ถ ๋ฐฉ๋ฒ์ผ๋ก ๊ทธ๋ํ ๊ทธ๋ฆฌ๊ธฐ
fig, ax = plt.subplots(figsize=(16,6))
plt.title('payment', fontsize = 30)
sns.barplot(
x = 'payment',
y = 'trips',
data = df_payment
)
์ง๊ธ๊น์ง TLC Trip Record Data์ ๋ฐ์ดํฐ๋ฅผ ์ ์ฒ๋ฆฌํ๊ณ ๊ฐ๋จํ ๋ถ์์ ํด๋ณด์๋ค.
์ค์ค๋ก ๋ฐ์ดํฐ๋ฅผ ์ด๋ฆฌ์ ๋ฆฌ ๋ง์ ธ๊ฐ๋ฉฐ SparkSQL์ ์ต์ํด์ง๊ธธ ๋ฐ๋๋ค.
Analytics ์น์ ์์ TLC ๋ฐ์ดํฐ ๋ถ์์ ๋ ์์ธํ๊ฒ ํด์ ํฌ์คํ ํ๋๋ก ํ๊ฒ ๋ค.
์๊ณ ํ์ จ์ต๋๋ค.
'๐ Data Engineering > Apache Spark' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[SparkML] MLlib Pipeline ๊ตฌ์ถํ๊ธฐ (0) | 2022.05.21 |
---|---|
[SparkML] MLlib ๊ฐ๋ ๋ฐ ์ค์ต ์ฝ๋ (0) | 2022.05.20 |
[SparkSQL] Catalyst, Tungsten ์๋ ์๋ฆฌ (0) | 2022.05.09 |
[SparkSQL] UDF ๊ฐ๋ ๋ฐ ์ฝ๋ (0) | 2022.05.08 |
[SparkSQL] DataFrame ๋ค๋ฃจ๊ธฐ (0) | 2022.05.07 |