[SparkSQL] DataFrame 다루기

[SparkSQL] DataFrame 다루기

2022. 5. 7. 12:27ㆍ🛠 Data Engineering/Apache Spark

DataFrame

SparkSQL에서 다루는 Structured Data로 아주 주요 개념이다.

기본적으로 Lazy Execution, 분산, Immutable이란 RDD의 장점을 가짐과 동시에

구조화(Structured)되어 있어 자동 최적화까지 가능하다.

CSV, JSON, Hive 등으로 읽거나 변환도 가능하다.

본격적으로 DataFrame을 다뤄보자.

"본 포스팅은 패스트캠퍼스의 강의를 듣고, 정리한 자료임을 밝힙니다."

Basic Setting

import os
import findspark
findspark.init(os.environ.get("SPARK_HOME"))
import pyspark
from pyspark import SparkConf, SparkContext
import pandas as pd
import faulthandler
faulthandler.enable()
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').appName("dataframe").getOrCreate()

# Data

stocks = [
    ('Google', 'GOOGL', 'USA', 2984, 'USD'), 
    ('Netflix', 'NFLX', 'USA', 645, 'USD'),
    ('Amazon', 'AMZN', 'USA', 3518, 'USD'),
    ('Tesla', 'TSLA', 'USA', 1222, 'USD'),
    ('Tencent', '0700', 'Hong Kong', 483, 'HKD'),
    ('Toyota', '7203', 'Japan', 2006, 'JPY'),
    ('Samsung', '005930', 'Korea', 70600, 'KRW'),
    ('Kakao', '035720', 'Korea', 125000, 'KRW'),
]


# Schema

stockSchema = ['name', 'ticker', 'country', 'price', 'currency']


# createDataFrame()
df = spark.createDataFrame(data = stocks, schema=stockSchema)

SparkSQL's DataFrame 기본 함수

1. df.dtypes

데이터 프레임의 칼럼과 타입을 보여준다.

df.dtypes


[('name', 'string'),
 ('ticker', 'string'),
 ('country', 'string'),
 ('price', 'bigint'),
 ('currency', 'string')]

2. df.show()

데이터 프레임의 전체 형태를 보여준다.

df.show()


+-------+------+---------+------+--------+
|   name|ticker|  country| price|currency|
+-------+------+---------+------+--------+
| Google| GOOGL|      USA|  2984|     USD|
|Netflix|  NFLX|      USA|   645|     USD|
| Amazon|  AMZN|      USA|  3518|     USD|
|  Tesla|  TSLA|      USA|  1222|     USD|
|Tencent|  0700|Hong Kong|   483|     HKD|
| Toyota|  7203|    Japan|  2006|     JPY|
|Samsung|005930|    Korea| 70600|     KRW|
|  Kakao|035720|    Korea|125000|     KRW|
+-------+------+---------+------+--------+

3. df.printSchema()

데이터 프레임의 스키마를 볼 수 있다.

df.printSchema()


root
 |-- name: string (nullable = true)
 |-- ticker: string (nullable = true)
 |-- country: string (nullable = true)
 |-- price: long (nullable = true)
 |-- currency: string (nullable = true)

4. df.Select()

데이터 프레임에서 원하는 Column이나 데이터를 추출할 수 있다.

df.select('name', 'currency').collect()


[Row(name='Google', currency='USD'),
 Row(name='Netflix', currency='USD'),
 Row(name='Amazon', currency='USD'),
 Row(name='Tesla', currency='USD'),
 Row(name='Tencent', currency='HKD'),
 Row(name='Toyota', currency='JPY'),
 Row(name='Samsung', currency='KRW'),
 Row(name='Kakao', currency='KRW')]

5. df.agg()

Aggregate(집합적인)의 약자. 그룹핑하여 데이터를 합치는 작업이다.

df.agg({'price':'mean'}).collect()


[Row(avg(price)=25807.25)]

----

# sql function을 import하여 aggregate에 활용
from pyspark.sql import functions as F
df.agg(F.count(df.currency)).collect()


[Row(count(currency)=8)]

6. df.groupBy()

지정 Column을 기준으로 데이터를 Grouping 하는 작업이다.

# currency로 그루핑한 후, 가격이 최대인 값을 뽑도록 했다.
df.groupBy('currency').agg({'price':'max'}).collect()


[Row(currency='KRW', max(price)=125000),
 Row(currency='JPY', max(price)=2006),
 Row(currency='HKD', max(price)=483),
 Row(currency='USD', max(price)=3518)]

7. df.join()

데이터프레임을 다른 데이터와 합치는 작업이다.

# 데이터프레임 join하기
df.join(df_earning, 'name').collect()


[Row(name='Amazon', ticker='AMZN', country='USA', price=3518, currency='USD', EPS='6.12', currency='USD'),
 Row(name='Google', ticker='GOOGL', country='USA', price=2984, currency='USD', EPS='27.99', currency='USD'),
 Row(name='Kakao', ticker='035720', country='Korea', price=125000, currency='KRW', EPS='705.0', currency='KRW'),
 Row(name='Netflix', ticker='NFLX', country='USA', price=645, currency='USD', EPS='2.56', currency='USD'),
 Row(name='Samsung', ticker='005930', country='Korea', price=70600, currency='KRW', EPS='1780.0', currency='KRW'),
 Row(name='Tencent', ticker='0700', country='Hong Kong', price=483, currency='HKD', EPS='11.01', currency='HKD'),
 Row(name='Tesla', ticker='TSLA', country='USA', price=1222, currency='USD', EPS='1.86', currency='USD'),
 Row(name='Toyota', ticker='7203', country='Japan', price=2006, currency='JPY', EPS='224.82', currency='JPY')]
 
----

df.join(df_earning, 'name').select(df.name, df_earning.EPS).collect()


[Row(name='Amazon', EPS='6.12'),
 Row(name='Google', EPS='27.99'),
 Row(name='Kakao', EPS='705.0'),
 Row(name='Netflix', EPS='2.56'),
 Row(name='Samsung', EPS='1780.0'),
 Row(name='Tencent', EPS='11.01'),
 Row(name='Tesla', EPS='1.86'),
 Row(name='Toyota', EPS='224.82')]

8. 맘대로 조작해보기

# where문을 이용해서 EPS제한을 두고, orderBy 함수를 이용해서 줄세웠다.
df.join(df_earning, 'name').where(df_earning.EPS > 5).orderBy(df.price).collect()


[Row(name='Tencent', ticker='0700', country='Hong Kong', price=483, currency='HKD', EPS='11.01', currency='HKD'),
 Row(name='Toyota', ticker='7203', country='Japan', price=2006, currency='JPY', EPS='224.82', currency='JPY'),
 Row(name='Google', ticker='GOOGL', country='USA', price=2984, currency='USD', EPS='27.99', currency='USD'),
 Row(name='Amazon', ticker='AMZN', country='USA', price=3518, currency='USD', EPS='6.12', currency='USD'),
 Row(name='Samsung', ticker='005930', country='Korea', price=70600, currency='KRW', EPS='1780.0', currency='KRW'),
 Row(name='Kakao', ticker='035720', country='Korea', price=125000, currency='KRW', EPS='705.0', currency='KRW')]
 
 ----
 
 # 나라를 기준으로 Grouping 후, 가격의 평균으로 데이터를 추렸다.
 df.join(df_earning, 'name').groupBy(df.country).agg(F.mean(df.price)).collect()
 
 
 [Row(country='Hong Kong', avg(price)=483.0),
 Row(country='USA', avg(price)=2092.25),
 Row(country='Japan', avg(price)=2006.0),
 Row(country='Korea', avg(price)=97800.0)]

이상 SparkSQL에서 데이터 프레임을 다루는 방법을 알아보았다.

다음 포스팅에선 사용자 지정 함수(User Define Function)에 대해 알아보자.

수고하셨습니다.

'🛠 Data Engineering > Apache Spark' 카테고리의 다른 글

[SparkSQL] Catalyst, Tungsten 작동 원리 (0)	2022.05.09
[SparkSQL] UDF 개념 및 코드 (0)	2022.05.08
[Spark] SparkSQL 개념 및 코드 (0)	2022.05.06
[Spark] Reduction 개념 및 코드 (0)	2022.05.01
[Spark] Spark 속도 최적화, Cashe() & Persist() (0)	2022.05.01

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

HIGHQUAL

HIGHQUAL

태그

최근글

댓글

공지사항

아카이브

'🛠 Data Engineering > Apache Spark' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역