[SparkML] ALS, 추천 알고리즘 활용하기

[SparkML] ALS, 추천 알고리즘 활용하기

2022. 5. 23. 00:17ㆍ🛠 Data Engineering/Apache Spark

ALS, Alternating Least Squares

SparkML은 추천 알고리즘인 ALS를 지원한다.

영화 평점 데이터를 가져와서 직접 ALS 모델을 Spark에서 사용해보자.

[ALS 개념]

https://mengu.tistory.com/60

[추천 알고리즘] ALS 개념, Basic 하게 feat. 코드 X

Alternating Least Squares 추천 알고리즘 중 하나로, 교대 최소 제곱법이라고도 불린다. 이번 포스팅에선 간단한 추천 알고리즘 개념들을 살펴보고, ALS의 장점을 알아보자. 추천 알고리즘 : 사용자가

mengu.tistory.com

Basic Settings

from matplotlib import font_manager, rc
font_path = 'C:\\WINDOWS\\Fonts\\HBATANG.TTF'
font = font_manager.FontProperties(fname=font_path).get_name()
rc('font', family=font)

import os
import findspark
findspark.init(os.environ.get("SPARK_HOME"))
import pyspark
from pyspark import SparkConf, SparkContext
import pandas as pd
import faulthandler
faulthandler.enable()
from pyspark.sql import SparkSession

MAX_MEMORY = "5g"
spark = SparkSession.builder.master('local').appName("movie-recommedation")\
.config("spark.executer.memory", MAX_MEMORY)\
.config("spark.driver.memory", MAX_MEMORY).getOrCreate()

영화 평점 데이터 받아오기

(1) 사이트에 접속해줍니다.

https://grouplens.org/datasets/movielens/25m/

MovieLens 25M Dataset

MovieLens 25M movie ratings. Stable benchmark dataset. 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million releva…

grouplens.org

(2) 검은 사각형 안에 있는 Zip 링크를 눌러서 데이터를 다운로드해줍니다.

(3) C:\원하는 저장 경로\ml-25m 자신이 원하는 경로에 폴더를 저장합니다.

ALS 사용하기

(1) 데이터 로드

ratings_file = 'C:/자신이 설정한 폴더 경로/ml-25m/ratings.csv'
ratings_df = spark.read.csv(f"file:///{ratings_file}", inferSchema=True, header=True)

(2) 필요한 칼럼 남기기 및 데이터 확인

ratings_df.show()

# timestamp 필요없으니까 없애기
ratings_df = ratings_df.select(['userId', 'movieId', 'rating'])


+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|    296|   5.0|1147880044|
|     1|    306|   3.5|1147868817|
|     1|    307|   5.0|1147868828|
|     1|    665|   5.0|1147878820|
|     1|    899|   3.5|1147868510|
|     1|   1088|   4.0|1147868495|
|     1|   1175|   3.5|1147868826|
|     1|   1217|   3.5|1147878326|
|     1|   1237|   5.0|1147868839|
|     1|   1250|   4.0|1147868414|
|     1|   1260|   3.5|1147877857|
|     1|   1653|   4.0|1147868097|
|     1|   2011|   2.5|1147868079|
|     1|   2012|   2.5|1147868068|
|     1|   2068|   2.5|1147869044|
|     1|   2161|   3.5|1147868609|
|     1|   2351|   4.5|1147877957|
|     1|   2573|   4.0|1147878923|
|     1|   2632|   5.0|1147878248|
|     1|   2692|   5.0|1147869100|
+------+-------+------+----------+
only showing top 20 rows



----

ratings_df.printSchema()


root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)

userId 별로 영화 코드와 그에 따른 평점이 정리되어 있는 데이터이다. 모든 유저가 영화를 본 것은 아니기 때문에, 1번 유저가 안 본 영화는 아예 데이터에 없을 것이다.

데이터에 올라온 영화의 개수는 총 59047개이다.

ratings_df.createOrReplaceTempView('ratings_df')

query = '''
SELECT
    COUNT(DISTINCT movieId) as movie_count
FROM
    ratings_df
'''

spark.sql(query).show()


+-----------+
|movie_count|
+-----------+
|      59047|
+-----------+

첫 번째 유저가 평점을 매긴 영화는 총 70개다.

query = '''
SELECT
    COUNT(*) as 1_count
FROM
    ratings_df
WHERE
    userId == 1
'''

spark.sql(query).show()


+-------+
|1_count|
+-------+
|     70|
+-------+

우린 ALS 알고리즘을 통해서 '안 본 영화' 중 1번 유저가 가장 좋아할 것 같은 영화를 추천하는 것이 포스팅의 목표다.

(3) 평점 통계

# 평점 통계
ratings_df.select('rating').describe().show()


+-------+------------------+
|summary|            rating|
+-------+------------------+
|  count|          25000095|
|   mean| 3.533854451353085|
| stddev|1.0607439611423535|
|    min|               0.5|
|    max|               5.0|
+-------+------------------+

(4) Train, Test 데이터 나누기

# train, test set 나누기
train_df, test_df = ratings_df.randomSplit([0.8,0.2])

print(f'train_df 의 길이 : {train_df.count()}')
print(f'test_df 의 길이 : {test_df.count()}')


train_df 의 길이 : 19999764
test_df 의 길이 : 5000331

(5) 모델 구축 및 훈련

from pyspark.ml.recommendation import ALS


# 추천 알고리즘
als = ALS(
    maxIter = 5,
    regParam=0.1,
    userCol='userId',
    itemCol = 'movieId',
    ratingCol = 'rating',
    # 학습하지 못하는 데이터 만났을 때 어떻게 대처할 것인가 설정.
    coldStartStrategy='drop'
)

'userCol' 에는 유저 아이디 칼럼을 입력한다.

'itemCol' 에는 아이템 아이디 칼럼을 입력한다.

'ratingCol' 에는 평점 칼럼을 입력한다.

단순히 ALS의 개념과 사용 방법만 알고 있어도 모델을 활용할 수 있다!

# 모델 fit()
model = als.fit(train_df)

(6) 추론 및 성능 확인

# 추론
prediction = model.transform(test_df)
prediction.show()


+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|     1|   1088|   4.0| 2.5297563|
|     3| 175197|   3.5|  2.874376|
|     4|   1580|   4.5|  3.097906|
|     9|   1088|   5.0| 3.9108486|
|    12|   8638|   4.0| 3.8486152|
|    13|   3175|   4.0| 3.6777968|
|    20|   1580|   4.0|  4.033978|
|    23|   1959|   5.0| 3.9067807|
|    30|   3175|   4.5| 3.8995147|
|    31|   8638|   2.0| 2.5974488|
|    41|   1580|   4.0| 3.5832882|
|    41|   2366|   3.0|  3.150863|
|    57|   1580|   3.0|  3.604269|
|    58|   6658|   5.0| 3.2883408|
|    63|  68135|   4.0| 3.1181765|
|    70|   1580|   3.0|  3.017872|
|    72|   1591|   2.0|  2.358852|
|    75|   1088|   3.5|  3.070664|
|    75|   1959|   2.0|  3.243212|
|    80|   1342|   2.0|  2.482828|
+------+-------+------+----------+
only showing top 20 rows


----

prediction.select('rating', 'prediction').describe().show()


+-------+------------------+-----------------+
|summary|            rating|       prediction|
+-------+------------------+-----------------+
|  count|           4996844|          4996844|
|   mean| 3.534328668255403|3.399862379635884|
| stddev|1.0605845133909824| 0.63975911880358|
|    min|               0.5|       -2.4304335|
|    max|               5.0|        6.9266715|
+-------+------------------+-----------------+

실제 rating과 prediction 통계 값을 비교한 결과, mean 값이 유의미하게 비슷한 것을 확인할 수 있다. 하지만 표준 편차가 차이나고, min 값이 마이너스가 있다는 점에서 보완이 필요해 보인다.

# 평가하기
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating',predictionCol='prediction')


# RMSE
rmse = evaluator.evaluate(prediction)
rmse


0.8134903395643102

(7) 유저마다 Top 5 recommendation 해주기

# 유저마다 Top5 recommendation 해주기
model.recommendForAllUsers(5).show()


+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|     1|[{202231, 5.65169...|
|     3|[{194434, 6.35166...|
|     4|[{194434, 6.07157...|
|     5|[{194434, 6.13072...|
|     6|[{162436, 6.21624...|
|     7|[{185645, 5.45231...|
|     8|[{194434, 5.93970...|
|     9|[{185645, 6.51190...|
|    10|[{194434, 6.12319...|
|    12|[{194434, 5.63328...|
|    13|[{194434, 6.47734...|
|    15|[{194434, 6.69569...|
|    16|[{194434, 6.56789...|
|    17|[{199187, 6.23324...|
|    19|[{194434, 5.82160...|
|    20|[{194434, 6.83623...|
|    21|[{194434, 6.51367...|
|    22|[{185645, 7.19147...|
|    23|[{194434, 6.35103...|
|    24|[{203086, 6.53243...|
+------+--------------------+
only showing top 20 rows

(8) 아이템마다 Top 5 유저 추천해주기

# 아이템마다 Top3 User 추천해주기
model.recommendForAllItems(3).show()


+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|      1|[{18230, 5.483052...|
|      2|[{87426, 5.260374...|
|      3|[{87426, 5.19888}...|
|      4|[{52924, 4.736040...|
|      5|[{52924, 4.899741...|
|      6|[{156252, 5.42339...|
|      7|[{10417, 5.039373...|
|      8|[{87426, 5.294695...|
|      9|[{87426, 5.268173...|
|     10|[{87426, 5.282108...|
|     11|[{10417, 5.357073...|
|     12|[{87426, 5.257176...|
|     13|[{108346, 5.20422...|
|     14|[{105801, 4.85644...|
|     15|[{87426, 5.336809...|
|     16|[{96740, 5.368758...|
|     17|[{58248, 5.512191...|
|     18|[{87426, 5.399836...|
|     19|[{87426, 5.231883...|
|     20|[{87426, 5.285846...|
+-------+--------------------+
only showing top 20 rows

(9) 특정 유저를 위한 추천 API 만들기

유저를 입력하면 Top 3 Moive를 추천해주는 API를 만들어 보자.

먼저 API에 들어갈 'userId' 데이터를 조직해보고, 그 결과로 나올 'recommendation' 데이터 형식을 살펴보자.

from pyspark.sql.types import IntegerType


# id 데이터가 어떻게 제공될지 확인
user_list = [65, 78, 93]
user_df = spark.createDataFrame(user_list, IntegerType()).toDF('userId')

user_df.show()


+------+
|userId|
+------+
|    65|
|    78|
|    93|
+------+

----

# recommend가 어떻게 나올지 확인
user_recommend = model.recommendForUserSubset(user_df, 5)

# 첫번째 유저에 대한 Top 5 recommendation 추출
movies_list = user_recommend.collect()[0].recommendations
recs_df = spark.createDataFrame(movies_list)
recs_df.show()


+-------+-----------------+
|movieId|           rating|
+-------+-----------------+
| 205277|6.759361743927002|
| 159761|6.485182762145996|
| 169606|6.246584415435791|
| 137363|5.943302154541016|
| 203633|5.908850193023682|
+-------+-----------------+

영화 ID를 유저에게 추천해봤자, 어떤 영화인지 알지 못한다. 영화 ID를 제목, 장르와 연결시켜주자.

movies_file = 'C:/ml-25m 폴더 경로/movies.csv'
movies_df = spark.read.csv(f"file:///{movies_file}", inferSchema=True, header=True)

movies_df.show()


+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sense and Sensibi...|       Drama|Romance|
|     18|   Four Rooms (1995)|              Comedy|
|     19|Ace Ventura: When...|              Comedy|
|     20|  Money Train (1995)|Action|Comedy|Cri...|
+-------+--------------------+--------------------+
only showing top 20 rows

movieId와 recommendation이 나왔을 때, 영화 제목과 장르를 연결해주는 쿼리를 짜 보자.

recs_df.createOrReplaceTempView('recommendations')
movies_df.createOrReplaceTempView('movies')

# SQL 쿼리
query = '''
Select
    *
From
    recommendations r
    Join movies m
    On r.movieId = m.movieId
ORDER BY
    rating desc
'''

recommendation_movies = spark.sql(query)
recommendation_movies.show()


+-------+-----------------+-------+--------------------+--------------------+
|movieId|           rating|movieId|               title|              genres|
+-------+-----------------+-------+--------------------+--------------------+
| 205277|6.759361743927002| 205277|   Inside Out (1991)|Comedy|Drama|Romance|
| 159761|6.485182762145996| 159761|         Loot (1970)|        Comedy|Crime|
| 169606|6.246584415435791| 169606|Dara O'Briain Cro...|              Comedy|
| 137363|5.943302154541016| 137363|The Mother Of Inv...|              Comedy|
| 203633|5.908850193023682| 203633|    The Bribe (2018)|        Comedy|Crime|
+-------+-----------------+-------+--------------------+--------------------+

API

# 실제로 이용할 때는 하나의 func안에서 수행
# 편하다.

query = '''
Select
    *
From
    recommendations r
    Join movies m
    On r.movieId = m.movieId
ORDER BY
    rating desc
'''

def get_recommendation(user_id, num_recs):

	# userid를 입력 받는다.
    user_df = spark.createDataFrame([user_id], IntegerType()).toDF('userId')
    
    # userid를 바탕으로 recommendation 하기
    user_recs_df = model.recommendForUserSubset(user_df, num_recs)
    
    # recommendation을 dataframe으로
    recs_list = user_recs_df.collect()[0].recommendations
    recs_df = spark.createDataFrame(recs_list)
    recs_df.createOrReplaceTempView('recommendations')
    movies_df.createOrReplaceTempView('movies')
    
    # SQL 문을 통해서 recommendation과 영화 제목/장르 데이터를 합쳤다.
    recommend_movies = spark.sql(query)
    return recommend_movies

# userId가 456번인 사람에게 10개의 영화를 추천해주자.
recs = get_recommendation(456, 10)

# pandas 변환
recs.toPandas()

시나리오 '영화를 추천해드립니다!!'

userId 100번이 내폴릭트에 접속했다.

띠링. 홈 화면이 뜨는 가운데, 네트워크는 백앤드에 미션을 내린다.

'userId 100번이 좋아할 만한 영화를 5개 추천해줘라! 친절하게!'

백앤드는 그냥 위의 코드를 돌리려 했다. 하지만 pandas dataframe으로 전달했다간 욕먹을 것 같은 분위기여서 새롭게 함수를 짜기로 했다.

def recommendation(user_id, num_recs):
    recs = get_recommendation(user_id, num_recs)
    r = recs.toPandas()
    
    # 제목 리스트 받아오기
    title_list = list(r['title'])
    
    # 장르 리스트 받아오기
    genre_list = list(r['genres'])
    
    # 알맞은 맨트를 써서 유저에게 추천해주기
    for i in range(0, num_recs):
        print(f'{i+1}번째 추천드릴 영화는 {title_list[i]}/{genre_list[i]}입니다.')

이제 userId 100번이 오면 다음 함수를 쓰면 된다.

recommendation(100, 5)


1번째 추천드릴 영화는 Adrenaline (1990)/(no genres listed)입니다.
2번째 추천드릴 영화는 Truth and Justice (2019)/Drama입니다.
3번째 추천드릴 영화는 School of Babel (2014)/Documentary입니다.
4번째 추천드릴 영화는 National Theatre Live: One Man, Two Guvnors (2011)/Comedy입니다.
5번째 추천드릴 영화는 Les Luthiers: El Grosso Concerto (2001)/(no genres listed)입니다.

이렇게 해서 Spark에서 추천 알고리즘 ALS를 활용해봤다.

도움이 되었길 바라며, 따라오느라 고생하셨습니다.

'🛠 Data Engineering > Apache Spark' 카테고리의 다른 글

[SparkML] MLlib Parameter 튜닝 개념 및 코드 (0)	2022.05.22
[SparkML] MLlib Pipeline 구축하기 (0)	2022.05.21
[SparkML] MLlib 개념 및 실습 코드 (0)	2022.05.20
[SparkSQL] 택시 데이터 다운/전처리/분석 feat. TLC (0)	2022.05.10
[SparkSQL] Catalyst, Tungsten 작동 원리 (0)	2022.05.09

HIGHQUAL

HIGHQUAL

태그

최근글

댓글

공지사항

아카이브

'🛠 Data Engineering > Apache Spark' 카테고리의 다른 글

관련글

티스토리툴바