[SparkML] ALS, ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ™œ์šฉํ•˜๊ธฐ

2022. 5. 23. 00:17ใ†๐Ÿ›  Data Engineering/Apache Spark

 

ALS, Alternating Least Squares

SparkML์€ ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ ALS๋ฅผ ์ง€์›ํ•œ๋‹ค.

์˜ํ™” ํ‰์  ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€์„œ ์ง์ ‘ ALS ๋ชจ๋ธ์„ Spark์—์„œ ์‚ฌ์šฉํ•ด๋ณด์ž.

 

 

[ALS ๊ฐœ๋…]

https://mengu.tistory.com/60

 

[์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜] ALS ๊ฐœ๋…, Basic ํ•˜๊ฒŒ feat. ์ฝ”๋“œ X

Alternating Least Squares ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜๋กœ, ๊ต๋Œ€ ์ตœ์†Œ ์ œ๊ณฑ๋ฒ•์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆฐ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„  ๊ฐ„๋‹จํ•œ ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ๋…๋“ค์„ ์‚ดํŽด๋ณด๊ณ , ALS์˜ ์žฅ์ ์„ ์•Œ์•„๋ณด์ž. ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜ : ์‚ฌ์šฉ์ž๊ฐ€

mengu.tistory.com

 

 

 

 

Basic Settings

 

from matplotlib import font_manager, rc
font_path = 'C:\\WINDOWS\\Fonts\\HBATANG.TTF'
font = font_manager.FontProperties(fname=font_path).get_name()
rc('font', family=font)

import os
import findspark
findspark.init(os.environ.get("SPARK_HOME"))
import pyspark
from pyspark import SparkConf, SparkContext
import pandas as pd
import faulthandler
faulthandler.enable()
from pyspark.sql import SparkSession

MAX_MEMORY = "5g"
spark = SparkSession.builder.master('local').appName("movie-recommedation")\
.config("spark.executer.memory", MAX_MEMORY)\
.config("spark.driver.memory", MAX_MEMORY).getOrCreate()

 

 

 

 

 

 

์˜ํ™” ํ‰์  ๋ฐ์ดํ„ฐ ๋ฐ›์•„์˜ค๊ธฐ


 

 

(1) ์‚ฌ์ดํŠธ์— ์ ‘์†ํ•ด์ค๋‹ˆ๋‹ค.

 

https://grouplens.org/datasets/movielens/25m/

 

MovieLens 25M Dataset

MovieLens 25M movie ratings. Stable benchmark dataset. 25 million ratings and one million tag applications applied to 62,000 movies by 162,000 users. Includes tag genome data with 15 million releva…

grouplens.org

 

 

 

 

(2) ๊ฒ€์€ ์‚ฌ๊ฐํ˜• ์•ˆ์— ์žˆ๋Š” Zip ๋งํฌ๋ฅผ ๋ˆŒ๋Ÿฌ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•ด์ค๋‹ˆ๋‹ค.

 

 

 

 

(3) C:\์›ํ•˜๋Š” ์ €์žฅ ๊ฒฝ๋กœ\ml-25m ์ž์‹ ์ด ์›ํ•˜๋Š” ๊ฒฝ๋กœ์— ํด๋”๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

 

 

 

 

 

 

 

ALS ์‚ฌ์šฉํ•˜๊ธฐ


 

(1) ๋ฐ์ดํ„ฐ ๋กœ๋“œ

 

ratings_file = 'C:/์ž์‹ ์ด ์„ค์ •ํ•œ ํด๋” ๊ฒฝ๋กœ/ml-25m/ratings.csv'
ratings_df = spark.read.csv(f"file:///{ratings_file}", inferSchema=True, header=True)

 

 

 

 

(2) ํ•„์š”ํ•œ ์นผ๋Ÿผ ๋‚จ๊ธฐ๊ธฐ ๋ฐ ๋ฐ์ดํ„ฐ ํ™•์ธ

 

ratings_df.show()

# timestamp ํ•„์š”์—†์œผ๋‹ˆ๊นŒ ์—†์• ๊ธฐ
ratings_df = ratings_df.select(['userId', 'movieId', 'rating'])


+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|    296|   5.0|1147880044|
|     1|    306|   3.5|1147868817|
|     1|    307|   5.0|1147868828|
|     1|    665|   5.0|1147878820|
|     1|    899|   3.5|1147868510|
|     1|   1088|   4.0|1147868495|
|     1|   1175|   3.5|1147868826|
|     1|   1217|   3.5|1147878326|
|     1|   1237|   5.0|1147868839|
|     1|   1250|   4.0|1147868414|
|     1|   1260|   3.5|1147877857|
|     1|   1653|   4.0|1147868097|
|     1|   2011|   2.5|1147868079|
|     1|   2012|   2.5|1147868068|
|     1|   2068|   2.5|1147869044|
|     1|   2161|   3.5|1147868609|
|     1|   2351|   4.5|1147877957|
|     1|   2573|   4.0|1147878923|
|     1|   2632|   5.0|1147878248|
|     1|   2692|   5.0|1147869100|
+------+-------+------+----------+
only showing top 20 rows



----

ratings_df.printSchema()


root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)

 

userId ๋ณ„๋กœ ์˜ํ™” ์ฝ”๋“œ์™€ ๊ทธ์— ๋”ฐ๋ฅธ ํ‰์ ์ด ์ •๋ฆฌ๋˜์–ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ์ด๋‹ค. ๋ชจ๋“  ์œ ์ €๊ฐ€ ์˜ํ™”๋ฅผ ๋ณธ ๊ฒƒ์€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์—, 1๋ฒˆ ์œ ์ €๊ฐ€ ์•ˆ ๋ณธ ์˜ํ™”๋Š” ์•„์˜ˆ ๋ฐ์ดํ„ฐ์— ์—†์„ ๊ฒƒ์ด๋‹ค.

 

 

๋ฐ์ดํ„ฐ์— ์˜ฌ๋ผ์˜จ ์˜ํ™”์˜ ๊ฐœ์ˆ˜๋Š” ์ด 59047๊ฐœ์ด๋‹ค.

 

ratings_df.createOrReplaceTempView('ratings_df')

query = '''
SELECT
    COUNT(DISTINCT movieId) as movie_count
FROM
    ratings_df
'''

spark.sql(query).show()


+-----------+
|movie_count|
+-----------+
|      59047|
+-----------+

 

 

์ฒซ ๋ฒˆ์งธ ์œ ์ €๊ฐ€ ํ‰์ ์„ ๋งค๊ธด ์˜ํ™”๋Š” ์ด 70๊ฐœ๋‹ค.

 

query = '''
SELECT
    COUNT(*) as 1_count
FROM
    ratings_df
WHERE
    userId == 1
'''

spark.sql(query).show()


+-------+
|1_count|
+-------+
|     70|
+-------+

 

 

์šฐ๋ฆฐ ALS ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด์„œ '์•ˆ ๋ณธ ์˜ํ™”' ์ค‘ 1๋ฒˆ ์œ ์ €๊ฐ€ ๊ฐ€์žฅ ์ข‹์•„ํ•  ๊ฒƒ ๊ฐ™์€ ์˜ํ™”๋ฅผ ์ถ”์ฒœํ•˜๋Š” ๊ฒƒ์ด ํฌ์ŠคํŒ…์˜ ๋ชฉํ‘œ๋‹ค. 

 

 

 

 

(3) ํ‰์  ํ†ต๊ณ„

 

# ํ‰์  ํ†ต๊ณ„
ratings_df.select('rating').describe().show()


+-------+------------------+
|summary|            rating|
+-------+------------------+
|  count|          25000095|
|   mean| 3.533854451353085|
| stddev|1.0607439611423535|
|    min|               0.5|
|    max|               5.0|
+-------+------------------+

 

 

 

 

 

(4) Train, Test ๋ฐ์ดํ„ฐ ๋‚˜๋ˆ„๊ธฐ

 

# train, test set ๋‚˜๋ˆ„๊ธฐ
train_df, test_df = ratings_df.randomSplit([0.8,0.2])

print(f'train_df ์˜ ๊ธธ์ด : {train_df.count()}')
print(f'test_df ์˜ ๊ธธ์ด : {test_df.count()}')


train_df ์˜ ๊ธธ์ด : 19999764
test_df ์˜ ๊ธธ์ด : 5000331

 

 

 

 

 

(5) ๋ชจ๋ธ ๊ตฌ์ถ• ๋ฐ ํ›ˆ๋ จ

 

from pyspark.ml.recommendation import ALS


# ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜
als = ALS(
    maxIter = 5,
    regParam=0.1,
    userCol='userId',
    itemCol = 'movieId',
    ratingCol = 'rating',
    # ํ•™์Šตํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ฐ์ดํ„ฐ ๋งŒ๋‚ฌ์„ ๋•Œ ์–ด๋–ป๊ฒŒ ๋Œ€์ฒ˜ํ•  ๊ฒƒ์ธ๊ฐ€ ์„ค์ •.
    coldStartStrategy='drop'
)

 

'userCol' ์—๋Š” ์œ ์ € ์•„์ด๋”” ์นผ๋Ÿผ์„ ์ž…๋ ฅํ•œ๋‹ค.

'itemCol' ์—๋Š” ์•„์ดํ…œ ์•„์ด๋”” ์นผ๋Ÿผ์„ ์ž…๋ ฅํ•œ๋‹ค.

'ratingCol' ์—๋Š” ํ‰์  ์นผ๋Ÿผ์„ ์ž…๋ ฅํ•œ๋‹ค.

 

๋‹จ์ˆœํžˆ ALS์˜ ๊ฐœ๋…๊ณผ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•๋งŒ ์•Œ๊ณ  ์žˆ์–ด๋„ ๋ชจ๋ธ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค!

 

 

# ๋ชจ๋ธ fit()
model = als.fit(train_df)

 

 

 

 

 

(6) ์ถ”๋ก  ๋ฐ ์„ฑ๋Šฅ ํ™•์ธ

 

# ์ถ”๋ก 
prediction = model.transform(test_df)
prediction.show()


+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|     1|   1088|   4.0| 2.5297563|
|     3| 175197|   3.5|  2.874376|
|     4|   1580|   4.5|  3.097906|
|     9|   1088|   5.0| 3.9108486|
|    12|   8638|   4.0| 3.8486152|
|    13|   3175|   4.0| 3.6777968|
|    20|   1580|   4.0|  4.033978|
|    23|   1959|   5.0| 3.9067807|
|    30|   3175|   4.5| 3.8995147|
|    31|   8638|   2.0| 2.5974488|
|    41|   1580|   4.0| 3.5832882|
|    41|   2366|   3.0|  3.150863|
|    57|   1580|   3.0|  3.604269|
|    58|   6658|   5.0| 3.2883408|
|    63|  68135|   4.0| 3.1181765|
|    70|   1580|   3.0|  3.017872|
|    72|   1591|   2.0|  2.358852|
|    75|   1088|   3.5|  3.070664|
|    75|   1959|   2.0|  3.243212|
|    80|   1342|   2.0|  2.482828|
+------+-------+------+----------+
only showing top 20 rows


----

prediction.select('rating', 'prediction').describe().show()


+-------+------------------+-----------------+
|summary|            rating|       prediction|
+-------+------------------+-----------------+
|  count|           4996844|          4996844|
|   mean| 3.534328668255403|3.399862379635884|
| stddev|1.0605845133909824| 0.63975911880358|
|    min|               0.5|       -2.4304335|
|    max|               5.0|        6.9266715|
+-------+------------------+-----------------+

 

์‹ค์ œ rating๊ณผ prediction ํ†ต๊ณ„ ๊ฐ’์„ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ, mean ๊ฐ’์ด ์œ ์˜๋ฏธํ•˜๊ฒŒ ๋น„์Šทํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ํ‘œ์ค€ ํŽธ์ฐจ๊ฐ€ ์ฐจ์ด๋‚˜๊ณ , min ๊ฐ’์ด ๋งˆ์ด๋„ˆ์Šค๊ฐ€ ์žˆ๋‹ค๋Š” ์ ์—์„œ ๋ณด์™„์ด ํ•„์š”ํ•ด ๋ณด์ธ๋‹ค.

 

 

 

# ํ‰๊ฐ€ํ•˜๊ธฐ
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating',predictionCol='prediction')


# RMSE
rmse = evaluator.evaluate(prediction)
rmse


0.8134903395643102

 

 

 

 

(7) ์œ ์ €๋งˆ๋‹ค Top 5 recommendation ํ•ด์ฃผ๊ธฐ

 

# ์œ ์ €๋งˆ๋‹ค Top5 recommendation ํ•ด์ฃผ๊ธฐ
model.recommendForAllUsers(5).show()


+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|     1|[{202231, 5.65169...|
|     3|[{194434, 6.35166...|
|     4|[{194434, 6.07157...|
|     5|[{194434, 6.13072...|
|     6|[{162436, 6.21624...|
|     7|[{185645, 5.45231...|
|     8|[{194434, 5.93970...|
|     9|[{185645, 6.51190...|
|    10|[{194434, 6.12319...|
|    12|[{194434, 5.63328...|
|    13|[{194434, 6.47734...|
|    15|[{194434, 6.69569...|
|    16|[{194434, 6.56789...|
|    17|[{199187, 6.23324...|
|    19|[{194434, 5.82160...|
|    20|[{194434, 6.83623...|
|    21|[{194434, 6.51367...|
|    22|[{185645, 7.19147...|
|    23|[{194434, 6.35103...|
|    24|[{203086, 6.53243...|
+------+--------------------+
only showing top 20 rows

 

 

 

 

(8) ์•„์ดํ…œ๋งˆ๋‹ค Top 5 ์œ ์ € ์ถ”์ฒœํ•ด์ฃผ๊ธฐ

 

# ์•„์ดํ…œ๋งˆ๋‹ค Top3 User ์ถ”์ฒœํ•ด์ฃผ๊ธฐ
model.recommendForAllItems(3).show()


+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|      1|[{18230, 5.483052...|
|      2|[{87426, 5.260374...|
|      3|[{87426, 5.19888}...|
|      4|[{52924, 4.736040...|
|      5|[{52924, 4.899741...|
|      6|[{156252, 5.42339...|
|      7|[{10417, 5.039373...|
|      8|[{87426, 5.294695...|
|      9|[{87426, 5.268173...|
|     10|[{87426, 5.282108...|
|     11|[{10417, 5.357073...|
|     12|[{87426, 5.257176...|
|     13|[{108346, 5.20422...|
|     14|[{105801, 4.85644...|
|     15|[{87426, 5.336809...|
|     16|[{96740, 5.368758...|
|     17|[{58248, 5.512191...|
|     18|[{87426, 5.399836...|
|     19|[{87426, 5.231883...|
|     20|[{87426, 5.285846...|
+-------+--------------------+
only showing top 20 rows

 

 

 

 

 

(9) ํŠน์ • ์œ ์ €๋ฅผ ์œ„ํ•œ ์ถ”์ฒœ API ๋งŒ๋“ค๊ธฐ

 

์œ ์ €๋ฅผ ์ž…๋ ฅํ•˜๋ฉด Top 3 Moive๋ฅผ ์ถ”์ฒœํ•ด์ฃผ๋Š” API๋ฅผ ๋งŒ๋“ค์–ด ๋ณด์ž.

๋จผ์ € API์— ๋“ค์–ด๊ฐˆ 'userId' ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ์งํ•ด๋ณด๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜ฌ 'recommendation' ๋ฐ์ดํ„ฐ ํ˜•์‹์„ ์‚ดํŽด๋ณด์ž.

 

from pyspark.sql.types import IntegerType


# id ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ์ œ๊ณต๋ ์ง€ ํ™•์ธ
user_list = [65, 78, 93]
user_df = spark.createDataFrame(user_list, IntegerType()).toDF('userId')

user_df.show()


+------+
|userId|
+------+
|    65|
|    78|
|    93|
+------+

----

# recommend๊ฐ€ ์–ด๋–ป๊ฒŒ ๋‚˜์˜ฌ์ง€ ํ™•์ธ
user_recommend = model.recommendForUserSubset(user_df, 5)

# ์ฒซ๋ฒˆ์งธ ์œ ์ €์— ๋Œ€ํ•œ Top 5 recommendation ์ถ”์ถœ
movies_list = user_recommend.collect()[0].recommendations
recs_df = spark.createDataFrame(movies_list)
recs_df.show()


+-------+-----------------+
|movieId|           rating|
+-------+-----------------+
| 205277|6.759361743927002|
| 159761|6.485182762145996|
| 169606|6.246584415435791|
| 137363|5.943302154541016|
| 203633|5.908850193023682|
+-------+-----------------+

 

 

์˜ํ™” ID๋ฅผ ์œ ์ €์—๊ฒŒ ์ถ”์ฒœํ•ด๋ดค์ž, ์–ด๋–ค ์˜ํ™”์ธ์ง€ ์•Œ์ง€ ๋ชปํ•œ๋‹ค. ์˜ํ™” ID๋ฅผ ์ œ๋ชฉ, ์žฅ๋ฅด์™€ ์—ฐ๊ฒฐ์‹œ์ผœ์ฃผ์ž.

 

movies_file = 'C:/ml-25m ํด๋” ๊ฒฝ๋กœ/movies.csv'
movies_df = spark.read.csv(f"file:///{movies_file}", inferSchema=True, header=True)

movies_df.show()


+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sense and Sensibi...|       Drama|Romance|
|     18|   Four Rooms (1995)|              Comedy|
|     19|Ace Ventura: When...|              Comedy|
|     20|  Money Train (1995)|Action|Comedy|Cri...|
+-------+--------------------+--------------------+
only showing top 20 rows

 

 

movieId์™€ recommendation์ด ๋‚˜์™”์„ ๋•Œ, ์˜ํ™” ์ œ๋ชฉ๊ณผ ์žฅ๋ฅด๋ฅผ ์—ฐ๊ฒฐํ•ด์ฃผ๋Š” ์ฟผ๋ฆฌ๋ฅผ ์งœ ๋ณด์ž.

 

recs_df.createOrReplaceTempView('recommendations')
movies_df.createOrReplaceTempView('movies')

# SQL ์ฟผ๋ฆฌ
query = '''
Select
    *
From
    recommendations r
    Join movies m
    On r.movieId = m.movieId
ORDER BY
    rating desc
'''

recommendation_movies = spark.sql(query)
recommendation_movies.show()


+-------+-----------------+-------+--------------------+--------------------+
|movieId|           rating|movieId|               title|              genres|
+-------+-----------------+-------+--------------------+--------------------+
| 205277|6.759361743927002| 205277|   Inside Out (1991)|Comedy|Drama|Romance|
| 159761|6.485182762145996| 159761|         Loot (1970)|        Comedy|Crime|
| 169606|6.246584415435791| 169606|Dara O'Briain Cro...|              Comedy|
| 137363|5.943302154541016| 137363|The Mother Of Inv...|              Comedy|
| 203633|5.908850193023682| 203633|    The Bribe (2018)|        Comedy|Crime|
+-------+-----------------+-------+--------------------+--------------------+

 

 

 

API

 

# ์‹ค์ œ๋กœ ์ด์šฉํ•  ๋•Œ๋Š” ํ•˜๋‚˜์˜ func์•ˆ์—์„œ ์ˆ˜ํ–‰
# ํŽธํ•˜๋‹ค.

query = '''
Select
    *
From
    recommendations r
    Join movies m
    On r.movieId = m.movieId
ORDER BY
    rating desc
'''

def get_recommendation(user_id, num_recs):

	# userid๋ฅผ ์ž…๋ ฅ ๋ฐ›๋Š”๋‹ค.
    user_df = spark.createDataFrame([user_id], IntegerType()).toDF('userId')
    
    # userid๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ recommendation ํ•˜๊ธฐ
    user_recs_df = model.recommendForUserSubset(user_df, num_recs)
    
    # recommendation์„ dataframe์œผ๋กœ
    recs_list = user_recs_df.collect()[0].recommendations
    recs_df = spark.createDataFrame(recs_list)
    recs_df.createOrReplaceTempView('recommendations')
    movies_df.createOrReplaceTempView('movies')
    
    # SQL ๋ฌธ์„ ํ†ตํ•ด์„œ recommendation๊ณผ ์˜ํ™” ์ œ๋ชฉ/์žฅ๋ฅด ๋ฐ์ดํ„ฐ๋ฅผ ํ•ฉ์ณค๋‹ค.
    recommend_movies = spark.sql(query)
    return recommend_movies

 

 

# userId๊ฐ€ 456๋ฒˆ์ธ ์‚ฌ๋žŒ์—๊ฒŒ 10๊ฐœ์˜ ์˜ํ™”๋ฅผ ์ถ”์ฒœํ•ด์ฃผ์ž.
recs = get_recommendation(456, 10)

# pandas ๋ณ€ํ™˜
recs.toPandas()

 

 

 

 

 

 

์‹œ๋‚˜๋ฆฌ์˜ค '์˜ํ™”๋ฅผ ์ถ”์ฒœํ•ด๋“œ๋ฆฝ๋‹ˆ๋‹ค!!'


 

userId 100๋ฒˆ์ด ๋‚ดํด๋ฆญํŠธ์— ์ ‘์†ํ–ˆ๋‹ค.

๋ ๋ง. ํ™ˆ ํ™”๋ฉด์ด ๋œจ๋Š” ๊ฐ€์šด๋ฐ, ๋„คํŠธ์›Œํฌ๋Š” ๋ฐฑ์•ค๋“œ์— ๋ฏธ์…˜์„ ๋‚ด๋ฆฐ๋‹ค.

 

'userId 100๋ฒˆ์ด ์ข‹์•„ํ•  ๋งŒํ•œ ์˜ํ™”๋ฅผ 5๊ฐœ ์ถ”์ฒœํ•ด์ค˜๋ผ! ์นœ์ ˆํ•˜๊ฒŒ!'

 

๋ฐฑ์•ค๋“œ๋Š” ๊ทธ๋ƒฅ ์œ„์˜ ์ฝ”๋“œ๋ฅผ ๋Œ๋ฆฌ๋ ค ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ pandas dataframe์œผ๋กœ ์ „๋‹ฌํ–ˆ๋‹ค๊ฐ„ ์š•๋จน์„ ๊ฒƒ ๊ฐ™์€ ๋ถ„์œ„๊ธฐ์—ฌ์„œ ์ƒˆ๋กญ๊ฒŒ ํ•จ์ˆ˜๋ฅผ ์งœ๊ธฐ๋กœ ํ–ˆ๋‹ค. 

 

 

def recommendation(user_id, num_recs):
    recs = get_recommendation(user_id, num_recs)
    r = recs.toPandas()
    
    # ์ œ๋ชฉ ๋ฆฌ์ŠคํŠธ ๋ฐ›์•„์˜ค๊ธฐ
    title_list = list(r['title'])
    
    # ์žฅ๋ฅด ๋ฆฌ์ŠคํŠธ ๋ฐ›์•„์˜ค๊ธฐ
    genre_list = list(r['genres'])
    
    # ์•Œ๋งž์€ ๋งจํŠธ๋ฅผ ์จ์„œ ์œ ์ €์—๊ฒŒ ์ถ”์ฒœํ•ด์ฃผ๊ธฐ
    for i in range(0, num_recs):
        print(f'{i+1}๋ฒˆ์งธ ์ถ”์ฒœ๋“œ๋ฆด ์˜ํ™”๋Š” {title_list[i]}/{genre_list[i]}์ž…๋‹ˆ๋‹ค.')

 

 

 

์ด์ œ userId 100๋ฒˆ์ด ์˜ค๋ฉด ๋‹ค์Œ ํ•จ์ˆ˜๋ฅผ ์“ฐ๋ฉด ๋œ๋‹ค.

 

recommendation(100, 5)


1๋ฒˆ์งธ ์ถ”์ฒœ๋“œ๋ฆด ์˜ํ™”๋Š” Adrenaline (1990)/(no genres listed)์ž…๋‹ˆ๋‹ค.
2๋ฒˆ์งธ ์ถ”์ฒœ๋“œ๋ฆด ์˜ํ™”๋Š” Truth and Justice (2019)/Drama์ž…๋‹ˆ๋‹ค.
3๋ฒˆ์งธ ์ถ”์ฒœ๋“œ๋ฆด ์˜ํ™”๋Š” School of Babel (2014)/Documentary์ž…๋‹ˆ๋‹ค.
4๋ฒˆ์งธ ์ถ”์ฒœ๋“œ๋ฆด ์˜ํ™”๋Š” National Theatre Live: One Man, Two Guvnors (2011)/Comedy์ž…๋‹ˆ๋‹ค.
5๋ฒˆ์งธ ์ถ”์ฒœ๋“œ๋ฆด ์˜ํ™”๋Š” Les Luthiers: El Grosso Concerto (2001)/(no genres listed)์ž…๋‹ˆ๋‹ค.

 

 

 

 

 

์ด๋ ‡๊ฒŒ ํ•ด์„œ Spark์—์„œ ์ถ”์ฒœ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ALS๋ฅผ ํ™œ์šฉํ•ด๋ดค๋‹ค.

๋„์›€์ด ๋˜์—ˆ๊ธธ ๋ฐ”๋ผ๋ฉฐ, ๋”ฐ๋ผ์˜ค๋Š๋ผ ๊ณ ์ƒํ•˜์…จ์Šต๋‹ˆ๋‹ค.