principle:
projects:
Building a Movie Recommendation Service
Matrix factorization algorithm implement by hand
load necessary packages
1 | import pandas as pd |
load movie data and rating data
1 | movies=pd.read_csv("movies.csv") |
Movie ids are not continuous, build movie dicitionary with line no as numpy movie id ,its actual movie id as the key.
1 | def build_movies_dict(movies_file): |
Each line of i/p file represents one tag applied to one movie by one user,and has the following format: userId,movieId,tag,timestamp make sure you know the number of users and items for your dataset return the sparse matrix as a numpy array.
1 | def read_data(input_file,movies_dict): |
matrix factorization implementation
X is the user-rate-movie matrix, it’s very sparse. P, Q are user-feature matrix and movie-featuree matrix respectively. The objective is to use gradient descend method to find the P,Q where $P \times Q$ approximates X.
1 | def matrix_factorization(X,P,Q,K,steps,alpha,beta): |
main function
1 | def main(X,K): |
1 | if __name__ == '__main__': |
recommend movies for users who have rated some of the movies
recommend 50 tops movies for each user based on his/her unrated movies. Implemented this seperately from building model as once the model is built, we can use it many times.
1 | def dict_with_user_unrated_movies(rating_file,movie_mapping_id): |
Using Apache Spark to faciliate computing
load data and convert it to RDD
1 | movies = sc.textFile("/FileStore/tables/movies.csv") |
data cleaning
1 | ratings=ratings.map(lambda x:x.split(",")) |
have a glimpse at ratings data
1 | ratings.take(2) |
remove the header line
1 | ratings=ratings.filter(lambda x:"userId" not in x).map(lambda x:(x[0],x[1],x[2])) |
split the data into training set, validation set and test set
1 | training_RDD, validation_RDD, test_RDD = ratings.randomSplit([6, 2, 2]) |
use alternate least squares algorithm to calcualte the coefficients of the factorization machines
1 | from pyspark.mllib.recommendation import ALS |
For rank 8 the RMSE is 0.9382743636404246
For rank 12 the RMSE is 0.9389290854027168
The best model was trained with rank 4
what the prediction looks like
1 | predictions.take(3) |
[((44, 3272), 3.9103419004701716),
((618, 7184), 2.9601086695162566),
((264, 52328), 3.3141828739969803)]
test model performance on test data set
1 | model = ALS.train(training_RDD, best_rank, iterations=iterations, |
recommend movies for a specified user, here, user with id=2
1 | # Recommend to user movies which are unrated by himself, recommend movie for ID:2 as an example. |
join movie information
1 | predictRes=predictRes.map(lambda x:(str(x[0]),(x[1]))).join(movies) |
[('27706',
(3.761019669094141,
("Lemony Snicket's A Series of Unfortunate Events (2004)",
'Adventure|Children|Comedy|Fantasy'))),
('37240', (2.4144329602095578, ('Why We Fight (2005)', 'Documentary'))),
('45183',
(4.492474207135306,
('Protector The (a.k.a. Warrior King) (Tom yum goong) (2005)',
'Action|Comedy|Crime|Thriller')))]
reformat the data to fit the model, take the highest 25 movies predicted ratings
1 | predictRes=predictRes.map(lambda x:(x[1][1][0],x[1][0])) |
[('Paulie (1998)', 6.010185255833662),
('Sweet Land (2005)', 5.791408996053072),
('Stir Crazy (1980)', 5.76573694136966),
('Westerner The (1940)', 5.549436023244587),
('Battlestar Galactica (2003)', 5.533985629970289),
('Truly Madly Deeply (1991)', 5.483704154666285),
('Auntie Mame (1958)', 5.380268674819099),
('School Daze (1988)', 5.363982694591471),
('Bugs Bunny / Road Runner Movie The (a.k.a. The Great American Chase) (1979)',
5.358809646115851),
('Cashback (2006)', 5.358809646115851),
('Memoirs of a Geisha (2005)', 5.317983563349337),
('About Time (2013)', 5.316004836486595),
('Vicious Kind The (2009)', 5.285894518338154),
('Spread (2009)', 5.285894518338154),
('New Rose Hotel (1998)', 5.27282052920912),
('Cocaine Cowboys (2006)', 5.264844765867043),
("Empire of the Wolves (L'empire des loups) (2005)", 5.242122655555558),
('Repo! The Genetic Opera (2008)', 5.242122655555558),
('Visitors The (Visiteurs Les) (1993)', 5.24178791470284),
('Aguirre: The Wrath of God (Aguirre der Zorn Gottes) (1972)',
5.23498039535383),
('Bloody Sunday (2002)', 5.2286529970954305),
('Mrs. Miniver (1942)', 5.216295450453579),
('Chaos (2001)', 5.212268024919439),
("Twelve O'Clock High (1949)", 5.207334792061248),
('American Pimp (1999)', 5.207111214164087)]