This repository contains a modern implementation of a 2010 paper Application of a Gaussian, Missing-Data Model to Product Recommendation in TensorFlow. The author of this paper was in team "The Ensemble" in 2009's $1 million Netflix Prize.
To install all libraries/dependencies used in this project, run
pip3 install -r requirements.txt
Due to GitHub file size limit, Y_full.csv
is broken down to smaller parts in ./data/Y_full
. To merge parts back to one file and put the merged file in ./data/
, run
cd ./data/Y_full
cat part* > Y_full.csv
mv Y_full.csv ..
Number of Movies (k) | Number of Users (n) | Data Preprocessing Time |
---|---|---|
100 | 137328 | ~1.5 min |
17700 | 480189 | ~40 min |
The initial experiments were run on the 100 movies with the greatest number of observed. We obtained the following results amazingly identical to the results mentioned in paper:
Algorithm | Iterations to Converge | Runtime | RMSE | Initial Estimate of R |
---|---|---|---|---|
EM | 26 | ~39 min | 0.9170 | |
McMichael | 35 | ~38 min | 0.9170 |
For full-scale experiments:
Algorithm | Iterations to Converge | Runtime | RMSE | Initial Estimate of R |
---|---|---|---|---|
McMichael |
The preprocessed data and initial experiments' results were obtained using a 2.2 GHz 6-Core Intel Core i7 processor with 32GB RAM. The full-scale experiemnts' results were obtained using a ml.p3.8xlarge
notebook instance on Amazon Web Services. Note that tf.function()
is used extensively (to construct callables that execute static TensorFlow graphs) to accelerate computing in the initial experiments where all tensors are dense.
For more background of the Netflix Prize: