Apache Spark is an open source cluster computing framework based on resilient distributed datasets (RDD). Due to the distributed nature of the framework it scales efficiently and well suited performing machine learning large datasets such as those used for music recommendation systems as they often include millions of users and millions of songs. Databricks is a company founded by the creators of Apache Spark which offers users access to big data processing using Spark.
I will use Spark to predict the number of listens of songs by users. This will form the basis of the music recommendation system. Once a model is created that performs well on a test dataset, I will apply the model to songs unlistened by the user, and given the number of predicted plays of the songs, recommend songs to the user in the order of descending number of predicted plays.
We use an alternating least squares model to predict song listens of given users. In the exploratory data analysis we find that there are many songs that have only been listened to once, so we have compared models that include and exclude those songs that have been listened to once. While the model including these songs returns a slightly lower RMSE on the test dataset, the model excluding them may give better recommendations, since the training set includes more songs may be more reflective of their listeneing profiles (since they listen to them more).
Importantly due to Sparks efficiency to scale, this model can scale to many more users. For example Spotify has around 30 million active monthly users, which is only a magnitude more than the number of users considered in this dataset.
The analysis can be found on databricks here.