Apache Spark is an open source cluster computing framework based on resilient distributed datasets (RDD). Due to the distributed nature of the framework it scales efficiently and well suited performing machine learning large datasets such as those used for music recommendation systems as they often include millions of users and millions of songs. Databricks is a company founded by the creators of Apache Spark which offers users access to big data processing using Spark.
I will use Spark to predict the year of a songs release using the audio data of the songs, such as the tempo of the song or the duration. This will form the basis of the music recommendation system. This the goal of this analysis to predict the year of a songs release given a set of audio features such as the tempo, duration, etc. The data comes from a subset of the Million Song Dataset from the UCI Machine Learning Repository.
We use linear regression to predict the year of a songs release given audio data about the song and achieve RMSE valsues of around 15.124 on the test dataset. The lowest RMSE is achieved using linear regression without interaction between the features. This suggests that there is not much correlation between the features. The RMSE obtained beats the baseline by almost 7 years, which is pretty good for simple linear regression.
The analysis can be found on databricks here.