To classify music streaming users dependent on their listening histories. I hypothesize that there are two types of listener: those that search for artists and and listen to songs from them, and those that listen to playlists. Users may not cluster into two distinct categories, but may fall in a spectrum between the two. Classifying where users fall on this spectrum can help optimize advertisements that keep users streaming music.
Playlists are composed of many songs, by many different artists. As such, users that listen to many playlists will have a listening history that contains many different artists, with a low number of plays for each artist- I call them playlist listeners. Alternatively, users that like to have complete control over their listening profiles are likely to choose an artist, listen to an album, rinse and repeat - I call them album listeners. Their total listening history is likely to have a low number of artists, and a large number of plays per artist. Every user can be classified as a combination of these two fundamental user listening profiles.
The data of user listening histories comes from the Million Song Dataset (MSD), which contains 3Gb of 1 million user listening triplets in the form: User ID, Song ID, Number of Plays. To get the artist names for each song, the metadata comes from the SQL database on the MSD website that comes from The Echo Nest (now taken over by Spotify). Since the metadata idetifies by Track ID, different to Song ID, we need to join tables via a SQL database of linking the Song and Track IDs.
Data analysis is performed in R, and complete code can be found on github here.
We can look at the number of artists in a users listening history vs the mean number of plays for an artist.
From the 2D histogram we can see that there are a larger number of users that fall in the lower region of the graph indicating larger number of total artists, and smaller number of mean plays per artist, so more users lie on the playlist listening part of the spectrum.
To quantify this I use a Play Index:
\(\text{Play Index} =\log(\frac{\text{Total Artists}}{\text{Mean Artist Plays}})\)
We can plot the distribution of the play index, a larger play index corresponds to more playlist listening, and a smaller play index corresponds to more album listening.
We can see that the distribution is negatively skewed, with the mean (red line) lying at 1.6916. This corresponds to ratio of \(\frac{Total Artists}{Mean Artist Plays}\) of 5.428181, i.e., if a user has listened to aroud 100 songs, it might contain about 25 artists, with each artist having been played 4 or 5 times.
One user has listened only to William Fitzsimmons, and listened to just 10 songs, in one album, a total of 741 times. This user is the epitome an album listener. On the other end of the spectrum one user has has listned to 883 unique artists, with a 2.08 mean plays per artist. This user is a playlist listener.
The skewness is of the distribution is -0.8136788 which indicates that there are more hardcore album listeners, than there are hardcore playlist listeners. Which means there are very few users are listening to a completely different playlists with no common artists all the time. Rather, the majority of users listening profiles are a combination of playlist and album listening, though they could also be listening to similar playlists with common artists, or listening to the same playlist multiple times, that would a similar listening profile.
To more accurately compute the play index, along with the number of plays of each song, would be whether the song was played as part of a playlist. The play index would be modified to:
\(\text{Play Index} =\log(\frac{\text{Playlist Song Plays}}{1-\text{Playlist Song Plays}})\)
In conclusion, determining a users play index is an easy quantifiable number with which to determine the listening properties of users. From the dataset of a over a million users and a million songs most users are on the playlist listening end of the spectrum, opposed to album listeners. By identifying a users listening profile, user-specific advertising can be optimiszed. For example, if a user is a falls more on the playlist side of the listeing spectrum, characterized by a large play index, suggesting different playlists is likely to be more successful than if it were suggested to an album listener. Conversely, new artists, or new albums from artists, would be more successful users that fall more on album listeners end of the spectrum.