A good playlist title is essential for letting users know what to expect from a playlist before listening. From my experience, users do not spend long deciding whether they are going to listen to a playlist, I'd estimate a couple of seconds or less. Therefore the playlist title is extremely important in determining a playlists' total number of plays- if users o not like the playlist title, for whatever reason, they won't listen to the playlist.
To determine the impact of a playlist title on its success I will use Spotifys great API to obtain pertinent information on a variety of playlists including its title. To determine a playlists' success I will use the number of followers, and make the assumption that it is proportional to the total number of plays. I will use text analytics to look at the type of words (i.e, verbs, nouns, adjectives) that make up playlists titles and perform regression to determine what kinds of words make up popular playlists.
Though I hypothesize that the title of the playlist is an indicator of playlist success, I want to also acknowledge other factors of success that are not considered in this analysis. Such factors include: the songs the playlist is composed of; the position of the playlist on the page (those at the top of the list are likely to get chosen more); the playlist image; the playlist description; the number of current followers; amongst others.
The goal of this analysis would be to come up with a series of actions a copywright could use to create playlist names based on the previous success of others.
A wordcloud from playlist titles.
Though this may not directly help to solve the problem, I always find it important to get a general feel for the data to help generate key insights, not just for this problem, but as a potential starting point for future problems. This may also help find any issues and discrepancies in the data, such as null values or misclassified categories that may appear.
Data analysis is performed in Python, and complete code can be found on github here.
We can look at the mean number of followers for a each category:
We can see that "top lists" and "pop" categories dominate, indicating that many users listen to general popular music opposed specific genres.
We can also look at the mean number of followers for a given playlist owner:
We see that Spotify rules, which makes sense as we are using the Spotify API, and searching through their data. It would not be unreasonable to assume that they are pushing the success of their playlists through advertising and placement at the tops of search lists.
We also can see that spanish speaking countries, and playlists that seem to be composed spanish-speaking songs, occupy many of the top spots in the list, indicating the relative success in Latin-America.
We can look at the number of characters in a title versus the number of followers of a playlist. On top of this we can add a trend line using linear regression.
The p-value for the relation is 4.72975e-05, indicating that the trend is statistically significant and the slope is 13178.879 indicating that for every character added you may expect a decrease by over 13,000 followers. So keep the title short and sweet.
The outlier is a playlist named 'Today's Top Hits', with almost 10 million followers (9,946,355), and may be one of the reasons the 'top lists', category has a very large number of mean followers. If we look at Spotify's web player, we can see that this playlist is right at the top, as well as featured on the homepage, also the number of followers agrees roughly with the Spotify API call (I wanted to do a quick sanity check because 10 million followers!!!). Whether being at the top of the list generates more followers or the the number of followers determines placement in the list is unclear though. What we can see though is that placement on the list is not simply in order of number of followers, but the playlist with most number of followers is in the left. I'm sure this is on purpose somehow.
In order to determine the best combination of types of words I use the natural language toolkit (NLTK), since it has a large number of specialised functions that will be useful for this application. For each playlist title I tokenize the title, separating it into individual words, phrases and symbols. Following, each token is classified to the type of word, for example "big" is classified as an adjective. I sum the frequency of each type of word and input it into a pandas dataframe, in which each row is an observation, in this case a playlist title, and each column is a feature, or type of word. The entries of the dataframe will be positive integers starting at zero, that represent the frequency of word type in each playlist title.
The natural language toolkit can separate the words into a number of different types. These types of words will be features of the regression model. Examples of the types of words are shown below:
Word Type | Examples |
---|---|
Conjunction | & 'n and both but either et for less minus neither nor or plus so therefore times v. versus vs. whether yet |
Numeral | zero two 78-degrees eighty-four IX '60s .025 fifteen 271,124 dozen quintillion |
Determiner | all an another any both del each either every half la many much nary neither no some such that the them these this those |
Genitive Marker | 's |
Existential there | there |
Foreign word | gemeinschaft je jour objets salutaris fille quibusdam pas trop Monte terram fiche oui corporis |
Preposition or conjunction | astride among uppon whether out inside pro despite on by throughout below beside |
Adjective, ordinal | third ill-mannered pre-war regrettable calamitous first separable ectoplasmic multi-disciplinary |
Adjective, comparative | bleaker braver breezier briefer brighter brisker broader bumper busier calmer cuter |
Adjective, superlative | calmest cheapest choicest creepiest crudest cutest darkest deadliest dearest deepest densest dinkiest |
Modal auxiliary | can cannot could couldn't dare may might must need ought shall should shouldn't will would |
Noun, common, singular or mass | common-carrier humour falloff slick wind hyena override subhumanity machinist |
Noun, proper, singular | Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos Oceanside Meltex Liverpool |
Noun, proper, plural | Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists Apocrypha |
Noun, common, plural | undergraduates scotches designs clubs fragrances averages subjectivists apprehensions muses factory-jobs |
Pre-determiner | all both half many quite such sure this |
Pronoun, personal | hers herself him himself hisself it itself me myself one oneself ours ourselves ownself self she thee theirs them themselves they thou thy us |
Pronoun, possessive | her his mine my our ours their thy your |
Adverb | occasionally prominently technologically magisterially predominately swiftly fiscally |
Adverb, comparative | further gloomier grander graver greater grimmer harder harsher healthier heavier longer |
Adverb, superlative | best biggest bluntest earliest farthest first furthest hardest heartiest highest largest |
Particle | aboard about across along apart around aside at away back before behind by crop down ever fast for forth from go high i.e. in into just later low more off on open out |
Symbol | % & ' '' ''. ) ). * + ,. < = > @ |
"to" as preposition | to |
Interjection | Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen huh howdy uh dammit |
Verb, base form | ask assemble assess assign assume atone attention avoid bake balkanize bank begin behold believe |
Verb, past tense | dipped pleaded swiped regummed soaked tidied convened halted registered cushioned exacted snubbed |
verb, present participle or gerund | telegraphing stirring focusing angering judging stalling lactating hankerin' alleging veering |
Verb, past participle | multihulled dilapidated aerosolized chaired languished panelized used experimented flourished imitated |
Verb, present tense, not 3rd person singular | predominate wrap resort sue twist spill cure lengthen brush terminate appear tend stray glisten obtain |
Verb, present tense, 3rd person singular | bases reconstructs marks mixes displeases seals carps weaves snatches slumps stretches authorizes |
Determiner | that what whatever which whichever |
Pronoun | that what whatever whatsoever which who whom whosoever |
Pronoun, possessive | whose |
Adverb | how however whence whenever where whereby whereever wherein whereof why |
We can look at the frequency of word types in the playlist titles. Only included are those with counts greater than 10.
We can see that many of the playlist titles are nouns followed by adjectives. Examples of the most common type, nouns, are 'session', 'soul', 'party', 'soultronic', 'country', 'gold', 'chillin', 'dirt', 'road', 'coffeehouse'. We can see some words are made up, so gets classified as a noun, also words like 'chillin', which are colloquial, get classified as a noun, when they should perhaps be a verb. 'gold' as well may refer to an adjective, rather than the noun. In this case maybe a more colloquial corpus may be more suitable to classify words, since it seems nouns are a catch-all.
I use multivariate linear regression to determine the relative slope of each kind of word in a title. I use linear regression since my goal is to create actionable results that are easily interpretable.
To perform the regression I use an 80/20 train-test split. In which I fit the data on the training dataset and evaluate on the test dataset. I will use residual sum of squares between the predicted and actual number of followers in the test set to determine the score for the regression model. With simple linear regression using ordinary least squares (OLS) I get a residual of 2.32439e+11 which I will use as a baseline of which to compare other models. The OLS minimized according to the function:
\(min_w \lVert Xw-y\rVert_2^2\)
Ridge regression addresses some of the problems of OLS, namely that OLS is highly sensitive to randomness in the data, that can produce high variance and thus, not perform well on the test dataset. Ridge regression imposes a penalty on the size of the coefficients, and minimizes the residual sum of squares as follows:
\(min_w \lVert Xw-y\rVert_2^2 + \alpha\lVert w\rVert^2_2\)
where \(X\) is the training dataset, \(w\) is the coefficients to be solved, \(y\) is the number of followers, and \(\alpha\) is the penalty or regularization parameter.
We can vary \(\alpha\) and observe the change in training and test scores and choose the alpha that provides the best score on the test dataset.
Here the alpha that provides the largest score is 96.1725, and produces a residual sum of squares of 1.57414e+11, better than OLS linear regression, but we can see that it is quite underfitted.
Lasso regression may perform a little better since performs a little better in estimating sparse coefficients, in which our training and test datasets are quite sparse. This method generally gives solutions with fewer parameter values which may be useful in our case. The model is trained similar to ride regression with alpha parameter according to:
\(min_w \frac{1}{2n_{samples}}\lVert Xw-y\rVert^2_2 + \alpha\lVert w\rVert^2_1\)
We again vary the \(\alpha\) parameter and observe the change in training and test scores and choose the alpha that provides the best score on the test dataset.
Here we find that the test scores actually perform better that training dataset, which may point to low correlation between the features and the number of followers. Also the alpha parameter is very large, indicating a very large penalty on the coefficients, i.e., all the features have very little impact on the number of followers of the playlist.
The residual obtained is lower at 1.57449e+11, yet if the coefficients are very small it does not provide much insight into what types of words are effective in creating popular playlist titles.
Elastic net is a combination of both lasso and ridge that allows for learning a sparse model where few of the weights are non-zero, yet still maintain the regularization properties of Ridge. We control the convex combination alpha parameters using a ratio parameter, \(\rho = \alpha_1/\alpha_2\), where \(\alpha_1\) is the parameter from lasso, and \(\alpha_2\) from ridge. Elastic net minimizes the function:
\(min_w \frac{1}{2n_{samples}}\lVert Xw-y\rVert^2_2 + \alpha\rho\lVert w\rVert^2_1 + \frac{\alpha(1-\rho)}{2}\lVert w\rVert^2_2\)
Now we vary both \(\alpha\) and the ratio of the two alphas, \(\rho\), where \(\alpha\) corresponds to the lasso \(\alpha\), and find the combination of the two that provide the largest score on the test dataset.
We can see that we can get an optimised score depending on the value of \(\alpha\) and the ratio of the two alphas. The optimised ratio obtained was 0.894737, with \(\alpha\) being 887.2, which intuitively is mostly lasso regression with a bit of ridge, and the residual is 1.57383e+11, the lowest so far.
Now we have an optimized model we can look at the coefficients to see how the different types of words affect the number of followers. We are most interested in the coefficients with the highest magnitudes, since they will have the highest leverage - adding or removing these words will have the most impact on the number of followers. Shown, are only those that have a coefficients above 100.
From these results we can directly make actionable recommendations. Below are the most 5 most influential actions you can do to increase number of followers with examples.
Action | Coefficient | Example |
---|---|---|
Reduce the number of characters | -9848.3 | "The Hottest Tracks of the Year" → "Hottest Songs This Year!" |
Add in genitive marker | 1493.0 | "The Most Metal Songs This Month" → "This Month's Metalest Songs" |
Reduce the number of words | -1270.2 | "The Best Country Hits of All Time" → "Country's All Time Best" |
Add in adjective | 934.17 | "100 Songs to Shower To" → "Ultimate Sing-along Essentials" |
Remove determiners | -631.74 | "The All Time Greatest Marching Band Anthems" → "Marching Bands' Best" |
In conclusion, multivariate linear regression has been performed on playlist titles taken from Spotify's API. The results of the regression has identified features in popular playlists titles that can be used to create titles for future playlists, such as the addition of genitive markers and the reduction of characters. Overall the data suggests that titles should be short and sweet, with little redundancy. Users want to spend as little time choosing a playlist and more time listening to it, so the quicker you can help them reach their decision, the better.