The Australian radio station, Triple J, host an annual vote for the “Hottest 100” songs released in that year. Votes are cast by listeners and music fans around the world for 10 songs per vote. This series of analyses attempts to progress towards a predictive model that determines 2 outcomes, whether a song will be present in the Hottest 100 and if so, where will it be placed in the ranking system.
This first analysis begins by determining the relationship between the number of times a song was played Triple J's radio station in the year under scrutiny and the binary outcome of whether or not it featured in the top 100 songs in the same year.
We first collect the play data and Hottest 100 charts for the years 2009-2012 and prepare it for analysis. We then analyse the years 2010, 2011 and 2012 using a linear model to demonstrate a weak positive correlation with an adjusted R2 value of 0.2458 across the 3 years combined. A digression is included to validate our source data, identifying potential gaps.
Triple J play data
The included data source q1_play_timestamps.tsv was sourced from the website JPlay, which sources its data directly from Triple J broadcasts. The observations are a row for each song played on the station, including the time stamp, song title, artist and JPlay's unique IDs for the song & artist. These data were originally housed in an SQL database, where I included the “first_year” column calculated as the year the song first appeared in the table, identified by SongID.
SELECT p.time stamp, p.artist, p.song, p.artistid, p.songid, (SELECT Year(Min(time stamp)) FROM Play p1 WHERE p.songid = p1.songid) AS first_year FROM Play p ORDER BY p.time stamp;
Hottest 100 Charts
The Hottest 100 charts in h100_ranks.tsv were copied directly from the Triple J website's record for 2010, 2011 & 2012. The charts were copied into Excel & cross-referenced with the song records and updated to point to the matching entry according to the JPlay sourced data.
While this is a fairly mundane piece of work, it's essential for getting the data in a usable format. We stripped the data down to include only songs released on 2010, 2011 & 2012. The mechanism for doing this was to identify the first time a song was played within this data set. This is not perfect but it does strip most data outside of the scope of interest. It is expected that a certain amount of noise still remains in the data, though this technique will not remove any valid observations of interest.
playLog <- rawPlays[rawPlays$first_year > 2009 & rawPlays$first_year < 2013, c("posixts", "songid", "first_year")] playLog <- playLog[playLog$first_year < 2013, ] playLog <- playLog[playLog$first_year == playLog$posixts$year + 1900, ]
Exploring 2010 data
In order to generate some visualisations of the data, we have limited it to songs first played in 2010 and therefore eligible for the 2010 Hottest 100 competition.
Figure 1 depicts the number of times each song first played in 2010 was played over the entire year. Data points in red were in the hottest 100 songs, points in black were not.
Points of interest in this figure are the red points very close to the X axis. There is one right on the X axis on the far right side of the chart and a few others fairly close. It does appear that there is a higher density of red in the upper half of this chart.
Most significantly, there is a very heavy bias to songs played infrequently (i.e. along the X axis). This leads us to investigate the data set further, ensuring it's integrity.
Digression: Data validation
Are songs played only once considered valid?
It appears there are a lot of songs played only once. Let's confirm this hypothesis.
Figure 2 shows a histogram depicting the frequency of songs that are played under ten times in the year 2010.
It appears that one-off plays are extremely common. Given that Triple J is an indie radio station, this is not entirely surprising and is not necessarily an anomaly. Songs played only once account for 61.9% of total plays in 2010. Three of these songs made it to the hottest 100, which seems unusual but confirms that we should not discount this data.
counts.2010[counts.2010$Freq == 1 & counts.2010$inh100, ]
## songid Freq inh100 rank ## 6733 34507 1 TRUE 95 ## 7015 34974 1 TRUE 21 ## 7016 35042 1 TRUE 26
We must investigate the data further to ensure nothing is missing. This will not be included in future analyses unless it includes new discoveries not covered here.
Have we got data for the entire year?
Figure 3 shows the number of songs played for each day of 2010. The highest point (the green mark early in the year) is Australia Day 2010. This is the day that the Hottest 100 results for the previous year are announced & aired. There are 365 points here, meaning we have data for every day of the year.
It's fairly clear that we have a pattern of 2 days per week that are consistently lower in plays than the rest of the week.
Does time of day have an impact on weekend plays?
Figure 4 shows the number of songs played over the year by hour of the day, segmented by day of week.
This investigation shows that there are an abnormaly low number of observations of songs played between 1am and 6am on Saturday and Sunday mornings. This corresponds to the Mix-up, House Party & other electronic/dance music sessions usually aired late Friday & Saturday nights. Both our data and the upstream provider JPlay are missing these records. It is assumed that Triple J does not provide play lists or song titles during these periods.
Because of this regular anomaly, it is anticipated that songs that would usually air most frequently in these windows will have deflated total play counts. One song played only once during the year that made it to number 21 in the 2010 Hottest 100 was Skrillex's Scary Monsters & Nice Sprites, another was Crystal Castles - Not In Love, ranking 26th. The third was not electronic, being Grouplove's Naked Kids, released in November 2010 and ranking at 95th place.
Using a standard linear regression model to analyse the impact that total annual play count has on the binary outcome of placing in the Hottest 100 chart for 2010 gives us this summary.
## ## Call: ## lm(formula = counts.2010$inh100 ~ counts.2010$Freq) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.4219 0.0013 0.0038 0.0038 1.0038 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.006337 0.001254 -5.05 4.4e-07 *** ## counts.2010$Freq 0.002504 0.000049 51.09 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.0997 on 7015 degrees of freedom ## Multiple R-squared: 0.271, Adjusted R-squared: 0.271 ## F-statistic: 2.61e+03 on 1 and 7015 DF, p-value: <2e-16
If we treat the outcome as a probability, this model suggests that when a song is played 402 times or more, it becomes guaranteed to place in the Hottest 100, though the R2 value of 0.271 suggests this is a weak relationship. The residuals tend to be placed closer towards the model when play count is at the extremes, suggesting the confidence in this model is higher when songs are played either very frequently or very infrequently. This appears to reflect intuition.
## ## Call: ## lm(formula = counts.2011$inh100 ~ counts.2011$Freq) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.4667 -0.0015 0.0036 0.0036 1.0036 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -6.17e-03 1.53e-03 -4.04 5.4e-05 *** ## counts.2011$Freq 2.56e-03 5.73e-05 44.60 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.111 on 6012 degrees of freedom ## Multiple R-squared: 0.249, Adjusted R-squared: 0.248 ## F-statistic: 1.99e+03 on 1 and 6012 DF, p-value: <2e-16
2011 data shows a similar result, with the slope being slightly lower at 0.002558 but the model being fitted slightly better with R2 of 0.249.
## ## Call: ## lm(formula = counts.2012$inh100 ~ counts.2012$Freq) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.3817 -0.0017 0.0028 0.0050 0.9893 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -7.26e-03 1.85e-03 -3.93 8.6e-05 *** ## counts.2012$Freq 2.25e-03 6.09e-05 36.92 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.12 on 4889 degrees of freedom ## Multiple R-squared: 0.218, Adjusted R-squared: 0.218 ## F-statistic: 1.36e+03 on 1 and 4889 DF, p-value: <2e-16
2012 follows the trend, with a slope of 0.002248 and R2 of 0.218.
Combined 2010-2012 data
## ## Call: ## lm(formula = counts.all$inh100 ~ counts.all$Freq) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.4437 -0.0008 0.0041 0.0041 1.0041 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -6.52e-03 8.69e-04 -7.5 6.5e-14 *** ## counts.all$Freq 2.43e-03 3.18e-05 76.4 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.109 on 17920 degrees of freedom ## Multiple R-squared: 0.246, Adjusted R-squared: 0.246 ## F-statistic: 5.84e+03 on 1 and 17920 DF, p-value: <2e-16
Putting the linear model aside for a moment, it appears that a cut-off point exists where a song played more than 185 times implies a spot in the charts. This trend appears to begin at around 150 plays. I expect to see this again in a future decision tree analysis.
Summary & Future Work
The number of times a song is played on the Triple J radio station does have an impact on its likelihood of appearing in the Hottest 100, though the correlation is weak. Despite the weak correlation it can be noted that songs with a very high frequency of plays are highly likely to at least place in the Hottest 100.
These results may be influenced by a number of external factors not captured here. A song may be played heavily on another station but not on Triple J, gathering “external” voters. A song may be played heavily during the period where Triple J do not record individual songs. A song may be released at the end of the year, gathering a relatively low number of plays but still enjoying popular appeal.
This analysis highlights that annual total play count is a weak indicator and it is recommended that it be used either in combination with other factors or that the play count be aggregated over a different period, or both. A future analysis should be done on play counts over the period of the year to investigate the suggestion that songs released closer to the voting period would have a higher chance of making the top 100.
Thanks for getting this far in my first data analysis. I plan on doing a heap more of these to ultimately try and build a more accurate model of what makes up a Hottest 100 song. I'd love to hear any feedback you have; I'm particularly interested in feedback on this analysis or the data. Any suggestions for future analyses or factors can be made on my website's index for this series.