Data & beer; Beer & data.

H100.1: Analysing the impact of annual play count on chart presence

Posted: April 21st, 2013 | Author: | Filed under: Blog | Tags: , , , , | 1 Comment »

Authored by Mark McDonald (i@ii.net), April 2013. Download the data & source files here.

Abstract

The Australian radio station, Triple J, host an annual vote for the “Hottest 100” songs released in that year. Votes are cast by listeners and music fans around the world for 10 songs per vote. This series of analyses attempts to progress towards a predictive model that determines 2 outcomes, whether a song will be present in the Hottest 100 and if so, where will it be placed in the ranking system.

This first analysis begins by determining the relationship between the number of times a song was played Triple J's radio station in the year under scrutiny and the binary outcome of whether or not it featured in the top 100 songs in the same year.

We first collect the play data and Hottest 100 charts for the years 2009-2012 and prepare it for analysis. We then analyse the years 2010, 2011 and 2012 using a linear model to demonstrate a weak positive correlation with an adjusted R2 value of 0.2458 across the 3 years combined. A digression is included to validate our source data, identifying potential gaps.

Data Collection

Triple J play data

The included data source q1_play_timestamps.tsv was sourced from the website JPlay, which sources its data directly from Triple J broadcasts. The observations are a row for each song played on the station, including the time stamp, song title, artist and JPlay's unique IDs for the song & artist. These data were originally housed in an SQL database, where I included the “first_year” column calculated as the year the song first appeared in the table, identified by SongID.

SELECT p.time stamp, p.artist, p.song, p.artistid, p.songid, 
  (SELECT Year(Min(time stamp)) FROM Play p1 WHERE p.songid = p1.songid) AS first_year
FROM Play p 
ORDER BY p.time stamp;

Hottest 100 Charts

The Hottest 100 charts in h100_ranks.tsv were copied directly from the Triple J website's record for 2010, 2011 & 2012. The charts were copied into Excel & cross-referenced with the song records and updated to point to the matching entry according to the JPlay sourced data.

Data Preparation

While this is a fairly mundane piece of work, it's essential for getting the data in a usable format. We stripped the data down to include only songs released on 2010, 2011 & 2012. The mechanism for doing this was to identify the first time a song was played within this data set. This is not perfect but it does strip most data outside of the scope of interest. It is expected that a certain amount of noise still remains in the data, though this technique will not remove any valid observations of interest.

playLog <- rawPlays[rawPlays$first_year > 2009 & rawPlays$first_year < 2013, 
    c("posixts", "songid", "first_year")]
playLog <- playLog[playLog$first_year < 2013, ]
playLog <- playLog[playLog$first_year == playLog$posixts$year + 1900, ]

Exploring 2010 data

In order to generate some visualisations of the data, we have limited it to songs first played in 2010 and therefore eligible for the 2010 Hottest 100 competition.

Figure 1 depicts the number of times each song first played in 2010 was played over the entire year. Data points in red were in the hottest 100 songs, points in black were not.

plot of chunk unnamed-chunk-6

Points of interest in this figure are the red points very close to the X axis. There is one right on the X axis on the far right side of the chart and a few others fairly close. It does appear that there is a higher density of red in the upper half of this chart.

Most significantly, there is a very heavy bias to songs played infrequently (i.e. along the X axis). This leads us to investigate the data set further, ensuring it's integrity.

Digression: Data validation

Are songs played only once considered valid?

It appears there are a lot of songs played only once. Let's confirm this hypothesis.

Figure 2 shows a histogram depicting the frequency of songs that are played under ten times in the year 2010.

plot of chunk unnamed-chunk-7

It appears that one-off plays are extremely common. Given that Triple J is an indie radio station, this is not entirely surprising and is not necessarily an anomaly. Songs played only once account for 61.9% of total plays in 2010. Three of these songs made it to the hottest 100, which seems unusual but confirms that we should not discount this data.

counts.2010[counts.2010$Freq == 1 & counts.2010$inh100, ]
##      songid Freq inh100 rank
## 6733  34507    1   TRUE   95
## 7015  34974    1   TRUE   21
## 7016  35042    1   TRUE   26

We must investigate the data further to ensure nothing is missing. This will not be included in future analyses unless it includes new discoveries not covered here.

Have we got data for the entire year?

Figure 3 shows the number of songs played for each day of 2010. The highest point (the green mark early in the year) is Australia Day 2010. This is the day that the Hottest 100 results for the previous year are announced & aired. There are 365 points here, meaning we have data for every day of the year.

plot of chunk unnamed-chunk-9

It's fairly clear that we have a pattern of 2 days per week that are consistently lower in plays than the rest of the week.

Does time of day have an impact on weekend plays?

Figure 4 shows the number of songs played over the year by hour of the day, segmented by day of week.

plot of chunk unnamed-chunk-10

This investigation shows that there are an abnormaly low number of observations of songs played between 1am and 6am on Saturday and Sunday mornings. This corresponds to the Mix-up, House Party & other electronic/dance music sessions usually aired late Friday & Saturday nights. Both our data and the upstream provider JPlay are missing these records. It is assumed that Triple J does not provide play lists or song titles during these periods.

Because of this regular anomaly, it is anticipated that songs that would usually air most frequently in these windows will have deflated total play counts. One song played only once during the year that made it to number 21 in the 2010 Hottest 100 was Skrillex's Scary Monsters & Nice Sprites, another was Crystal Castles - Not In Love, ranking 26th. The third was not electronic, being Grouplove's Naked Kids, released in November 2010 and ranking at 95th place.

Modelling

2010 data

Using a standard linear regression model to analyse the impact that total annual play count has on the binary outcome of placing in the Hottest 100 chart for 2010 gives us this summary.

## 
## Call:
## lm(formula = counts.2010$inh100 ~ counts.2010$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4219  0.0013  0.0038  0.0038  1.0038 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.006337   0.001254   -5.05  4.4e-07 ***
## counts.2010$Freq  0.002504   0.000049   51.09  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.0997 on 7015 degrees of freedom
## Multiple R-squared: 0.271,   Adjusted R-squared: 0.271 
## F-statistic: 2.61e+03 on 1 and 7015 DF,  p-value: <2e-16

If we treat the outcome as a probability, this model suggests that when a song is played 402 times or more, it becomes guaranteed to place in the Hottest 100, though the R2 value of 0.271 suggests this is a weak relationship. The residuals tend to be placed closer towards the model when play count is at the extremes, suggesting the confidence in this model is higher when songs are played either very frequently or very infrequently. This appears to reflect intuition.

plot of chunk unnamed-chunk-12

2011 data

## 
## Call:
## lm(formula = counts.2011$inh100 ~ counts.2011$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4667 -0.0015  0.0036  0.0036  1.0036 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.17e-03   1.53e-03   -4.04  5.4e-05 ***
## counts.2011$Freq  2.56e-03   5.73e-05   44.60  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.111 on 6012 degrees of freedom
## Multiple R-squared: 0.249,   Adjusted R-squared: 0.248 
## F-statistic: 1.99e+03 on 1 and 6012 DF,  p-value: <2e-16

2011 data shows a similar result, with the slope being slightly lower at 0.002558 but the model being fitted slightly better with R2 of 0.249.

plot of chunk unnamed-chunk-14

2012 data

## 
## Call:
## lm(formula = counts.2012$inh100 ~ counts.2012$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3817 -0.0017  0.0028  0.0050  0.9893 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -7.26e-03   1.85e-03   -3.93  8.6e-05 ***
## counts.2012$Freq  2.25e-03   6.09e-05   36.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.12 on 4889 degrees of freedom
## Multiple R-squared: 0.218,   Adjusted R-squared: 0.218 
## F-statistic: 1.36e+03 on 1 and 4889 DF,  p-value: <2e-16

2012 follows the trend, with a slope of 0.002248 and R2 of 0.218.

plot of chunk unnamed-chunk-16

Combined 2010-2012 data

## 
## Call:
## lm(formula = counts.all$inh100 ~ counts.all$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4437 -0.0008  0.0041  0.0041  1.0041 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6.52e-03   8.69e-04    -7.5  6.5e-14 ***
## counts.all$Freq  2.43e-03   3.18e-05    76.4  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.109 on 17920 degrees of freedom
## Multiple R-squared: 0.246,   Adjusted R-squared: 0.246 
## F-statistic: 5.84e+03 on 1 and 17920 DF,  p-value: <2e-16

plot of chunk unnamed-chunk-18

Putting the linear model aside for a moment, it appears that a cut-off point exists where a song played more than 185 times implies a spot in the charts. This trend appears to begin at around 150 plays. I expect to see this again in a future decision tree analysis.

Summary & Future Work

The number of times a song is played on the Triple J radio station does have an impact on its likelihood of appearing in the Hottest 100, though the correlation is weak. Despite the weak correlation it can be noted that songs with a very high frequency of plays are highly likely to at least place in the Hottest 100.

These results may be influenced by a number of external factors not captured here. A song may be played heavily on another station but not on Triple J, gathering “external” voters. A song may be played heavily during the period where Triple J do not record individual songs. A song may be released at the end of the year, gathering a relatively low number of plays but still enjoying popular appeal.

This analysis highlights that annual total play count is a weak indicator and it is recommended that it be used either in combination with other factors or that the play count be aggregated over a different period, or both. A future analysis should be done on play counts over the period of the year to investigate the suggestion that songs released closer to the voting period would have a higher chance of making the top 100.

Author's Note

Thanks for getting this far in my first data analysis. I plan on doing a heap more of these to ultimately try and build a more accurate model of what makes up a Hottest 100 song. I'd love to hear any feedback you have; I'm particularly interested in feedback on this analysis or the data. Any suggestions for future analyses or factors can be made on my website's index for this series.


One Comment on “H100.1: Analysing the impact of annual play count on chart presence”

  1. 1 Tom H said at 5:28 pm on August 19th, 2014:

    Hi Mark,

    I’ve been wanting to work on a project like this for a long time and was very happy to stumble across your data and analysis.

    I have spent the last couple of weeks sifting through and trying to improve on what you’ve done. So far I have been concentrating on the 2010 data and identified a few small fixes:
    1. The songs you have identified with play counts of 1 are actually eligible for 2011 but were just first played in 2010 and hence were captured by your data filtering.
    2. A similar problem, some songs were played in 2009 which were eligible in 2010 and therefore excluded in your counts.2010 data frame, I have manually added these which slightly improved the linear model fit.
    3. I have also added data for the hottest #101-200 (which is available for every year since 2010 at least) which has also improved the lm fit and I think will help a lot in further analysis.
    4. Many of the songs can be eliminated from the analysis due to not actually being eligible i.e. older songs which were played once or twice during the year. This could possibly be done by cross referencing the counts data with an official voting list (which are available). For example there are 7031 unique songs in the counts.2010 data, but only just over 1000 on the voting list.

    I’ve done some of cleaning of the data and code and have plenty more ideas, although I am still only a beginner so I am learning as I go.

    If you’d like to see what I’ve done please just reply to my email address and I’ll get back to you. I’d also really like to know how you initially got the jplay data as I’d love to have a go at predicting this years hottest 100.

    Judging by your website it seems you are quite busy with a few projects, but if you are still interested in this project I would love to work together or just bounce some ideas back and forth.

    Best regards,
    Tom


Leave a Reply