Data & beer; Beer & data.

H100.1: Analysing the impact of annual play count on chart presence

Posted: April 21st, 2013 | Author: | Filed under: Blog | Tags: , , , , | 1 Comment »

Authored by Mark McDonald (i@ii.net), April 2013. Download the data & source files here.

Abstract

The Australian radio station, Triple J, host an annual vote for the “Hottest 100” songs released in that year. Votes are cast by listeners and music fans around the world for 10 songs per vote. This series of analyses attempts to progress towards a predictive model that determines 2 outcomes, whether a song will be present in the Hottest 100 and if so, where will it be placed in the ranking system.

This first analysis begins by determining the relationship between the number of times a song was played Triple J's radio station in the year under scrutiny and the binary outcome of whether or not it featured in the top 100 songs in the same year.

We first collect the play data and Hottest 100 charts for the years 2009-2012 and prepare it for analysis. We then analyse the years 2010, 2011 and 2012 using a linear model to demonstrate a weak positive correlation with an adjusted R2 value of 0.2458 across the 3 years combined. A digression is included to validate our source data, identifying potential gaps.

Data Collection

Triple J play data

The included data source q1_play_timestamps.tsv was sourced from the website JPlay, which sources its data directly from Triple J broadcasts. The observations are a row for each song played on the station, including the time stamp, song title, artist and JPlay's unique IDs for the song & artist. These data were originally housed in an SQL database, where I included the “first_year” column calculated as the year the song first appeared in the table, identified by SongID.

SELECT p.time stamp, p.artist, p.song, p.artistid, p.songid, 
  (SELECT Year(Min(time stamp)) FROM Play p1 WHERE p.songid = p1.songid) AS first_year
FROM Play p 
ORDER BY p.time stamp;

Hottest 100 Charts

The Hottest 100 charts in h100_ranks.tsv were copied directly from the Triple J website's record for 2010, 2011 & 2012. The charts were copied into Excel & cross-referenced with the song records and updated to point to the matching entry according to the JPlay sourced data.

Data Preparation

While this is a fairly mundane piece of work, it's essential for getting the data in a usable format. We stripped the data down to include only songs released on 2010, 2011 & 2012. The mechanism for doing this was to identify the first time a song was played within this data set. This is not perfect but it does strip most data outside of the scope of interest. It is expected that a certain amount of noise still remains in the data, though this technique will not remove any valid observations of interest.

playLog <- rawPlays[rawPlays$first_year > 2009 & rawPlays$first_year < 2013, 
    c("posixts", "songid", "first_year")]
playLog <- playLog[playLog$first_year < 2013, ]
playLog <- playLog[playLog$first_year == playLog$posixts$year + 1900, ]

Exploring 2010 data

In order to generate some visualisations of the data, we have limited it to songs first played in 2010 and therefore eligible for the 2010 Hottest 100 competition.

Figure 1 depicts the number of times each song first played in 2010 was played over the entire year. Data points in red were in the hottest 100 songs, points in black were not.

plot of chunk unnamed-chunk-6

Points of interest in this figure are the red points very close to the X axis. There is one right on the X axis on the far right side of the chart and a few others fairly close. It does appear that there is a higher density of red in the upper half of this chart.

Most significantly, there is a very heavy bias to songs played infrequently (i.e. along the X axis). This leads us to investigate the data set further, ensuring it's integrity.

Digression: Data validation

Are songs played only once considered valid?

It appears there are a lot of songs played only once. Let's confirm this hypothesis.

Figure 2 shows a histogram depicting the frequency of songs that are played under ten times in the year 2010.

plot of chunk unnamed-chunk-7

It appears that one-off plays are extremely common. Given that Triple J is an indie radio station, this is not entirely surprising and is not necessarily an anomaly. Songs played only once account for 61.9% of total plays in 2010. Three of these songs made it to the hottest 100, which seems unusual but confirms that we should not discount this data.

counts.2010[counts.2010$Freq == 1 & counts.2010$inh100, ]
##      songid Freq inh100 rank
## 6733  34507    1   TRUE   95
## 7015  34974    1   TRUE   21
## 7016  35042    1   TRUE   26

We must investigate the data further to ensure nothing is missing. This will not be included in future analyses unless it includes new discoveries not covered here.

Have we got data for the entire year?

Figure 3 shows the number of songs played for each day of 2010. The highest point (the green mark early in the year) is Australia Day 2010. This is the day that the Hottest 100 results for the previous year are announced & aired. There are 365 points here, meaning we have data for every day of the year.

plot of chunk unnamed-chunk-9

It's fairly clear that we have a pattern of 2 days per week that are consistently lower in plays than the rest of the week.

Does time of day have an impact on weekend plays?

Figure 4 shows the number of songs played over the year by hour of the day, segmented by day of week.

plot of chunk unnamed-chunk-10

This investigation shows that there are an abnormaly low number of observations of songs played between 1am and 6am on Saturday and Sunday mornings. This corresponds to the Mix-up, House Party & other electronic/dance music sessions usually aired late Friday & Saturday nights. Both our data and the upstream provider JPlay are missing these records. It is assumed that Triple J does not provide play lists or song titles during these periods.

Because of this regular anomaly, it is anticipated that songs that would usually air most frequently in these windows will have deflated total play counts. One song played only once during the year that made it to number 21 in the 2010 Hottest 100 was Skrillex's Scary Monsters & Nice Sprites, another was Crystal Castles - Not In Love, ranking 26th. The third was not electronic, being Grouplove's Naked Kids, released in November 2010 and ranking at 95th place.

Modelling

2010 data

Using a standard linear regression model to analyse the impact that total annual play count has on the binary outcome of placing in the Hottest 100 chart for 2010 gives us this summary.

## 
## Call:
## lm(formula = counts.2010$inh100 ~ counts.2010$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4219  0.0013  0.0038  0.0038  1.0038 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.006337   0.001254   -5.05  4.4e-07 ***
## counts.2010$Freq  0.002504   0.000049   51.09  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.0997 on 7015 degrees of freedom
## Multiple R-squared: 0.271,   Adjusted R-squared: 0.271 
## F-statistic: 2.61e+03 on 1 and 7015 DF,  p-value: <2e-16

If we treat the outcome as a probability, this model suggests that when a song is played 402 times or more, it becomes guaranteed to place in the Hottest 100, though the R2 value of 0.271 suggests this is a weak relationship. The residuals tend to be placed closer towards the model when play count is at the extremes, suggesting the confidence in this model is higher when songs are played either very frequently or very infrequently. This appears to reflect intuition.

plot of chunk unnamed-chunk-12

2011 data

## 
## Call:
## lm(formula = counts.2011$inh100 ~ counts.2011$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4667 -0.0015  0.0036  0.0036  1.0036 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.17e-03   1.53e-03   -4.04  5.4e-05 ***
## counts.2011$Freq  2.56e-03   5.73e-05   44.60  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.111 on 6012 degrees of freedom
## Multiple R-squared: 0.249,   Adjusted R-squared: 0.248 
## F-statistic: 1.99e+03 on 1 and 6012 DF,  p-value: <2e-16

2011 data shows a similar result, with the slope being slightly lower at 0.002558 but the model being fitted slightly better with R2 of 0.249.

plot of chunk unnamed-chunk-14

2012 data

## 
## Call:
## lm(formula = counts.2012$inh100 ~ counts.2012$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3817 -0.0017  0.0028  0.0050  0.9893 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -7.26e-03   1.85e-03   -3.93  8.6e-05 ***
## counts.2012$Freq  2.25e-03   6.09e-05   36.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.12 on 4889 degrees of freedom
## Multiple R-squared: 0.218,   Adjusted R-squared: 0.218 
## F-statistic: 1.36e+03 on 1 and 4889 DF,  p-value: <2e-16

2012 follows the trend, with a slope of 0.002248 and R2 of 0.218.

plot of chunk unnamed-chunk-16

Combined 2010-2012 data

## 
## Call:
## lm(formula = counts.all$inh100 ~ counts.all$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4437 -0.0008  0.0041  0.0041  1.0041 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6.52e-03   8.69e-04    -7.5  6.5e-14 ***
## counts.all$Freq  2.43e-03   3.18e-05    76.4  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.109 on 17920 degrees of freedom
## Multiple R-squared: 0.246,   Adjusted R-squared: 0.246 
## F-statistic: 5.84e+03 on 1 and 17920 DF,  p-value: <2e-16

plot of chunk unnamed-chunk-18

Putting the linear model aside for a moment, it appears that a cut-off point exists where a song played more than 185 times implies a spot in the charts. This trend appears to begin at around 150 plays. I expect to see this again in a future decision tree analysis.

Summary & Future Work

The number of times a song is played on the Triple J radio station does have an impact on its likelihood of appearing in the Hottest 100, though the correlation is weak. Despite the weak correlation it can be noted that songs with a very high frequency of plays are highly likely to at least place in the Hottest 100.

These results may be influenced by a number of external factors not captured here. A song may be played heavily on another station but not on Triple J, gathering “external” voters. A song may be played heavily during the period where Triple J do not record individual songs. A song may be released at the end of the year, gathering a relatively low number of plays but still enjoying popular appeal.

This analysis highlights that annual total play count is a weak indicator and it is recommended that it be used either in combination with other factors or that the play count be aggregated over a different period, or both. A future analysis should be done on play counts over the period of the year to investigate the suggestion that songs released closer to the voting period would have a higher chance of making the top 100.

Author's Note

Thanks for getting this far in my first data analysis. I plan on doing a heap more of these to ultimately try and build a more accurate model of what makes up a Hottest 100 song. I'd love to hear any feedback you have; I'm particularly interested in feedback on this analysis or the data. Any suggestions for future analyses or factors can be made on my website's index for this series.


Triple J Hottest 100 Prediction

Posted: April 21st, 2013 | Author: | Filed under: Projects | Tags: , , , , , | No Comments »

At the end of 2012 I set myself a goal to try & predict the 2012 Hottest 100.  Little did I realise what a daunting, complex task I had set myself.  After screwing around with a pile of hand-rolled scripts, 3 virtual machines, about a dozen SSIS packages, around 6 SSAS prediction models, a PostgreSQL database, a SQL Server OLTP DB, a SQL Server OLAP DB and more knowledge of the Fuzzy Lookup transform that I will ever admit to, I decided to start over with simpler, incremental goals.

Eventually I still want to predict a Hottest 100 chart but to get there I’m starting with smaller data sets and asking simpler questions.  This page will be my diary.

Analyses

H100.1: Is there a correlation between play counts in the year and Hottest 100 success?

H100.2: Is there a correlation between play counts during certain times of the year and Hottest 100 success? (in progress)

H100.3: Is there a correlation between artist nationality & Hottest 100 success? (planned)

Future Work

Pick some of the more useful factors and build a more complex model (clustering & decision trees is about all I could do at the moment).

More Data

  • YouTube Plays
  • Social Media mentions
  • Try and extract genre from MusicBrainz
  • Play data from ARIA & other non-JJJ sources

Technical Stuff

Code

I’m doing this using the R statistical programming language and the RStudio IDE.  I’ll include all of my exploratory files as well as the clean write-ups, R markdown files and data when I upload something.  It’s a pretty cool tool & I’m fairly new at this stage.  If you know a better way to do something I’ve done, please tell me!

Maths

I’m trying to predict 2 outcomes from my data: a binary outcome defining if the song is in the top 100 or not and given that a song IS in the top 100, it’s rank.  I’m hoping to generate a probability model for the former and a continuous variable for the latter.  This will mean prediction takes two steps, or two models but it makes the problem easier to solve.  In theory the body of eligible songs are ranked by JJJ from 1 to N everything between 101 and N is hidden from us.