At the end of 2012 I set myself a goal to try & predict the 2012 Hottest 100. Little did I realise what a daunting, complex task I had set myself. After screwing around with a pile of hand-rolled scripts, 3 virtual machines, about a dozen SSIS packages, around 6 SSAS prediction models, a PostgreSQL database, a SQL Server OLTP DB, a SQL Server OLAP DB and more knowledge of the Fuzzy Lookup transform that I will ever admit to, I decided to start over with simpler, incremental goals.
Eventually I still want to predict a Hottest 100 chart but to get there I’m starting with smaller data sets and asking simpler questions. This page will be my diary.
H100.2: Is there a correlation between play counts during certain times of the year and Hottest 100 success? (in progress)
Pick some of the more useful factors and build a more complex model (clustering & decision trees is about all I could do at the moment).
- YouTube Plays
- Social Media mentions
- Try and extract genre from MusicBrainz
- Play data from ARIA & other non-JJJ sources
I’m doing this using the R statistical programming language and the RStudio IDE. I’ll include all of my exploratory files as well as the clean write-ups, R markdown files and data when I upload something. It’s a pretty cool tool & I’m fairly new at this stage. If you know a better way to do something I’ve done, please tell me!
I’m trying to predict 2 outcomes from my data: a binary outcome defining if the song is in the top 100 or not and given that a song IS in the top 100, it’s rank. I’m hoping to generate a probability model for the former and a continuous variable for the latter. This will mean prediction takes two steps, or two models but it makes the problem easier to solve. In theory the body of eligible songs are ranked by JJJ from 1 to N everything between 101 and N is hidden from us.