Data & beer; Beer & data.

Enabling HTTP request / response logging in the Google APIs Client Library for Java

Posted: June 3rd, 2014 | Author: | Filed under: Blog | Tags: , , , | No Comments »

This one usually gets me every time I touch this library, so I’m posting my notes for future reference.

Logging can be enabled by creating a custom HttpRequestInitializer and HttpResponseInterceptor, but that will consume the InputStream and you’ll likely see getContent() return null, or IOExceptionStream closed” errors.

However, the client library uses Java’s standard java.util.logging platform, which can be enabled to write to the console / stderr with these steps.

  • Create a file, called say ‘logging.properties’ and store it somewhere accessible to your project (I just dump it in the project root for debugging).
  • Paste these lines into the file:
handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level=CONFIG
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
  • Run your app with this extra VM argument, by either adding it to your command line when you run java -cp … -Djava.util.logging… MyClass, or to the “VM options” field in your IDE.
-Djava.util.logging.config.file=logging.properties

Et voila, logs!


Ingress Portal Attack Simulation in R

Posted: July 14th, 2013 | Author: | Filed under: Blog | Tags: , , , | No Comments »

I’ve put together a pretty basic R script that generates portal attack simulations for Ingress. Ultimately I want to use this to generate some optimal resonator placement strategies for specific scenarios but for now it just generates cool simulation pics.

Grab the code on Github. I’d love a pull request if you have something useful to add. Details on how to use it are provided in the README.

Scenarios

Basic Examples

Here are some sample scenarios where the attacker fires 3 level 8 bursters in 2 different firing strategies. The attacker is firing upon what I call a “lazily loaded” portal filled by a single L8 player. That is, a L8 res in the EAST slot, a L7 in NE, L6s in N & NW, L5s in W & SW and L4s in S and SE.

It’s kinda cool to see that the “fire in the center” strategy for L5 and higher bursters isn’t necessarily the best.

Fire straight from the center.

Fire closer to the more powerful resonators.

Nothing fired, just the lazy loading scenario prior to an attack.

Street-Side Portal

I’ve often wondered if I could optimise resonator placement for a portal that sits on a street, prone to drive-bys. The case that came to mind was to put the heavier resonators facing the street (absorbing damage) but also bring them closer to the center, keeping them off the street. In these examples I’ve assumed traffic flows from the west to east, but it doesn’t make a huge difference.

Here’s the “streetside” resonator placement, no attack.

Here’s an attack on the street-side setup. Ouch.

Here’s a comparative attack on a standard “lazy” fill. There’s 27% more damage done in the street-side “optimised” scenario.


H100.1: Analysing the impact of annual play count on chart presence

Posted: April 21st, 2013 | Author: | Filed under: Blog | Tags: , , , , | 1 Comment »

Authored by Mark McDonald (i@ii.net), April 2013. Download the data & source files here.

Abstract

The Australian radio station, Triple J, host an annual vote for the “Hottest 100” songs released in that year. Votes are cast by listeners and music fans around the world for 10 songs per vote. This series of analyses attempts to progress towards a predictive model that determines 2 outcomes, whether a song will be present in the Hottest 100 and if so, where will it be placed in the ranking system.

This first analysis begins by determining the relationship between the number of times a song was played Triple J's radio station in the year under scrutiny and the binary outcome of whether or not it featured in the top 100 songs in the same year.

We first collect the play data and Hottest 100 charts for the years 2009-2012 and prepare it for analysis. We then analyse the years 2010, 2011 and 2012 using a linear model to demonstrate a weak positive correlation with an adjusted R2 value of 0.2458 across the 3 years combined. A digression is included to validate our source data, identifying potential gaps.

Data Collection

Triple J play data

The included data source q1_play_timestamps.tsv was sourced from the website JPlay, which sources its data directly from Triple J broadcasts. The observations are a row for each song played on the station, including the time stamp, song title, artist and JPlay's unique IDs for the song & artist. These data were originally housed in an SQL database, where I included the “first_year” column calculated as the year the song first appeared in the table, identified by SongID.

SELECT p.time stamp, p.artist, p.song, p.artistid, p.songid, 
  (SELECT Year(Min(time stamp)) FROM Play p1 WHERE p.songid = p1.songid) AS first_year
FROM Play p 
ORDER BY p.time stamp;

Hottest 100 Charts

The Hottest 100 charts in h100_ranks.tsv were copied directly from the Triple J website's record for 2010, 2011 & 2012. The charts were copied into Excel & cross-referenced with the song records and updated to point to the matching entry according to the JPlay sourced data.

Data Preparation

While this is a fairly mundane piece of work, it's essential for getting the data in a usable format. We stripped the data down to include only songs released on 2010, 2011 & 2012. The mechanism for doing this was to identify the first time a song was played within this data set. This is not perfect but it does strip most data outside of the scope of interest. It is expected that a certain amount of noise still remains in the data, though this technique will not remove any valid observations of interest.

playLog <- rawPlays[rawPlays$first_year > 2009 & rawPlays$first_year < 2013, 
    c("posixts", "songid", "first_year")]
playLog <- playLog[playLog$first_year < 2013, ]
playLog <- playLog[playLog$first_year == playLog$posixts$year + 1900, ]

Exploring 2010 data

In order to generate some visualisations of the data, we have limited it to songs first played in 2010 and therefore eligible for the 2010 Hottest 100 competition.

Figure 1 depicts the number of times each song first played in 2010 was played over the entire year. Data points in red were in the hottest 100 songs, points in black were not.

plot of chunk unnamed-chunk-6

Points of interest in this figure are the red points very close to the X axis. There is one right on the X axis on the far right side of the chart and a few others fairly close. It does appear that there is a higher density of red in the upper half of this chart.

Most significantly, there is a very heavy bias to songs played infrequently (i.e. along the X axis). This leads us to investigate the data set further, ensuring it's integrity.

Digression: Data validation

Are songs played only once considered valid?

It appears there are a lot of songs played only once. Let's confirm this hypothesis.

Figure 2 shows a histogram depicting the frequency of songs that are played under ten times in the year 2010.

plot of chunk unnamed-chunk-7

It appears that one-off plays are extremely common. Given that Triple J is an indie radio station, this is not entirely surprising and is not necessarily an anomaly. Songs played only once account for 61.9% of total plays in 2010. Three of these songs made it to the hottest 100, which seems unusual but confirms that we should not discount this data.

counts.2010[counts.2010$Freq == 1 & counts.2010$inh100, ]
##      songid Freq inh100 rank
## 6733  34507    1   TRUE   95
## 7015  34974    1   TRUE   21
## 7016  35042    1   TRUE   26

We must investigate the data further to ensure nothing is missing. This will not be included in future analyses unless it includes new discoveries not covered here.

Have we got data for the entire year?

Figure 3 shows the number of songs played for each day of 2010. The highest point (the green mark early in the year) is Australia Day 2010. This is the day that the Hottest 100 results for the previous year are announced & aired. There are 365 points here, meaning we have data for every day of the year.

plot of chunk unnamed-chunk-9

It's fairly clear that we have a pattern of 2 days per week that are consistently lower in plays than the rest of the week.

Does time of day have an impact on weekend plays?

Figure 4 shows the number of songs played over the year by hour of the day, segmented by day of week.

plot of chunk unnamed-chunk-10

This investigation shows that there are an abnormaly low number of observations of songs played between 1am and 6am on Saturday and Sunday mornings. This corresponds to the Mix-up, House Party & other electronic/dance music sessions usually aired late Friday & Saturday nights. Both our data and the upstream provider JPlay are missing these records. It is assumed that Triple J does not provide play lists or song titles during these periods.

Because of this regular anomaly, it is anticipated that songs that would usually air most frequently in these windows will have deflated total play counts. One song played only once during the year that made it to number 21 in the 2010 Hottest 100 was Skrillex's Scary Monsters & Nice Sprites, another was Crystal Castles - Not In Love, ranking 26th. The third was not electronic, being Grouplove's Naked Kids, released in November 2010 and ranking at 95th place.

Modelling

2010 data

Using a standard linear regression model to analyse the impact that total annual play count has on the binary outcome of placing in the Hottest 100 chart for 2010 gives us this summary.

## 
## Call:
## lm(formula = counts.2010$inh100 ~ counts.2010$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4219  0.0013  0.0038  0.0038  1.0038 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.006337   0.001254   -5.05  4.4e-07 ***
## counts.2010$Freq  0.002504   0.000049   51.09  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.0997 on 7015 degrees of freedom
## Multiple R-squared: 0.271,   Adjusted R-squared: 0.271 
## F-statistic: 2.61e+03 on 1 and 7015 DF,  p-value: <2e-16

If we treat the outcome as a probability, this model suggests that when a song is played 402 times or more, it becomes guaranteed to place in the Hottest 100, though the R2 value of 0.271 suggests this is a weak relationship. The residuals tend to be placed closer towards the model when play count is at the extremes, suggesting the confidence in this model is higher when songs are played either very frequently or very infrequently. This appears to reflect intuition.

plot of chunk unnamed-chunk-12

2011 data

## 
## Call:
## lm(formula = counts.2011$inh100 ~ counts.2011$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4667 -0.0015  0.0036  0.0036  1.0036 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.17e-03   1.53e-03   -4.04  5.4e-05 ***
## counts.2011$Freq  2.56e-03   5.73e-05   44.60  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.111 on 6012 degrees of freedom
## Multiple R-squared: 0.249,   Adjusted R-squared: 0.248 
## F-statistic: 1.99e+03 on 1 and 6012 DF,  p-value: <2e-16

2011 data shows a similar result, with the slope being slightly lower at 0.002558 but the model being fitted slightly better with R2 of 0.249.

plot of chunk unnamed-chunk-14

2012 data

## 
## Call:
## lm(formula = counts.2012$inh100 ~ counts.2012$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3817 -0.0017  0.0028  0.0050  0.9893 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -7.26e-03   1.85e-03   -3.93  8.6e-05 ***
## counts.2012$Freq  2.25e-03   6.09e-05   36.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.12 on 4889 degrees of freedom
## Multiple R-squared: 0.218,   Adjusted R-squared: 0.218 
## F-statistic: 1.36e+03 on 1 and 4889 DF,  p-value: <2e-16

2012 follows the trend, with a slope of 0.002248 and R2 of 0.218.

plot of chunk unnamed-chunk-16

Combined 2010-2012 data

## 
## Call:
## lm(formula = counts.all$inh100 ~ counts.all$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4437 -0.0008  0.0041  0.0041  1.0041 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6.52e-03   8.69e-04    -7.5  6.5e-14 ***
## counts.all$Freq  2.43e-03   3.18e-05    76.4  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.109 on 17920 degrees of freedom
## Multiple R-squared: 0.246,   Adjusted R-squared: 0.246 
## F-statistic: 5.84e+03 on 1 and 17920 DF,  p-value: <2e-16

plot of chunk unnamed-chunk-18

Putting the linear model aside for a moment, it appears that a cut-off point exists where a song played more than 185 times implies a spot in the charts. This trend appears to begin at around 150 plays. I expect to see this again in a future decision tree analysis.

Summary & Future Work

The number of times a song is played on the Triple J radio station does have an impact on its likelihood of appearing in the Hottest 100, though the correlation is weak. Despite the weak correlation it can be noted that songs with a very high frequency of plays are highly likely to at least place in the Hottest 100.

These results may be influenced by a number of external factors not captured here. A song may be played heavily on another station but not on Triple J, gathering “external” voters. A song may be played heavily during the period where Triple J do not record individual songs. A song may be released at the end of the year, gathering a relatively low number of plays but still enjoying popular appeal.

This analysis highlights that annual total play count is a weak indicator and it is recommended that it be used either in combination with other factors or that the play count be aggregated over a different period, or both. A future analysis should be done on play counts over the period of the year to investigate the suggestion that songs released closer to the voting period would have a higher chance of making the top 100.

Author's Note

Thanks for getting this far in my first data analysis. I plan on doing a heap more of these to ultimately try and build a more accurate model of what makes up a Hottest 100 song. I'd love to hear any feedback you have; I'm particularly interested in feedback on this analysis or the data. Any suggestions for future analyses or factors can be made on my website's index for this series.


Triple J Hottest 100 Prediction

Posted: April 21st, 2013 | Author: | Filed under: Projects | Tags: , , , , , | No Comments »

At the end of 2012 I set myself a goal to try & predict the 2012 Hottest 100.  Little did I realise what a daunting, complex task I had set myself.  After screwing around with a pile of hand-rolled scripts, 3 virtual machines, about a dozen SSIS packages, around 6 SSAS prediction models, a PostgreSQL database, a SQL Server OLTP DB, a SQL Server OLAP DB and more knowledge of the Fuzzy Lookup transform that I will ever admit to, I decided to start over with simpler, incremental goals.

Eventually I still want to predict a Hottest 100 chart but to get there I’m starting with smaller data sets and asking simpler questions.  This page will be my diary.

Analyses

H100.1: Is there a correlation between play counts in the year and Hottest 100 success?

H100.2: Is there a correlation between play counts during certain times of the year and Hottest 100 success? (in progress)

H100.3: Is there a correlation between artist nationality & Hottest 100 success? (planned)

Future Work

Pick some of the more useful factors and build a more complex model (clustering & decision trees is about all I could do at the moment).

More Data

  • YouTube Plays
  • Social Media mentions
  • Try and extract genre from MusicBrainz
  • Play data from ARIA & other non-JJJ sources

Technical Stuff

Code

I’m doing this using the R statistical programming language and the RStudio IDE.  I’ll include all of my exploratory files as well as the clean write-ups, R markdown files and data when I upload something.  It’s a pretty cool tool & I’m fairly new at this stage.  If you know a better way to do something I’ve done, please tell me!

Maths

I’m trying to predict 2 outcomes from my data: a binary outcome defining if the song is in the top 100 or not and given that a song IS in the top 100, it’s rank.  I’m hoping to generate a probability model for the former and a continuous variable for the latter.  This will mean prediction takes two steps, or two models but it makes the problem easier to solve.  In theory the body of eligible songs are ranked by JJJ from 1 to N everything between 101 and N is hidden from us.


Samsung Galaxy Camera review

Posted: January 25th, 2013 | Author: | Filed under: Blog | Tags: , , , , , | 1 Comment »

So Weezer played in town & I managed to score a couple tickets.  My Dear Employer also happened to start selling the new Samsung Android cameras (EK-GC100).  I thought I’d test the camera out at the gig & see what it could do (full disclosure: I requested a camera to borrow & test out; I was not paid or asked to do so).

We were seated at the back of the venue, directly in line with the stage but many, many meters away.  Couple that with my not-so-steady hand & a lot of the photos turned out quite blurry.  I’ve kept & published them to give an idea of the ratio of good-to-terrible shots.

Here’s a few shots for comparison.  It’s not a particularly fair “review” since these are all dark concert shots, but you get the idea.  None of these pictures have been post-processed, other than what the device did through its presets.

From our seats, no zoom.

Weezer

Almost full zoom.

Weezer

Full zoom (21x optical).

Weezer

It does some great video, too.  The big disappointment from concert filming was the huge audio clipping & distortion.  We were miles back so I suspect this is not going to be a camera to use at any concert.  That said, the video quality was great.

Outside of these dark concert shots, the camera has some pretty cool features for photos of people.  Back in about 2006 I saw an IEEE magazine article talking about the “next greatest thing” where they had camera prototypes that would take a series of shots, perform facial recognition and let you choose which face shot to use for each detected person in the photo.  This camera can do that! If I can hang on to it a little longer I’ll grab some shots and share some samples.

The other feature worth mentioning is that this is a full Android device.  The camera takes a micro-SIM and has WiFi, so you can use it on the go.  Being a Samsung, it comes pre-installed with Dropbox to keep your pics in The Cloud™, as well as Instagram and some pretty fun photo manipulation apps that will probably never get used.

I’m a Flickr user and keep my phone pics backed up using Photo Mule (it appears to be missing from Play these days, but I keep the APK here).  The Flickr app is also unsupported for this device – a huge bummer, but I installed from APK.  Once  past these annoyances, the rest of the important photo apps were next: Add A Cat & Rage Face Photo.

The Android OS interface is snappy and keeps a “Camera” icon on each of the home screens; handy for flicking back to Camera mode with short notice.  The battery life is OK – it is by now means excellent but considering what the thing is doing, it lasts a decent while.  I was taking photos or videos & had the the running for about 3-4 hours, with regular passes to friends to play, and had 54% remaining after the gig.

I am by no means a photo nut and this camera is not for photo nuts.  It is a networked camera, of medium complexity, for people looking for a point-and-click device.  At $439 (from Kogan, not my employer) I would buy one for myself.


JSON APIs in Excel

Posted: August 12th, 2012 | Author: | Filed under: Projects | Tags: , , , , | 1 Comment »

My wife & I were talking about moving to a new place and one potential candidate came up that didn’t seem to solve my #1 priority of spending less time in transit.  It was closer to our nearest freeway but at an exit further away than our current place.  (Pro tip: skip the article and just download the code right here).

She seemed pretty keen on the place but I couldn’t quite justify a new place that didn’t save me any time.  I figured there are a few places I visit on a pretty regular basis, so I needed some way to work out if, on average, I’d be spending less time travelling.

A quick scan of the Google Maps API showed that I can pretty easily pull out the distance & duration between two addresses (as simple strings).  Now to get the data into some form that can be manipulated easily.

Excel 2010 provides a few ways of importing data; the Google API supports XML so I checked that first.  Unfortunately the XML import isn’t live, it’s only a once-off import, and I want to be able to update a value and have it update my sheet straight away.

Excel also supports a “Web Query” for importing data, which is pretty cool, you give it a URL and it will take a <table/> out of a web page and bring it in for you to manipulate.  Cool, but not helpful here.

Since it didn’t look like Excel could do it natively using it’s default tools I got my hands dirty.  This question on Stack Overflow had the goods: a class library in VBA that pulls a string data feed from a URL and a link to this project that provides a VBA class for parsing JSON string data.

With those two modules neatly embedded in my sheet, all it took was a few lines into a custom function that retrieved the JSON API result and parsed it to the single value I needed.  Since I don’t want to get black-listed from the API, I popped a (very) simple cache in place.  Don’t get too confident though, it only caches to memory, so will still need to retrieve every calculation each time you load the sheet.

Dim DistCache As New Scripting.Dictionary

Function CalculateDistance(startAddress As String, endAddress As String)
  Dim key As String
  key = startAddress & "|" & endAddress

  If DistCache.Exists(key) Then
    v = DistCache(key)
  Else
    Dim request As New SyncWebRequest
    request.AjaxGet ("http://maps.googleapis.com/maps/api/directions/json?origin=" & startAddress & "&destination=" & endAddress & "&sensor=false")
    Dim json As String
    json = request.Response

    Dim parser As New JSONLib
    Set result = parser.parse(json)
    Set routes = result("routes")
    Set route = routes(1)
    Set legs = route("legs")
    Set leg = legs(1)
    Set dist = leg("distance")
    v = dist("value")
    DistCache(key) = v
  End If

  CalculateDistance = v
End Function

I saved the spread sheet (with non personal data) – if you want to check it out you can download it here.  Remember you will need to enable macros before it will work.  To check out the source, hit the “Developer” tab and click “Visual Basic”.  It looks like this place will save me a heap of time after all – now that the hard bit is done we just have to buy it :)


Alot of Fun

Posted: May 28th, 2012 | Author: | Filed under: Projects | Tags: , , , , | No Comments »

Being ever frustrated with the world’s misspelling of the words ‘a lot’, and being inspired by Hyperbole and a Half’s clever coping mechanisms, I thought I’d enhance the web with a Greasemonkey script that will show our friend, the Alot, on screen when his name is mentioned.

Here is the little critter:

 

You can install the plugin yourself from the page on userscripts.org.


Hehpic

Posted: May 1st, 2011 | Author: | Filed under: Projects | Tags: , , | No Comments »

For a while now I’ve been working on an image-sharing site called hehpic.  It’s a work in progress & still has some fairly rough edges but it’s online & available for all to use.

Future development:

  • Browser extensions for easy sharing
  • Public API (there’s a private API now but it’s changing too frequently to be released)
  • More search features
  • Sex up the browsing interface
    • Tag combos (works now, just has to be done manually, use a comma)
    • Paging/Searching/Tag counts
  • Dedicated hosting environment (I love my Dreamhost but it’s probably not going to scale well – I’ll deal with that when/if it happens)

Main issues:

  • Feature poor
  • IE rendering is awful

Check it out.  Upload something.  Make an account.  Subscribe to some feeds.  Comment here.


Link Shrink 0.2

Posted: May 27th, 2010 | Author: | Filed under: Projects | Tags: , , , , | No Comments »

Just a quick one this time, Link Shrink 0.2 is in the market.  Should hit the “100 downloads” mark soon, woo :)

Changelog:

  • Links can now be shared out through the clipboard (i.e. copy & paste)
  • App handles already shortened URLs gracefully
  • Slightly more sensible logic around the API credential validation

Github. Market.


Link Shrink 0.1

Posted: May 18th, 2010 | Author: | Filed under: Projects | Tags: , , , , | No Comments »

I’ve just put together a 0.000000001 version of my bit.ly URL shortening tool for Android, called “Link Shrink”.

It integrates into the Android OS fairly tightly, at the moment only through the “Share Page” actions (technically speaking, android.intent.action.SEND + android.intent.category.DEFAULT + text/plain, copied straight from the Android browser) so when browsing (or doing anything else that supports URL sharing) you can hit share, hit Link Shrink & it will generate a bit.ly URL and send it back through the Share Page “intent” again so you can pass it on to SMS, email, delicious or whatever tickles your fancy.

Since it’s only a 0.1, it only supports the basics.

  • URL shortening using bit.ly (and only bit.ly)
  • Users can optionally provide their own bit.ly login & API key to add URLs to their account
  • The only option for what to do with the shortened URL is re-sharing

There are plans for the future though and I’m kinda sweet on this tiny little app so it’ll probably be sooner rather than later.

  • UI improvements
  • Better error handling (e.g. detect when attempting to shorten an already shortened URL)
  • More shortening services (if anyone has preferences please let me know, I only use bit.ly)
  • More things to do with shortened URLs, such as:
    • Provide text boxes with long & short URLs for copy & paste
    • Send directly to the clipboard
    • Open bit.ly stats page
    • Prompt user every time for one of the above
  • Better configuration screen to support the above
  • Your idea here!

Ultimately though, the most important feature (URL shortening) is done & working fairly solidly on the emulator & my phone so I think I’ll let it loose on the market.  Please whack a comment in or email me if you have any feedback, problems, ideas or suggestions.

As always, the source is on Github, plus you can get the latest APK from my Dropbox, plus it’s on the market (scan the barcode below or click here if you’re on an Android phone).

Link Shrink QR code