Data & beer; Beer & data.

Enabling HTTP request / response logging in the Google APIs Client Library for Java

Posted: June 3rd, 2014 | Author: | Filed under: Blog | Tags: , , , | No Comments »

This one usually gets me every time I touch this library, so I’m posting my notes for future reference.

Logging can be enabled by creating a custom HttpRequestInitializer and HttpResponseInterceptor, but that will consume the InputStream and you’ll likely see getContent() return null, or IOExceptionStream closed” errors.

However, the client library uses Java’s standard java.util.logging platform, which can be enabled to write to the console / stderr with these steps.

  • Create a file, called say ‘logging.properties’ and store it somewhere accessible to your project (I just dump it in the project root for debugging).
  • Paste these lines into the file:
handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level=CONFIG
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
  • Run your app with this extra VM argument, by either adding it to your command line when you run java -cp … -Djava.util.logging… MyClass, or to the “VM options” field in your IDE.
-Djava.util.logging.config.file=logging.properties

Et voila, logs!


Ingress Portal Attack Simulation in R

Posted: July 14th, 2013 | Author: | Filed under: Blog | Tags: , , , | No Comments »

I’ve put together a pretty basic R script that generates portal attack simulations for Ingress. Ultimately I want to use this to generate some optimal resonator placement strategies for specific scenarios but for now it just generates cool simulation pics.

Grab the code on Github. I’d love a pull request if you have something useful to add. Details on how to use it are provided in the README.

Scenarios

Basic Examples

Here are some sample scenarios where the attacker fires 3 level 8 bursters in 2 different firing strategies. The attacker is firing upon what I call a “lazily loaded” portal filled by a single L8 player. That is, a L8 res in the EAST slot, a L7 in NE, L6s in N & NW, L5s in W & SW and L4s in S and SE.

It’s kinda cool to see that the “fire in the center” strategy for L5 and higher bursters isn’t necessarily the best.

Fire straight from the center.

Fire closer to the more powerful resonators.

Nothing fired, just the lazy loading scenario prior to an attack.

Street-Side Portal

I’ve often wondered if I could optimise resonator placement for a portal that sits on a street, prone to drive-bys. The case that came to mind was to put the heavier resonators facing the street (absorbing damage) but also bring them closer to the center, keeping them off the street. In these examples I’ve assumed traffic flows from the west to east, but it doesn’t make a huge difference.

Here’s the “streetside” resonator placement, no attack.

Here’s an attack on the street-side setup. Ouch.

Here’s a comparative attack on a standard “lazy” fill. There’s 27% more damage done in the street-side “optimised” scenario.


H100.1: Analysing the impact of annual play count on chart presence

Posted: April 21st, 2013 | Author: | Filed under: Blog | Tags: , , , , | 1 Comment »

Authored by Mark McDonald (i@ii.net), April 2013. Download the data & source files here.

Abstract

The Australian radio station, Triple J, host an annual vote for the “Hottest 100” songs released in that year. Votes are cast by listeners and music fans around the world for 10 songs per vote. This series of analyses attempts to progress towards a predictive model that determines 2 outcomes, whether a song will be present in the Hottest 100 and if so, where will it be placed in the ranking system.

This first analysis begins by determining the relationship between the number of times a song was played Triple J's radio station in the year under scrutiny and the binary outcome of whether or not it featured in the top 100 songs in the same year.

We first collect the play data and Hottest 100 charts for the years 2009-2012 and prepare it for analysis. We then analyse the years 2010, 2011 and 2012 using a linear model to demonstrate a weak positive correlation with an adjusted R2 value of 0.2458 across the 3 years combined. A digression is included to validate our source data, identifying potential gaps.

Data Collection

Triple J play data

The included data source q1_play_timestamps.tsv was sourced from the website JPlay, which sources its data directly from Triple J broadcasts. The observations are a row for each song played on the station, including the time stamp, song title, artist and JPlay's unique IDs for the song & artist. These data were originally housed in an SQL database, where I included the “first_year” column calculated as the year the song first appeared in the table, identified by SongID.

SELECT p.time stamp, p.artist, p.song, p.artistid, p.songid, 
  (SELECT Year(Min(time stamp)) FROM Play p1 WHERE p.songid = p1.songid) AS first_year
FROM Play p 
ORDER BY p.time stamp;

Hottest 100 Charts

The Hottest 100 charts in h100_ranks.tsv were copied directly from the Triple J website's record for 2010, 2011 & 2012. The charts were copied into Excel & cross-referenced with the song records and updated to point to the matching entry according to the JPlay sourced data.

Data Preparation

While this is a fairly mundane piece of work, it's essential for getting the data in a usable format. We stripped the data down to include only songs released on 2010, 2011 & 2012. The mechanism for doing this was to identify the first time a song was played within this data set. This is not perfect but it does strip most data outside of the scope of interest. It is expected that a certain amount of noise still remains in the data, though this technique will not remove any valid observations of interest.

playLog <- rawPlays[rawPlays$first_year > 2009 & rawPlays$first_year < 2013, 
    c("posixts", "songid", "first_year")]
playLog <- playLog[playLog$first_year < 2013, ]
playLog <- playLog[playLog$first_year == playLog$posixts$year + 1900, ]

Exploring 2010 data

In order to generate some visualisations of the data, we have limited it to songs first played in 2010 and therefore eligible for the 2010 Hottest 100 competition.

Figure 1 depicts the number of times each song first played in 2010 was played over the entire year. Data points in red were in the hottest 100 songs, points in black were not.

plot of chunk unnamed-chunk-6

Points of interest in this figure are the red points very close to the X axis. There is one right on the X axis on the far right side of the chart and a few others fairly close. It does appear that there is a higher density of red in the upper half of this chart.

Most significantly, there is a very heavy bias to songs played infrequently (i.e. along the X axis). This leads us to investigate the data set further, ensuring it's integrity.

Digression: Data validation

Are songs played only once considered valid?

It appears there are a lot of songs played only once. Let's confirm this hypothesis.

Figure 2 shows a histogram depicting the frequency of songs that are played under ten times in the year 2010.

plot of chunk unnamed-chunk-7

It appears that one-off plays are extremely common. Given that Triple J is an indie radio station, this is not entirely surprising and is not necessarily an anomaly. Songs played only once account for 61.9% of total plays in 2010. Three of these songs made it to the hottest 100, which seems unusual but confirms that we should not discount this data.

counts.2010[counts.2010$Freq == 1 & counts.2010$inh100, ]
##      songid Freq inh100 rank
## 6733  34507    1   TRUE   95
## 7015  34974    1   TRUE   21
## 7016  35042    1   TRUE   26

We must investigate the data further to ensure nothing is missing. This will not be included in future analyses unless it includes new discoveries not covered here.

Have we got data for the entire year?

Figure 3 shows the number of songs played for each day of 2010. The highest point (the green mark early in the year) is Australia Day 2010. This is the day that the Hottest 100 results for the previous year are announced & aired. There are 365 points here, meaning we have data for every day of the year.

plot of chunk unnamed-chunk-9

It's fairly clear that we have a pattern of 2 days per week that are consistently lower in plays than the rest of the week.

Does time of day have an impact on weekend plays?

Figure 4 shows the number of songs played over the year by hour of the day, segmented by day of week.

plot of chunk unnamed-chunk-10

This investigation shows that there are an abnormaly low number of observations of songs played between 1am and 6am on Saturday and Sunday mornings. This corresponds to the Mix-up, House Party & other electronic/dance music sessions usually aired late Friday & Saturday nights. Both our data and the upstream provider JPlay are missing these records. It is assumed that Triple J does not provide play lists or song titles during these periods.

Because of this regular anomaly, it is anticipated that songs that would usually air most frequently in these windows will have deflated total play counts. One song played only once during the year that made it to number 21 in the 2010 Hottest 100 was Skrillex's Scary Monsters & Nice Sprites, another was Crystal Castles - Not In Love, ranking 26th. The third was not electronic, being Grouplove's Naked Kids, released in November 2010 and ranking at 95th place.

Modelling

2010 data

Using a standard linear regression model to analyse the impact that total annual play count has on the binary outcome of placing in the Hottest 100 chart for 2010 gives us this summary.

## 
## Call:
## lm(formula = counts.2010$inh100 ~ counts.2010$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4219  0.0013  0.0038  0.0038  1.0038 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.006337   0.001254   -5.05  4.4e-07 ***
## counts.2010$Freq  0.002504   0.000049   51.09  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.0997 on 7015 degrees of freedom
## Multiple R-squared: 0.271,   Adjusted R-squared: 0.271 
## F-statistic: 2.61e+03 on 1 and 7015 DF,  p-value: <2e-16

If we treat the outcome as a probability, this model suggests that when a song is played 402 times or more, it becomes guaranteed to place in the Hottest 100, though the R2 value of 0.271 suggests this is a weak relationship. The residuals tend to be placed closer towards the model when play count is at the extremes, suggesting the confidence in this model is higher when songs are played either very frequently or very infrequently. This appears to reflect intuition.

plot of chunk unnamed-chunk-12

2011 data

## 
## Call:
## lm(formula = counts.2011$inh100 ~ counts.2011$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4667 -0.0015  0.0036  0.0036  1.0036 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.17e-03   1.53e-03   -4.04  5.4e-05 ***
## counts.2011$Freq  2.56e-03   5.73e-05   44.60  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.111 on 6012 degrees of freedom
## Multiple R-squared: 0.249,   Adjusted R-squared: 0.248 
## F-statistic: 1.99e+03 on 1 and 6012 DF,  p-value: <2e-16

2011 data shows a similar result, with the slope being slightly lower at 0.002558 but the model being fitted slightly better with R2 of 0.249.

plot of chunk unnamed-chunk-14

2012 data

## 
## Call:
## lm(formula = counts.2012$inh100 ~ counts.2012$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3817 -0.0017  0.0028  0.0050  0.9893 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -7.26e-03   1.85e-03   -3.93  8.6e-05 ***
## counts.2012$Freq  2.25e-03   6.09e-05   36.92  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.12 on 4889 degrees of freedom
## Multiple R-squared: 0.218,   Adjusted R-squared: 0.218 
## F-statistic: 1.36e+03 on 1 and 4889 DF,  p-value: <2e-16

2012 follows the trend, with a slope of 0.002248 and R2 of 0.218.

plot of chunk unnamed-chunk-16

Combined 2010-2012 data

## 
## Call:
## lm(formula = counts.all$inh100 ~ counts.all$Freq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4437 -0.0008  0.0041  0.0041  1.0041 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6.52e-03   8.69e-04    -7.5  6.5e-14 ***
## counts.all$Freq  2.43e-03   3.18e-05    76.4  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.109 on 17920 degrees of freedom
## Multiple R-squared: 0.246,   Adjusted R-squared: 0.246 
## F-statistic: 5.84e+03 on 1 and 17920 DF,  p-value: <2e-16

plot of chunk unnamed-chunk-18

Putting the linear model aside for a moment, it appears that a cut-off point exists where a song played more than 185 times implies a spot in the charts. This trend appears to begin at around 150 plays. I expect to see this again in a future decision tree analysis.

Summary & Future Work

The number of times a song is played on the Triple J radio station does have an impact on its likelihood of appearing in the Hottest 100, though the correlation is weak. Despite the weak correlation it can be noted that songs with a very high frequency of plays are highly likely to at least place in the Hottest 100.

These results may be influenced by a number of external factors not captured here. A song may be played heavily on another station but not on Triple J, gathering “external” voters. A song may be played heavily during the period where Triple J do not record individual songs. A song may be released at the end of the year, gathering a relatively low number of plays but still enjoying popular appeal.

This analysis highlights that annual total play count is a weak indicator and it is recommended that it be used either in combination with other factors or that the play count be aggregated over a different period, or both. A future analysis should be done on play counts over the period of the year to investigate the suggestion that songs released closer to the voting period would have a higher chance of making the top 100.

Author's Note

Thanks for getting this far in my first data analysis. I plan on doing a heap more of these to ultimately try and build a more accurate model of what makes up a Hottest 100 song. I'd love to hear any feedback you have; I'm particularly interested in feedback on this analysis or the data. Any suggestions for future analyses or factors can be made on my website's index for this series.


Samsung Galaxy Camera review

Posted: January 25th, 2013 | Author: | Filed under: Blog | Tags: , , , , , | 1 Comment »

So Weezer played in town & I managed to score a couple tickets.  My Dear Employer also happened to start selling the new Samsung Android cameras (EK-GC100).  I thought I’d test the camera out at the gig & see what it could do (full disclosure: I requested a camera to borrow & test out; I was not paid or asked to do so).

We were seated at the back of the venue, directly in line with the stage but many, many meters away.  Couple that with my not-so-steady hand & a lot of the photos turned out quite blurry.  I’ve kept & published them to give an idea of the ratio of good-to-terrible shots.

Here’s a few shots for comparison.  It’s not a particularly fair “review” since these are all dark concert shots, but you get the idea.  None of these pictures have been post-processed, other than what the device did through its presets.

From our seats, no zoom.

Weezer

Almost full zoom.

Weezer

Full zoom (21x optical).

Weezer

It does some great video, too.  The big disappointment from concert filming was the huge audio clipping & distortion.  We were miles back so I suspect this is not going to be a camera to use at any concert.  That said, the video quality was great.

Outside of these dark concert shots, the camera has some pretty cool features for photos of people.  Back in about 2006 I saw an IEEE magazine article talking about the “next greatest thing” where they had camera prototypes that would take a series of shots, perform facial recognition and let you choose which face shot to use for each detected person in the photo.  This camera can do that! If I can hang on to it a little longer I’ll grab some shots and share some samples.

The other feature worth mentioning is that this is a full Android device.  The camera takes a micro-SIM and has WiFi, so you can use it on the go.  Being a Samsung, it comes pre-installed with Dropbox to keep your pics in The Cloud™, as well as Instagram and some pretty fun photo manipulation apps that will probably never get used.

I’m a Flickr user and keep my phone pics backed up using Photo Mule (it appears to be missing from Play these days, but I keep the APK here).  The Flickr app is also unsupported for this device – a huge bummer, but I installed from APK.  Once  past these annoyances, the rest of the important photo apps were next: Add A Cat & Rage Face Photo.

The Android OS interface is snappy and keeps a “Camera” icon on each of the home screens; handy for flicking back to Camera mode with short notice.  The battery life is OK – it is by now means excellent but considering what the thing is doing, it lasts a decent while.  I was taking photos or videos & had the the running for about 3-4 hours, with regular passes to friends to play, and had 54% remaining after the gig.

I am by no means a photo nut and this camera is not for photo nuts.  It is a networked camera, of medium complexity, for people looking for a point-and-click device.  At $439 (from Kogan, not my employer) I would buy one for myself.