I'm a sucker for data and automated data analysis, I don't care if it's a huge unruly database, CVS/SVN commit info, a code base, stock market feeds, site performance data, whatever – I always believe that if you can just write a clever enough program you can discover something useful from it.
However my quest for programs to discover this “useful thing” is like, to borrow the use of an image from The West Wing, Charlie Brown kicking the football while Lucy holds it – It always looks promising right up to the last minute when the inevitable happens.
The thing is just like Charlie Brown I'm sure I'll be able to do it next time.
This time was the Perl Survey 2007 – I was sure that I could construct a program that on it's own would automatically discover something truly startling about the Perl community if I just approached it from the right direction.
I decided to limit myself to the boolean fields in the survey, making a field goal certain.
So I wrote a program that identified all the boolean fields or at least those that looked like them. Then it went about constructing conditional propositions for permutations of the fields. Such as ...
P->Q
!P->Q
P->!Q
!P->!Q
Where P and Q are both boolean fields in the data. These conditionals where then evaluated against the response data and a measure of truthiness (you've got to love hacks that combine formal logic, Perl and Stephen Colbert) was calculated.
I'm pretty sure there is a better or maybe more formal algorithm/mathematical approach to do this, however a quick google/trip to the book case didn't reveal anything immediately useful – although half of the time I find in algorithms and maths especially, you need to solve a problem the naïve way to properly appreciate and understand the clever way or even recognise it as the solution you need.
In the first version of the program I discovered two things, a lot of very truthy things were really really boring. e.g. If you haven't helped with Perl 5 there is a good chance you haven't helped with Perl 6. And there is also a certain amount of psychology in how we value rules, i believe, we value the if P then whatever type, and value less the if !P then whatever type, and we especially no not value if !P then !Q.
Also, my first measure of truthiness which averaged over all the responses just didn't feel right and I moved to averaging it over the times when P or rather the left hand side (LHS) was true – for reasons I'm not sure I can explain – this approach just felt better and gave better results.
So I cracked on with version 2, despite seeing Lucy's hand on the ball starting to twitch on version 1.
So with the improvements in my mind I managed to adjust it in just a few minutes. And the results where astonishing!....-ly boring. Here's the top 5...
If 'Posted to Perl Mongers list ' Then 'Subscribed to Perl Mongers list ' (With 0.995 probability)
If 'Posted to other list ' Then 'Subscribed to other list ' (With 0.990 probability)
If 'Attended conference (non-local) ' Then 'Attended conference ' (With 0.970 probability)
If 'Attended Perl Mongers (non-local)' Then 'Attended Perl Mongers ' (With 0.965 probability)
If 'Contributed to Perl 5 ' Then 'Subscribed to other list ' (With 0.882 probability)
Not exactly breathtaking stuff, and there will be no prizes for what other list they are likely to be subscribed to if they have contributed to Perl 5.
Anyway to get something useful out of the list I ended up hand editing it myself, so here is my clumsy/poor attempt at interesting things from the Perl Survey 2007 (pay attention to the probabilities as the lower ones are more interesting).
If 'Contributed to Perl 5 ' Then 'Contributed to CPAN ' (With 0.780 probability)
If 'Presented at conference ' Then 'Subscribed to other list ' (With 0.778 probability)
If 'Contributed to Perl 6 ' Then 'Provided feedback ' (With 0.701 probability)
If 'Contributed to Perl 5 ' Then 'Attended Perl Mongers ' (With 0.630 probability)
If 'Attended conference ' Then 'Attended Perl Mongers ' (With 0.629 probability)
If 'Subscribed to other list ' Then 'Attended conference (non-local) ' (With 0.112 probability)
If 'Led other projects ' Then 'Contributed to Perl 5 ' (With 0.111 probability)
If 'Contributed to CPAN ' Then 'Contributed to Perl 6 ' (With 0.101 probability)
If 'Attended conference ' Then 'Contributed to Perl 5 ' (With 0.099 probability)
If 'Perlmonks ' Then 'Contributed to Perl 6 ' (With 0.089 probability)
If 'Attended Perl Mongers ' Then 'Contributed to Perl 6 ' (With 0.077 probability)
If 'Posted to Perl Mongers list ' Then 'Contributed to Perl 6 ' (With 0.072 probability)
To be honest I think Lucy has won once again. However if you are interested in the raw output have a look at v1_out.txt or v2_out.txt. And do take some heart from the performance figures even with version 1's brute force/clumsy technique the program still managed to analyse over 5 million conditional proposition/data points in 20 seconds - maybe some day we'll kick that football.