Can we get an updated Oyster Card data set?

drolfe · August 27, 2018, 10:18am

Folks,

Your 2009 5% Oyster card sample is a really good real world data set, in that it has the anomalies and gotchas that occur in the real world, such as:

Journeys that end before they start (presumably clock in ticket machine is off)
Journeys that take zero or 1 minutes and begin and end in same place (presumably someone using station as a shortcut)

… and so on.

Is there any chance you guys could produce a bigger, more up to date data set?

David Rolfe

theochapple · October 10, 2018, 10:51am

Hi @drolfe
We have a more up to date set of data here (it includes contactless taps as well as Oyster). We are working on a release of daily aggregated taps data. We’ll announce it on the forum when it’s ready so watch this space!
Thanks
Theo

drolfe · October 12, 2018, 10:34am

Theo,

Thank you for your reply - what we’re actually looking for is a large anonymized unaggregated data set that relates to real world human activity. The 2009 oyster card data is good, but in an ideal world we’d get more data - ideally 100%.

Why would we want it? Well, I work for a company called voltdb.com, who make am OLTP database used for handling 100’s of thousands of transactions per second. A ‘pet peeve’ of mine is that most test applications use ridiculously simple random generators to mimic traffic, when anyone who works in this space knows that real world traffic will have a steady background level of weirdness.

The oyster data set has great examples of this, in that we see people starting a tube journey at station ‘A’ and then finishing at station ‘A’ a minute before they started. This is the kind if thing which gives many OLTP systems nervous breakdowns. My guess is it’s caused by people using the station as a shortcut, and using two machines where the clocks are slightly out, but I’d be curious as to what you think…

In any case - I’m asking for more data out of my own personal curiosity, and I entirely understand if that’s a hard thing for you to prioritise…

DR

harry · October 21, 2018, 4:35am

That would work out expensive if done on National Rail, where a minimum fare of about £1.90, equal to the cost of travelling to the nearest station, is imposed. It would make for an expensive shortcut to do on a daily basis, but might be useful for those with a freedom pass.

On a similar vein, I’ve seen stoppoint data showing buses predicted to arrive at the tenth stop on a route a minute or two before arriving at the ninth stop, but that’s probably due to the one prediction being generated live and the other being delivered from cached data that could be 5 seconds older.

theochapple · October 22, 2018, 2:22pm

Hi @drolfe
Yes unfortunately releasing and maintaining 100% coverage would be a huge undertaking for us. There would have to be a very strong use case for us to be able to justify the cost and effort to do that.