Where can I get a full weeks Oyster data?


I work for VoltDB (www.voltdb.com) who develop applications in the 100KTPS range.

I’ve developed a dashboard demo using VoltDB and ChartIO (picture attached) and used the publicly available data set ( https://data.london.gov.uk/dataset/oyster-card-journey-information) but it only has 5% of the data for a week. I’ve hacked past it by inserting each record 20 times, but will still love to work with ‘better’ data.

So my question is this: If I wanted to get my hands on an entire weeks data who would I ask?

Currently my dashboard looks like this:

, who make a very high performance OLTP database, with most deployments in the >


We also have a zip file which contains a two week sample from October 2015 including station entries by day and in 30 minute intervals and the same data for station exits. It also contains journey times and bus journeys. Read more here https://blog.tfl.gov.uk/2015/12/09/is-customer-flow-data-useful-to-developers/ Let us know what you think of the data - we are always keen to refine and improve our data sets based on feedback from the open data community.


It’s unquestionably useful - one of the interesting things you encounter when working with Tube data is that you know where people enter and leave the systems but what happens between is a bit of a mystery. Having data on how loaded different routes are would be useful.

One interesting analogy I see is with network traffic management - unlike in a ‘normal’ network where the packet’s path is determined by some kind of central controlling authority in the TFL universe the ‘packets’ are sentient and make their own decisions about routing! The analogy gets more interesting when you consider that Tube ‘bandwidth’ is a wasting asset just like normal bandwidth - an empty seat on a train represents a lost opportunity for someone to travel somewhere.

Let’s say I have a conventional WAN with 3 nodes - London, Dublin and New York, all of which are connected. If the London-Dublin link gets saturated but the others aren’t the most efficient way to use the network would be to route new traffic from London to Dublin via New York instead of having it join the queue for London->Dublin. I would argue that the Tube network is the same in many ways - the optimal route for a given journey might change and become - at times - counterintuitive.

So to return to the question of how useful this data is: You could plausibly use it to build an app that offers users the chance to trade time for comfort by offering them different - and possibly weird - routes that use spare capacity instead of forcing people to play ‘sardines’ on trains.

Question: How accurate and predictable are load factors for trains? And once such an app is unleashed and gains acceptance how fast could it render a historical data set useless?

Question: How close is TFL to being able to provide train loading information in real time? As a developer if I have live data and last weeks data for the same day and time I can make predictions about loading with a straight face. Otherwise I’m guessing…

As for the full week’s Oyster data: Here at VoltDB we love large data sets with real world data. The problem with fake data is that you’re setting yourself up for what I call “Kung Fu Villain Syndrome” - nice, well behaved data that doesn’t represent that nastiness of the real world. We’ve found the 5% set very useful, but would much prefer a 100% set, even for a day…


David Rolfe


Thanks for your feedback, David! The data on typical usage is historic by nature, and therefore correct for the period of time it represents (in the case of the data we released, weekdays in November 2015). We refresh the data annually to take account of changing demand and usage patterns across the network. We understand that issues may arise if developers present the typical data as being wholly accurate at a given point in time (which we’re not suggesting they do – this data provides the baseline ‘typical’ network conditions). It’s also clear that, to deliver real customer value, we need to supplement the baseline data with near real-time data reflecting current conditions on the network. We’re working towards addressing this need in the future. Look out for updates here and on our blog.