Bus working timetable update seems to be a mess

mjcarchive · November 16, 2021, 11:18am

A very quick look suggests that (a) new material, such as the 414 curtailed to Marble Arch has not been uploaded and (b) a lot of old material has been reloaded, possibly overwriting some or all of last week’s new files (e g the 188).

I say “a quick look” because I have a few pints waiting for me very shortly, so it may be some time until I can take a proper and reliable look but I thought it worth giving a heads up.

I also wonder if it is connected in some way to the Countdown/iBus issues (though surely not the bike issue) identified in other threads, as that may well have involved new schedules not being loaded.

briantist · November 16, 2021, 2:06pm

@mjcarchive I’m not having problems with the data being read from the output (either from here from https://data.bus-data.dft.gov.uk/timetable/download/compliant_bulk_archive ) so I’m wondering if I should be running tests on this data.

I must admit that I don’t care about out of date data as it just gets filtered out by the import code.

If I was to write an automated tests, what should I be coding for?

mjcarchive · November 16, 2021, 6:30pm

Brain

I’m not suew what you are asking me! I can see a handful of TfL files in the huge zip file that comes form the link you give. These are Datastore style XML files. The Datastore update overnight tonight may be fine for all that I know.

In this context I am using the PDF files in the Data Bucket at
http://bus.data.tfl.gov.uk/
These files are shown as updated on 16th November (even if they arer in practice unchanged). My testing is macro-aided brute force. Download the full set, see what is in this week and notr last week. Then, whether any of thye new files have been seen before (that is, prior to last week). For the last few months this last check has shown no reincarnations (discounting a a handful of cases where it oscillates between, say, sSa and SSa).,

This week I find 133 files which were not there last week but are direct replacements, plus 18 which are new route/daytype combinations.

I the find 133 files which have reappeared this week, overwriting newer material. Routes involved are 31 32 35 40 47 101 120 176 188 207 313 329 405D 427 456 467 487 626 653 683 688 C11 G1 N1 N28 N31 N207 VN19 UL19 UL23 UL48 UL69 X68, all of which were updated last week.

For good measure, 38 other files vew been deleted, largely seasonal specials, probably loaded for the first time last week.

briantist · November 17, 2021, 6:42am

@mjcarchive Thanks. Given all the problems you report I think it might be worth checking the “PDF files in the Data Bucket” against the Transxchange format dumps.

Out of interest, what do you do with PDF files as they’re not actually useful for machine processing?

B

mjcarchive · November 17, 2021, 11:32am

@briantist
The TransXchange files normally appear in advance of the change, which makes sense given that Journey Planner users could be planning a journey two weeks hence. The 414 file, for example, as been there for a couple of weeks. So I know the WTT currently there does not match. I’m not sure in any case whether tha basic course is the same, or - if it is - at what point they might diverge.

Oh yes, what do I do with the PDFs? I make them accessible on a website London Bus Timetable Graveyard
along with whatever earlier versions have been available by TfL in the past. The FOI team has actually directed thoese inquiring about old schedules to this site so I think it is quite useful to them that it exists. Without it they would have continued to get a never ending stream of requests for old schedules.

The website has links to the TfL site for current timetables and elsewhere for older ones. Trouble right now is that the latest WTT update seems to have overwritten the previous week’s new files with older ones, so the links to the current WTTs from my site don’t currently do what they say on the can. I could do the same as for the older ones but I should not have to rewrite the code to cope with that kind of system failure at the TfL end!

Having got the PDFs, I thought it would be “interesting” to see how much subsequent processing was needed to turn them into good old-fashioned timetables. in Excel. First step - OCR new files (they are not image PDFs) using a bulk process. Second step - write VBA code to reorganise the data, pick up stop names and so on. Then, having produced workable Excel files, compare with the previous version to see what has changed (in a lot of cases the actual times have not). Some of that feeds into someone else’s timetable website, which I know is quite widely used, though I know there are more important (and earlier) inputs into that site.

I skated over the first (OCR) step as that only works approximately in a lot of cases. However I was able to write VBA code which identified and overcame the most common OCR errors. A few need manual intervention - perhaps 1 in 40 or so - but it is usually obvious what has to be done.

All very sad, of course. Each step provided a significant but not insuperable challenge. I carried on because I did not want to be beaten! Like Hillary and Everest, I did it because it is there!

One final point worth mentioning For the oldest schedule, TfL made available a huge zip file of WTTs in XML form, going as far back as 2006 in some cases. These were relatively easy to read into Excel, without errors. I ended up producing PDF versions for the website.

The vast majority of this automated. It’s not like processing machine readable live feeds but a lot can be done with older technology sometimes.

Michael

mjcarchive · November 17, 2021, 4:45pm

@jamesevans @neamanshafiq
Nobody from TfL has responded to this yet (though that doesn’t mean nobody has read it).

As far as I can see, the 3am update on Tuesday morning has reversed the flow of time and replaced the 9th November set with the 2nd November set. Quite a clever trick but not actually very helpful.

Is correction in hand please?

neamanshafiq · November 18, 2021, 4:20pm

@mjcarchive Apologies for the delay. The latest schedules have now been uploaded.

mjcarchive · November 18, 2021, 7:13pm

Thanks, @nemanshafiq, just seen this.

It would be nice to know what actually went wrong in the first place but I doubt anyone is going to tell me!

Michael