JP TransXchange data changes over time

john_wr · June 3, 2021, 11:28am

Hello,

We are converting TFL txc data (provided as journey-planner-timetables.zip on tfl website) into GTFS and trying to keep the trip_ids stable from one version of TxC data to another. For that reason we used hash values for trips based on the most important data like stop sequence, schedule, calendar days, exceptions and operation start/end dates.
Due to the fact that TFL changes the start date for the same trips quite frequently in the new updated static TxC data our trip_ids change as well. The problem is that the change of the operation start date is sometimes the only change made to the data over a period of time (e.g. 7 days).
What is the reason for this?

Example data examined:
Tram route presented in the file tfl_63-TR-_-y05-14.xml
We compared the data of 20.05.2021 and 02.06.2021
We made a comparison of 2 files and some other cases as well. Nearly in 9 cases out of 10 the only change is the Operating period start date

Thanks in advance,
John

mjcarchive · June 4, 2021, 11:10am

Interesting topic. The first question, I suppose, is why files which are basically unchanged get recreated every time but as I do that myself sometimes I think I know why! It is more efficient just to recreate the lot than mess around identifying the files which actually need changing. Because the metadata (like CreationDateTime) are written to the file, this means that the new files always look different, so file comparison tools, duplicate spotters and clever spidering software can’t be used to home in on the files with substantive changes. You have to look at each of the thousand files, obviously with some sort of automated process. If anyone knows a better way, please post it!

As an aside, the Working Timetable file set also seems to be recreated from scratch each week but no new metadata (or anything else) are introduced into the files. Despite that, the file size of the recreated file is occasionally slightly different to that it replaced!

IIRC the StartDate was not changed every time in the first few years of the availability of the zip file. I don’t know why this was changed to weekly revision. Given that the StartDate in the file is uninformative (as it almost invariably is not when the service started in that form), do you actually need to use it in generating the hash code at all?

Michael

harry · June 17, 2021, 10:05am

Just thinking aloud, but do those files have eTags?

As I understand it, the eTag would normally be generated from the contents of the file, and therefore two identical files that differ only by creation date should still have the same eTag.

mjcarchive · June 17, 2021, 11:00am

Harry

eTags are new to me but I can’t find anything that looks like one in the XML files.

The other issue is that the contents of the file nearly always change as the start date for the service ( is usually moved on a week even if nothing else has changed.

Michael

harry · June 17, 2021, 11:44am

I wouldn’t expect to see the eTag inside the file. In fact, if you think about it, that wouldn’t be possible. The eTag has to change whenever the file contents change, however slight – but putting the eTag in the file would change the file contents, and therefore invalidate the eTag.

They’re typically used with downloadable files, in which case the eTag is delivered as a file header alongside the file. Browsers store the eTag alongside the file in the file cache and if you subsequently request the same file again, the browser will send the old eTag as part of the request. If the old eTag matches the current eTag, then the server responds with a quick “file unchanged” message instead of the complete file. So, the browser delivers the file from its cache, instead of downloading it again. Mostly you don’t see it happening, unless you want to do something different with it.

But:

Doing that defeats the eTag process anyway, because it’s an actual content change and therefore will change the eTag every time.