Improving our Open Data - better DX

JakubGasiorowski · October 28, 2022, 10:30am

We are working on improving our Open Data and we are currently looking at data documentation.

We want to remove barriers to our data and improve :

Usability
Findability
Credibility

I can see this forum is a great source of information and people generally help each other with problems we sometimes couldn’t cover in a document, but as an initial study, I want to open a discussion on this forum and ask:

Are you generally satisfied with the level of information about our Open Data? (Praises and/or topical rants are very welcome)
What is the ONE thing you would like to see or get improved about our data documentation? (i.e. a database of ALL datasets in one place, more detailed information about APIs / Static data, better DX, improvements to API platform etc)

You are the data consumer, so I’d like to hear your thoughts.

Thanks!

Jakub

briantist · October 28, 2022, 11:02am

@JakubGasiorowski Just for clarification: do you mean just the occasional data sets or the live api.tfl.gov.uk as well?

JakubGasiorowski · October 28, 2022, 11:30am

Hi Brian, I was asking about both:

Static Data in S3 buckets (and elsewhere)
Live data in https://api-portal.tfl.gov.uk/ and other feeds (Countdown etc)

If there is anything in particular, documentation-wise, that you wished was improved?

briantist · October 28, 2022, 12:24pm

OK, I don’t really like coming on here to moan about things, however. The documentation isn’t great.

I was just writing a multi-headed (TfL, Google and NRE) journey planner that you can see at RTI

I ended up writing this class to work what the actual TfL journey planner parameters are, because they’re not quite as the documentation (noSolidStairs, noEscalators, noElevators, stepFreeToVehicle, stepFreeToPlatform)

I would suggest that some form of tests to be run picking up the parameters from the documentation and then running a test suite to find out what is written and what’s right.

class TfLJourneyPlannerSearchParameters
{
//    public bool $nationalSearch = true;
    public string $via;
    public string $date = "yyyyMMdd";
    public string $time = "HHmm";
    public string $timeIs = "departing"; // leastinterchange" | "leasttime" | "leastwalking"
    // https://api.tfl.gov.uk/Journey/Meta/Modes has a correct list of modes
    public string $journeyPreference = "leasttime"; 
    public string $mode = "bus,overground,tube,coach,dlr,tram,walking,cycle";//noSolidStairs,noEscalators,noElevators,stepFreeToVehicle,stepFreeToPlatform";
    public string $accessibilityPreference = "";
    public string $fromName;
    public string $toName;
    public string $viaName;
    public string $maxTransferMinutes;
    public string $maxWalkingMinutes; // "slow" | "average" | "fast".
    public string $walkingSpeed = "average";// "allTheWay" | "leaveAtStation" | "takeOnTransport" | "cycleHire
    public string $cyclePreference;// "TripFirst" | "TripLast"
    public string $adjustment; // "easy,moderate,fast"
    //    public bool $alternativeCycle = false;
    //    public bool $alternativeWalking = false;
    //    public bool $applyHtmlMarkup = false;
    //    public bool $useMultiModalCall = false;
    //    public bool $walkingOptimization = false;
   //    public bool $taxiOnlyTrip = false;
    //    public bool $routeBetweenEntrances = false;
   public string $bikeProficiency;

    const DATEFORMAT = "Y-m-d\TH:i:s.uP";
    
    public function setDateTime(string $input)
    {
        $now = DateTime::createFromFormat(TfLJourneyPlannerSearchParameters::DATEFORMAT, $input);
        $this->date = $now->format("Ymd");
        $this->time = $now->format("Hi");
    }
}

As I said above, the TfL system are generally exceptionally reliable and the documentation is more accurate than most.

briantist · October 28, 2022, 3:54pm

Oh yes, the other thing I find unhelp is that despite my meter-wide 4k screen APIs: Details - Transport for London - API I can’t see anything but ellipses in the menu.

briantist · October 29, 2022, 8:08am

@JakubGasiorowski Another thing that winds me up with the tfl api is all those UNDOCUMENTED “$type” return data that serve no use at all because not only have they never been part of a description of the return data set, they are in JSON so you’ve got your data typed hard-coded into the format.

https://api.tfl.gov.uk/line/mode/national-rail/status for example

First, I called an TfL API so I’m going to be getting Tfl.Api.Presentation.Entities and the others all replicate the section of the data in them. All they do it more than double the size of the JSON return!

I would bet £100 that no-one has ever used these $type things in real code.

So $type is redundant undocumented bloat.

(Sorry you did ask)

harry · October 29, 2022, 11:39am

I’ll start with a question, which has always puzzled me.

We have these categories:

And then we have the Unified API which appears to contain all the same categories over again:

I’ve always assumed that everything except the Unified API was historical duplicates that are only needed for maintaining old code, but am I missing anything if I stick exclusively to the Unified API?

Next, that second image reminded me … it’s not what any other user would (normally) see when they go to a page like APIs: Details - Transport for London - API.

Normally you would get this:

I found the truncated display completely and totally unusable, because often the important information distinguising two similar APIs is beyond the truncated area.

Sometimes they’re even totally indistinguishable:

I converted it to a scrolling column with some personal css, which probably does a number of other things too, most probably changing column widths and making them scroll independently, but I don’t remember exactly what:

.operation-name { white-space: normal; }
.nav { overflow-y: auto; display: block; max-height: 50vh; }
.ycqdvmhchp { overflow-y: auto; display: block; max-height: 75vh; }
.ycqdvmhchp { max-width: 1080px; }
.ijdrsuhpjk { max-width: 400px; overflow-y: scroll; height: 70vh}
.table-preset-head { max-width: 200px; }
.table-preset-head { max-width: 170px; }
.table-preset { overflow-x: auto; }
.meatiytmag { max-width: 2000px; }
.collapsible-container { max-height: 800px; }
.text-truncate { white-space: normal !important; }

I don’t suggest you use that css directly, because it might not suit everybody. But if you have a way of adding it (I use firefox with the stylus addon, which lets you easily turn user css on and off as and when requiired) it may give you some ideas to consider incorporating.

Some other suggestions

A user is likely to want to know what a valid value might be. The placeholder value isn’t helpful but the middle column provides the answer:

It would be much more useful if

the placeholder said value, eg victoria or N133 and/or
the description came immediately underneath it

So, something like:

Note that I’ve added a link, which could sensibly execute the appropriate API call to display the possible IDs in a collapsible list

Likewise:

Here there is a relatively small list (20-30) of possible modes so although the placeholder could display value, eg tube,dlr it should probably be possible to create it as a multiselectable dropdown by executing https://api.tfl.gov.uk/Line/Meta/Modes.

Most people probably wouldn’t need it in both places. So I’d suggest making their position optional:

Of lesser importance, but perhaps interesting:

I can see that cache control would be relevant and guess what it will do. But the ability to add extra values suggests there ought to be a list of other possible headers that TFL recognises, and the supported values.

Thankyou for giving us the opportunity to submit our ideas.

I presume the actual functionality of the API itself isn’t your department, but it has major failings. One if them is that it is often totally impossible for a freedom pass user to get a journey plan between certain destinations within the freedom pass area that doesn’t use invalid trains for part of the journey. It would presumably be possible to add a Freedom Pass parameter to the JP API (the same way as step free access filtering works) and add an FP flag to each record of the timetable information to ensure they aren’t given unsuitable journeys. But TFL has been silently ignoring that requirement for more than 5 years.

Likewise, I presume the user interface for the journey planner is nothing to do with you, but there really needs to be a similar consultation on that. Some of its user interface features are not just inconvenient but positively user-hostile. I’ve raised some of them here:

Buses that disappear from predictions API – which is a much bigger problem than I’ve so far reported, and makes journey planner, the journey planner API and the countdown displays at bus stops all totally unusable in the Chingford area. And probably many other places too.

Disambiguation Errors is an irritation. Frequently circumventable if you know what you are doing. But the average person shouldn’t be expected to do that. I note I even said in June 2018 in that thread “I’m probably wasting my time reporting another bug, because nobody seems to be fixing them.” Clearly I was, because they’re still not fixed. Even today, journey planner still thinks Glyn Road, Enfield is somewhere near Cricklewood instead of being in Enfield.

Unnecesary disambiguation in Via field – yet more user hostility, where in some cases JP requires disambiguation for stations like Liverpool Street and Stratford that the user has already disambiguated by picking them from JP’s own dropdown list. All much more recent, but also not fixed.

harry · October 29, 2022, 11:40am

Interesting you should say that. We obviously think alike.

See my user CSS in Section 2 of the previous post Sadly the colour scheme here makes the HR divider almost invisible, but I’ve now added numbers to the sections so it should be findable.

Or you could add some more sensible css to the page and find it much easier:

hr { color: red; background-color: red; height: 2px; }

mjcarchive · October 29, 2022, 11:47pm

As a user for the most part of static sources, I would like to see better guidance on where to find them and on the updating cycle. I would also like to see all static sources made available through one of the data buckets, not least because it helps the user see when a file has been updated (or at least recreated). That is not the case at the moment - for example the bus stop files are not there.

I have also pointed out in the past that the set of bus spider maps in the data bucket is out of sync with those available from the maps section of the TfL website, with obsolete editions in their droves. The point is that a one-off data dump is fine when it first happens but if it is not maintained subsequently it becomes misleading.

arturs · October 30, 2022, 9:45am

@harry It is in fact the other way around. The Unified API option is a legacy version of the documentation moved over from the old website, and you’re advised against using it. The actual separated categories are what you should be using.

Quote from docs homepage in regards to the “Unified API” category:

“This API document is a legacy from our last portal. All the other APIs are bundled here as a single ungrouped OpenAPI document. You are advised to use the other APIs for documentation.”

harry · October 30, 2022, 10:37am

Thanks for clarifying. That’s certainly not mentioned in APIs: Details - Transport for London - API and I think the legacy information should not be shown by default. But maybe have an option on the page that allows it to be shown on request. Or move it to a separate page and have a link.

But actually that reveals a bigger problem.

If I select unified API, I can search for a keyword (eg stop) and it will be found in all categories
If I don’t select unified API (or any other category) and use the search box for the same search, nothing at all is found.
apparently I’ve got to guess which three categories match my stop keyword, and do three separate searches to find all of them. That makes no sense, especially as the legacy version worked better, and I guess ought to be precisely the sort of thing that @JakubGasiorowski wants to know about here.

JakubGasiorowski · October 31, 2022, 9:58am

Hello,
Thanks @briantist, @harry, @arturs, @mjcarchive for taking the time to list things that bother you and that you think could be improved.

This is an early stage of our Open Data Audit, where we gather all of the requirements and suggestions. I’ll try to provide some answers in due course as well.

I can’t promise that everything suggested could be fixed or improved, but we’ll try to make our Data and Platform as user friendly as it can be.

If you ever had any issues of using or finding our Open Data (systematic problems, not bugs), please keep adding your suggestions here!

Also, if you have an example from using other platforms, how data documentation / platform itself could look like, please share that as well. Thanks!

Cheers,
Jakub

briantist · October 31, 2022, 10:25am

@JakubGasiorowski

Another thing that might help others to find the information is to give the https://api-portal.tfl.gov.uk/ a sitemap.xml so that Google can search it: Build and Submit a Sitemap | Google Search Central | Documentation | Google Developers

mjcarchive · November 3, 2022, 12:00am

One more comment, if I may. File naming conventions should be followed for files that are going to be made visible to external users, as a standardised format makes it easier to handle them.

I commented some years ago that the convention for the bus spider maps was that there was no convention - dates might be included, hyphens might be present, “a4” might be in the file name and they could all change between editions of the maps for the same area!

[Edited to make clear that this section refers to the first line of each file, not the file name itself; the old format bus WTTs used the Title metadata field to provide this sort of additional info; the new format ones not doing this was distinctly unhelpful though it did mean I learnt some new tricks]

When the new format bus WTTs first appeared the first line was in a standardised format, e g
Schedule_1-62283-FrHo-LC-1.pdf

A few weeks in, I started seeing first lines like
Schedule_105-62693-spSa-LU-1-20221029.pdf,
the date sometimes looking like a start date but sometimes with no obvious interpretation.

In the last few weeks, I see
Schedule_9-62594-Su-LU- Copy from 60778
Schedule_422-62582-MFHo-SK-1 - ST copied 58338
Schedule_243-62664-Fr-MN-1-20221015 - ST copied 62613

so three different variants telling you the WTT has been copied across.

None of this is the end of the world but even my untidy mind says that defining a convention and sticking to it is better than letting a thousand naming ideas bloom!

[Edited - this bit DOES refer to the actual file names]

Of course any naming convention must be thought through well enough to avoid duplicates. As I have said on another thread, this is not the case for bus WTTs for the 143, N29 and now also the 481, with the result that the most important parts of those routes are not available for some days of the week.

briantist · November 3, 2022, 7:38am

@mjcarchive Whilst wholly agreeing with your point, isn’t the convention over the internet to use a MIME header?

Because of, course, once you stick these in a ZIP file, the headers are lost - ZIPs, GZIPs and TARs all assume the convention that the file extension denotes the content type.

Which is why PHP needs to have PHP: image_type_to_mime_type - Manual and PHP: exif_imagetype - Manual !

mjcarchive · November 3, 2022, 8:20am

Brian

Sorry, that’s beyond my level of understanding!

I haven’t come across files within a ZIP file which are not the type that the extension says they are. The issue for me is whether I can determine the basics of the content from the file name, metadata, data bucket record or whatever. For the bus WTT filenames I can tell the route and day type. I can’t tell whether the file is new because all files are recreated weekly (often with a slightly different file size!). I used to be able to get Service Change Number from the Title metadata field of each PDF but this is now left unfilled.

What I have to do is open each file in Powershell using Mutool to look at and extract the first line, which I then parse in Excel. I have edited my previous post because I realised that I had described the smorsgabord of descriptions as file names when I mean first names - well, I was typing it quite late. I have had to edit the parsing of this first line three or four times as new variants of the first line emerge. So not fatal too what I was doing but annoying as it should ot have been necessary.

Of course for this sort of working document users have to accept that it is what the organisation needs itself that comes first. The plea really is that if it is known if those documents are also to be made publicly available then there should be consistency of form of both file name and the contents.

briantist · November 3, 2022, 9:10am

@mjcarchive

If you want to work out if the contents of a file are new without comparing them line by line then use either a crc32(), md5() or sha1() function on the contents. These generate short values that can be compared to find out if a file has changed because a single tiny change to a file creates a considerable difference in the value.

For the type of files you are using, I would personally consider a crc32() to be a simple way to do this as it just generates a 32-bit integer which is easy to code. You could, for example, just append the crc32() value to your local copy of the file and that way new versions are easy to spot!

briantist · November 3, 2022, 9:11am

For some reason when I try to post a sample code, I get an error from the forum.

mjcarchive · November 3, 2022, 12:39pm

Just as an aside, some 80-90% of the bus WTTs PDFs this week had a slightly different file size to the ones available the previous week, even though there is no reason in the vast majority of cases why the contents should be unchanged. This did happen with the old formats but typically for a lot less than 5% of the files - few enough for me to actually check that the contents were unchanged, which I do not attempt for the new format.

I’ve rationalised that the difference in the percentages is because the new format file sizes are considerably larger, thus a very minor difference is more likely to show up in the file size but I have no real idea whether this makes sense or why the differences occur in the first place.

Presumably crc32() would also show that there is a difference in these situations.

briantist · November 3, 2022, 4:29pm

Yes, that how it works.