Cambrian line 20 Oct 2017: loss of ERTMS speed restrictions. RAIB report released

Elecman · 28 Mar 2018

Jonny said:
How long would a SPATE be signposted for?

For th3 duration of the weekly operating notice that published the original restriction

RailUK Forums

daikilo · 28 Mar 2018

rebmcr said:
Perfectly reasonable, but making a tightly-scoped "sync verification" module a mandatory part of the normal operating sequence, would at least cause a right-side failure during a desync, no matter what unexpected scanario caused it.

I would go a step further, no reboot/restart should ever lead to a situation where functionalities have been lost without it being apparent. As for the theory that complication makes it impossible, this is rubbish or alarming, every software sequence should be an ordered step and none should ever be missed without a fault signature and probably restart failure message.

OneOffDave · 29 Mar 2018

daikilo said:
I would go a step further, no reboot/restart should ever lead to a situation where functionalities have been lost without it being apparent. As for the theory that complication makes it impossible, this is rubbish or alarming, every software sequence should be an ordered step and none should ever be missed without a fault signature and probably restart failure message.

For complex, tightly coupled systems it is impossible to model every possible interaction in the system. It's not just about the software but how the software interacts with the real world and what impacts that has. If you feel that experts in the field are spouting rubbish, why not get your own research published in peer reviewed journals

daikilo · 29 Mar 2018

OneOffDave said:
For complex, tightly coupled systems it is impossible to model every possible interaction in the system. It's not just about the software but how the software interacts with the real world and what impacts that has. If you feel that experts in the field are spouting rubbish, why not get your own research published in peer reviewed journals

I should have been more specific in that I was refering to safety-critical rail signalling software, the subject of this thread, and also to the specific case of start-up or reboot. If complication is added to the point where safe working cannot be ensured then steps should be simplified even if it then takes longer to operate.

OneOffDave · 29 Mar 2018

daikilo said:
I should have been more specific in that I was refering to safety-critical rail signalling software, the subject of this thread, and also to the specific case of start-up or reboot. If complication is added to the point where safe working cannot be ensured then steps should be simplified even if it then takes longer to operate.

Yes, I agree there should be some method of checking that volatile information survives update and upgrade processes.

dtaylor84 · 29 Mar 2018

daikilo said:
every software sequence should be an ordered step and none should ever be missed without a fault signature and probably restart failure message.

This is just a string of words without meaning.

YorkshireBear · 29 Mar 2018

dtaylor84 said:
This is just a string of words without meaning.

No it is not.

It is quite clear what i means. Each sequence should happen in an order and if any of them fail or do not happen it should be obvious via either a fault signature or a restart failure message. Which would have prevented this incident.

dtaylor84 · 29 Mar 2018

OK, that clears up the "ordered step" part.

I'm still not really sure what set of "software sequences" this argument applies to, or what it means for something to "be obvious via a fault signature" (other than by reporting an error message, which is mentioned separately and is presumably something distinct.)

carriageline · 4 Apr 2018

I imagine that no one would of thought it was possible, hence why it happened!

The signallers are given a list of speed restrictions on a display. That’s programmed to be updated by the RBC/SCT/whatever the Cambrian use for speed restrictions. For some reason it wasn’t.

It’s not like this was something that was just not thought about. Ok yes, the system should be more robust. But how can you work out every possible fault if it hasn’t happened yet?

IIRC in the RAIB preliminary statement, it said the manufacture hadn’t even found out why it happened.

nickswift99 · 4 Apr 2018

There are relatively new approaches to risk management that ought to apply here but were almost certainly too new for this rollout.

A systems approach will enable you to identify previously unknown/unexpected faults. STAMP is an example, for which academic papers can be found here http://sunnyday.mit.edu/

daikilo · 4 Apr 2018

carriageline said:
I imagine that no one would of thought it was possible, hence why it happened!

The signallers are given a list of speed restrictions on a display. That’s programmed to be updated by the RBC/SCT/whatever the Cambrian use for speed restrictions. For some reason it wasn’t.

It’s not like this was something that was just not thought about. Ok yes, the system should be more robust. But how can you work out every possible fault if it hasn’t happened yet?

IIRC in the RAIB preliminary statement, it said the manufacture hadn’t even found out why it happened.

Railway signalling system has been considered fail-safe for over a century. It had weaknesses like fog and snow but these weren't hidden. No "fault" should be hidden and no-one should ever be forced to say "it failed-unsafe and we don't know why". One could argue that the whole system should have been shut-down in case an/other hidden failure case/s had also occured during that reboot.

Dieseldriver · 4 Apr 2018

daikilo said:
Railway signalling system has been considered fail-safe for over a century. It had weaknesses like fog and snow but these weren't hidden. No "fault" should be hidden and no-one should ever be forced to say "it failed-unsafe and we don't know why". One could argue that the whole system should have been shut-down in case an/other hidden failure case/s had also occured during that reboot.

100% agree. From a Drivers perspective we rely implicitly on signal aspects, safety systems/indications and signage (as well as our own extensive knowledge which can only be so much). A modern system behaving in this way is actually pretty worrying and suggests that the system in use on the Cambrian is unreliable for the safe running of trains.
This time it was relating to a Temporary Speed Restriction but how are we to trust this system given that it's primary function is to stop trains bumping into each other at high speeds?

Dave1987 · 4 Apr 2018

carriageline said:
I imagine that no one would of thought it was possible, hence why it happened!

The signallers are given a list of speed restrictions on a display. That’s programmed to be updated by the RBC/SCT/whatever the Cambrian use for speed restrictions. For some reason it wasn’t.

It’s not like this was something that was just not thought about. Ok yes, the system should be more robust. But how can you work out every possible fault if it hasn’t happened yet?

IIRC in the RAIB preliminary statement, it said the manufacture hadn’t even found out why it happened.

I find it incredibly worrying that they don't know why it happened. When people's lives are at risk it's absolutely not acceptable to say "this fault has never happened before so how can we have put things in place to stop it happening". If it doesn't 'fail safe' like everything does currently on the railway currently then it's very very concerning.

HSTEd · 4 Apr 2018

Dave1987 said:
I find it incredibly worrying that they don't know why it happened. When people's lives are at risk it's absolutely not acceptable to say "this fault has never happened before so how can we have put things in place to stop it happening". If it doesn't 'fail safe' like everything does currently on the railway currently then it's very very concerning.

Everything on the railway does not always fail safe.

Occasional Wrong Side failures are a fact of life
The important thing is to work out why this failure happened and remove the vulnerability.

Dave1987 · 4 Apr 2018

HSTEd said:
Everything on the railway does not always fail safe.

Occasional Wrong Side failures are a fact of life
The important thing is to work out why this failure happened and remove the vulnerability.

I would like you to sight an example of something on the railway that does not fail safe. You clearly know of something else you would not make statements like that.

Imagine this failure had happened with a train operating under ATO where drivers route knowledge had been cut to the bone like some are proposing and it had been over some dodgy track. There you have the perfect recipe for a huge accident. Things like this show the weaknesses of systems like this. I have a fair amount of experience with coding and know that you can have bugs in a system that lay unseen for years until they rear their ugly heads.

HSTEd · 4 Apr 2018

Dave1987 said:
I would like you to sight an example of something on the railway that does not fail safe. You clearly know of something else you would not make statements like that.

Well the obvious example is Clapham Junction in '88
Fail Safe systems are designed to fail safe, but like all engineered systems they occasionally fail to perform their designed function.

Dave1987 said:
Imagine this failure had happened with a train operating under ATO where drivers route knowledge had been cut to the bone like some are proposing and it had been over some dodgy track. There you have the perfect recipe for a huge accident. Things like this show the weaknesses of systems like this. I have a fair amount of experience with coding and know that you can have bugs in a system that lay unseen for years until they rear their ugly heads.

The driver would have been over the route dozens of times under ATO control anyway, and it is likely he would have noticed something was wrong before the accident anyway - as the train failed to brake in the manner that it normally did.

ComUtoR · 4 Apr 2018

HSTEd said:
The driver would have been over the route dozens of times under ATO control anyway, and it is likely he would have noticed something was wrong before the accident anyway - as the train failed to brake in the manner that it normally did.

I can't speak for the specifics but there are plenty of routes that I rarely go over and it is very easy to go 6 months without going over a specific route. It can also be a case where a Driver goes over a route for the first time since signing it etc. etc.

Dave1987 · 4 Apr 2018

HSTEd said:
Well the obvious example is Clapham Junction in '88
Fail Safe systems are designed to fail safe, but like all engineered systems they occasionally fail to perform their designed function.

Well I actually thought you were going to quote an incident that had happened in the last decade that I had not heard about.

The driver would have been over the route dozens of times under ATO control anyway, and it is likely he would have noticed something was wrong before the accident anyway - as the train failed to brake in the manner that it normally did.

Do you understand how TSR's and ESR's work? You are talking about one TSR that had been in for a very long time that the driver knew about. What if this was for a 20mph TSR over a bit of dodgy track that had only come in the previous day and the driver had been on holiday? You could end up with a train doing line speed through a severe speed restriction which is extremely dangerous. This kind of thing is the prime reason there will a driver at the front with full route knowledge and full training.

HSTEd · 4 Apr 2018

Dave1987 said:
Well I actually thought you were going to quote an incident that had happened in the last decade that I had not heard about.

Well there may have been one, but Clapham Junction was merely the first example of how any engineered system will inevitably fail eventually that came to my head

Dave1987 said:
Do you understand how TSR's and ESR's work? You are talking about one TSR that had been in for a very long time that the driver knew about. What if this was for a 20mph TSR over a bit of dodgy track that had only come in the previous day and the driver had been on holiday? You could end up with a train doing line speed through a severe speed restriction which is extremely dangerous. This kind of thing is the prime reason there will a driver at the front with full route knowledge and full training.

How does full route knowledge protect against that, if they haven't been told about the TSR how on earth are they going to divine it from their route knowledge?
You could provide the driver a list at the start of shift of all the extent TSRs, and have a track mileage counter visible to the driver in the cab

DY444 · 4 Apr 2018

Dave1987 said:
I would like you to sight an example of something on the railway that does not fail safe. You clearly know of something else you would not make statements like that.

Imagine this failure had happened with a train operating under ATO where drivers route knowledge had been cut to the bone like some are proposing and it had been over some dodgy track. There you have the perfect recipe for a huge accident. Things like this show the weaknesses of systems like this. I have a fair amount of experience with coding and know that you can have bugs in a system that lay unseen for years until they rear their ugly heads.

There have been incidents where systems which were thought to be fail safe but turned out not to be. One I can think of was on the Washington Metro where a track circuit module failed in such a way that it failed to detect a train resulting in a fatal collision. I can think of others in the UK which were less serious but they have happened very occasionally.

carriageline · 4 Apr 2018

Wrong Side Failures are still an occurrence (IE one every 12 months?)

It’s mostly signals showing aspects they shouldn’t, or track circuits not occupying when they shouldn’t. It happens

Llanigraham · 4 Apr 2018

ComUtoR said:
I can't speak for the specifics but there are plenty of routes that I rarely go over and it is very easy to go 6 months without going over a specific route. It can also be a case where a Driver goes over a route for the first time since signing it etc. etc.

But not on the Cambrian! They are up and down it day in, day out.

Llanigraham · 4 Apr 2018

carriageline said:
Wrong Side Failures are still an occurrence (IE one every 12 months?)

It’s mostly signals showing aspects they shouldn’t, or track circuits not occupying when they shouldn’t. It happens

Quite!!
And for a more recent example, I cite Moreton on Lugg.

bramling · 4 Apr 2018

HSTEd said:
The driver would have been over the route dozens of times under ATO control anyway, and it is likely he would have noticed something was wrong before the accident anyway - as the train failed to brake in the manner that it normally did.

This statement is extremely naive.

Firstly I love the use of the word "likely". The railway doesn't do things based on what's "likely" to happen (or not happen).

Secondly there's absolutely no guarantee at all that the driver would have been over the route many times at all - it could for example be his first trip back after a lengthy period of leave.

Also it's well known that with ATO systems drivers are less likely to react to things as it takes time for them to re-focus.

Wilts Wanderer · 4 Apr 2018

For an up to date example of a wrong side failure, look at the VTEC HST that had an external door open unexpectedly at 125mph a few days ago.

Systems should be designed to fail safe, but not all failure-prone objects on the railway are a system. Engineering is as much about good judgement as it is about compliance with rules and standards. This is where the modern railway and Network Rail frighten me. It is increasingly all about compliance and less about common sense and critical judgement.

Chris M · 4 Apr 2018

Dave1987 said:
Well I actually thought you were going to quote an incident that had happened in the last decade that I had not heard about.

Waterloo?
Cardiff East Junction?
Watford tunnel?
Broad Oak level crossing, Kent?

That's just from RAIB reports published in 2017.

Bald Rick · 4 Apr 2018

carriageline said:
Wrong Side Failures are still an occurrence (IE one every 12 months?)

It’s mostly signals showing aspects they shouldn’t, or track circuits not occupying when they shouldn’t. It happens

Signalling Wrong Siders are much more frequent than that. Mostly TCs showing clear when occupied (usually rail or wheel contamination), but AWS bell vice horn is quite common also, and signals showing a less restrictive aspect than they should have, or a ‘wrong’ junction indicator are not unknown. Much more rarely points throwing the wrong way or similar - the Waterloo derailment in August was sone of these.

In any case such events are risk scored, and those with a score over 50 are the ones to be really worried about. There were 102 in 2015/16. https://www.networkrail.co.uk/who-w...rformance/infrastructure-wrong-side-failures/

cjmillsnun · 4 Apr 2018

HSTEd said:
Well there may have been one, but Clapham Junction was merely the first example of how any engineered system will inevitably fail eventually that came to my head

CLJ was not an engineered system that failed. It was human error. No ifs and buts. That failure was caused by rogue wires not being cut back after changes to the system.

Bald Rick · 4 Apr 2018

cjmillsnun said:
CLJ was not an engineered system that failed. It was human error. No ifs and buts. That failure was caused by rogue wires not being cut back after changes to the system.

It was still a wrong side failure.

The human error was somebody not doing their job properly, by not completing a wiring task, and it not being properly checked.

Change the word ‘wiring’ for ‘software’, and you have a possible cause of the ETCS failure.

cjmillsnun · 4 Apr 2018

Bald Rick said:
It was still a wrong side failure.

The human error was somebody not doing their job properly, by not completing a wiring task, and it not being properly checked.

Change the word ‘wiring’ for ‘software’, and you have a possible cause of the ETCS failure.

No arguments that it was a wrong side failure.

A wire count is a much simpler task than deciphering millions of lines of code but you are correct that one simple error can cause a dangerous situation.

Cambrian line 20 Oct 2017: loss of ERTMS speed restrictions. RAIB report released

Established Member

RailUK Forums

Established Member

Member

Established Member

Member

Member

Established Member

Member

Established Member

Member

Established Member

Member

On Moderation

Veteran Member

On Moderation

Veteran Member

Established Member

On Moderation

Veteran Member

Member

Established Member

On Moderation

On Moderation

Veteran Member

Established Member

Member

Veteran Member

Established Member

Veteran Member

Established Member