back to article Telstra hauls in Cisco, Ericsson, Juniper to explain TITSUPs

Australia's dominant carrier Telstra has hauled three of its key vendors into the principal's office for a dressing down, following network outages that had the Australian incumbent giving mobile users “free data days” to stem the outrage. Major outages in the mobile network happened in February and March, and the carrier's …

  1. JeffyPoooh
    Pint

    "...with them all trying to re-register at once, the network collapsed."

    There's an algorithm for that.

    It might need handset updates (likely impossible), or cunning misdirection by the operator.

  2. A Non e-mouse Silver badge
    FAIL

    15% increase causes overload?

    How can a 15% increase in load overload a system? How hard are they sweating those assets? Surely they should be running a N+1 set up (AT LEAST!) and so should be capable of running with one node down.

    From that statistic alone, I'd be blaming Telstra for failing to design their network properly, not their suppliers for supplying shoddy equipment.

    1. frank ly

      Re: 15% increase causes overload?

      Is there anybody in Telstra (or similar) who knows how to design a network like that? I'd have thought they'd subcontract it out to groups who do have the experience.

    2. Richard Jones 1
      Flame

      Re: 15% increase causes overload?

      No that is not a correct reading as I see things. Most handsets are registered on the system as they are switched on. They remain on line until switched off, they go flat or move out of range. The system will be dimensioned on the basis of expected traffic demands and expected new service request demands plus any other relevant factors such as normal surges. Suddenly removing 15% of connections and then all of them automatically trying to reconnect is a very different thing. That is not 'normal' traffic it is a very specific type of traffic all focussed on very specific resources, those associated with control of the network not the traffic on the network.

      Yes duplicated resources running in parallel through very well integrated and designed networks vastly increase network security. Two gateways with each having say 60% of nominal maximum busy hour load capacity can proved a very high level of service continuity, with three or more way diversity even better security can be obtained, provided that the load sharing algorithm does not itself eat up capacity. Cut cutting off 15% of connections and then suddenly and instantly dumping them back on the network would likely lead to something rather well over a 100% increase in control traffic on the network, note, control not customer paying traffic. Failures then lead to retries and so traffic snowballs. Such cut overs are usually managed by careful planning to avoid such out of control surges.

      Back in the day when I worked on such changes we were very careful to allow new traffic to come on stream at quiet times and via tightly controlled steps. I did once decide to simulate a full power full demand restart, an interesting experiment. There were a number of fuses at different distances from where I stood, I was able to enjoy the sound of them all popping due to the slight delays in the sound arriving at my ears. We hastily checked the fuses and made sure traffic came on stream in a less sudden rush. That system had no more problems.

      1. JetSetJim

        Re: 15% increase causes overload?

        No that is not a correct reading as I see things. Most handsets are registered on the system as they are switched on. They remain on line until switched off, they go flat or move out of range. The system will be dimensioned on the basis of expected traffic demands and expected new service request demands plus any other relevant factors such as normal surges. Suddenly removing 15% of connections and then all of them automatically trying to reconnect is a very different thing. That is not 'normal' traffic it is a very specific type of traffic all focussed on very specific resources, those associated with control of the network not the traffic on the network.

        Indeed, the registration process for a mobile involves it talking to the HLR and exchanging some handshakes, potentially with the HLR generating and delivering new authentication keys back to the user. If this is what was rebooted, it's entirely possible that it fell over trying to generate authentication keys for too many people at once. It's not designed to cope with everyone attempting it at the same time, and there's no mechanism around for telling the mobiles to back off and wait a bit before attempting to register with the system because (drum roll) there's no way of talking to the mobile if it's not registered. As mentioned, you dimension for normal peak behaviour (everyone switching phones on over time in the morning) plus a fudge factor. This won't help if the thing falls over in the middle of the day - you normally do upgrades on such a box in the depths of the night on your idlest day where the impact is minimised.

  3. gnufrontier

    Tell the truth now

    They probably have a list of excuses they trot out. Who know what the real story is on this. If they are having a come to Jesus meeting with the big three it's got to be more than what they are saying.

    1. Colin Tree

      Re: Tell the truth now

      They have a book of excuses and a book of random numbers to select an excuse.

      They probably made the capable engineers and technicians redundant years ago.

      The truth might expose them to liability, or, worse, name calling.

  4. Kitschcamp

    And don't forget cable internet

    A pretty much Adelaide wide cable internet outage is now in its 5th day with no sign of a fix. Routing for everything except their speed test is going via Perth and back to Adelaide again. Their service status admits their fault and it's their fault, but morons on non-support refuse to admit any problem even though Telstra have already admitted it...

  5. Tempest
    Happy

    Problems with CISCO and Juniper? Should have bought the best - HuaWei! - GCHQ approved!

    Blame Obama - he went around the world bad-mouthing Chinese products and the Australians fell for it hook, line and sinker.

    Meanwhile, back in Blighty, HuaWei sets up a GCHQ monitored lab and everything is fine. Except the 5 Eyes group have to crack yet another piece of hardware.

  6. Somone Unimportant

    And let's not forget a patch that Telstra loaded onto their core 4G network switches on Australia Day that resulted in ACKS not being sent during some VoLTE to SIP calls. Took them two weeks to confirm the issue and another two weeks to fix it, during which time we had thousands of one-way only audio calls.

    Clients thought it was our system at fault, and so did telstra at first. We lost clients because of this...

    The cleverer they become, the harder they fall.

  7. jdundon

    Sounds like a case of "when all else fails, blame the hardware vendors" - likely a people and process issue. Another possibility might be bad actors somehow infiltrating their network?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like