The CBC.ca site crash explained

As many of you noticed, the CBC.ca web site was down a good chuck of yesterday. A scaled-down version of the site was in its place from 9:30 a.m. ET until the early evening. Turns out, the reason was a major data storage “fallover” involving all redundancy systems. Ouch.

Computers aren’t perfect. Sometimes they crash. Unfortunately, when they crash on highly trafficked web sites, a lot of people notice.

“What you couldn’t see was that the CBCNews.ca staff continued to write and update dozens more stories all day long behind the scenes,” Jonathan Dube of CBC News’ director of digital media told me. “[They did] extra work to enter the stories into two publishing systems, because we wanted to make sure that as soon as the site came back up, it was completely up-to-date — and that’s exactly what happened.”

Email This Post
  CBC.ca web site

19 Responses to “The CBC.ca site crash explained”

    Kev says:

    That wasn’t a fun day by any stretch of the imagination. But the news team definitely deserve kudos both for putting in the extra effort to duplicate top stories in the interim solution and for the calmness and maturity they displayed in general.

    CBC.ca’s UI team also deserve an awful lot of credit. They banged out an MT-based interim solution in record time, and by the time regular service was restored they were on the verge of finishing a second interim solution that would have brought back full news publishing ability (albeit in a stripped-down wrapper).

    It sucks when outages happen, but I’m never more impressed with the people I work with then when I see how they react to these kind of situations. Hopefully I won’t get to see it again for a while though!



    Hugh Thompson - Publisher Digital Home says:

    Nice to see the site back in action!



    jody says:

    WAY too vague for us techy folks! “The cleaning lady tripped over the power cord” would’ve been better!



    derek fong says:

    “Turns out, the reason was a major data storage “fallover” involving all redundancy systems… Computers aren’t perfect. Sometimes they crash.”

    Without knowing all the facts about what happened, it seems to me that if the failover (redundant) systems failed, then that is attributable to human error not computer failure. Redundant systems are supposed to be there to safeguard against failure / loss of service / alien invasion, but if the redundant (independent?) systems also fail, then that says to me that either the redundancy systems were not properly planned out or that someone in the data centre did something they weren’t supposed to (like erasing all the files on a server which then propagated to all of the redundant systems).

    However you cut it, it looks like human error to me.



    Emily G says:

    Well, the crash was annoying, but at least it was the only time I ever remember it happening.



    Steve Billinger says:

    Hi

    An enormous team of Production, Operations, IT and external vendor teams worked calmly and professionally - given the circumstances! - firstly, to keep the CBC viewers informed of the issue & progress, and secondly, to get our content areas (not just News but Sports and Factual and A&E and Kids and Docs and Radio….who were all ready to go!) - back up as quickly as possible and thirdly, to ensure that this won’t happen again.

    Stuff happens in the Internet world - it’s complex and sometimes unrewarding - like love, or following the Leafs! Best thing you can have to combat the stress of it all is a fantastic team of people dedicated to getting the CBC back in the homes, mobiles, PC’s, iPOD’s and radios of the audience we know depend on us!

    Personal apologies from me to CBC audience and content teams and thanks to same for their patience.

    Like those old vinyl LP liner notes say - “…and big thanks to those who without which this fix would never have happened - you know who you are!!


    Steve Billinger is Executive Director for Digital Programming and Business Development (apologies to he and Steve Pratt, who my under-caffeinated brain mixed up!)



    The Qaz says:

    I know the interwebs are new media and don’t have to follow the standards of traditional media, but there’s something to be said about doing things right. I mean if CBC TV went down nationwide for the majority of the day it would be completely unacceptable. But cause the web is run on a computer, it just gets shrugged off as a computer boo-boo. New technologies, new problems. Just learn to live with it I guess.



    Dan Misener dot com - CBC.ca looks better when it’s broken? says:

    [...] cbc.ca crashed yesterday. For a long while, this “Temporary Site” of news headlines was all you could [...]



    Yeah, Right says:

    Wow, so the CBCNews.ca staff worked all day long updating stories. I guess the CBCSports.ca staff just took the day off, Tod?


    The text you’re complaining about is a quotation from Jonathan Dube (guess you missed the quote marks?). If you have a beef about the omission of CBC Sports, feel free to take it up with him.

    When will you realize that CBC.ca is more than just the News people?

    By the way, love how another massive failure of CBC.ca’s backup systems was spun into a positive story. Two plus two is indeed five.



    CBCer says:

    Steve Billinger recently said that on average there are more people on cbc.ca than the main network (TV).

    So ask yourself this…would the main network go off the air for 9 hours?

    This is the second major outage in the last year and a half, perhaps John Dube and many others should be looking at their backup/recovery systems instead of congratulating themselves.



    Sean says:

    “Turns out, the reason was a major data storage “fallover””. Fallover, what the heck is that? Did someone knock your disk arrays on their side?

    I’ve worked in corparate computer services for over 12 years and have never heard that term. Failover, yes. Fallover, no. But if the systems had failed over there wouldn’t have been an outage.



    Peter Brandon says:

    Tod, what’s a disk “fallover”? You are risking your street cred as a techno nerd. I’ve heard of disk “crashes”. I’ve had disk mirrors that tried to fail over by mistake and then only got part way. In my 25+ years of IT work, I never heard of a disk “fallover”. It must be a new term.

    Seriously, it looks to me that all DNS links were redirected to your mobilty server, not a “scaled-down version of the site”. At least, I got redirected to http://www.cbc.ca/mobile/ during for much of that post 9:30 ET period. Posting a “technical difficulties” message on that site would have helped.

    Actually, I am glad that the CBC is not spending big bucks on expensive RAID-5 disk arrays mirrored on geographically separate sites. It could easily cost as much as, say, supporting a small orchestra. The web service is not essential in the event of a natural disaster like your broadcast radio is. It is a second class service and I think that the taxpayer is well-served to support it as such. Just don’t expect us in the classical music fan base to get excited about classical content moving to a second-class web-based service in the fall.

    Peter Brandon,
    Edmonton


    Hi Peter… good question. The CBC’s official channels wouldn’t provide me with any details about the crash; the text you see is taken from an internal email that was leaked. And I’ve never heard of that term either.



    Kev says:

    Tod, what’s a disk “fallover”? You are risking your street cred as a techno nerd. I’ve heard of disk “crashes”. I’ve had disk mirrors that tried to fail over by mistake and then only got part way. In my 25+ years of IT work, I never heard of a disk “fallover”. It must be a new term.

    This is what you get when non-technical managers play Broken Telephone.

    Our website mostly lives on a Bluearc NAS. It’s a solid, dependable piece of kit, with redundant PSUs and heads. Both of these components have failed before (as is to be expected in normal operation), and the failover time is on the order of seconds. However, due to budget constraints, we only have the one.

    On Thursday, it looked like it had gotten into some bizarre state where servers were seeing different versions of volumes, and disk usage stats didn’t make sense at all. It took some time to figure out what was going on, and in the end it turned out that it wasn’t a NAS issue. (The conflicting volume info across the server farm was a red herring, and was caused by a longstanding problem that we’ve only recently got the resources to start fixing.)

    Bluearc gave us great onsite support, and our storage admin is one of the best sysadmins I’ve worked with in my (admittedly less than 25+yr) career, so that took less time than it might have. In the meantime the rest of us worked assuming the worst, as to do otherwise would have extremely negative consequences if the NAS had in fact failed.

    In a better-funded world, we would have replicated storage, the apps writing to the NAS would have more resources for maintenance, and this kind of incident would also be a glitch on the order of a few seconds or minutes. But we don’t live in that kind of world - and being even more strapped for cash than TV or radio, CBC.ca is the canary in the coalmine when it comes to seeing the effects of the chronic underfunding of the CBC in general.

    You may think it’s a second-class service, and in terms of some people’s attitude to it you may be right. But when it comes to online news, I completely disagree. I think it’s vital both on a daily basis and in emergencies, and needs to be funded and supported appropriately. You have a team of people delivering a world-class service already. If they had the resources commensurate with the quality of that service, they could make it even better, and deliver it to you with bulletproof availability.

    Or you could just take the whole thing as an excuse to obsess about R2 content changes in a completely unrelated thread. Your choice.


    Thanks Kevin. :)



    sneekz says:

    Coverup.

    Human error, which might just be a happy reason, considering the last failure was a massive mess of tech failures.

    In the early days of tv, channels went off the air for hours as well. Next step - improve the downtime.

    But that would require spending money.



    Fagstein » Canada.com struggles under load says:

    [...] a few days after the CBC.ca network went down, Canada.com was out for most of the day yesterday, making the websites of 10 Canadian daily [...]



    Steve Billinger - Exec Director - DP&BD says:

    Hello all,

    A couple of things - first - I don’t run CBC Radio 3 - I am Executive Director for Digital Programming and Business Development. Steve Pratt is the Radio 3 man and a good one too!

    We are still looking at the issues in regards to the problem. Without question, all internal teams (including Sports but also the “unsung” heros of Audience Relations, Marketing etc) and external vendors (yes - Bluearc, once that was identified as a potential area) worked very hard to identify and rectify the problem. A proper post-mortem will be conducted and risk mitigation procedures will follow. No cover up and no “rush to blame” either.

    While I think the “cleaning lady tripped over the power cord” is an amusing explanation, and would probably have been seriously considered, it’s not the right one. It does however, highlight the sticky issue of “human error” - in that case, who is “to blame”, the cleaning lady for tripping, the site designer for putting the cord there or the system for “failover”.

    The entire team both internal and external, who work on CBC.ca care passionately about it and clearly do not see it as a second class of service. To say otherwise is a disservice to everyone who works on CBC.ca - probably people that the commentor works with. This platform is complex and not yet “robust” but we are striving to make it so. In addition there are lots of “owners” so tracking down changes and therefore solutions is difficult.

    If you don’t believe TV has service disruptions then you obviously weren’t waiting 30 minutes or more earlier this month to watch Chelsea vs Man United battle it out for the League Championship - the most anticipated match of the year. And TV has been on for 50 years plus.

    I’m concentrating on the following to ensure we work closely together to provide a good service - this is not just a technical problem.

    1) How best to transmit a “network crash screen” with some flexibility in it’s wording to provide the audience with feedback as to what’s happening.
    2) Coordination with Marketing and Audience Relations for the same reason.
    3) Coordination with Content team leads to allow them to provide alternative service.
    4) Stand by publishing tools (i.e. - the Movable Type blog tool is one example) & how best to implement them.
    5) Possibility of further redundant systems both on and off site.
    6) Coordination with IT and External vendors.
    7) Risk mitigation.

    Thanks for comments - as always questions can be sent directly to me at steve.billinger@cbc.ca or call me at 416 205-7182.



    Peter Brandon says:

    Thanks in particular, Kev and Steve. It sounds like Kev is in the IT shop, particularly since he uses terms like “NAS” (Network Attached Storage), and PSU (Power Supply Units) without defining them for any non-technical folks in the audience for this blog. When I used the term DNS (Domain Name System), I did that too :)

    I was lucky enough to only go through one major storage outage and the ensuing post-mortem in my career and that was enough. Like you, my disk vendor and the IT team worked their butts off to trouble shoot and fix the problem and the user community did extra work to get around the problem and unlike your problem, mine never became public. BlueArc would be really motivated to hustle for you because their business is based on a reputation for reliability. In our case, we worked through technical plan fairly well and we had our share of things we could have done better. The business folks were a bit unprepared to help us prioritize which application to bring up first when we came back. I am glad to see that you are working on your communications plan. Like you, there was maybe a bit too much rejoicing from our team when the service came back that trickled out to the user community. I truly sympathize with your plight and regret any extra pressure my comments might have caused from senior management during your post-mortem.

    Aside from the shot about R2, I was really serious that it was OK to have this outage and it is OK to have second class service on the web.

    I really hope that your first redundancy planning priority would go the Emergency Broadcast System. If this does not work, people may die. If another tornado hits Edmonton, I’d use my transistor radio to find out what to do next, so my local Radio 1 had better have lots of redundancy. Web and wireless web should be well behind even because my high-speed modem would not work due in a power outage and I may not want to clutter up the cell frequencies.

    Over my career, I used to grumble about penny-pinching from my senior management about IT infrastructure. I also used to deal with what I called the “Project Hero/Support Scum” syndrome - keeping the web delivery reliable probably gets taken for granted by senior management, while launching a new site might gain a promotion for somebody. But, as an audience member, even if the CBC got their increase from 33$ to 40$ per capita, I would want that money to go to other things ahead of expensive web redundancy.

    Like most web users, I have reduced service expectations. For example, some of my BBC podcast subscriptions are irregular and I don’t complain. I am used to seeing other web sites going down like Revenue Canada’s web filing service. It has been raised in other forums that web services exclude those who don’t have a computer, let alone high-speed internet. Compare the number of postings to this blog on this incident compared to the number posted to items on my pet subject. To me, the “canary in the coal mine” is the steady decline in Canadian classical concert coverage or the fewer and fewer truly original Ideas episodes every year, not a drop in your annual web availability from 99.999% to 98.5%.

    I do not mean to imply that the people supporting the service are second-rate. But, it is and should be a second tier service.

    Regards,
    Peter



    C Keigher says:

    Explains the weirdness when I was browsing via WAP that day. I was almost mortified to find that the page was completely impossible to navigate via my mobile phone.



    Steve Billinger - Exec Director - Digital Programming & Business Development says:

    Hi

    You’re right - the mobile experience is not great. Too many people assume that the http://www.cbc.ca on a computer is the ONLY way to experience our great content.

    We’re working very hard on fixing it and rationalizing the entire experience for viewers who prefer to get CBC.ca through RSS, Widgets, desktop tickers, mobiles, etc.

    Would love to hear comments about your experience and ideas for improving directly at steve.billinger@cbc.ca

    thx - Steve