Sonic Status March

Site Wide Sendmail Upgrades: We are busy…

March 31, 2003

Mon Mar 31 13:05:55 PST 2003 — Site Wide Sendmail Upgrades: We are busy rebuilding sendmail for all of our servers to patch the second severe sendmail exploit released this month. Most of our internal servers have been updated as well as bolt. Some users may have experienced some trouble sending mail from bolt while the new binary was being installed and the configuration was getting properly tweaked. The updates will be completed within an hour on the rest of our servers. -Kelsey and Russ

PacBell DSL services restored.

March 30, 2003

Sun Mar 30 21:56:52 PST 2003 — PacBell DSL services restored. The SMS-1800 DSL termination hardware failed, and additionally failed to reboot on it own. Actually, it took quite a bit of coaxing to get it to boot properly; it’s FE (Forwarding Engine) failed to initialize until we removed one of it’s redundant power supplies that had been quietly complaining for the past few days. We replaced the power supply with a spare and all appears to be well. Please note: our LATA9 DSL subscribers were not affected by this outage and are terminated on a SMS 500 colocated in Stockton. -Kelsey and Nathan

PacBell DSL outage – The router that supports

March 30, 2003

Sun Mar 30 20:30:02 PST 2003 — PacBell DSL outage – The router that supports our Pacific Bell DSL customers rebooted about 25 minutes ago, and has not come up properly. We are currently looking into the situation, and will restore service/update this space ASAP. – Eli, Kelsey, Scott

We are down a PRI (23 modems) on one of our…

March 27, 2003

Thu Mar 27 21:25:32 PST 2003 — We are down a PRI (23 modems) on one of our access servers that handles 522-1001, 522-1002 as well as some other currently unpublished numbers. We’ve received some reports of busy’s and are working to get the apparently failed line card back online. At this moment, we have plenty of free capacity at our other POPS. We expect busy’s to be resolved shortly as we are going into off-peak hours and should have the failed card back up soon. -Kelsey and Russ.

Focal and SBC are still working on fixing the

March 26, 2003

Wed Mar 26 11:49:51 PST 2003 — Focal and SBC are still working on fixing the outage. In the meantime, we were finally successful in getting a tunnel operating to our router at Focal and have restore services to the SF POP via this tunnel. Quality of service will be noticeably degraded by packet loss and latency until the T3 is restored. -Kelsey and Nathan

Update: SBC and Focal have restored all of our services. At this time all of our voice lines are back in operation and SF pop traffic is now properly routing via the T3. -Kelsey, Nathan, Jared, Eli and Chris.

Problems with the SF POP persist, and our…

March 26, 2003

Wed Mar 26 10:20:52 PST 2003 — Problems with the SF POP persist, and our Tech Support and Office phone lines are temporarily unreachable. System status is otherwise normal, and we hope to have phone service restored quickly. – Eli, Support Update 11:23:24 PST 2003 — Focal has issued a 2.5 hour ETR to restore both Voice service to the office, and Data connectivity to the SF POP. In the meantime, we do still have Livehelp available, check out livehelp.sonic.net:8000/html/customer_login.html (We have lots of Reps standing by, believe me!) – Eli, Support

The T3 backhaul to our SF pop is down, this…

March 26, 2003

Wed Mar 26 08:33:47 PST 2003 — The T3 backhaul to our SF pop is down, this has effectively isolated this pop from the rest of our network. Customers trying to dial-up to SF numbers will not be able to authenticate, and customers currently online will find that they are unable to contact our DNS servers. Our UUNet transit is also picked up at the SF pop so at this time all transit traffic is being handled by our other transit providers. We’ll have an ETA shortly. -Kelsey and Nathan.

We have been informed by Focal that there has been a fiber cut and SBC has told is there has been ‘major’ system failure at their equipment at 650 Townsend. We know that SBC techs are probably already on site working to resolve the problem, whichever is the case. Meanwhile, this outage has also affected all of our voice lines as well; tech support and all of our other voice lines are not available. -Kelsey, Nathan, Jared, Matt and John.

Intermittent busies on 9811 numbers.

March 23, 2003

Sun Mar 23 12:03:18 PST 2003 — Intermittent busies on 9811 numbers. A bad PRI card in our dial-up gear has caused anyone dialing a number ending in 9811 to get occasional busy signals. We are working with our provider to disable the PRI until we can replace the card. -Matt, Kavan and John F.

Etherswitch failure.

March 18, 2003

Tue Mar 18 10:36:17 PST 2003 — Etherswitch failure. Currently web and mail services are unreachable due to an etherswitch failure. We are in the process of replacing it. –Kelsey, Nathan and John

The etherswitch has been replaced and the mail system is slowly recovering. It will probably take a couple of hours before things are back to normal, due to dealing with the backlog. –Kelsey, Nathan and John (and Zeke!)

Mail server load problems.

March 18, 2003

Tue Mar 18 13:28:43 PST 2003 — Mail server load problems. The failure of the switch this morning triggered a cyclical failure of our already taxed mail cluster. We’ve been busy all morning doing what we can to assure smooth mail delivery and have, at this point, restored most services. Customers may experience some delays in delivery of email, however no email will or has been lost. Customers may also still encounter some difficulties sending or retrieving mail while we continue to work on the servers to improve the situation.

In order to increase NFS server performance we’ve added three more disks to the Netapp volume that handles mail. These additional spindles have already helped reduce mail server load. We’ve also added a secondary MX server to handle mail for sonic.net when our cluster is off line; this keeps the mail under our control and ensures more rapid delivery of delayed mail. -Kelsey, Nathan, Zeke, et al.