UPS Failure Redux

May 22, 2015 – 6:05 pm by Kelsey

First, we’d like to clarify the extent of the problems causes by the UPS failure and subsequent dropping of load in the Datacenter.  This had no impact on any residential or enterprise connectivity services including Legacy DSL, Fusion and Fusion FTTN.  The UPS that failed was the smallest of the three UPSes in Santa Rosa and we had been working to migrate load from it.  As such, less than 20 customers in total lost some or all of their power circuits, some of which may have been part of redundant A/B circuits.  Some colo customers lost connectivity as several distribution switches did loose power.  Most sonic services, including pop, imap, webmail were not affected or only saw a brief outage as single PSU equipment rebooted and/or clusters converged as load shifted to systems that were unaffected.  The only public service that had lingering issues was our webhosting cluster which required a little manual attention for it to come online.

The outage was eventually caused by a physical failure of the maintenance bypass switch – one of the phases in the switch stuck and/or didn’t close correctly –  in the bypass cabinet for the PDU we were moving.  In hindsight, it is unfortunate that we chose to operate the switch in the first place as it wasn’t strictly the simplest way to migrate the load.  The last power failure in the datacenter was in Oct ’04 — where the same, UPS failed.

We will schedule migration off of the temporary feeds put in place in the coming weeks.  This final move is significantly easier to execute and has an exceedingly low likelihood of causing any service interruptions.

-Kelsey, Russ, and the rest of System and Network Operations.

 

intermittent dns failure

May 22, 2015 – 10:39 am by williamt

Between 5:00 pm yesterday and 9:00 am today, customers may have experienced intermittent DNS failures or slower than normal name resolution. At 9:00 am this morning we noticed a configuration failure on one of our name server clusters. We immediately disabled the cluster which allowed traffic to flow over to our other redundant cluster. We have since addressed the issue and restored the cluster to working service. We are currently investigating our monitoring procedures to identify why this issue wasn’t detected earlier and to make sure it doesn’t happen again. We apologize for any inconvenience this may have caused.

– William & Kelsey

UPS Failure in Santa Rosa Datacenter

May 22, 2015 – 9:45 am by Kelsey

One of the three UPSes that handles load in our Santa Rosa datacenter failed early this morning and tripped into bypass.  Unfortunately, the internal failure is significant and at least involves the primary IGBTs.  We are exploring our repair options but the most likely outcome is that we will be accelerating the planned decommissioning of this UPS and migration of its associated PDU to one of our other two UPSes.  This is something that we had planned on completing at some point in the next six to twelve months but have not yet scheduled or scripted.  It is a relatively straight forward procedure but must be executed with great care to ensure both the safety of our workers and that live load in the datacenter is not dropped.  Updates will be posted as needed.

Current status: Our standby generator is currently running to enable the ATS to transfer load without interruption in the event that our primary PG&E power feed drops.

Update: Friday 14:00, we have electricians on site placing the cable to move the PDU from the failed UPS to one of our other UPSes.  We plan to complete the migration as soon as the cable is staged and ready to go.  Once the cable is placed, the new target UPS will be placed into maintenance bypass.  This allows us to transition the PDU from the old bypassed UPS to the new UPS without dropping its load.  Once the cable is terminated, the breaker on the target UPS is closed, the old breaker can be opened completing the transition.  At this point, the target UPS will be restarted.

Update: Friday 15:05, we’re beginning the bypass procedure now.

Update: Friday 15:15, unfortunately, load the PDU was dropped momentarily but we are continuing to complete the migration.  Power was lost to several of our single PSU systems but most affected services have already been restored.  More information forthcoming.

-Kelsey and Russ

Service Impacting Network Maintenance – Business Park Customers in Santa Rosa – 5/20/15

May 19, 2015 – 7:43 am by Tim Jackson

This maintenance is now complete.

Tomorrow night (5/20/2015) beginning at 11:59PM PDT, we will be performing a software upgrade of routers serving business park customers in Santa Rosa. The expected customer outage is expected to last 15-20 minutes.

-Tim J.

Non-Intrusive Network Maintenance – 5/20/15

May 19, 2015 – 7:42 am by Tim Jackson

This maintenance is now complete.

Tomorrow night (5/20/2015) beginning at 11:59PM PDT we will be performing software upgrades on core routers in the bay area. No customer impact is expected from this as traffic will be re-routed during the maintenance.

-Tim J.

Intermittent Performance – Legacy DSL.

May 16, 2015 – 8:48 pm by rabrown

Update (9:28AM): The cause of the performance issues has been located and a workaround put into place.

We are currently investigating an issue causing intermittent performance and connectivity to some legacy DSL customers. We will update this message once we have more information regarding the situation.

– Robbie and the NOC

Legacy DSL Outage – Salinas

May 15, 2015 – 8:13 pm by Tim Jackson

Customers in the Salinas area on legacy DSL experienced an approximately 20 minute outage that is now resolved.

-Tim J.

Fusion/FlexLink Intrusive Maintenance

May 13, 2015 – 2:20 pm by Brandon Gile

Beginning tonight at midnight I will be performing intrusive maintenance on equipment serving Fusion and FlexLink customers in Anaheim.  Expected downtime is around ~1 hour.

-Brandon

 

Maintenance is taking a little longer than expected, extending the estimated time out another hour.

-Brandon

 

Maintenance is now complete, thank you

-Brandon

Server Maintenance

May 11, 2015 – 6:07 pm by joemuller

Tonight, May 11th, at 11:59pm, Operations staff will be performing minor work on the server cluster which runs the Forums, Member Tools and Webmail. These sites may be unavailable for a short period as each server is taken offline in turn. We expect the work to take no more than 1 hour.

– Dexter, Joe, and Sysops Team

Voicemail & MemberTools Outage

May 8, 2015 – 8:36 am by Kelsey

A regular automatic system update applied at 04:56AM this morning broke the Fusion Voicemail application services, including the internal RPC services used by the Membertools to check the status of a user’s Voicemail account.   The simple fix was applied at 8:28AM once the issues were brought to our attention.  During this period calls to Voicemail resulted in a fast-busy and mebertools logins would return a blank page.  Unfortunately, our monitoring did pick up the failure but the alert wasn’t recognized as a service impacting issue.  -Kelsey