- Network and IT managers face horror stories and have nightmares daily.
- Unfortunately, matters have not changed. Network and IT managers continue to encounter strange incidents regularly.
- A few of those incidents are the result of poor planning outside the domain of the IT department. For example, the case of the poorly placed data center kills switch, also known as the haunted red button.
One of the network engineers reported that they had a large data center with a few AS/400s running multiple applications across Canada. The server room was state-of-the-art, with central cooling, proper cable management, alarms, fire suppression, cameras, and an emergency power shutdown button.
Unfortunately, this button was at ‘butt’ height and had no protective cover. One administrator killed power to the entire room, including the AS/400s, shutting down all the enterprise applications across the country. The individual had a very humiliating week trying to recover all the information and get them back up and running.
Untested Software & Systems
Another problem that might occur more frequently is when a UPS fails to come on when power is cut. It is usually the result of haste. The UPS is deployed and not tested properly. These kind of problems are not limited to hardware.
Same Problem With Cloud Outages
Most of the last two years’ top cloud outages have resulted from updates or other software changes gone wrong. Cloudflare had faced approximately one-hour outage impacting several companies and sites due to a change to the network configuration.
Google Cloud had a two-hour outage due to a change to the Traffic Director code that processes configuration updates. The code change assumed that the configuration data format migration was fully completed. The data migration had not been completed.
Amazon Web Services (AWS) has faced a five-hour outage on the East Coast because of a glitch in some automated software that led to “unexpected behavior” that “overwhelmed” AWS networking devices.