Clackamas County supports a large array of critical services to support the citizens, businesses and other public agencies of the County — including public safety, heath, emergency services, roads, communications and many other services. It is critical these services be available, during both normal periods of service and emergencies. Technology Services supports almost all the technology needs for these critical services including phones, data, systems and applications. As with any professional technology services operation, TS has multiple layers of disaster recovery methodologies as part of the overall County Business Continuation Plan (BCP) – of which Disaster Avoidance (DA) is built in . However, as TS coordinated the County’s next generation of Disaster Recovery and Continuation of Operations Planning (COOP) – it became clear a new approach to the overall architecture of technical service was required that combined service performance with an integrated holistic design incorporating Disaster Avoidance from all levels.
The problem is obvious to any agency trying to develop a Disaster Recovery plan – there are just too many possible event scenarios to plan – or fund. Some of the issues Clackamas County had to take into account
- Too many disaster scenarios, at multiple levels of impact, to plan or design for. (Major events for Clackamas County include storms, flooding, earthquakes and an active volcano.)
- Insufficient funds to build any design that attempted 100% availability.
- Requirement to maintain performance of service within limited funding while balancing needs for DA.
- Wide range of required services, locations, operational times, etc.
- Limited staffing to be available at all possible locations and times required for full coverage.
- Given the numerous bridges and waterways, the likelihood of communications loss in an earthquake or flood is high (which makes remote locations as cloud services more difficult to leverage).
- The large array of services and applications required to be available in any level of an event — compounded by the perception that all services are critical by the County agencies — makes prioritization of services and resources difficult.
DR by DA
Bottom line, while having a solid BCP is critical, planning for every possible contingency is both impossible and inefficient. Taking the principle of “The best disaster recovery plan is to avoid the disaster” to the next level, the County initiated an large ongoing program to avert as many technology related disasters as possible by designing disaster avoidance into all levels of technical infrastructure and services. As with disaster recovery, the wide range of disaster avoidance options are too great to plan for all possibilities. Therefore TS analyzed the range of critical services and most likely events to develop a “Pyramid of Events” that categorizes the events into levels of impact. This allows TS and the County to design disaster avoidance around these levels — from a high degree of probability with have minimal service impact up to those that there is almost no probability of available services — and all levels in between. Using this pyramid, TS has designed disaster avoidance into overall technical design to cover as much of the pyramid as possible, and with estimated times of recovery. This allows TS to design reliable services and budget for them — and the County to design the County BCP planning around which events they can expect some level of technical services, and which events not to plan on services.
While disaster avoidance is a key part of all critical technical design, TS took this beyond a secondary consideration. Disaster avoidance is a primary design element at all levels of TS service and infrastructure, even at the initial business analysis of any requested service or application we ask ourselves, "how do we keep this service available?" This has led to changes in the architecture of all services, standardization of many of the systems, adoption of several new technologies such as virtualization, and an ongoing philosophy of disaster avoidance as a service mandate. With this design premise adopted, recent technology implementations and upgrades in the County have applied disaster avoidance in all levels of design. This includes a recent total redesign of the County’s network infrastructure, the move to virtual servers and storage, upgrades to operational centers, completion of the County fiber infrastructure, change in all equipment procurement, design of all applications, system monitoring and analysis into other options and services to further enhance the reliability and availability of County technical services. This has already had a significant impact of the reliability of County services. Events ranging from a simple component failure to equipment failure have become almost non-impacting to services. Events of larger scope such a communications failure or a building loss is no longer catastrophic with services designed to utilize redundant options. Even large events such as earthquakes, are expected to have quicker recovery of services as this process has, and is continuing to, design new ways to efficiently provide for alternative service solutions.
Disaster avoidance, as well as overall Business Continuation Planning, is an ongoing process that TS will continue to evolve as the County becomes even more dependent on critical services and technology provides for new potential solutions. As this process expands and grows, some of the next steps already in development include:
- Leveraging other facilities that are more distant as additional redundant pathways.
- Partnerships with other agencies for co-locations of services, both for equipment and staff if needed, are already in progress.
- Continued design of applications and services to employ disaster avoidance based architecture.
- Further research into ways to utilize cloud services for a greater role in disaster avoidance, recovery and even performance balancing as well to contain costs when planning for worst case scenarios.
- Continued research into cost effective options to reinforce current designs such as use of satellite communications, professional services as stand by contracting backup to critical staff and more.
- Continued education of both TS and other agency staff in the design of disaster avoidance in all levels of County operations, not just the technical services.
- Overall Design
- Dual Operation Centers, with a third offsite data center under development
- Basic load balancing design between centers to maintain performance, costs and DA
- Most critical systems are redundant (storage, server farm, network, etc.)
- The Clackamas Broadband Express Fiber plant is utilized to provide dedicated, secure dark fiber connectivity between all buildings, many on redundant pathways
- Network design with multiple connections between operations and key building MDF rooms in an overall star design with redundant connectivity and switches
- All primary equipment is “dual homed” with multiple network interfaces to the network on separate pathways
- Multiple ISP connections to separate carriers (including redundant firewalls, IDS, internet routers) which have separate pathways to regional POPs
- Phone systems configured for rapid load routing to other phone servers
- All operation centers have generator support, multiple UPS and multiple circuits
- Equipment has multiple power supplies with separate circuits and UPS connections
- All operations have N+1 HVAC units to maintain cooling and humidity requirements
- All operation centers / MDF have multiple levels of environmental monitoring / alerts
- Staff shifts have been expanded for greater coverage
- For off-hours, 24/7/365 on call is utilized for critical services (server, network, phone)
- An afterhours emergency call center has been contracted
- Every night, twice on weekends, on call staff check into the status of critical systems to ensure services are available
- Move of most servers to multiple Virtual Server farms
- All systems and equipment have dual components such as power supplies, network etc
- Redundant and/or raid based local drives
- Redundant multi-tier SAN arrays with multiple network pathways
- Use of snap shots
With the implementation of disaster avoidance designs and policy, services have already improved and BCP planning has incorporated the new service level expectations :
- Service availability and reliability has gone up, near 99.9% scheduled availability
- Ability to leverage redundant systems for reduced downtime due to upgrades & maintenance
- Scheduled downtimes for service much fewer and coordinated
- Most services available 24*7*365 (except scheduled downtimes) for greater utilization, telecommuting, critical service support and expanded County service options
- Standardized equipment and approach to service, also reduced cost due to standardized parts
- Reduced costs due to lost productivity and customer support
- Overall improvement in service levels and availability
- County planning is more coordinated with disaster avoidance as part of the overall process
To support the technology needs of the County – incorporating DA design, TS utilizes the following:
- Virtual Server Farms are on HP Blade servers running VMWare / VSphere
- Storage is primarily multi-tier utilizing EMC and Nimble
- Network is primarily Cisco and Juniper with overall multi-path Star design between buildings
- Active monitoring / alerting utilizes OpenNMS, Splunk, LanDesk and Microsoft System Center
- Several types of active operational monitoring / alerting covering power, HVAC, Humidity & security
- Phone servers are utilize Siemens Systems