





 |
|
- Overview
- How Year2000 Compares to Other Planning Events
- Year2000 is scheduled, Y2K failures are not.
- What will actually happen is a complete unknown.
- Potential complexity can be overwhelming
- Fundamentally, Year2000 is a software problem,
- Failures will occur across the end of the century.
- Year2000 may be incorrectly cited as the cause of a failure.
- Y2K - many small, isolated failures spread over an extended period rather
than one big failure or event.
- Impact of Y2K will be distributed rather than focused at one location or
area.
- Remediation of internal systems does not obviate the risk of Y2K
problems.
- Problems could occur internally or with utilities, communications
services or business partners.
- Year2000 problems will require different solutions than just switching to
a duplicate system.
- Year2000 is business issue rather than a technology/hardware issue
- Loss of public confidence about Year2000 issues may engender client
actions that cause problems of their own.
- Call Center volumes could escalate.
- Unusual or excessive business activity could occur.
- There could be excessive or unusual requests for account statements.
- Automated information systems could experience excessive traffic.
- Banks could experience unusual cash demands.
- Inability of a business partner to process may have impact.
- Contingency Planning as a Business Imperative
- Consumers (and most businesses) normally assume that everything works.
- Failure is inherent in everything -- materials fatigue, people make
errors, programs have bugs.
- The past is not necessarily prologue.
- New is not necessarily better.
- Due to the high cost of failure, some industries consider contingency
planning a normal part of business.
- A General Approach to Contingency Planning
- An essential precursor -- understand how the business works.
- What are the core business processes?
- What is the chain of relationships between suppliers, process and
consumers which enable the process to produce value?
- What are the flows of information, material and control within those
processes?
- Know what is important -- both internally and externally.
- Starting Assumption - Focus on Problem Sources or Solutions?
- A simple scenario planning methodology
- Establish a contingency planning working group
- Enumerate key sources and services
- Utilities
- Electricity
- Water
- Natural Gas
- Sewer Services
- Trash Removal
- Services
- Street access
- Vehicle Fuel
- Public Transportation
- Local Telephone Service
- Long Distance Telephone Service
- Internet Access
- Local Access
- Web Servers
- Email Servers
- Internet Name Service (DNS)
- Wide area routing
- Mail Delivery
- Package Delivery
- Business Services
- Market Data
- Market Access for Trading
- Clearing and Settlement Services
- Trading partner EDI access
- Internal Services
- Telephone PBX
- Voice Mail
- EMail Services
- File Servers
- Application Hosts
- Database Servers
- Internal Networks
- Document business processes and services.
- Develop failure scenarios
- Some possible ways that a service could be affected:
- Unavailable
- Available intermittently
- Available but wrong
- Pruning Strategy - Selective Focus
- Develop A Business Response Portfolio
- Probable Event Horizon
- Scenario detection approach
- Response Validity Timeframes
- Fallback strategy if limits exceeded
- Plan for return to normalcy
- Test Strategy
- Do not ignore routine failures
- Identify and implement impact mitigation responses
- Develop Situation Management Strategies
- War Room
- Situation Management Team - Standing
- Situation Management Team - Event-Driven
- Business Support and Escalation Processes
- Issues to Consider
- Event Horizon
- Event Duration
- Multiplicity of Events
- Track Everything
- Assign Priorities or Relative Impacts
- Manage the Situation
- Afterwards, Review What Was Learned
- Impact to Communications and Dispatch Mechanisms
- Technical vs Business Problem Management
- And Hope for the Best
- Overview
With the impending turn of the century and the discovery that some previously desirable
technology shortcuts had undesirable side-effects, there has been considerable interest in
both remediation and contingency planning. Remediation for Year2000 problems, the
systematic examination of internal systems and correction of inherent date problems,
should be just about complete. Development of business plans to address the greater than
usual uncertainties over the next few months should be underway. These business plans, or
contingency plans, detail at some level the actions to be taken should an abnormal event
scenario begin to unfold. While the focus will likely be scenarios around the 'year2000'
problem, it should be noted that contingency planning is in reality a normal part of the
management process. As business processes become more interconnected as a result of
e-commerce, risk management and the establishment of contingency plans will become
widespread.
- How Year2000 Compares to Other Planning Events
- Year2000 is scheduled, Y2K failures are not.
Unlike the calendar transition from 1999 to 2000, the event horizon for Year2000 failures
is not a single, focused point but a broad band that stretches back many years and can be
expected to extend well into the future. A Cap Gemini America executive survey reported by
CNN August 28, 199 ("Early Year 2000 glitches provide sneak preview") reported
that 40% of firms surveyed had already encountered Year2000 failures. Date-related
problems which now would be classified as Y2K have been encountered for many years. For
example - a number of Canadian brokerage firms encountered the as yet unnamed Y2K problem
when a widely used bond calculator failed to evaluate 10 year bonds back in 1990.
The difficulty lies in the many different ways that a date problem could impact the
function of a systems. It would be very convenient if the system just shut down. But most
likely the application would compute an invalid value causing incorrect decisions to be
recommended or (worse) pass bad data on to the next user. Total system failure is usually
fairly obvious and easy to detect. Data corruption problems may not be easy to detect --
particularly with the widespread attitude that if an application executed then what it did
must be correct (it was tested once, wasn't it?).
The event horizon for a Y2K problem in an individual application or system will be
specific to how the date is being used. If the date procedure is calculating future events
such as when a product lot expires or the number of days interest paid before maturity
then the failures should have occurred in the past. If the date procedure is evaluating
past events such as the age of an outstanding invoice or the expiration of an
interest-free grace period then the failures should occur in the future, sometime after
the calendar transition. There is even the chance that a calculation will only fail for a
while -- and heal itself when both dates it is working with are from the same century.
- What will actually happen is a complete unknown.
- Potential complexity can be overwhelming
A detailed examination of the webs of dependencies between and within businesses can be
overwhelming. It is suggested that initial analysis and planning be performed at a high
level and selectively expanded where appropriate. Bottom-up analysis from a technologists
perspective is likely to be less productive than top-down from the perspective of the
senior management team.
- Fundamentally, Year2000 is a software problem,
so limited precedence can be derived from software industry experience creating and
maintaining programs (see Caper Jones article "Probability of Year2000 Damages"
on the Year2000 archive at www.year2000.com/archive/proby2k.html). Analysis of software
defects and the efficacy of their removal suggests that problems can occur even with
'remediated' systems.
- Failures will occur across the end of the century.
The year2000 transition has not in any way suspended normal, routine failures. Furnaces
will still go out, power will fail in some localities, computers will disgorge garbage --
hardware will continue to break and software will continue to encounter 'will never
happen' problems. And none of these failures will have anything to do with the Y2K bug.
- Year2000 may be incorrectly cited as the cause of a failure.
There have been a number of widely reported incidents described as Year2000 failures which
in fact had completely unrelated causes (see CNN Aug. 16 article "Londoners suffer
Y2K power outage" and the later "Eclectic Electrical Supply" analysis by
Peter de Jager on www.year2000.com). This syndrome can be expected to become more common
as the millennium approaches.
- Y2K - many small, isolated failures spread over an extended
period rather than one big failure or event.
TV has suggested that multiple, small failures could cascade into a major event (as
demonstrated by historical major blackouts), the electric utilities are pretty terrified
of this and have made plans to isolate areas just in case [or should have].
- Impact of Y2K will be distributed rather than focused at one
location or area.
- Remediation of internal systems does not obviate the risk of
Y2K problems.
The remediation process only addresses the systems under local control -- suppliers and
other business partners may not have been affected.
- Problems could occur internally or with utilities,
communications services or business partners.
- Year2000 problems will require different solutions than just
switching to a duplicate system.
Failing over to a backup system would not work if the problem were in the program logic or
data feeds. Contingency service providers have been worried about firms declaring
disasters due to Y2K problems and then tying up valuable recovery centre resources
duplicating those problems in the backup systems.
- Year2000 is business issue rather than a technology/hardware
issue
- Loss of public confidence about Year2000 issues may engender
client actions that cause problems of their own.
- Call Center volumes could escalate.
- Unusual or excessive business activity could occur.
- There could be excessive or unusual requests for account
statements.
- Automated information systems could experience excessive
traffic.
- Banks could experience unusual cash demands.
- Inability of a business partner to process may have impact.
- Contingency Planning as a Business Imperative
- Consumers (and most businesses) normally assume that
everything works.
It is more comfortable to assume that things just work the way they are supposed to. This
assumption needs to be tested regularly for self-preservation. How often becomes an issue
of survival.
- Failure is inherent in everything -- materials fatigue, people
make errors, programs have bugs.
- The past is not necessarily prologue.
If nothing has changed a problem-free past is predictive only to the extent that the
environment is stable and predictable. This is rarely true.
- New is not necessarily better.
The cynic would say that nothing new ever works. What he really means to say is that new
things have rarely been tested sufficiently to weed out either design flaws or
implementation flaws -- components are subject to infant mortality, programs may have bugs
in significant places, human processes can be shaky. From a statistical point of view the
likelihood of failure is often thought of as a 'bathtub' curve - high at the start,
falling to a low value through most of the products life, rising at the end as things wear
out.
- Due to the high cost of failure, some industries consider
contingency planning a normal part of business.
Risk management and contingency planning have traditionally been the province of military
and government planners. But as businesses become more connected with each other and their
ultimate customers they become increasingly exposed to internal and external failures.
Risk management will need to be considered around every controlled change and contingency
plans developed to minimize the impact of these inevitable disruptions.
- A General Approach to Contingency Planning
- An essential precursor -- understand how the business works.
The Value-Chain metaphor can be useful when looking at the business as a whole and its
constituent processes. What this suggests is that business value is created not just by
the processes within a business but through transportation and communications links that
tie those processes to both suppliers and consumers.
- What are the core business processes?
Know what business functions and processes are most important to the health and
profitability of the firm. Watch out for low profile activity that provides essential
input into more critical functions -- if withholding this contribution compromises the
viability of more 'critical' functions then it should be considered critical as well.
Understanding how long a particular function can be shutdown is helpful in establishing
its criticality. Be aware, though, that processes which are executed at intervals as part
of the business cycle should not be discounted -- yearend close is critical when it it
time to close. For this reason, some processes may be tied to the business calendar to
determine their criticality.
- What is the chain of relationships between suppliers, process
and consumers which enable the process to produce value?
- What are the flows of information, material and control within
those processes?
All business processes consume resources of some type without which their function would
be impossible. These inputs are transformed (or perhaps just redirected) by the process
and the results propagated to subsequent consumers. Decision-making information is
required from a variety of sources internal and external to manage and direct the process.
Knowing these items is essential in being able to estimate the impact of Y2K degradation
or failures.
- Know what is important -- both internally and externally.
Being connected is a reality of the global business environment. Problems experienced by
business partners can easily affect other businesses -- particularly if they are linked by
electronic information exchanges or other tightly coupled services. But not everything
will be affected uniformly -- internal processes could fail but have minimal impact if
their output could be defered. Similarly, not all computer clocks and calendars are
important -- consider how much it would really affect things if the calendar on a computer
were set back to a pre-2000 date to avoid Y2K problems.
- Starting Assumption - Focus on Problem Sources or Solutions?
Year2000 is distinguished from most business problems in that the uncertainty around what
could go wrong is exceptionally high -- and the time frame in which business contingency
plans are needed is very close and fixed. It will be helpful to decide up front if the
planning approach seeks to identify what could fail or if it is assumed that key things
will fail and plan accordingly.
- A simple scenario planning methodology
What is outlined here is a simple approach to developing contingency plans. More detailed
information can be found by consulting the web sites listed in 'Directory of Related
Links'. The MITRE Contingency Planning site is a personal favorite. The SIA Year2000
contingency planning working committee reports are also helpful -- particularly for
financial service firms.
- Establish a contingency planning working group
A cross-functional working group should be established to review failure scenarios and
develop contingency plans. This group should have representatives from each major business
area including technology. It cannot be stressed enough that contingency planning is a
business management exercise, not a technical function. Contingency plans encapsulate
decisions about the strategy and tactics for a business unit to cope with unexpected
changes that impact its ability to function. The technologists can develop possible
failure scenarios but evaluating potential impacts quickly is likely only from the
management team. A cross-functional group is recommended because of the potential for
indirect effects across the organization.
- Enumerate key sources and services
This list contains some suggestions that could be considered. It may be impossible to
anticipate what could affect specific services -- but perhaps less difficult to predict
that some services could be affected. A list such as this would be developed for rounds of
'what-if' scenario planning.
- Utilities
- Electricity
- Water
- Natural Gas
- Sewer Services
- Trash Removal
- Services
- Street access
- Vehicle Fuel
- Public Transportation
- Local Telephone Service
- Long Distance Telephone Service
- Internet Access
- Local Access
- Web Servers
- Email Servers
- Internet Name Service (DNS)
- Wide area routing
- Mail Delivery
- Package Delivery
- Business Services
- Market Data
- Market Access for Trading
- Clearing and Settlement Services
- Trading partner EDI access
- Internal Services
- Telephone PBX
- Voice Mail
- EMail Services
- File Servers
- Application Hosts
- Database Servers
- Internal Networks
- Document business processes and services.
Write down what could be affected -- a spreadsheet arraying business functions against
input services is a useful planning tool.
- Develop failure scenarios
by examining the inputs and processes at a high level and imagining what could go wrong
(or better, assume that an essential process stops or produces wrong information). Then
examine the business consequences for that action. This is a business exercise looking at
the overall function of business units, not a technical exercise looking only at computer
systems. Assign a likelihood of occurrence in simple terms (like low, medium, high) and a
severity of impact (like low, medium or severe). Determining economic impact is useful,
particularly when discussing scenarios with senior management, but is not always easy to
evaluate.
- Some possible ways that a service could be affected:
- Unavailable
This is the most widely anticipated kind of Y2K interruption.
- Available intermittently
Intermittent availability is normal for many services even without Y2K factors. But it can
be problematic if unexpected (as with electric power going off and on multiple times over
a short period).
- Available but wrong
A service that continues to work but contains incorrect information is perhaps the most
insidious kind of Y2K problem. There is nothing unique about bad data -- particularly in
that very few firms validate inputs at the source where errors can be contained. Instead,
bad data is normally allowed to propagate deep into other systems to fester until
discovered later. Data quality is suggested to be a very likely Y2K problem.
- Pruning Strategy - Selective Focus
It will be helpful to categorize the evaluated scenarios and focus on developing
contingencies in a predetermined order -- most severe impact to least severe, for example.
Low impact scenarios can be excluded initially. It is suggested that scenarios that are
considered to be unlikely but severe should not be excluded -- although ordering severe
scenarios by likelihood of occurrence may be valid. It is important to keep in mind that
what will happen is unknown, so guessing that a particular scenario has a low probability
of occurrence could be wishful thinking.
- Develop A Business Response Portfolio
For each scenario a business strategy should be developed. Non-technical alternatives
should be explored where possible -- manual processing, selective shutdowns, alternate
processing approaches. A number of factors should be considered for each business
strategy:
- Probable Event Horizon
It is helpful to identify when certain problems are likely to occur -- some may be very
specific. With Y2K problems this is expected to be around certain dates - Jan 1, 2000,
February 29, 2000, etc.
- Scenario detection approach
The contingency plan should contain some ideas around how problems would be detected that
would suggest a particular scenario was unfolding -- power outages should be easy, data
quality problems may not be.
- Response Validity Timeframes
Contingency business strategies are usually workable for limited periods. It is useful to
identify those limits during the planning process.
- Fallback strategy if limits exceeded
A second tier contingency response should be considered. This alternative would be
executed if the time limits ascertained for the primary contingency response were
exceeded. This could include decisions to shutdown certain services, reroute orders to an
alternate location, etc. -- these choices will be unique to the individual business.
- Plan for return to normalcy
The individual contingency plan should consider the return to normal processing. For
example, if the contingent response were to accept orders using paper documents rather
than an electronic order system, consideration should be given to whether to transcribe
the paper orders to the electronic sales history files at the end. And how to coordinate
order numbers and update inventories to rebalance the system.
- Test Strategy
Key to validating a potential business response is testing whether it addresses the
problem in an acceptable and cost-effective manner. For any given situation there are many
potential solutions -- only some of which will work and of those only some which are
affordable. At the very minimum it is helpful to conduct a walkthrough
exercise to simulate
the impact of the alternative. Participants should represent all involved areas -- not
just the immediate affected function. The walkthrough should attempt to identify areas of
conflict or capacity -- what works in one area may have undesirable impacts elsewhere.
- Do not ignore routine failures
The intrinsic reliability of an application or service should not be ignored in developing
contingency plans. Even though Y2K may not have an effect, failures could still occur and
impact the business.
- Identify and implement impact mitigation responses
Scenario analysis may identify vulnerabilities that can be offset by preemptive changes.
Customer service PCs in a Call Centre might be vulnerable to power interruptions even
though the main servers were on UPS/APS -- moving some or all to protected power would
mitigate their vulnerability.
- Develop Situation Management Strategies
The appropriate situation management strategy should be considered as part of the
contingency planning process. Notification processes to decision makers and necessary
decision response times should be considered in formulating the situation management
approach. Unreliable notification mechanisms might dictate that decision makers be kept
close at hand during probable trouble periods. Relaxed response time requirements could
suggest that a negative response approach be used to for coordination.
- War Room
A situation room for managing problems should be prepared prior to the critical periods.
Key reference materials - maps, system documentation, diagrams, staff and vendor lists,
etc. should be collected. White boards are useful for displaying status messages and
notices in a clear but easily changeable manner. Access to the room in the event of
problems should be considered - a war room on the top floor of a high rise might not be
useful in the event of a power failure.
- Situation Management Team - Standing
Many larger organizations are adopting the approach of a standing situation management
team located in a war room with rotating teams on duty across the entire perceived
vulnerability period. One challenge of a standing team is maintaining vigilance --
particularly if the war room team is a key part of the situation detection process. This
approach is well-suited to environments that cover large geographical areas or have
particularly critical response requirements.
- Situation Management Team - Event-Driven
An alternate approach is to constitute the situation management team in the event problems
are detected. One issue that should be considered is how event detection will occur and
through what mechanisms will notification be delivered. Pager dispatch may not work if the
system at fault is the local phone network. One alternative to consider is negative
notification - distributing an 'all is well' heartbeat during the critical periods.
Failure to receive the heartbeat would be considered a problem notification.
- Business Support and Escalation Processes
The criteria and decision path to escalate problems within the organization and to
external vendors should be considered in advance. This needs to be documented (see
'Developing a Support Strategy'). Computerized problem management systems are very helpful
to encapsulate this kind of information but should not be relied upon exclusively. After
all, the problem management system could fail as well. External vendors should be
contacted to ascertain their resource availability plans during critical periods. Staff
should be consulted to ensure that key people are not on holiday during critical periods.
Some firms are requiring that staff be on site or locally available.
- Issues to Consider
- Event Horizon
Be clear on when problems are expected to occur -- these will be unique to the systems
they are embedded in and how those systems are used. Remember that many Y2K problems will
not manifest themselves precisely at 12:01am, Jan 1, 2000. Some areas in financial
services were affected in 1990, for example. Some firmware date problems will only be seen
when the machine is restarted,
- Event Duration
Think through how long a problem could be extant before it affected the business; and how
long the contingency strategy could be followed before alternate approaches were needed.
And remember, people need rest to be effective -- plan for staff rotation.
- Multiplicity of Events
Do not be surprised if there are multiple events occurring. Anticipating and managing the
cross-impact of multiple events could be a real challenge - know when to change and adapt
strategies.
- Track Everything
Some problems will be routine failures, others may be Y2K-related. But all could affect
the business in one manner or another.
- Assign Priorities or Relative Impacts
Focus resources on problems that matter and can be solved.
- Manage the Situation
Maintain a policy of periodic situation reviews to ensure that
resources are used
effectively over time and redeploy staff as status changes.
- Afterwards, Review What Was Learned
- Impact to Communications and Dispatch Mechanisms
Do not forget that the communications systems and message dispatch tools are also suspect.
Plan alternate means of communication (including changing physical proximity) to maintain
control in the event of problems.
- Technical vs Business Problem Management
And remember that the objective is to maintain the business -- with the technology in a
supporting role, not the other way around. Be clear and firm on deadlines and thresholds
and don't hesitate to invoke contingency if the technical situation fails to correct
itself in time. Technologists are optimists about one more fix that is sure to correct the
problem -- don't let their optimism jeopardize the business.
- And Hope for the Best
|
|