LHC Computing Grid Project
LCG Applications Area Risk Analysis
  LCG Applications Area
 

Risk summary

Key

Likelihood:
  1 - never expected to happen
  2 - could happen but very unlikely
  3 - could well happen at some point
  4 - will probably happen
Impact:
  1 - we can deal with it, no problem
  2 - a bit of a hassle but not too bad
  3 - clearly can be dealt with, but with significant effort
  4 - crisis
Overall risk:
  1 - very unlikely to have significant adverse impact on the project
  2 - unlikely to have significant adverse impact the project
  3 - could have a significant adverse impact on the project
  4 - would be a crisis, and may happen

Risks, not in a specific order

==== Inadequate third party software
Inadequacy of third party software in terms of functionality,
performance or timely availability of required new features.
== Likelihood/impact/overall risk: 3/2/1
We have to anticipate that this could happen, hence we
are very careful in selecting third party software, and
if it fails nonetheless, we have an architecture that
promotes ease of replacement.
== Current process for managing risk: 
Establish a thorough process for vetting third party software
against strict acceptance requirements for quality,
functionality, support, future prospects etc.
Also, follow component architecture of the LCG applications
blueprint by which specific tools and implementations
can be replaced by alternatives.
== Future options for managing risk: 
Do not adopt third party software without applying the
process.
== Crisis strategy: 
Identify and integrate a replacement with the required
functionality, again following the acceptance process.
== Actions required: 
Acceptance process and policy is now in development.

==== Third party software no longer available
Third party software used by LCG software ceases to be
available.
== Likelihood/impact/overall risk: 3/2/1
We have to anticipate that this could happen, so we
have an architecture that promotes ease of replacement.
However, an important criterion for adoption of a
third party product in the first place is that it 
have a broad user community, and generally that it be
open source, making its abandonment more improbable.
== Current process for managing risk: 
Follow component architecture of the LCG applications
blueprint by which specific tools and implementations
can be replaced by alternatives.
== Future options for managing risk: 
Same.
== Crisis strategy: 
Identify and integrate a replacement with the required
functionality, following the acceptance process. If
necessary, assess whether the third party product 
that is no longer externally supported can be
internally supported (unlikely, and not really in
our general strategy).
== Actions required: 
None.

==== Software obsolescence
Technology and/or language evolution render LCG software
obsolete
== Likelihood/impact/overall risk: 4/2/1
We know this will eventually happen, and accommodating
this smoothly is a principal driver for our architecture
and software design approaches.
== Current process for managing risk: 
Follow component architecture of the LCG applications
blueprint by which specific tools and implementations
can be replaced by alternatives.
Adhere to the requirement of the blueprint that language
evolution be supported by designing with the potential
for evolution in mind, and assessing the evolvability
of our software and architecture -- specifically, using
Java as the currently most likely language target.
== Future options for managing risk: 
Monitor trends in technology and language evolution.
== Crisis strategy: 
Retire obsolescent components, replacing them with new
ones. The blueprint architecture is designed to keep
the difficulty and disruptiveness of this low.
== Actions required: 
None.

==== Deployment failure
LCG software deployment on distributed testbed and/or 
production facilities fails
== Likelihood/impact/overall risk: 1/3/1
This is highly improbable because of the frequent
incremental testing and deployment we do in progressively
larger scale settings. By the time we are deploying on
a distributed system, we should not be doing anything
we haven't done before.
== Current process for managing risk: 
Convenient deployability of our software is an architectural
requirement of the LCG applications software blueprint
which we follow. We develop and test in a very tight
iteration loop to expose problems early. New releases
of our software are rapidly picked up and exercised in
testbeds and (soon) real-world settings in the 
experiments. Our release schedules are set such that we
are ready early to deploy test setups early on testbeds
and production environments so they are exercised and
problems fixed before real deployment is required.
== Future options for managing risk: 
Same.
== Crisis strategy: 
Address the problem preventing deployment with top
priority, and incorporate testing for the problem
in future regression testing.
== Actions required: 
Those described above.

==== Platform incompatibility
LCG software fails to function on important deployment
platforms
== Likelihood/impact/overall risk: 1/2/1
This is highly improbable because deployment platforms
are always included in our software releases.
== Current process for managing risk: 
Applications area software is built for deployment
use on the platforms specified by the experiments as
required for deployment. So long as we have an
accurate list, our software will be required to
function on the deployment platforms in order for
it to be approved for release. 
Further, we will build and test our software on a
range of other platforms which may in the future
become deployment platforms, so that future platform
migration does not catch us offguard with platform
or compiler incompatibilities in our software.
== Future options for managing risk: 
Same.
== Crisis strategy: 
Address the problem resulting in failure with top
priority, and incorporate testing for the problem
in future regression testing.
== Actions required: 
As described above.

==== Experiment discord
Inability to sustain sufficient agreement between experiments
to continue a common software development project
== Likelihood/impact/overall risk: 2/3/2
Possible, but based on the experience so far, the experiments
are determined to see common software development succeed.
== Current process for managing risk: 
Involve all experiments closely in oversight, requirements
gathering, management, software development, and deployment
and testing so that any emerging problems are identified
and addressed early. This is done through the structure
and operation of the SC2, RTAGs, PEB, Architects Forum,
and the openness of the applications area software development
effort.
Be flexible in adapting the requirements and functionality
to the evolving needs of the experiments.
== Future options for managing risk: 
Same.
== Crisis strategy: 
Be as flexible as possible, and escalate through the
management to address the problem at an effective level.
If it is unresolvable, cancel the project.
== Actions required: 
As described above.

==== Experiment departure
Software development effort is lost when a project fails to meet
objectives resulting in one or more experiments abandoning the project
to develop in-house solutions.
== Likelihood/impact/overall risk: 2/3/1
Possible, but based on the experience so far, the experiments
are determined to see common software development succeed.
== Current process for managing risk: 
Involve all experiments closely in oversight, requirements
gathering, management, software development, and deployment
and testing so that any emerging problems are identified
and addressed early. This is done through the structure
and operation of the SC2, RTAGs, PEB, Architects Forum,
and the openness of the applications area software development
effort.
Be flexible in adapting the requirements and functionality
to the evolving needs of the experiments.
== Future options for managing risk: 
Same.
== Crisis strategy: 
Be as flexible as possible, and escalate through the
management to address the problem at an effective level.
If it is unresolvable, reassess and rescope the project
in light of the changed manpower profile and user base.
== Actions required: 
As described above.

==== Integration failure
Failure to integrate software components into a coherent
architecture and framework
== Likelihood/impact/overall risk: 1/2/1
Highly unlikely because the most basic precept of the
architecture blueprint we are following is to design
in the context of a coherent architecture and framework.
== Current process for managing risk: 
Follow the architecture blueprint which prescribes a
coherent architecture and framework. Test coherence
and architectural consistency continuously by using
components together as soon as and wherever possible.
== Future options for managing risk: 
Same.
== Crisis strategy: 
Fix the failure with top priority and prevent recurrence
with reassessment and revision of design, development
and integration testing practices.
== Actions required: 
Follow the blueprint.

==== Requirements/functionality mismatch
Mismatch of delivered product functionality to experiment
requirements
== Likelihood/impact/overall risk: 2/2/1
Highly unlikely except via inadequacy of communication.
== Current process for managing risk: 
Experiments are centrally involved in all levels of the
project (oversight, requirements, management, execution,
testing and validation) to avoid this. Perhaps the
greatest risk would come from inadequate communication
between an experiment as a whole and those members
providing the communication and involvement with the
LCG. To avoid (or expose) this we strive for early
take-up of LCG software in the experiments so that it
is deployed in real-world settings to end users,
where deficiencies will be exposed.
== Future options for managing risk: 
Expand formalized software validation efforts involving
the experiment user communities.
== Crisis strategy: 
Incorporate required software revisions or extensions
with high priority into the work plan. Work with the
experiments to develop interim workarounds.
== Actions required: 
As described.

==== Product take-up failure
Take-up of products fails in one or more of the experiments
expecting to use LCG software
== Likelihood/impact/overall risk: 2/2/2
Unlikely unless there is a communication failure as in
the previous risk, or unless there is a strategic
redirection in the experiment that makes a planned
LCG product inapplicable.
== Current process for managing risk: 
As for requirements/functionality mismatch. The close
involvement of the experiment architects and software
leadership in the applications area should prevent
unanticipated strategic redirections that would have
this effect.
== Future options for managing risk: 
Same.
== Crisis strategy: 
Assess whether changes in the product that are practically
realizable can make take-up possible. 
== Actions required: 
As described.

==== Licensing limitations
Software license constraints imposed by LCG funding agreements will
prevent us from fully exploiting open source software.
== Likelihood/impact/overall risk: 3/4/4
At present, this looks like a real risk. If it happens it is
a disaster.
== Current process for managing risk: 
The applications area cannot function if we cannot use
(in particular) GSL-licensed software. We depend on
high quality, robust, extremely widely used open source
products which would take major development efforts over
years to match.
We are addressing this by working out an arrangement by which
our licensing is compatible with using GSL-licensed software.
== Future options for managing risk: 
Should not be a risk for much longer.
== Crisis strategy: 
Underway -- get a GSL-compatible license in place.
== Actions required: 

==== Loss of ROOT team
The small ROOT development team evaporates or is otherwise impaired.
== Likelihood/impact/overall risk: 2/3/2
The smallness of the core ROOT team makes this a non-negligible
risk, despite the demonstrated commitment of the team over
many years.
== Current process for managing risk: 
The ROOT team at CERN has been properly recognized, supported and
consolidated in a well defined EP/SFT section.
While the core ROOT team is very small, its user community is huge,
and its user community includes many who are deeply familiar with
ROOT and are contributing to its development. If the core ROOT
team were impaired, the whole HENP community would be highly
motivated to identify and pool resources to repair the team.
If ROOT or a part thereof were nonetheless rendered unusable, we
would exploit the implementation-neutral component architecture of
applications area software to make the introduction of an
alternative implementation tractable. Considerable development
work would likely be required, however, to recover the needed
functionality.
== Future options for managing risk: 
The ROOT team is working to lessen the risk in areas characterized
by very high complexity understood by few, for example the
CINT system and associated dictionary. We are developing a plan
by which ROOT, CINT and LCG software will use a common dictionary
in the future.
== Crisis strategy: 
As described.
== Actions required: 
As described.

==== Missing personnel
Promised personnel commitments fail to materialize or fall short
== Likelihood/impact/overall risk: 3/3/3
This could well happen.
== Current process for managing risk: 
Scope deliverables and timelines to what can be achieved with
identifiable manpower. Prioritize deliverables so that they
can be descoped in an orderly way if expected manpower does
not materialize.
== Future options for managing risk: 
Same.
== Crisis strategy: 
Descope deliverables according to priorities and stretch
timelines where possible.
== Actions required: 
As described.

==== Staffing continuity failure
Continuity failure in LCG staffing (e.g. late or inadequate phase 2
support) reduces software development and support effort.
== Likelihood/impact/overall risk: 3/3/3
Given the present horizon on LCG funding, pending a clear
plan for phase 2, this is a real risk, and potentially
a very serious one.
== Current process for managing risk: 
Scope projects to deliver full software products by the end of
LCG phase 1. Support and incremental development requirements
will remain, however, and must be addressed by phase 2 support.
Involve experiments closely in the development efforts so that
product expertise for further development and support resides
within the experiments.
== Future options for managing risk: 
Clarify phase 2 funding and staffing.
== Crisis strategy: 
Rely to a greater extent on the experiments for product support
and incremental development, but this is inconsistent with the
severely constricted manpower levels currently found in the
experiments.
== Actions required: 
As described.

==== Software delivery failure by work teams
A work package or institute team is unable to deliver promised
software.
== Likelihood/impact/overall risk: 4/3/2
Could well happen.
== Current process for managing risk: 
Develop and sustain development teams of diverse capability
and expertise so that shortfalls can be made up elsewhere with
reassigned manpower. We have already seen this happen to some
degree. It is an expected part of software management.
== Future options for managing risk: 
Same.
== Crisis strategy: 
As described.
== Actions required: 
As described.

==== Duplication of work
Software is developed which amounts to a duplication of
other work or is rendered irrelevant by other work or
changing requirements
== Likelihood/impact/overall risk: 4/1/1
This we know happens on occasion because some development
work is exploratory, investigating alternative approaches
which have potential, but ultimately the most effective
solution is chosen.
== Current process for managing risk: 
Develop specific, well-defined work plans and ensure they
are reviewed broadly and thoroughly so that real and
possible duplications are identified and understood,
and either agreed on as exploratory work or removed
from the program. The mechanisms and processes to achieve
this exist and are followed: work plan development,
review and approval take place using project management,
the Architects Forum, the PEB and the SC2.
== Future options for managing risk: 
Same.
== Crisis strategy: 
Terminate redundant work when its cost is determined to
exceed its potential value.
== Actions required: 
As described.

==== Grid middleware/infrastructure failure
Grid middleware projects and/or the distributed computing
infrastructure fail to deliver distributed components and
capability that meet requirements.
== Likelihood/impact/overall risk: 3/3/3
This can well happen because of the immature and (in some
cases) R&D nature of grid middleware development.
== Current process for managing risk: 
Alternate solutions for the short term are developed where
this risk is appreciable, removing grid middleware from
the critical path. An example is the MySQL implementation of
the POOL file catalog that was done pending the maturation
of grid replica location services. Once the grid middleware
is off the critical path, it can be iteratively tested and
ultimately incorporated when it is robust and proven.
== Future options for managing risk: 
Same.
== Crisis strategy: 
If a grid middleware component fails, back off to an interim
solution again until the middleware problem is solved.
== Actions required: 
As-necessary development of simple interim solutions as
described.

==== POOL data storage performance failure
POOL event data storage fails to meet the performance and scalability
requirements of the experiments.
== Likelihood/impact/overall risk: 1/3/1
Highly unlikely because of the ongoing (re)assessment of
requirements and the incorporation of those requirements
into POOL release acceptance tests, and because of the
use of ROOT I/O -- a tool of proven capability in production
experiments -- as the foundation of POOL event data
storage.
== Current process for managing risk: 
The experiments provide performance and scalability requirements
to POOL which are used as the basis of measurable acceptance
tests incorporated into POOL testing and release procedures.
POOL releases must meet the specified requirements. The
frequent release schedule, and the ongoing reassessment of
requirements in the experiments, ensures that any problems
with scalability and performance will show up early.
== Future options for managing risk: 
Same.
== Crisis strategy: 
The strategy described should ensure that a failure occurs
early enough that it can be addressed by priority development
work and solved before the problem reaches crisis proportions.
== Actions required: 
As described.

==== POOL catalog management performance failure
POOL file catalog management fails to meet the performance and
scalability requirements of the experiments.
== Likelihood/impact/overall risk: 2/3/2
The strategy here involves use of replica location services
provided by the grid projects, which carries some inherent
risk because of the immaturity of these products. 
== Current process for managing risk: 
As for 'grid/middleware infrastructure failure'.
== Future options for managing risk: 
The long term risk here is small because Oracle, known to
provide the needed performance, is available and can be
used directly by POOL if necessary to meet the performance
and scalability requirements where they are most severe,
at the Tier 0 (CERN), and elsewhere if necessary.
== Crisis strategy: 
Rely more directly on proven RDBMS tools rather than grid
middleware intermediaries if necessary.
== Actions required: 
As described.

==== Overlooking a risk
One of the more prominent risks is the risk of overlooking
a risk.
== Likelihood/impact/overall risk: 4/2/1
While this is likely to happen, it is not so likely
that a major risk will be overlooked and thus a major
failure mode remain unconsidered.
== Current process for managing risk: 
Have the risk assessment widely reviewed. This assessment
has been reviewed by the Architects Forum, and this risk
is the one that was added as an outcome of the review.
== Future options for managing risk: 
Post and periodically review the risk assessment.
== Crisis strategy: 
A crisis can arise if an overlooked major risk occurs.
The crisis could be of any nature but would probably be
a technical deficiency or major bug requiring manpower
redirection and possible schedule stretch.
== Actions required: 
Necessary actions of risk review have been taken.


Contact: T. Wenaus