| LHC Computing Grid Project | |
| LCG Applications Area Risk Analysis | |
| LCG Applications Area | |
| Risk summary |
| Key |
Likelihood: 1 - never expected to happen 2 - could happen but very unlikely 3 - could well happen at some point 4 - will probably happen Impact: 1 - we can deal with it, no problem 2 - a bit of a hassle but not too bad 3 - clearly can be dealt with, but with significant effort 4 - crisis Overall risk: 1 - very unlikely to have significant adverse impact on the project 2 - unlikely to have significant adverse impact the project 3 - could have a significant adverse impact on the project 4 - would be a crisis, and may happen
| Risks, not in a specific order |
==== Inadequate third party software Inadequacy of third party software in terms of functionality, performance or timely availability of required new features. == Likelihood/impact/overall risk: 3/2/1 We have to anticipate that this could happen, hence we are very careful in selecting third party software, and if it fails nonetheless, we have an architecture that promotes ease of replacement. == Current process for managing risk: Establish a thorough process for vetting third party software against strict acceptance requirements for quality, functionality, support, future prospects etc. Also, follow component architecture of the LCG applications blueprint by which specific tools and implementations can be replaced by alternatives. == Future options for managing risk: Do not adopt third party software without applying the process. == Crisis strategy: Identify and integrate a replacement with the required functionality, again following the acceptance process. == Actions required: Acceptance process and policy is now in development. ==== Third party software no longer available Third party software used by LCG software ceases to be available. == Likelihood/impact/overall risk: 3/2/1 We have to anticipate that this could happen, so we have an architecture that promotes ease of replacement. However, an important criterion for adoption of a third party product in the first place is that it have a broad user community, and generally that it be open source, making its abandonment more improbable. == Current process for managing risk: Follow component architecture of the LCG applications blueprint by which specific tools and implementations can be replaced by alternatives. == Future options for managing risk: Same. == Crisis strategy: Identify and integrate a replacement with the required functionality, following the acceptance process. If necessary, assess whether the third party product that is no longer externally supported can be internally supported (unlikely, and not really in our general strategy). == Actions required: None. ==== Software obsolescence Technology and/or language evolution render LCG software obsolete == Likelihood/impact/overall risk: 4/2/1 We know this will eventually happen, and accommodating this smoothly is a principal driver for our architecture and software design approaches. == Current process for managing risk: Follow component architecture of the LCG applications blueprint by which specific tools and implementations can be replaced by alternatives. Adhere to the requirement of the blueprint that language evolution be supported by designing with the potential for evolution in mind, and assessing the evolvability of our software and architecture -- specifically, using Java as the currently most likely language target. == Future options for managing risk: Monitor trends in technology and language evolution. == Crisis strategy: Retire obsolescent components, replacing them with new ones. The blueprint architecture is designed to keep the difficulty and disruptiveness of this low. == Actions required: None. ==== Deployment failure LCG software deployment on distributed testbed and/or production facilities fails == Likelihood/impact/overall risk: 1/3/1 This is highly improbable because of the frequent incremental testing and deployment we do in progressively larger scale settings. By the time we are deploying on a distributed system, we should not be doing anything we haven't done before. == Current process for managing risk: Convenient deployability of our software is an architectural requirement of the LCG applications software blueprint which we follow. We develop and test in a very tight iteration loop to expose problems early. New releases of our software are rapidly picked up and exercised in testbeds and (soon) real-world settings in the experiments. Our release schedules are set such that we are ready early to deploy test setups early on testbeds and production environments so they are exercised and problems fixed before real deployment is required. == Future options for managing risk: Same. == Crisis strategy: Address the problem preventing deployment with top priority, and incorporate testing for the problem in future regression testing. == Actions required: Those described above. ==== Platform incompatibility LCG software fails to function on important deployment platforms == Likelihood/impact/overall risk: 1/2/1 This is highly improbable because deployment platforms are always included in our software releases. == Current process for managing risk: Applications area software is built for deployment use on the platforms specified by the experiments as required for deployment. So long as we have an accurate list, our software will be required to function on the deployment platforms in order for it to be approved for release. Further, we will build and test our software on a range of other platforms which may in the future become deployment platforms, so that future platform migration does not catch us offguard with platform or compiler incompatibilities in our software. == Future options for managing risk: Same. == Crisis strategy: Address the problem resulting in failure with top priority, and incorporate testing for the problem in future regression testing. == Actions required: As described above. ==== Experiment discord Inability to sustain sufficient agreement between experiments to continue a common software development project == Likelihood/impact/overall risk: 2/3/2 Possible, but based on the experience so far, the experiments are determined to see common software development succeed. == Current process for managing risk: Involve all experiments closely in oversight, requirements gathering, management, software development, and deployment and testing so that any emerging problems are identified and addressed early. This is done through the structure and operation of the SC2, RTAGs, PEB, Architects Forum, and the openness of the applications area software development effort. Be flexible in adapting the requirements and functionality to the evolving needs of the experiments. == Future options for managing risk: Same. == Crisis strategy: Be as flexible as possible, and escalate through the management to address the problem at an effective level. If it is unresolvable, cancel the project. == Actions required: As described above. ==== Experiment departure Software development effort is lost when a project fails to meet objectives resulting in one or more experiments abandoning the project to develop in-house solutions. == Likelihood/impact/overall risk: 2/3/1 Possible, but based on the experience so far, the experiments are determined to see common software development succeed. == Current process for managing risk: Involve all experiments closely in oversight, requirements gathering, management, software development, and deployment and testing so that any emerging problems are identified and addressed early. This is done through the structure and operation of the SC2, RTAGs, PEB, Architects Forum, and the openness of the applications area software development effort. Be flexible in adapting the requirements and functionality to the evolving needs of the experiments. == Future options for managing risk: Same. == Crisis strategy: Be as flexible as possible, and escalate through the management to address the problem at an effective level. If it is unresolvable, reassess and rescope the project in light of the changed manpower profile and user base. == Actions required: As described above. ==== Integration failure Failure to integrate software components into a coherent architecture and framework == Likelihood/impact/overall risk: 1/2/1 Highly unlikely because the most basic precept of the architecture blueprint we are following is to design in the context of a coherent architecture and framework. == Current process for managing risk: Follow the architecture blueprint which prescribes a coherent architecture and framework. Test coherence and architectural consistency continuously by using components together as soon as and wherever possible. == Future options for managing risk: Same. == Crisis strategy: Fix the failure with top priority and prevent recurrence with reassessment and revision of design, development and integration testing practices. == Actions required: Follow the blueprint. ==== Requirements/functionality mismatch Mismatch of delivered product functionality to experiment requirements == Likelihood/impact/overall risk: 2/2/1 Highly unlikely except via inadequacy of communication. == Current process for managing risk: Experiments are centrally involved in all levels of the project (oversight, requirements, management, execution, testing and validation) to avoid this. Perhaps the greatest risk would come from inadequate communication between an experiment as a whole and those members providing the communication and involvement with the LCG. To avoid (or expose) this we strive for early take-up of LCG software in the experiments so that it is deployed in real-world settings to end users, where deficiencies will be exposed. == Future options for managing risk: Expand formalized software validation efforts involving the experiment user communities. == Crisis strategy: Incorporate required software revisions or extensions with high priority into the work plan. Work with the experiments to develop interim workarounds. == Actions required: As described. ==== Product take-up failure Take-up of products fails in one or more of the experiments expecting to use LCG software == Likelihood/impact/overall risk: 2/2/2 Unlikely unless there is a communication failure as in the previous risk, or unless there is a strategic redirection in the experiment that makes a planned LCG product inapplicable. == Current process for managing risk: As for requirements/functionality mismatch. The close involvement of the experiment architects and software leadership in the applications area should prevent unanticipated strategic redirections that would have this effect. == Future options for managing risk: Same. == Crisis strategy: Assess whether changes in the product that are practically realizable can make take-up possible. == Actions required: As described. ==== Licensing limitations Software license constraints imposed by LCG funding agreements will prevent us from fully exploiting open source software. == Likelihood/impact/overall risk: 3/4/4 At present, this looks like a real risk. If it happens it is a disaster. == Current process for managing risk: The applications area cannot function if we cannot use (in particular) GSL-licensed software. We depend on high quality, robust, extremely widely used open source products which would take major development efforts over years to match. We are addressing this by working out an arrangement by which our licensing is compatible with using GSL-licensed software. == Future options for managing risk: Should not be a risk for much longer. == Crisis strategy: Underway -- get a GSL-compatible license in place. == Actions required: ==== Loss of ROOT team The small ROOT development team evaporates or is otherwise impaired. == Likelihood/impact/overall risk: 2/3/2 The smallness of the core ROOT team makes this a non-negligible risk, despite the demonstrated commitment of the team over many years. == Current process for managing risk: The ROOT team at CERN has been properly recognized, supported and consolidated in a well defined EP/SFT section. While the core ROOT team is very small, its user community is huge, and its user community includes many who are deeply familiar with ROOT and are contributing to its development. If the core ROOT team were impaired, the whole HENP community would be highly motivated to identify and pool resources to repair the team. If ROOT or a part thereof were nonetheless rendered unusable, we would exploit the implementation-neutral component architecture of applications area software to make the introduction of an alternative implementation tractable. Considerable development work would likely be required, however, to recover the needed functionality. == Future options for managing risk: The ROOT team is working to lessen the risk in areas characterized by very high complexity understood by few, for example the CINT system and associated dictionary. We are developing a plan by which ROOT, CINT and LCG software will use a common dictionary in the future. == Crisis strategy: As described. == Actions required: As described. ==== Missing personnel Promised personnel commitments fail to materialize or fall short == Likelihood/impact/overall risk: 3/3/3 This could well happen. == Current process for managing risk: Scope deliverables and timelines to what can be achieved with identifiable manpower. Prioritize deliverables so that they can be descoped in an orderly way if expected manpower does not materialize. == Future options for managing risk: Same. == Crisis strategy: Descope deliverables according to priorities and stretch timelines where possible. == Actions required: As described. ==== Staffing continuity failure Continuity failure in LCG staffing (e.g. late or inadequate phase 2 support) reduces software development and support effort. == Likelihood/impact/overall risk: 3/3/3 Given the present horizon on LCG funding, pending a clear plan for phase 2, this is a real risk, and potentially a very serious one. == Current process for managing risk: Scope projects to deliver full software products by the end of LCG phase 1. Support and incremental development requirements will remain, however, and must be addressed by phase 2 support. Involve experiments closely in the development efforts so that product expertise for further development and support resides within the experiments. == Future options for managing risk: Clarify phase 2 funding and staffing. == Crisis strategy: Rely to a greater extent on the experiments for product support and incremental development, but this is inconsistent with the severely constricted manpower levels currently found in the experiments. == Actions required: As described. ==== Software delivery failure by work teams A work package or institute team is unable to deliver promised software. == Likelihood/impact/overall risk: 4/3/2 Could well happen. == Current process for managing risk: Develop and sustain development teams of diverse capability and expertise so that shortfalls can be made up elsewhere with reassigned manpower. We have already seen this happen to some degree. It is an expected part of software management. == Future options for managing risk: Same. == Crisis strategy: As described. == Actions required: As described. ==== Duplication of work Software is developed which amounts to a duplication of other work or is rendered irrelevant by other work or changing requirements == Likelihood/impact/overall risk: 4/1/1 This we know happens on occasion because some development work is exploratory, investigating alternative approaches which have potential, but ultimately the most effective solution is chosen. == Current process for managing risk: Develop specific, well-defined work plans and ensure they are reviewed broadly and thoroughly so that real and possible duplications are identified and understood, and either agreed on as exploratory work or removed from the program. The mechanisms and processes to achieve this exist and are followed: work plan development, review and approval take place using project management, the Architects Forum, the PEB and the SC2. == Future options for managing risk: Same. == Crisis strategy: Terminate redundant work when its cost is determined to exceed its potential value. == Actions required: As described. ==== Grid middleware/infrastructure failure Grid middleware projects and/or the distributed computing infrastructure fail to deliver distributed components and capability that meet requirements. == Likelihood/impact/overall risk: 3/3/3 This can well happen because of the immature and (in some cases) R&D nature of grid middleware development. == Current process for managing risk: Alternate solutions for the short term are developed where this risk is appreciable, removing grid middleware from the critical path. An example is the MySQL implementation of the POOL file catalog that was done pending the maturation of grid replica location services. Once the grid middleware is off the critical path, it can be iteratively tested and ultimately incorporated when it is robust and proven. == Future options for managing risk: Same. == Crisis strategy: If a grid middleware component fails, back off to an interim solution again until the middleware problem is solved. == Actions required: As-necessary development of simple interim solutions as described. ==== POOL data storage performance failure POOL event data storage fails to meet the performance and scalability requirements of the experiments. == Likelihood/impact/overall risk: 1/3/1 Highly unlikely because of the ongoing (re)assessment of requirements and the incorporation of those requirements into POOL release acceptance tests, and because of the use of ROOT I/O -- a tool of proven capability in production experiments -- as the foundation of POOL event data storage. == Current process for managing risk: The experiments provide performance and scalability requirements to POOL which are used as the basis of measurable acceptance tests incorporated into POOL testing and release procedures. POOL releases must meet the specified requirements. The frequent release schedule, and the ongoing reassessment of requirements in the experiments, ensures that any problems with scalability and performance will show up early. == Future options for managing risk: Same. == Crisis strategy: The strategy described should ensure that a failure occurs early enough that it can be addressed by priority development work and solved before the problem reaches crisis proportions. == Actions required: As described. ==== POOL catalog management performance failure POOL file catalog management fails to meet the performance and scalability requirements of the experiments. == Likelihood/impact/overall risk: 2/3/2 The strategy here involves use of replica location services provided by the grid projects, which carries some inherent risk because of the immaturity of these products. == Current process for managing risk: As for 'grid/middleware infrastructure failure'. == Future options for managing risk: The long term risk here is small because Oracle, known to provide the needed performance, is available and can be used directly by POOL if necessary to meet the performance and scalability requirements where they are most severe, at the Tier 0 (CERN), and elsewhere if necessary. == Crisis strategy: Rely more directly on proven RDBMS tools rather than grid middleware intermediaries if necessary. == Actions required: As described. ==== Overlooking a risk One of the more prominent risks is the risk of overlooking a risk. == Likelihood/impact/overall risk: 4/2/1 While this is likely to happen, it is not so likely that a major risk will be overlooked and thus a major failure mode remain unconsidered. == Current process for managing risk: Have the risk assessment widely reviewed. This assessment has been reviewed by the Architects Forum, and this risk is the one that was added as an outcome of the review. == Future options for managing risk: Post and periodically review the risk assessment. == Crisis strategy: A crisis can arise if an overlooked major risk occurs. The crisis could be of any nature but would probably be a technical deficiency or major bug requiring manpower redirection and possible schedule stretch. == Actions required: Necessary actions of risk review have been taken.
Contact: T. Wenaus