Você está na página 1de 39

Applying GQM in an industrial software factory1

Alfonso Fuggetta, Luigi Lavazza, Sandro Morasca, Stefano Cinti, Giandomenico Oldano, Elena Orazi
Politecnico di Milano and CEFRIEL Via Fucini, 2 - 20133 Milano, Italy Phone: +39-2-239541 E-mail: {fuggetta|lavazza|morasca}@elet.polimi.it

Digital Equipment S.p.A. V.le F. Testi, 280/6 - 20126 Milano (Italy) Phone: +39-2-6618-1 e-mail: {stefano.cinti|giandomenico.oldano|elana.orazi}@digital.com

Abstract Goal/Question/Metric (GQM) is a paradigm for the systematic definition, establishment, and exploitation of measurement programs supporting the quantitative evaluation of software processes and products. Although GQM is a quite well-known method, detailed guidelines for establishing a GQM program in an industrial environment are still limited. Also, there are few reported experiences on the application of GQM to industrial cases. Finally, the technological support for GQM is still inadequate. This paper describes the experience we have gained in applying GQM at Digital laboratories in Italy. The procedures, experiences, and technology that have been employed in this study are largely reusable by other industrial organizations willing to introduce a GQM-based measurement program in their development environments. Categories: D.2.2 [Software Engineering] Tools and Techniques Computer-aided software engineering (CASE) D.2.8 [Software Engineering] Metrics Performance measures D.2.9 [Software Engineering] Management Productivity; Software quality assurance (SQA)

This work was partly supported by ESSI Project n. 10358 CEMP - Customized Establishment of Measurement Programmes, funded by the European Union.

1. INTRODUCTION Software is becoming increasingly complex and critical: the quality of software products is the major concern of both software producers and users. It has been argued that the quality of software products heavily relies on the quality of the process used to conceive, design, develop, deploy, and maintain them. This observation has drawn increasing attention on methods and techniques for the evaluation of software development process quality and maturity. Significant examples of these efforts are the Capability Maturity Model (CMM) of the Software Engineering Institute [1] and the Quality Improvement Paradigm (QIP) of the University of Maryland [2]. The ultimate goal of these process-oriented methodologies is to support the assessment of the process, its improvement, and, consequently, the development and deployment of better software products. One of the most important factors for enabling and supporting any improvement initiative is the availability of effective software metrics that make it possible to quantitatively evaluate the quality level of products and development processes [3]. Reliable metrics provide the evidence of improvements, allow cost-benefit analysis, and provide the basis for decision making. The selection and exploitation of metrics for a specific evaluation problem are complex processes. Methodological and technological aids are needed to guide and support the work of the software engineer, to suggest which steps to accomplish and how to structure and manage the knowledge that is progressively constructed on the development process and software products. We will refer to these activities with the term measurement process, to distinguish it from the development process that is being observed and studied. The objective of the measurement process is to define and operate a measurement program, i.e., a context-specific set of metrics, and the related guidelines and procedures that must be followed to collect and analyze them. GQM [2] [4] is a method to guide the definition and exploitation of a goal-driven measurement program. This paper reports experiences and results derived from the application of GQM to an industrial software process at the Digital Software Engineering Center in Gallarate (Italy). The study has been conducted with the support of the European Union, as part of the ESSI project CEMP (Customized Establishment of Measurement Programs). The main objectives of the CEMP project were to assess the practical applicability of GQM in industrial environments, and to evaluate the cost/effectiveness of establishing a measurement program based on GQM. The study was carried out by three different industrial software development organizations (namely, Bosch, Digital Italia, and Schlumberger), in order to take into account the cultural differences and approaches, and the variations in costs and efforts spent across three different companies operating in four countries. The role of the academic partners, University of Kaiserslautern and CEFRIEL, was centered on providing support to the software development organizations and pursuing the dissemination of results. In this paper, we show a number of empirical results and lessons learned concerning the introduction of a systematic measurement program such as GQM in an industrial

setting. These include experiences, insights, guidelines for the introduction of a GQMbased measurement program, as well as an assessment of the related costs. The focus and goal of this paper are not related to the results of the application of GQM in terms of impact on the organization's bottom-line, productivity, quality, Instead, the goal was to gain experience in assessing the problems and costs of establishing a measurement program. Consistently, the paper is organized as follows. Section 2 concisely presents the main characteristics of QIP and GQM. Section 3 presents a summary of related experiences and puts our work into perspective, by briefly pointing out the main results that have been achieved. Section 4 presents the software development process used at Digital and the context where the work has been conducted. Section 5 reports on the application of GQM in Digital. Section 6 presents a detailed evaluation of the work done and of the main results. Finally, Section 7 draws some conclusions and outlines directions for future work. The appendices report details of the Digital process, the GQM plan definition, a tool we developed to support the definition of the GQM plan, and the GQM data collection forms. 2. THE METHODOLOGY The Quality Improvement Paradigm (QIP) is a framework for guiding and supporting the improvement of software processes and products. GQM is one of the main components of QIP, since it provides a method for defining and exploiting the measurement program that makes the evaluation of improvements possible[4] [2]. GQM has been proposed and applied as a systematic technique for developing a measurement program for software processes and products. GQM is based on the idea that measurement should be goal-oriented, i.e., all data collection in a measurement program should be based on an explicitly documented rationale. 2.1 The GQM Process Establishing a GQM-based measurement program, performing the measurements, and collecting and analyzing data is a complex process. GQM suggests a specific process to guide the software engineers in creating and operating the measurement program. The structure of the GQM process reflects the phases of QIP, since the creation and operation of the measurement program are essential components of any improvement activity. Figure 1 shows the seven main activities of this process: Prestudy. This activity aims at collecting the information on the context where the measurement activity has to be carried out. One must identify preconditions and constraints, strategic objectives of the company, existing experiences and data, and characteristics of the organization, product, and market. Identification of GQM goals. Based on the description of the context, a set of goals for the improvement activity are defined and ranked according to their relevance and importance to the organization strategy.

Production of the GQM plan. The identified goals are used as the starting point to create the GQM plan, i.e., a structured document where each goal is associated with a set of metrics needed to achieve the identified goals. Production of the measurement plan. The measurement plan is the operational counterpart of the GQM plan: the latter indicates what data we may want to collect, while the former indicates how the data collection activity has to be carried out. Collection and validation of data. This activity aims at collecting and evaluating process and product data, according to the GQM and measurement plans. Analysis of data. Collected data are analyzed, to understand and evaluate the level of accomplishment of the different goals that were initially selected. Packaging of experiences. The experiences that have been gained in conducting the measurement activity are packaged so that they can be reused in future projects.
Prestudy Context characterization

Identification of GQM goals

GQM goals GQM plan

Production of the GQM plan Production of the measurement plan

Measurement plan

Collection and validation of data Analysis of data

Validated data Results of the evaluation

Packaging of experiences

Experience packages

Figure 1: The GQM process. For illustration purposes, we represented the GQM process as a linear sequence of activities. In practice, loops are possible (and in some cases needed) to refine and revise the outcomes of previously completed activities according to new information that has been identified meanwhile. Also, the experience packages obtained at the end of a specific execution of the GQM process are used in subsequent executions of the GQM process.

2.2 The GQM plan The most important concept/product of the GQM paradigm is the GQM plan, produced to define a set of metrics used to reach the organizational goals. The GQM plan is produced through hierarchical refinements. The goals selected in Step 2 of the GQM process constitute the top level of the GQM plan. Goals are defined in terms of the following entities: Object of study: the part of reality that is being observed and studied. Purpose : the motivations for studying the object. Quality focus: the object characteristics that are considered in the study. Viewpoint: the person or group of people interested in studying the object. Environment: the application context where the study is carried out. Each goal is associated with an Abstraction Sheet (Level 2 of the GQM plan) which is composed of four parts: Quality focus: provides additional details on the object characteristics to study. Variation factors: specifies process and product characteristics that may affect the quality focus. Baseline hypotheses: characterize the current status of the object of study with respect to the quality focus. They describe the initial beliefs of the observer concerning the quality focus described above. Impact on baseline hypotheses: describes how the variation factors are expected to affect the current state of the object of study. The abstraction sheet is an extension of the original GQM plans structure that has been proposed to address some weaknesses of the approach (see Sections 3 and 6.1.2). From the abstraction sheet, a set of questions is derived (Level 3 of the GQM plan). These questions must be answered in order to understand if and how goals have been reached. Questions are a more detailed view of the abstraction sheet. Finally, a set of metrics is derived from each question (Level 4 of the GQM plan). These metrics are used to collect data, which will be used to answer the questions that have been raised. A significant part of a GQM plan is presented in Appendix 2.

3. RELATED WORK To appreciate the contribution of this paper, we summarize some significant related work in the area. Section 3.1 introduces basic concepts of empirical software engineering, and compares the work described in this paper with the basic categories of empirical studies. Section 3.2 presents some general experiences on the establishment of measurement programs. Section 3.3 discusses experiences and criticisms specifically related to GQM. Section 3.4 relates the work discussed in the paper with the body of knowledge and experiences presented in this section.

3.1 Empirical Software Engineering Empirical software engineering is one of the most challenging and critical areas of modern software engineering. The goal of empirical software engineering is to build a credible empirical base [] that is of value to both professional developers and researchers [6]. As argued in [6], empirical software engineering inherits most of the methodological approaches and techniques of social sciences [7], since its goal is to observe complex social settings [8], i.e., contexts where the interaction among humans is the critical factor that determines the quality and effectiveness of the results being produced. In particular, the empirical work is accomplished through the execution of empirical studies, i.e., observations of specific settings, with the purpose of collecting and deriving useful information on their behavior. Empirical studies can be classified in three categories, according to the increasing degree of confidence in the results of the study [9]: 1. Anecdotal studies 2. 3. Case studies Experiments

With respect to the above classification, the work presented in this paper can be considered a meta case-study, since we have accomplished a controlled and instrumented observation of a single, complex social setting. We use the term meta since the goal of the study was not directly focused on the evaluation of some development activity or technique used in the software development process observed. Rather, it was centered on the observation of the process followed to create a measurement program that can then be used to study the software development process. Consistently, the contribution of this paper refers to this meta-goal and not the evaluation of specific software engineering method, technique, or tool. For completeness and better evaluation of the benefits obtained with the introduction of a measurement program in an industrial setting, we also report results of the case study, i.e., the results we obtained by observing the software process being studied. 3.2 Establishing a Measurement Program There are several contributions related to the establishment of measurement programs. Here, we present a few significant works to emphasize the most important issues that have been addressed in the past years. References [10] and [11] present general considerations on the role of metrics in software engineering research and on the methodological and scientific bases for defining and collecting data. These works argue about the adoption of elements of measurement theory in Software Engineering. The critical point is the degree of applicability of the measurement theory assumptions and principles in the software engineering domain. While the former contribution strongly supports the adoption of such principles, the latter argues that these principles should not be followed blindly, but common sense, combined with rigor, should always prevail; overly strict adherence

to these theoretical principles in practical situations may sometime constitute hindrance to the development of the disciplines. Reference [12] presents the establishment of a measurement program at HewlettPackard. The authors argue that consensus must be built at the management level around the establishment and exploitation of a measurement program. This is a precondition for the committed participation of the members of the organization, to sustain the costs related to the creation of the measurement program, and take full advantage of the results of the measurement activity. A related important point is also raised in [14]. The author argues that it is necessary to link the establishment of a measurement program to the maturity level of an organization. Metrics are welcome only when they are clearly needed and easy to collect and understand. Another significant contribution is represented by the Software Measurement Guidebook developed at the Software Engineering Laboratory [13]. It contains guidelines and suggestions on how to establish a measurement program. Even though strongly influenced by QIP and GQM (see also Section 3.3), it also contains general comments that are applicable even when these techniques are not adopted: 1. Data collection should not be the dominant element of process improvement; application of the measures is the goal. 2. The focus of a measurement program must be self improvement, not external comparison. 3. Measurement data are fallible, inconsistent, and incomplete. 4. The capability to qualify a process or product with measurement data is limited by the abilities of the analysts. 5. Personnel treat measurement as an annoyance, not a significant threat. 6. Automation of measurement has limits. 3.3 Using GQM GQM was initially conceived several years ago. An early version of the approach was presented in 1984 [15]. Even if GQM is quite well-known, the amount of knowledge and experience on applying it is still limited. In particular, we envisaged a series of problems that can be summarized as follows: There are relatively few published experiences on the usage of GQM in industrial cases (see [16] as an example). The method is not yet defined in a fully precise and detailed way. Moreover, it is subject to changes and modifications. Thus, its application may be influenced and conditioned by the availability of GQM experts. There are few documented experiences on the evaluation of the cost of establishing a GQM-based measurement program.

There is no specific technology that can be exploited to support the GQM-based measurement process, for instance, to collect and store the information on the goals and characteristics of the development process being studied. Some of these points have been discussed by David Card in [17]. He argues that GQM is really better used as a brainstorming technique. The critical aspects that Card emphasizes are basically the following: 1. The method is not repeatable. Two groups of people from the same organization, starting with the same goals, will arrive at different sets of questions and metrics. 2. The method is nonterminating. It is difficult to know when to stop posing questions and defining relevant metrics. 3. The method is not practical. Some of the questions that result from a GQM exercise cant be answered unless an organization changes the way it does business to make the necessary metrics available. These criticisms have been addressed in [18] and [19]. The former argues that GQM has evolved over the years to a model-based approach. The approach now includes a goalgeneration template that can be used to guide the definition of the GQM hierarchy (as illustrated in Section 2). Also, the author argues that the model is not less repeatable than any design methodology. The latter contribution argues that there are several heuristics that complement the basic GQM principles. The resulting set of guidelines is better than pure brainstorming. In spite of this argumentation, Card still responds that GQM is not a stand-alone technique. An experience on using GQM that relates to the previous point is presented in [20]. It argues that there is a conceptual gap between the definition of the GQM goals and the selection of a useful set of metrics. In particular, the top-down approach suggested by GQM should be complemented by a bottom-up procedure for assessing and exploiting available raw data. From the raw data, questions can be generated and used to define intermediate sub-goals. In turn, these sub-goals can be used to refine the top-level goals. This intertwine of top-down and bottom-up approaches can be instrumental in filling the gap. 3.4 Contributions of this Paper This paper presents an experience in establishing a measurement program using the GQM approach. Therefore, the paper shares some topics and issues of the papers discussed in Section 3.2 and some of those cited in Section 3.3. In summary, the main contributions of the paper are: 1. Experiences and suggestions related to the application of a detailed definition of the GQM process in an industrial software factory. 2. A detailed analysis of the cost of establishing a measurement program according to the GQM process.

These contribution are instrumental in establishing a baseline of experiences and data that can be used to increase the degree of guidance and support offered to the users of the GQM approach.

4. THE CONTEXT: SOFTWARE DEVELOPMENT AT DIGITAL ITALIA A measurement program must be designed and implemented to help a specific software development organization address its improvement objectives. The design of a measurement program must be based on a characterization of a variety of factors such as the structure and management style of the organization, the roles of people, the goal of the measurement activity and its relationship with the company mission and objectives, the characteristics of products and processes, and the practices and technology used in the organization. These and other factors define the context where the design of the measurement program is going to take place. In this section we will briefly outline the context of the case study, i.e., the development of FUSE 2.0 at Digital Engineering Center in Gallarate. 4.1 The Process and the Product The center in Gallarate is part of the Corporate Engineering Division which is in charge of the development of all software products for Digital lines of computers2. Digitals software life-cycle is essentially a variation of the waterfall approach. It is composed of six sequential phases: Phase 0 - Strategy and requirements collection. This phase is basically centered on the preparation of the product feasibility study. It aims at identifying market opportunities, product requirements, and alternative/viable solutions. Phase 1 - Analysis and planning. The goal of this phase is to develop the product specifications, and the corresponding development plans. Phase 2 - Design and implementation. In this phase, the product is designed, coded, and tested (initial internal test). Phase 3 - Qualification. In this phase, the product is thoroughly tested with respect to the requirements established during Phase 1. Phase 4 - Production, distribution and service. This phase deals with the deployment and support of the product at the customers sites. Phase 5 - Retirement. The product is no longer produced and distributed, but product assistance is still maintained to support clients with maintenance contracts.

Two years after the completion of this study, the organization of the engineering center in Gallarate has been changed as part of an effort to move engineering groups closer to customers. The center has been moved to Milan and part of the development groups have joined the corporate product/market divisions.

A thorough analysis of the software process phases is beyond the scope of this paper. Appendix 1 reports a detailed description of Phases 2 and 3, which were the target of the measurement program developed in the CEMP project. Here, we just report the definition of baselevel. A baselevel is an intermediate release of the product for the purpose of testing. Baselevels are produced at fixed dates indicated in the Development plan (generally every two weeks). The purposes of baselevels are to verify the consistency of the product with respect to requirements, identify failures, and indicate the trend of the development and testing process. The testing activity carried out on baselevels provides feedback to the development activity. At the end of the development, the final baselevel is released and used to prepare the Field Test Kit. The project that was selected for the case study is DEC FUSE, version 2.0. DEC-FUSE is an integrated software development environment, based on a selective broadcast mechanism [21]. 4.2 Goals: Reliability and Reusability Digitals development sites have been involved in a joint software metrics program since 1990, in order to monitor and foster technical excellence and time-to-profit results. In particular, the objective of the program is to support Digital software development managers in achieving the following goals: Reduce the number of defects in released products, by improving software quality assurance in all phases of the software development process. Identify opportunities and strategies for increasing the reuse of software artifacts. Improve the ability to control costs and schedule overruns. Facilitate and support the compliance with ISO 9000 standards. Therefore, the development process has been instrumented with several measuring points. The collected information is stored in a local repository under the control of each development manager. Examples of collected data are: planned vs. actual completion dates; business performance; perceived quality (determined through the number of assistance requests); number of faults fixed by the maintenance groups. In the context of this program, Digital developed a software system called Problem Tracking Tool (PTT) for storing failure information. When Digital Italias management decided to take further initiatives to control and reduce defect density and increase the amount of reused code in delivered products, it became quite clear that the set of metrics collected at Digital laboratories was not sufficient. Digital Italia management selected the GQM paradigm as the systematic and structured approach for the definition, collection, and analysis of metrics.

10

5. HISTORY OF THE STUDY The GQM-based measurement program was defined by a team of professionals that will be referred to with the term GQM team. The group of developers of FUSE will be identified with the term project team. The GQM team was composed of CEFRIEL consultants and selected members of the project team. 5.1 The GQM Plan for FUSE The GQM team defined the following goals: Goal 1 Analyze the design and qualification phases of the development process for the purpose of evaluating failure detection effectiveness from the viewpoint of the management and the development team. In particular, the hypothesis that originated this goal is that field test (the way it was carried out) provided little benefits, with respect to its (high) cost. Goal 2 Analyze the development process for the purpose of evaluating the correlation between failures and originating faults3 from the viewpoint of the development team. The accomplishment of this goal was expected to provide data on the distribution of faults in components4, the distribution of fault origins across the development phases, etc. Goal 3 Analyze the development process for the purpose of better understanding the fault removal activities from the viewpoint of the project leader. The purpose of this goal is to highlight the characteristics of faults and fault correction actions (e.g., in which phase each fault was originated, why, by whom, how critical it was, how much did it cost to fix it, etc.). Note that this goal shares several features with Goal 2, thus there are several questions related to both goals. Goal 4 Analyze the delivered product for the purpose of evaluating its components with respect to the distribution of faults and failures from the viewpoint of the development team. This goal aims at understanding possible differences among the components of DEC FUSE that could affect the components faultiness. Since the components were developed at two different sites, possible differences in the process are also considered. Goal 5 Analyze the product source code for the purpose of evaluating its actual level of reuse, reusability, and knowledge needed for reuse from the viewpoint of the development team. This goal aims at understanding how much code is being reused, how difficult is to reuse it, and how much code could possibly be reused (and at what price).
3

We adopt the terminology proposed by the IEEE: failure is the observable problem that the user/tester perceive, while the fault is the defect that originated the incorrect behavior of the program.

In this project the term component is used to identify a tool of the FUSE environment. In other cases a different interpretation of the same term is possible (e.g., component could indicate a piece of software at a finer granularity).

11

These goals were all considered extremely important by Digitals management. For space reasons, we will discuss only the first goal, whose complete definition is reported in Appendix 2. The abstraction sheet for Goal 1 contains, among others, the following information: The quality focus is centered on the characterization of failure detection events. This involves the description of attributes like the phase in which the failure was detected, the component responsible for the failure, the criticality5 of the failure, the effort dedicated to the correction of the failure, etc. Variation factors include the way testing is carried out, the characteristics of the software (size, complexity, etc.), and the knowledge of the application domain. Goal 1 has originated over 30 questions (described in detail in Appendix 2) which have in turn originated about 50 metrics. It can be noticed that the concepts and metrics identified in the GQM plan are rather trivial. This should not appear as surprising or disappointing: The value of GQM does not reside in facilitating the creation of new metrics. Actually, the number of potentially collectable metrics is huge, and it is quite difficult to identify those that can really be instrumental in better understanding and in improving the process. Indeed, the value of GQM (and of any other method of this kind) must be evaluated with respect to its ability to identify a significant and minimal set of metrics that are clearly related to the goals of the measurement activity. The value of a measurement program is not related only to its ability of discovering new and unexpected findings. In many cases, the people working in the process do have an intuitive understanding of the problems that have to be addressed. What they often miss is a solid evidence that can support and justify the definition and accomplishment of specific (and possibly expensive) process improvement initiatives. The scope of Goal 1 is restricted to testing, although it is well known that earlier error elimination is more cost-effective. The emphasis on testing is due to Digitals needs and constraints: in a different context GQM could have been used to address other improvements strategies (e.g., new design tools, design and code inspections, static code analyzers, etc.). 5.2 The Measurement Plan Once metrics have been selected, the next step in establishing a measurement program consists of defining the measurement plan, i.e., the set of data collection procedures. The measurement plan must specify how information has to be collected (e.g., manually

Failures are classified in five priority levels, ranging from 1 (showstopper) to 5 (suggestion). Priority levels are also called criticality in the rest of the paper.

12

vs. automatically), which support should be used (e.g., paper forms vs. software tools), who is requested to provide the information and when, who is supposed to perform validation, etc. Again, we will consider only Goal 1; similar considerations apply to the other goals. As a first step, we defined a set of data collection forms, i.e., conceptual representations of the data to be collected. They are used as a starting point for the design of the data collection database, and to support the actual data collection activity, which is based on different/partial implementations of these forms (paper forms, electronic forms, interviews, ...). The forms contain both subjective questions (e.g., How well do developers know the product?) as well as objective questions (e.g., How many developers worked on a given component?). Three forms were created: The Product Form collects general information on the product. The project leader fills in a single copy of this form. The objective is to represent the features of the product as a whole (e.g., product size, total development effort, ...). The Component Form collects information on each component of the product. It contains questions concerning the code of the component (size, complexity, reused code, documentation, interfaces with other components, etc.), the developers, the effort employed for the development. The project leader fills in one copy of the form for each component of the product. The objective is to characterize each component with respect to the baseline hypotheses and variation factors. For example, if a baseline hypothesis says components developed from scratch contain up to 300% more bugs than new versions of existing components, it is clearly necessary to classify each component as developed from scratch or new version. The Problem Report Form collects data concerning failures and faults. A copy of this form is compiled by the maintainer every time a fault is corrected. In this form, some questions are used to classify failures, by detecting the component where they were found, the phase in which they were introduced, and their cause. Other questions are related to the techniques used to detect and remove faults. In particular, this form contains questions concerning the tools, the quality of the available documentation, and the effort employed. Note that the forms are filled in a time period that can be quite long. Some questions can be answered at the very beginning of the project, while others can be answered only at the end of the development, or even some time after the marketing of the product. A form collects all the information related to a particular topic (in this case the product components). The same form may thus span over several goals, i.e., it can collect data related to questions that refer to different goals. GQM emphasizes the difference between objective measures and subjective measures. The reason for stressing this difference is the underlying hypothesis that only objective measures can be collected automatically, since subjective measures require human

13

agents to perform evaluations or express opinions. Actually, we discovered that another perhaps more relevant difference is the age of data. Old data (i.e., data that have been collected before the definition of a GQM plan) usually need to undergo a requalification process that can only be carried out by human agents. In our study it was necessary to perform the following activities: Extract data from the company database. The useful information is rarely available readily: often it is scattered throughout the company database (sometimes in several databases), or it is mixed with other data that are not relevant as far as the GQM data are concerned. Interpret data. The retrieved data are only the starting point that makes it possible to reconstruct the desired information. Often, this step must be accomplished by some experienced developer. Validate data. Many data points must be discarded for a variety of reasons. For instance, they are duplicated or generated under improper conditions. Enrich data. The collected data may happen to be inconsistent with the GQM plan. In this case the missing or erroneous information has to be reconstructed (usually resorting to peoples knowledge and experience), or provided through a new data collection activity. In pursuing the first goal related to FUSE, we basically reused the data contained in Digitals database, and we actually had to carry out the information revitalization actions described above. This effort also produced some benefits: we revealed the shortcomings and inadequacy of the current problem tracking data representation method. Although the measurement activity addressed the correct set of process and product characteristics (e.g., faults), the data describing these characteristics were generally incomplete or poorly structured. For example, there was no link between a fault description and the localization of the code affected by the corrective action. The project management realized the situation quite soon, and decided to improve the problem tracking database by modeling the collected data according to the forms described above. Subjective measures were initially collected by means of interviews. This approach ensures that the interviewed people correctly understand the questions, and that they actually dedicate the required attention to answering them. It was also possible to illustrate and sometimes discuss the evaluation criteria, in order to get consistent answers. 5.3 The Data Collection Environment When the project started, the only facility for storing process measures was the company archive of problem reports (PTT). This repository was assessed as not suitable for supporting the GQM activity. Therefore, a data collection environment centered on a relational DBMS was developed (Figure 2). Data are collected through paper forms, or produced directly by STW (the tool used by the testing team [22]), or submitted by the

14

developers through e-mail messages. The latter procedure is mainly used to report failure detection or fault removal. Finally, the database receives data extracted from the company database.
Paper form STW file E-mail message

GQM-oriented GUI

GQM data input

Filter

Company database

GQM database

GQM data browsing and analysis

Figure 2: The Data Collection Environment. During the execution of the case study, the GQM plans grew into quite intricate diagrams, including a few hundred metrics, some shared by up to three different questions. It was clear that keeping this amount of meta-data under control without automated support was a time-consuming and error-prone task. In particular, we realized that the execution of the GQM process could be significantly facilitated by making some support available to relieve the team from the burden of many clerical tasks such as maintaining the definition of GQM goals, generating the DB schema from the GQM plan, and maintaining the relationships between the GQM plan and the collected data. For this reason, a tool was developed in parallel with the manual execution of the GQM process. The tool was then used to support an additional part of the study not discussed in this paper. A brief description of the GQM tool is given in Appendix 3. The main requirements and characteristics of the tool can be summarized as follows: - The production and maintenance of large GQM plans and data collection forms is a complex activity, because there are many dependencies among different parts of the documents. Therefore, the tool was designed to provide support for creating, updating and displaying GQM plans and forms. In addition, it is able to perform consistency checks, and it is also provided with reporting features. - Goals tend to express qualities of a given product or process. They are therefore partly reusable in goals concerning the same product or process. For instance, the reliability goals described in Section 0 can be applied to other Digital projects. Since the ability to reuse (fragments of) existing GQM plans would significantly reduce

15

the cost of developing new GQM plans, the tool supports these reuse operations in a flexible and simple way. - An important phase of the GQM process comprises the analysis and interpretation of collected data. These activities proceed bottom-up: the collected data are analyzed and aggregated to derive answers to the questions and evaluate the degree of accomplishment of the related goals. Note that data are first analyzed by the GQM team, and then presented to the development team, in the so-called feedback sessions [23] in order to evaluate the results and provide a uniform interpretation of the data. To support these activities, the tool maintains the links among the GQM plan (i.e., the metrics) and the corresponding collected data, and makes it possible to visualize/draw data using several chart formats. As a side effect, the availability of a tool made the GQM process more flexible, by making it possible to incrementally formalize goals even before they were recognized as relevant, and easily change and fine-tune the plan.

6. RESULTS AND EVALUATION OF THE CASE STUDY The main results that have been achieved by applying GQM at Digital Italia can be summarized as follows: 1. A better comprehension and definition of the GQM process has been achieved. We contributed within the CEMP project to the creation of a comprehensive user guide that has been recently released by STTI [23]. 2. A detailed evaluation of the costs related to the establishment of a GQM-based measurement program has been performed. The costs of the Digital case study have been compared with those of the other two GQM users (Bosch and Schlumberger) to discover similarities and differences (e.g., distribution of efforts among different phases of the GQM process) [24] [25]. As for the Digital Italia software development process, several changes have been introduced. These changes were deemed significant by both the management and the developers. In particular, a better comprehension of the effectiveness of the qualification activities has been achieved, and consistent improvement initiatives have been accomplished. 6.1 The GQM Process 6.1.1 Structure of the GQM process GQM has been criticized because of lack of structure and guidance. It has been considered a brainstorming technique, rather that an comprehensive method to guide the measurement process (see Section 3.3). The experience carried out in the past years and, specifically, in the CEMP project has made it possible to create a detailed and improved description of the GQM process [23]. In this section we provide a few examples that illustrate this point.

16

existing measurement program documentation

process prestudy
available inputs, preconditions and constraints

existing measurement program experience

identify available inputs, preconditions and constraints

organization characterization characterize organization and identify org. improvement goals

organizational improvement goals

identify and characterize candidate application projects

project characterization

project plan select project and identify project goals

project goals

Figure 3.The prestudy (sub)process As a first example of GQM process activity, let us consider the prestudy (sub)process (see Figure 3). In the existing GQM literature, the identification of improvement goals was considered a preparatory activity for the identification of candidate projects. The application of GQM in the CEMP project revealed that the definition of the improvement goals cannot strictly precede the characterization of the candidate projects. Indeed, the discussion of project details enriches the knowledge about the organization and its needs (see the timeline in Figure 4).
subprocess
characterize organization identify improvement goals

identify inputs

identify candidates characterize candidates

select projects identify project goals

time

Figure 4. Timeline for the prestudy (sub)process.

17

As a second example, let us consider the identification of GQM goals (see Figure 5).
existing measurement program documentation

identify GQM goals


existing measurement program experience

specify measurement goals informally

informal measurement goals

description of environment
organization characterization specify GQM goals list of candidates GQM goals

project characterization

organizational improvement goals

rank and select GQM goals

selected GQM goals

project goals

Figure 5: Identification of GQM goals In a previous version of the GQM process model, the sequence of activities of this process was structured as follows: 1. describe informally all the measurement goals; 2. rank and select goals; 3. formalize the selected goals as GQM goals. The case study carried out at Digital showed that: It is difficult and error-prone to base the goal ranking and selection on an informal definition. The formalization of goals requires that they are fully understood. This is instrumental to evaluate their feasibility, to reuse parts of other goals and existing measures, etc. Note that a complete formalization is not required, since, in general, it is sufficient to precisely define only the goal and the abstraction sheet. A typical timeline for the goal identification process is represented in Figure 6. The process is composed of three activities, two of which (specify measurement goal informally and specify GQM goals) are iterated until it is possible to rank and select the most interesting goals (the figure shows two iterations of such cycle).

18

subprocess

specify GQM goal specify measurement goal informally specify measurement goal informally

specify GQM goal rank and select GQM goals

time

Figure 6: Timeline for the identification of GQM goals 6.1.2 Structure of the GQM plan As discussed in Section 3.3, it is often difficult to identify reasonable questions starting from a general definition of the measurement goals. In the CEMP project, we have successfully tested an extension of the original GQM template for goal generation. This extension (i.e., the abstraction sheet see Section 2) was defined as the result of previous experiences in using GQM. It aims at bridging the gap between the goals and the questions, by summarizing for each goal the focus, variation factors, baseline hypotheses, and the expected impact of the variation factors on the baseline hypotheses (for a detailed example see Appendix 2). Abstraction sheets have turned out to be very effective to capture relevant process/product information related to the goal being analyzed. They make it possible to integrate the goal definition with a range of information that can provide useful hints for the selection of questions. For instance, baseline hypotheses, i.e., what people believe is happening in a specific context, can be easily transformed in questions.
Goal1 Abstraction sheet 1 Goal2 Abstraction sheet 2 Goal3 Abstraction sheet 3

Question1

Question2

Question3

Question4

Question5

Metric1

Metric2

Metric3

Metric4

Figure 7: A typical GQM plan. We have been able to confirm in our study that a GQM plan is represented by a hierarchy, not necessarily a tree. In particular, the same question may arise from the refinement of different goals, and the same metric may be used to answer several

19

questions, as depicted in Figure 7. This allows for reuse of questions and metrics when introducing new goals in an existing GQM plan. In other words, the introduction of new goals in a GQM plan is in general not accompanied by the introduction of a whole new set of questions and metrics. 6.1.3 Creation of a GQM plan The definition of the GQM-based measurement program at Digital Italias laboratories benefited from some practices and situations that helped reduce its complexity and increase its effectiveness. Understanding the process The GQM team observed that they had just a qualitative knowledge of the Digital development process. In particular, the details of the process were not formally described. We realized that, to support the refinement and detailed characterization of the measurement goals, we needed to refine and enrich the high-level process model presented in Appendix 1. Based on this experience, we argue that the development of the GQM and measurement plans must be based on a comprehensive knowledge of the process details. Clearly, the level of refinement that is needed varies depending on the specific (part of the) plan being defined. However, it is certainly necessary to have a clear picture of the process structure. This is consistent with the observation reported in [14], according to which only simple metrics related to effort and project duration can be collected when the maturity of process is low. Setting the goals The goals of the case study were quite clear from the very beginning, the objective being to understand how the software development process could be modified in order to improve the deployed products with respect to reliability and reusability. The focused qualities were modeled in very simple terms: product reliability is defined as the number of defects perceived by the users; product reusability is defined as the possibility to use already existing code, and is intended as a way to decrease the cost of development. The detailed definition of goals was then carried out through a series of brainstorming and revision sessions. It is interesting to outline how these sessions were carried out: The GQM team based the detailed definition of several goals on qualitative comments that were exposed by the project team. For instance, Digitals management wanted to test the hypothesis that the external field test provided little contribution to the detection of products defects. This hypothesis was based on the project team members observations, but the quantity and quality of the available data supporting this hypothesis were not considered sufficient to demonstrate the existence (and the real extent) of the problem, and to justify at the management level

20

specific modifications of the process. The first goal was thus oriented towards the verification and explanation of this phenomenon. Another principle that was constantly applied was the verification of known (bad) practices. For instance, it is well known that often errors are corrected in the code without updating the corresponding design documents: this possibility (and other similar ones) were carefully considered in the definition of goals. Goal definition was verified both internally (by the GQM team) and externally (by independent GQM experts and the rest of the project team).

Staffing The GQM team was composed of professionals with different expertise. Therefore, the team had a good knowledge of the different topics that were relevant to the case study. In particular the GQM team included: A project team leader (i.e., a Digital person in charge of a software development activity). Two project team members. A GQM expert (from CEFRIEL). Two experts in process assessment and improvement (from CEFRIEL). They had already worked with Digital and knew fairly well the characteristics of Digitals process and products. It is, therefore, important to emphasize the followings enabling factors: 1. The whole team had a good knowledge of the Digital organization worldwide and at different levels of management. 2. The whole team except for the GQM expert already knew many of the projects carried out at the engineering center in Gallarate. 3. The project team members were in charge of testing the product being studied (i.e., FUSE), thus they had a comprehensive knowledge of its functions and structure. 4. The GQM team was actively supported by the rest of the project team and the site manager. Executing the GQM process Here we report some hints that can contribute to make the GQM process easier to execute. Exploiting prior knowledge on the process. The prestudy phase can be simpler than it appears, provided that the members of the GQM team have a comprehensive knowledge of the organization, development process, and ongoing improvement initiatives of the company. Classification and selection of goals. We found that an early formalization of goals in the GQM plan is beneficial. While not difficult nor time-consuming, early formalization makes it possible to achieve two advantages:

21

Discard irrelevant goals as early as possible. By using an informal goal, one can formulate goals that are impossible to achieve, repetitions of existing goals, or goals which are not sound or not consistent with other goals. Formalization highlights the weaknesses of some goals, making it possible to discard them quickly. Select and rank goals on a clear and sound basis. The abstraction sheet and metrics associated with each goal make it easy to clearly assess the goal relevance with respect to the strategic improvement objectives. Recording the rationale of the choices. We found it useful to carefully record the rationale of the main choices that were made during the process (e.g., why a goal was introduced/discarded, why metrics were defined in a given way, etc.). This helps verify and keep under control the consistency of the plans with respect to the general strategic goals. Moreover, the traceability of the plan with respect to the actual development process is guaranteed. Involvement of developers. This makes it possible to gain a deep insight into the process, and develop a valid interpretation of results. Use of negative results. It is clearly possibleand perfectly acceptablethat part of the GQM plan fails. For instance, a goal is not reached, or an initial hypothesis is negated, or a question cannot be answered, or a metric cannot be collected. These negative results (along with the causes of failures) must be carefully recorded, since they are instrumental in increasing the knowledge on the development process, and can thus prevent the repetition of errors. Feasibility assessment. The critical aspect in using GQM is to create plans that are consistent, complete, realistic, and actually geared towards the achievement of the improvement objectives. Consistency, completeness, and adequacy are achievable by means of review sessions. Conversely, it is more difficult to verify whether a plan is realistic (i.e., whether it is not too expensive, it can be carried out with the existing resources and within a reasonable schedule, etc.). Most important, the GQM plan must be composed of metrics that can be actually collected (i.e., they correspond to observable phenomena), that can provide enough data points, that are not too noisy, etc. Therefore, it is necessary to check the plan against the development process. This operation requires a comprehensive knowledge of the development process, and has to be carried out very carefully. Also, it guarantees that the GQM plan can be pursued effectively, avoiding errors, and reducing the number of iterations required in the GQM process to define the measurement program. It can be noted that feasibility analysis shares some goals with the feedback sessions. However, the latter are carried out later in the GQM process (after data collection and analysis) and can only discover errors in the data collection

22

activity, while feasibility analysis prevents mistakes in the definition of the GQM plan. 6.2 Cost of the Case Study One of the relevant goals of the CEMP project was the assessment of the cost of establishing a GQM program. The effort (expressed in person-days) for the execution of the GQM process at Digital is illustrated in Figure 8. Note that the illustrated costs cover the execution of the whole plan, comprising the five goals described in Section 5.1. The participants are classified as project team (Digitals people working on the experiment), GQM consultants (including the process improvement experts), and junior researchers. The latter played the same role as the GQM and process consultants: their effort is shown separately because their cost was lower than that of the experts.
80 70 60 50 40 30 20 10 0 T raining
Prelim inary Identific ation GQM of objec tives plan study Environm ent Data c ollec tion set-up

Project team G/Q/M cons ultants Junior Res earchers

Figure 8. Distribution of the costs of the case study. The classification of costs reported in Figure 8 reflects the structure of the official cost statement of the CEMP project. Although this cost breakdown was defined before the final definition of the GQM process was produced, it was decided to continue collecting data according to such schema, since the mapping to the new definition of the GQM process is relatively straightforward. In particular, the following considerations apply: Although not strictly part of the GQM process, training of project team members is often necessary, in order to enable project members to participate in the GQM process effectively. Note that in this experiment the junior people supporting GQM and process experts had to undergo the same training as the project members. The preliminary study and the identification of the strategic objectives correspond to the GQM phases as described in Section 2.1. The GQM plan cost accounts for both the definition of the Goal/Question/Metrics DAG and the planning of activities.

23

Environment set-up is not explicitly mentioned in the GQM process model. Nevertheless, this is a necessary preparatory activity that involves the development of electronic forms, data conversion utilities, filters, etc. The effort needed to develop the GQM CASE tool (see Appendix 3) was not considered. The cost for data collection and analysis includes also the packaging of experiences. The overall cost of establishing and running the GQM process was a reasonably small fraction of the development cost6. It is difficult to assess the benefits brought by the improvement actions that followed the experiment, because we could not observe such actions. However, the first improvement action concerning the reorganization of testing (see Section 6.3.2) made it possible to decrease the cost of testing by 30%. A comparison of the cost figures related to the case studies carried out within the CEMP project by Digital, Bosch, and Schlumberger is reported in [25]. In particular, the following two figures complement the considerations on costs reported above: 1. The total average effort to introduce GQM-based measurement is about one person/year (1/3 from the project team and the rest from the GQM team). 2. The project overhead due to measurement (i.e., the fraction of time spent by developers to collect metrics) is about 2%. The effort spent by Digital was greater than the average cost measured in the CEMP project, even not considering training and environment set-up. The difference is explained by the number of GQM goals defined for the FUSE project (which was double with respect to the other case studies within CEMP) and the additional cost that was required to verify, clean, and interpret existing data, in order to achieve Goal 1. Economy of scale Establishing a GQM-based measurement program involves the iteration of activities, whose cost decreases as the experience in using the method increases. Moreover, the definition and establishment of a measurement program can be accomplished by reusing parts of existing programs: Parts of a GQM plan (e.g., questions and metrics) can be reused in new plans. In particular, a metric can often be used to answer different questions, possibly belonging to different goals. Different metrics can share the same set of elementary collected data. When a new GQM plan is defined, GQM plans formerly defined for similar goals within different projects can be used as starting points. This is particularly effective when both projects are carried out in the same organization. These opportunities to cut the cost of running the GQM process were fully exploited with the help of the GQM tool.

The value of this fraction is not reported for confidentiality reasons.

24

The experiment reported here was replicated on a different project (concerning the Datatrieve product) within the same development environment. In this new scenario, it was possible to reuse experience, GQM plan definitions, and software tools, decreasing the cost of GQM process execution by 40%. Figure 9 compares the costs (expressed in persondays) of the two experiments. It is possible to note that: - The infrastructure supporting the GQM process was entirely reused, therefore the cost for setting-up the environment was dramatically reduced. - The GQM plan development increased the efficiency of the team and the reuse of available items. Its cost was 20% of that of the first experience. - The main cost factor (about 65% of the whole cost) was given by data collection, also because the direct involvement of developers was required to interpret existing data.
140

120

100

80 FUSE Datatrieve 60

40

20

0 Pre-Study Identification of objectives GQM plan Environment set-up Data collection

Figure 9. Costs of establishing GQM in two projects. 6.3 Benefits for the Digital Engineering Center This section must not be considered the novel contribution of this paper, but a concise presentation of the results that have been achieved by Digital Italia, and considered important and relevant by the Digital Italia management. 6.3.1 Main results The first goal is oriented to the evaluation of failure detection and correction rate with respect to time, location, origin, etc. Data concerning about 2200 failures were collected: Figure 10 reports the distribution of these failures per priority and detection phase.

25

600

500 1. Showstopper 2. High priority 3. Medium priority 300 4. Low priority 5. Suggestion 200

400

100

0 Inherited from earlier versions Development Qualification Internal Field Test External Field Test After release

Figure 10: Distribution of failures per priority and detection phase. The same data are also presented in Figure 11, to highlight the distribution of failures per priority level and per detection phase.
Internal Field Test 1% External Field Test 2% Qualification 50% After release 5% Inherited from earlier versions 12%

4. Low priority 15%


Development 30%

5. Suggestion 1. Showstopper 5% 7%

2. High priority 20%

3. Medium priority 53%

Figure 11. Distribution of failures per priority (left) and detection phase (right). These data enabled the GQM team to make the following observations: The development phase produces the highest number of Priority 1 failures. This was expected, since in this phase the product was in an unstable situation. The qualification phase detects the highest number of failures. This was expected as well, since the testing was done by experienced people with the help of good tools and reasonable schedules and resources. Most failures are classified at Priority 3. Although the shape of the distribution was expected, the very high number of Priority 3 failures is probably due to an inadequate classification of failures. In fact, by classifying a failure as a Priority 3 problem, one does neither suggest that the failure is a severe one (as Priority 2 would imply), nor that it is of little importance (as Priority 4 would indicate). A redefinition of the priority classification in only four levels would probably help correct this situation, by forcing software developers to provide a more convincing and precise characterization of failures.

26

The field test (both internal and external) provides a very small contribution (less than 5%) to the detection of failures. The objective demonstration that field test provides a very small contribution to failure detection suggested important design changes to Digitals software process. These changes result in relevant economic benefits. In fact, the field test activity is quite expensive and the resources (people and funds) that are currently allocated to this activity can be used more effectively. For instance, the external field test sites could be restricted to a small selected group of users that is better supported and motivated to actually use the product in real operating conditions. Note that in the case of field test, the results reported by the GQM were in opposition with other well known situations (e.g., Microsofts successful beta test programs). This can be easily explained by the differences in the nature of products and users, and in the size of the installation bases. 6.3.2 Other results In addition to the main findings described above, further analysis of the data collected for Goal 1 unveiled interesting facets of the development process. In particular, we were able to observe several characteristics of the baselevels (the definition of baselevels is given in Section 4.1). The development of the product was carried out in 18 baselevels, the first 7 involving just design and coding, while the correction of faults was carried out along with the development of new features, starting from baselevel 8. We observed the situation one and three months after the release of the product. The following data are illustrated in the rest of this section: Distribution of failures per priority and baselevel (Figure 12). Distribution of failures fixed/deferred per baselevel (Figure 13). Note that deferred failures are taken into account only once: if a failure has been deferred for several baselevels, it is counted only in the baselevel in which it was first noticed. Distribution of the testing effort, expressed in persondays, per baselevel (left part of Figure 14). Productivity of testing per baselevel, expressed as the number of discovered failures per personday of testing effort (right part of Figure 14).

27

350

300

250

200

5 . Su g g e s t io n 4 . L o w p r io r it y 3 . M e d iu m p r io r it y 2 . Hig h p r io r it y

150

1 . Sh o w s t o p p e r

100

50

0 In h e r it e d B0 1 B0 2 B0 3 B0 4 B0 5 B0 6 B0 7 B0 8 B0 9 B1 0 B1 1 B1 2 B1 3 B1 4 B1 5 B1 6 B1 7 FB M0 1 M0 3

Figure 12. Failures per priority and baselevel. By analyzing these data, the GQM team was able to make several observations: The data represent quite well the history of the project: during the first seven baselevels, practically no new software was released, so all the values are low. It can also be observed that as soon as the production team began to release new software, the testing effort was greatly increased. The diagram reporting the ratio of fixed failures with respect to the problems deferred to a following baselevel indicates that the production teams capacity of fixing failures was exploited to the limit. This turned out not to be a problem, as all the faults were corrected within the end of the project.
300 250 200 150 100 50 0
I nherit ed B1 B2 B3 B4 B5 B6 B7 B8 B9 B 10 B 11 B 12 B 13 B 14 B 15 B 16 B 17 B F M1 M3

Fixed Deferred

Figure 13. Failures fixed/deferred per baselevel.

28

Priority 1 failures appeared just after the release of new software (i.e., after baselevel 8), but were almost eliminated in a few baselevels: a few priority 1 failures appeared after baselevel 13. This witnesses the ability of the testing team to timely discover the most severe problems, and the ability of the production team to effectively solve them. When the project was approaching its conclusion, the number of new failures decreased. However, the testing effort was unchanged, causing the productivity of testing activities to decrease, reaching very low levels at the end of the project.
7 6 5 4 3 2 1 0
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10B11B12B13B14B15B16B17BF

70 60 50 40 30 20 10 0

B1 B2 B3 B4 B5 B6 B7 B8 B9 B10B11B12B13B14B15B16B17BF

Figure 14. Testing: effort in persondays (left) and detected failures per personday (right) per baselevel. This last observation, combined with the evaluation of the diagram reporting the average correction time per failure (not reported in this paper) suggested a modification of the testing process. Software changes used to be tested immediately by the developers, but the problem was classified as closed only at the completion of the baselevel, i.e., when the official test had confirmed the results of the developers test. It was quite clear that the rationalization of the testing practice could result in relevant savings, since it is possible to reduce duplication of efforts and have a reliable evaluation of the test process effectiveness and performance. Another change in the process concerns the release of software components. In practice, for products like FUSE, made of a set of fairly independent components, the indication deriving from data analysis was to schedule the release of components in an incremental way. This maximizes the efficiency of testing because, at every baselevel, the testing activity can concentrate on the new code (presumably more buggy than the code belonging to previous baselevels, and which has already been tested at least once). 6.3.3 Summary of the benefits for the company Digital Italia has substantially improved the existing measurement program. Moreover, it has learned how to establish and/or extend a GQM measurement program in a systematic and coherent way. More specifically, the benefits can be summarized as follows: Better data collection practices (i.e., a coherent and goal-oriented measurement program). Measures are linked to the questions they must contribute to answer. Comprehensive queries for retrieving data are defined. The regular structure of the database allows easy statistical computations.

29

Better data management facilities (a relational database, whose schema is adaptable through the GQM tool as the GQM plan evolves). Better interpretation of data. The involvement of the production team in the definition of the plan and in the feedback sessions guarantees a reliable interpretation of data. Moreover, the coherence of the metrics defined in the plan with the process model guarantees their unambiguous and precise interpretation. Better motivation for data collection. The development team has been regularly involved in the GQM process. They understand the importance of the measurement program, and they appreciate the potential (and actual) benefits of this activity. Better usage of existing data. Once a rigorous framework for validation and interpretation is provided, existing data can be exploited and effectively used.

7. CONCLUSIONS Software process improvement must be based on quantitative evaluation of process and product characteristics. This assumption has guided our work, which has been centered on the evaluation and application of the GQM paradigm in an industrial software factory. The main insights and lessons learned can be summarized as follows: GQM has been criticized because of lack of structure and guidance. The experiences carried out in the past few years and, specifically, in the CEMP project have made it possible to create a detailed and improved description of the GQM process. Abstraction sheets have turned out to be very effective in capturing relevant process/product information related to the goal being analyzed. They make it possible to at least partially fill the conceptual gap between goals and questions. The application of GQM to different projects within the same organization allows to the reuse of parts of GQM plans (goals, abstraction sheets, questions, and metrics) concerning common process and product features. The reuse of these parts of GQM plans may yield substantial savings in terms of effort and time. The development of the GQM and measurement plans should be based on a comprehensive knowledge of the process details, i.e., when a significant level of maturity has been reached. Goal setting has to be based on an intuitive and simple representation of the known problems of the process being studied. Staffing is a critical enabling factor. The GQM team must have a good knowledge of the organization being studied at different levels of management. In executing the GQM process, it is important to follow a set of guidelines:

30

1. Exploiting prior knowledge on the process. The prestudy phase can be simpler than it appears, provided that the members of the GQM team have a comprehensive knowledge of the organization, development process, and ongoing improvement initiatives of the company. 2. Classification and selection of goals. We found that an early formalization of goals in the GQM plan is beneficial to discard irrelevant goals as early as possible and to select and rank goals on a clear and sound base. 3. Recording the rationale of the decisions. We found it useful to carefully record the rationale of the main decisions that were made during the process. 4. Involvement of developers. This makes it possible to gain a deep insight into the process, and allows a careful validation of results. 5. Use of negative results. Negative results (along with the causes of failures) must be carefully recorded, since they are instrumental in increasing the knowledge on the development process, and can thus prevent the repetition of errors. 6. Feasibility assessment. The critical aspect in using GQM is to create plans that are consistent, complete, realistic, and actually geared towards the achievement of the improvement objectives. The overall cost of establishing and running the GQM process was a reasonably small fraction of the development cost. The cost decreases significantly and quite rapidly as experience is gained. Overall, we have been able to achieve two objectives: a) The experiences and information on the problems, costs, and required skills related to the creation of a measurement program have been packaged and can now be used to start up new measurement programs. b) The Digital Engineering Center has established an effective measurement program that has made it possible to evaluate several process improvement hypotheses and identify appropriate improvement strategies. Acknowledgment The authors would like to thank the following people who, in different ways, have been instrumental in pursuing the activities described in this paper. Dieter Rombach, for his guidance and advice. Barbara Hoisl, for her support in the development of the measurement program. The partners of the CEMP project Bosch, Schlumberger, and the Software Technology Transfer Initiative, University of Kaiserslautern for many useful discussions. The management of Digital Italy, and particularly Paolo Rivera, for their support. Luca Baratto, Marco Grigoletti, Cristiano Gusmeroli, and Luca Panigada, for their work in the development of the GQM tool.

31

References 1. Software Engineering Institute, The Capability Maturity Model: guidelines for improving the software process, Addison-Wesley, 1995. 2. V. Basili and D. Rombach, The TAME project: towards improvement-oriented software environments, IEEE Transactions on Software Engineering, vol. 14, no. 6, pp. 758-773, 1988. 3. N. Fenton, Software metrics: a rigorous approach, Chapman & Hall, 1991. 4. V. Basili, G. Caldiera, and D. Rombach, Goal/Question/Metric Paradigm, in Encyclopedia of Software Engineering, vol. 1, J. C. Marciniak, Ed.: John Wiley & Sons, 1994, pp. 528-532. 5. V. Basili, The Experience Factory and its relationship to other improvement paradigms, in Proceedings of European Software Engineering Conference 1993 (ESEC-93), Garmish (Germany), 1993. 6. L. Votta, A. Porter, and D. Perry, Experimental Software Engineering: a report on the state of the art, in Proceedings of 17th International Conference on Software Engineering (ICSE 17), Seattle (WA), 1995. 7. C. M. Judd, E. R. Smith, and L. H. Kidder, Research methods in social relations, Sixth ed. Fort Worth, TX (USA): Holt, Rinehart and Winston, Inc., 1991. 8. J. Lofland and L. H. Lofland, Analyzing Social Settings, Third ed: Wadsworth Publishing Company, 1995. 9. L. Votta and M. L. Zajac, Design process improvement case study using process waiver data, in Proceedings of Fifth European Software Engineering Conference (ESEC 95), Sitges (Spain), 1995. 10. N. Fenton, Software measurement: a necessary scientific basis, IEEE Transactions on Software Engineering, vol. 20, no. 3, pp. 199-206, 1994. 11. L. Briand, K. El Emam, and S. Morasca, On the application of measurement theory in Software Engineering, Empirical Software Engineering: An international Journal, vol. 1, no. 1, , 1996. 12. R. B. Grady and D. L. Caswell, Software Metrics: establishing a company-wide program: Prentice-Hall, 1987. 13. Software Engineering Laboratory, Software Measurement Guidebook, NASA, Goddart Space Flight Center, Greenbelt (MD) SEL-94-102, June 1995. 14. S. L. Pfleeger, Lessons learned in building a corporate metrics program, IEEE Software, vol. 10, no. 3, pp. 67-74, 1993. 15. V. Basili and D. Weiss, A methodology for collecting valid software engineering data, IEEE Transactions on Software Engineering, vol. SE-10, no. 6, pp. 728-738, 1984. 16. V. Basili, M. K. Daskalantonakis, and R. H. Yacobellis, Technology transfer at Motorola, IEEE Software, vol. 11, no. 2, pp. 70-76, 1994.

32

17. D. N. Card, What makes for effective measurement?, IEEE Software, vol. 10, no. 6, pp. 94-95, 1993. 18. V. Basili, GQM approach has evolved to include models, IEEE Software, vol. 11, no. 1, pp. 8, 1994. 19. D. Weiss, GQM plus heuristics better than brainstorming, IEEE Software, vol. 11, no. 1, pp. 8-9, 1994. 20. K. K. Hefner, An experienced-based optimization of the Goal/Question/Paradigm, in Proceedings of California Software Symposium, Irvine, CA (USA), 1995. 21. S. Reiss, Connecting tools using message passing in the FIELD environment, IEEE Software, vol. 7, no. 4, pp. 57-66, 1990. 22. Software Research Inc., STW User Guide,. 23. C. Gresse, B. Hoisl, and J. Wst, A process model for planning GQM-based measurement, Software technology Transfer Initiative, University of Kaiserslautern, Department of Computer Science, D-67653 Kaiserslautern (Germany), Technical Report STTI-95-04-E, October 1995. 24. R. van Solingen, F. van Latum, M. Oivo, and E. Berghout, Application of software measurement at Schlumberger RPS, in Proceedings of Sixth European Software Cost Modeling Conference (ESCOM), Paris (France), 1995. 25. C. Gresse, D. Rombach, and G. Ruhe, Tutorial: A practical approach for building GQM-based measurement programs - Lessons learned from three industrial case studies, in Proceedings of Tenth Brasilian Symposium on Software Engineering, So Carlos (Brasil), 1996. 26. Integration Definition for Function Modeling, IDEF0. Federal Information Processing Standards Publications, December 1993.

33

APPENDIX 1: Digitals Design and Qualification phases The design and implementation phase (alias Phase 2) is depicted in Figure 15, according to the IDEF0 notation [26] (enabling conditions and triggering events are omitted).
Design Document Arch. Design Documentation Project Leader Test Case Development Project Leader Component Development Project Leader Baselevel Test Cases Test Execution Project Leader Final Baselevel Documentation for Field Test

Specifications

Preparation of FT Kit

Field Test Kit

Project Leader Product Leader Test Results

Figure 15 - Design and implementation. The Qualification phase (alias Phase 3) is represented in Figure 16. It aims at checking the product behavior by means of Testing in the development environment, Internal Field Test (IFT: the product is used by Digital personnel not belonging to the development team), External field test (EFT: the product is released to a limited number of selected customers in order to get feedback from a representative set of users).
Field Test Kit Testing Project Leader Internal Field Test Digital External Field Test Selected Customers Failure Reports Correction Project Leader Sanity Kit Testing Project Leader External Field Test Selected Customers Failure Reports Release Master Kit Project Leader Product Leader

Master Kit Marketable Product

Testing and Final Correction Project Leader

Approval

Product manager Review board

Figure 16. Phase 3 - Qualification.

34

Before the product is released a Sanity Kit and a Master Kit are produced. The latter is used for a limited period in order to verify its compliance with the company standard.

APPENDIX 2: Detailed definition of Goal 1 The detailed description of goal 1 is reported. Metrics are omitted for space reasons. Goal Analyze the design/development and qualification/delivery phases of the development process for the purpose of evaluating failure detection effectiveness (quality focus) from the viewpoint of the management and the development team for FUSE 2.0. Quality focus Characterization of failure detection events, each described by the following information: Phase and time when it occurred. Component where the failure was located. Cause of the failure (user error, software problem, hardware failure, documentation mistake, etc.). The failure itself is characterized by the following attributes: Failure criticality. User priority (the importance of the problem from the users viewpoint). Cost to fix (in terms of effort and time). Variation factors The process of detecting failures is affected by: Testing methods, in terms of techniques, tools, and level of coverage. Testing team characteristics, in particular experience in testing, knowledge of the problem, size of the team, ... The characteristics of the product itself may depend on attributes like: platform, size, complexity, quality of specification and required reliability. Other variation factors are related to the knowledge of the domain. The understandability of the requirements and the confidence degree about the results of the tests were considered as important influencing factors. Baseline hypotheses The development team was asked to estimate: the number of failures per phase (internal testing, external field testing, series production); the number of failures per component;

35

the distribution of failures per criticality; the effort to fix the faults in source code causing failures of priority 1 or 2. These estimations are confidential. Impact on baseline hypotheses The assumptions we have made are basically the following ones: Better testing methods increase the percentage of failures detected in Internal and External field testing, and decrease the percentage of failures detected in series production. Better testing team performances will decrease the number of expected failures per component in series production and their criticality. The more complex the implemented function, the more difficult it is to do a thorough test and the more easily a failure may remain undetected. Questions and metrics Questions concerning the process (in particular the conformance of the measured activity with respect to the reference process model): Q1 What leads to exposing failures? Q2 How accurate were the tests? Q3 What is the distribution of tests over each component and its functionality? Q4 What is the importance of testing each functionality? Q5 How much effort does it take to test each component and functionality? Q6 What is the size, experience and knowledge of the testing team? Questions concerning the process (in particular the domain conformance). The following questions aim at understanding whether the testing team had the required level of knowledge and expertise (Q8), the operating conditions where it worked (Q7), and the quality of the work done. A new testing session could be started, if the questions reveal a low confidence in the results of testing. Q7 How understandable are the requirements to the tester? Q8 What is the knowledge of the testing team on application domain? Q9 How confident is the tester that the result is correct? Questions on the quality focus (failures): Q10 How many failures occur in each phase? Q11 How many failures escaped each testing phase over the total number of failures? Q12 What is the failure time distribution (in each phase)? Q13 What is the failure distribution with respect to detection effort in phase 2? Q14 To what extent can failures be traced back to component and functionality?

36

What is the time and effort distribution of failures for each component and functionality they can be traced to? Q16 What is the origin of a failure? Q17 What is the distribution of failures with respect to their origin? Q18 What is the distribution of failures with respect to their origin for each component and for each functionality? Q19 What is the distribution of the effort to fix failures with respect to their origin? Q20 What is the ability of testing techniques in exposing failures with respect to their origin? The following set of questions addresses an important attribute of the quality focus, namely the failure criticality (from the viewpoint of the user and the developer). Q21 What is the criticality of each failure from the users viewpoint? Q22 What is the criticality of each failure from the corporations viewpoint? Q23 What is the time distribution of failures with respect to their criticality? Q24 What is the distribution of the effort for exposing failures with respect to their criticality? Q25 What is the distribution of the effort to fix failures with respect to their criticality? Q26 What is the component and its functionality distribution of failures with respect to their criticality? Q27 What is the ability of testing techniques in exposing failures with respect to their criticality? Questions concerning the product (logical and physical attributes) provide insight into the components features that could affect the generation of faults. Q28 What are the differences across the platforms? Q29 What is the size of the entire product, of each component and of its functionality? Q30 What is the complexity of each component, of its functionality and of the entire product? Q31 What is the reliability of each component and of the entire product? Q32 What is the quality of specifications for each component? The following feedback questions are instrumental to the definition of an improvement strategy that reflects the analysis of the metrics described above. They try to formulate process improvement proposals, generally by analyzing the truth of the baselevel hypotheses. For example, Q34 assumes that increasing the effort dedicated to development and testing could cause big savings in the following phases. Q33 Does the test method need to be refined or modified?

Q15

37

Q34 Q35

Would it be useful to provide more resources to test in Phase 2? Does a better development organization lead to fewer failures?

Appendix 3: A quick tour of the GQM tool


The GQM tool was developed for the Microsoft Windows environment (a new version for Windows95 is being developed). The tool is available via anonymous ftp from ftp://ftp.cefriel.it/pub/Settore2/gqm-tool. Some of the features of the tool are highlighted in Figure 17. Editing The upper left window (labeled A) reports the explosion of the currently selected goal (G_FUSE_fail_fault). There are several quality foci associated with the goal. In particular, QF_Fail/Component is open and we can see that it has two associated questions, the second one (Q_Fail/CompCritic) is refined into two metrics: M_Fail/FuncPriority, and M_Fail/CompPriority, which is currently selected. The lower left window (labeled B) reports information concerning the selected metric M_Fail/CompPriority. Data visualization The lower right window (labeled C) displays data associated with the selected metric (in this case, the Failure_origin metric). In particular, the distribution per failure priority and per origin is given in bar-chart form. Data are available also in textual formats (in the upper left scrollable area of the window). Data browsing The upper right window (labeled D) displays a list of tuples, each corresponding to a data point of the selected metric. The attributes of each data point are visible (e.g., the instance labeled EFT-00054 of the Failure_origin metric has attribute Failure origin equal to User).

38

Figure 17. A screen shot of the GQM tool.

39

Você também pode gostar