Commodity Cluster Computing For

DHPC Technical Report DHPC-073
Commodity Cluster Computing for Computational Chemistry

K.A. Hawick1, D.A. Grove1, P.D. Coddington1, M.A. Buntine2
1
Department of Computer Science, University of Adelaide, Adelaide, SA 5005, Australia 2 Department of Chemistry, University of Adelaide, Adelaide, SA 5005, Australia khawick@cs.adelaide.edu.au, mark.buntine@adelaide.edu.au 21 January 2000
Keywords: Computational chemistry, Beowulf cluster, cluster computing, parallel computing, performance benchmarks
Abstract:
Access to high-performance computing power remains crucial for many computational chemistry problems. Unfortunately, traditional supercomputers or cluster computing solutions from commercial vendors remain very expensive, even for entry level configurations, and are therefore often beyond the reach of many small to medium-sized research groups and universities. Clusters of networked commodity computers provide an alternative computing platform that can offer substantially better price/performance than commercial supercomputers. We have constructed a networked PC cluster, or Beowulf, dedicated to computational chemistry problems using standard ab initio molecular orbital software packages such as Gaussian and GAMESS-US. This paper introduces the concept of Beowulf computing clusters and outlines the requirements for running the ab initio software packages used by computational chemists at the University of Adelaide. We describe the economic and performance trade-offs and design choices made in constructing the Beowulf system, including the choice of processors, networking, storage systems, operating system and job queuing software. Other issues such as throughput, scalability, software support, maintenance, and future trends are also discussed. We present some benchmark results for the Gaussian 98 and GAMESS-US programs, in order to compare the processor performance (and price/performance) with other computing platforms. We also analyse the efficiency and scalability of the parallel versions of these programs on a commodity Beowulf cluster. We believe that the Beowulf cluster we have constructed offers the best price/performance ratio for our computational chemistry applications, and that commodity clusters can now provide dedicated supercomputer performance within the budget of most university departments.
1. Introduction
In recent years computational chemistry, particularly in the area of electronic structure determination, has seen an explosive growth in the number of researchers using theoretical methods to aid both in interpreting experimental results and designing new experimental strategies. The growth in uptake of computational chemistry usage has been driven, in large part, by the advent of easy to use graphical "front ends" such as GaussView [1] and Spartan [2]. Even non-commercial packages such as MOLDEN [3] have opened up the advantages of computational chemistry to a whole new class of chemist who is not necessarily expert in the nuances of quantum chemistry nor in computer operating systems. Occurring simultaneously with improvements in computational chemistry software, there has been an extremely rapid increase in the hardware development cycle with vendors continuing to double microprocessor performance characteristics every 12 to 18 months [4]. Improvements in the ease-of-use of computational chemistry software, coupled with the ever increasing microprocessor speed and data storage options, has allowed chemists to apply chemical theory to molecular systems of unprecedented size and at levels of theory that were simply intractable only a few years ago. Nonetheless, the prediction of physical and chemical properties of interest continues to remain limited by the availability of sufficient, affordable computing capacity. Access to High-Performance Computing (HPC) power remains crucial for many computational chemistry problems. Unfortunately, traditional supercomputers or cluster computing solutions from commercial vendors continue to be very expensive and therefore beyond the reach of many small to medium-sized research groups and universities. They also become obsolete very quickly, due to the continual rapid increase in processor performance. With a limited computing budget, it is crucial to get the best price/performance from a computing system. The dramatic improvements in performance of commodity PCs over the past few years have made clusters of commodity computers, now commonly known as Beowulf clusters [5a,5b,5c], an attractive option for low-cost, high-performance computing. It has been recognised for some time now that commodity processors could be very useful for computational chemistry applications [6,7], and most chemistry programs have now been ported to commodity systems, making Beowulf clusters a viable platform for computational chemistry. In an effort to significantly reduce the cost of HPC for computational chemistry applications at the University of Adelaide, we have designed and constructed a Beowulf cluster, known as Perseus, that is dedicated to computational chemistry. We initially developed a prototype system of 16 dual Pentium PCs, and the benchmark results presented in this paper were obtained on this machine. We have now constructed the full Beowulf cluster, which consists of 116 dual Pentium PCs (i.e. 232 Pentium processors) connected by a commodity Fast Ethernet network. Our system cost approximately US$200,000, which is considerably less than a traditional HPC system of comparable performance from a commercial vendor, providing around an order of magnitude better price/performance ratio. In this paper we introduce the Beowulf concept of using commodity computing components to build high-performance computers, and explain why this approach is well suited to computational chemistry applications. We describe the prototype Beowulf system that we have constructed and the design decisions that were made to provide the best price/performance ratio for chemistry applications. We provide performance benchmarks that we have undertaken using the Gaussian 98 [8] and GAMESS-US [9a] suites of programs, and discuss how we can maintain efficient use of the machine when it is scaled up to hundreds of processors. In addition, we present some insight into the price/performance benefits of Beowulf clusters for computational chemistry applications by
comparing the performance of the Beowulf cluster with other traditional high-performance computers. We believe that Beowulf clusters offer a viable high-performance computing option to researchers who do not have access to expensive commercial supercomputing resources. The results of our performance benchmarking will be of interest to chemists who are looking to have access to maximum computing power for smaller capital outlays.
2. Beowulf Clusters
There has been substantial interest in recent years in building high-performance multiprocessor computers using off-the-shelf desktop computers connected together with standard network technologies such as Ethernet. Networked clusters of computers can be used for processing multiple sequential jobs, and can also support parallel programming using message passing software such as Message Passing Interface (MPI) [10] and Parallel Virtual Machine (PVM) [11]. This general concept of cluster computing [5b,12,13,14a,14b] can provide a much more cost-effective compute resource than traditional supercomputers, by taking advantage of commodity components. Clusters are also easier to maintain, at least in terms of hardware, and can be upgraded cheaply and incrementally, since commodity nodes can easily be added or replaced. Cluster management software packages [13] such as DQS [15] and Condor [16a,16b] provide queuing software to run batches of single processor jobs on individual nodes of clusters, and some can also schedule parallel programs that use many processors. Considerable work is being invested worldwide into software tools and technologies for cluster computing [14a,14b]. The tools and methodologies that were invented for parallel computing and supercomputing are being adapted and developed to cope with the communications latency and bandwidth properties of compute clusters. Networking technologies have now advanced to the stage where individual desktop workstations are often connected with the same type of network technology as might be used for a dedicated rack-mounted cluster. Clusters are often created from existing networks of workstations or personal computers sitting on peoples desks. A cluster management system is used to place jobs on machines that are idle [13,16b]. However in this paper we will concentrate solely on clusters that are specifically constructed and used as dedicated compute servers for computational chemistry applications. It is also feasible to combine these two approaches, with the pool of compute nodes consisting of a combination of one or more dedicated compute clusters and a group of non-dedicated desktop machines that may only be available at certain times. The concept of cluster computing has been around for many years. Initially, clusters consisted of networks of high-end Unix workstations from companies such as DEC, IBM, Sun, Hewlett-Packard, and latterly also SGI. These were usually connected by standard 10 Mbit/sec Ethernet. In the 1990s clusters have been built using proprietary high-speed interconnects with much lower latency and higher bandwidth in order to improve the performance of parallel computing applications. This has been the basis for many of the current commercial high-performance parallel computers, such as the IBM SP series, which is a cluster of IBM RS/6000 workstations connected by a proprietary high-speed network. The problem with these clusters is that they use high-end workstations and high-end networks, both of which are very expensive. It was not possible for the benefits of commodity pricing to filter through to end users of these systems.
Commodity clusters have become very popular over the last few years because of the dramatic performance improvements in PC processors, to the point where they are now comparable to high-end Unix workstations for scientific computations. While commodity networks are still orders of magnitude slower than the proprietary high-speed networks used in commercial parallel computers, they have also improved greatly over the past few years, with 100 Mbit/sec Fast Ethernet now being the norm and Gigabit Ethernet becoming available. The other important driver for PC clusters has been the development of the Linux operating system, a freely available Unix implementation that was originally developed for PCs and has now been ported to a range of architectures, providing a reasonably standardised platform for program development. The development of Linux meant that the programming environment, compilers, cluster management systems, and parallel computing software used on Unix clusters could be used on clusters of commodity PCs. The NASA Beowulf project [5a,5b,5c] was the first attempt to develop a cheap high-performance computer by using commodity PC hardware connected by standard Ethernet networks and running Linux and other freely available software. Conceived in 1994 at NASA Goddard Space Flight Center as a cheap image processing system, this project demonstrated that a commodity cluster could provide an excellent price/performance ratio for a range of scientific applications. Beowulf clusters have proven to be very popular with academic and research organisations, and there are now many hundreds of Beowulfs in use worldwide. Many of these organisations have joined to form the Beowulf consortium [5d], who actively share information and software for Beowulf systems. Most organisations have constructed their own Beowulf systems from off-the-shelf components. However, for those groups that lack the desire or expertise to build one themselves, it is now possible to buy Beowulf systems from a number of hardware vendors and systems integration companies [5a]. A number of Beowulf machines have been constructed for use by computational chemists. Most are small systems of 8 to 32 processors, for dedicated use by chemistry research groups. Computational chemists are also using large Beowulf machines that are beginning to be provided as shared facilities at universities and other research organisations. Recently some large Beowulf clusters have been built for dedicated use by computational chemists. The LoBoS (Lots of Boxes on Shelves) [17] supercomputers are two PC clusters at the Molecular Graphics and Simulation Laboratory of the National Institutes of Health, the largest (LoBoS2) consisting of 100 PCs with dual 450 MHz Pentium II processors (i.e. 200 processors in total) connected by Fast Ethernet and Gigabit networks and running the Linux operating system. The LoBoS clusters are used primarily for parallel computational chemistry programs such as CHARMM, GAMESS and AMBER applied to simulations of biological systems. COBALT (Computers On Benches All Linked Together) [18a,18b] at the Department of Chemistry of the University of Calgary is a cluster of 94 Compaq Alpha Personal Workstations with 500 MHz Alpha 21164 processors, connected by Fast Ethernet and running under Digital Unix (now known as Compaq Tru64). It is used to run Gaussian as well as some density functional codes. Perseus [19], the Beowulf cluster we have built at the University of Adelaide, is slightly larger than LoBoS2, with 116 dual Pentium PCs. 100 of the nodes are dual 500 MHz Pentium III PCs with 10 Gbyte disks, while the remaining 16 nodes nodes consist of 8 dual 400 MHz Pentium II and 8 dual 450 MHz Pentium III PCs with 4 Gbyte disks, which formed our initial prototype system. Each node has at least 256 MBytes of RAM. Perseus, LoBoS2 and COBALT all have similar peak speeds of around 100 GFlops, at a cost around an order of magnitude less than comparable commercial supercomputers. For the last few years, the trend in high-performance computing has been a move away from
supercomputers that use special-purpose, high-end, high-cost components, in favour of clusters of commodity workstations that are now provided by all the major workstation vendors (IBM, Sun, SGI, Compaq, etc). Beowulf-style clusters using less expensive, mass-market commodity PC components offer an even cheaper option to these commercial clusters, and a substantially better price/performance ratio. Beowulf clusters should continue to enjoy rapid improvements in performance and price/performance in the foreseeable future, and are likely to make up an increasing proportion of the high-performance computing market in the future. Current trends in commodity cluster computing technologies are that processor clock speeds will continue to improve and that an approximate balance or equilibrium between network delivery speeds and processing capability may be maintained. Balance between the processor performance and the network and memory data access rates is a key issue for the design of any computer system. Current commodity technologies provide a reasonable balance so that processors can be kept supplied with an adequate amount of data, either from memory or from the network, in order to be well utilised and not kept idle waiting for data to process. Indeed, in the case of Gigabit Ethernet it is likely that current generation PC bus architectures may not be able to keep up with the data delivery speeds. Improvements in the latency of commodity networks would be very useful, although this is also limited by current bus architectures. The next important development may be the introduction of faster memory bus standards and technologies, so that processors can communicate effectively with their memory systems and support processors or peripherals at sustained speeds that will not impact on processing and network data delivery.
3. System Requirements for Computational Chemistry

In order to design and construct a cluster that will provide optimal price/performance for computational chemistry applications, a clear understanding of the usage requirements of the system is needed. This varies for different research groups and different applications, however the requirements we have used to design our system are fairly typical for computational chemists. High-performance computers are used by computational chemists at the University of Adelaide primarily to determine molecular structure. The most heavily used program is the Gaussian [8] package for electronic structure determination. Other packages such as GAMESS [9a,20], CHARMM [21] and Spartan [2] are also used. At any given time, there are typically many users, each of whom may be running several different jobs. Each job typically takes several days or even weeks on a single processor to produce a final result. Prior to the construction of the Beowulf cluster, the main compute resource for these jobs had been a 20-node SGI Power Challenge, which was shared with other university research groups. The Power Challenge had been considerably oversubscribed just with the demand from computational chemistry users. Our goal in building the Beowulf cluster was to provide the computational chemists with a much greater compute resource that was dedicated to their needs, which in turn would free up the Power Challenge to provide more compute cycles for other users. All of the major computational chemistry software packages have parallel implementations available. Gaussian can run in parallel on shared memory machines (such as the Power Challenge), and can also run on distributed memory machines (such as a Beowulf cluster) with the use of the Linda [22a] parallel programming environment. GAMESS-US [9a,9b] and GAMESS-UK [20] both have parallel versions that use either MPI or their own message passing libraries, and a parallel version of CHARMM is available that uses PVM. Both PVM and MPI are standard message
passing libraries that are available for virtually all parallel computers, including Beowulf clusters. When programs are run in parallel across multiple processors, there is always an overhead that reduces the efficiency and speedup to less than the ideal case, for which the program would run N times faster on N processors [23a,23b]. This overhead may be due to a number of factors, such as the cost of communications between the processors, an imbalance in sharing the workload between the processors, or some part of the program that must be executed sequentially. The overhead increases (i.e. the efficiency of the parallel program decreases) as the number of processors is increased, and eventually there will be a point at which the program will not run any faster by adding more processors. The details of how the performance scales with changes in the number of processors depends on factors such as how well the various parts of the program can be parallelised, the size of the input data, and the performance of the parallel computer being used. Each of the computational chemistry packages consists of a number of different program components and algorithms, some of which parallelise very efficiently, some that give only a limited speedup for a small number of processors, and some that will only run sequentially. In order to determine how the cluster can be most effectively used, we have performed some benchmarks to measure how the speedup and efficiency of various chemistry programs scales with the number of processors. The results of our benchmark studies are presented in section 5. The basic usage requirement that we are aiming to address is where there are many users wanting to run many computational chemistry jobs simultaneously. In this situation, the goal should not be to maximise the performance (i.e. minimise the run-time) of each individual job by using as many processors as possible, but rather to maximise the throughput [16a] of the compute resource, that is, to provide the most efficient utilisation of the whole machine across all the jobs, by minimizing the number of wasted cycles. If the number of jobs to be run is greater than the number of processors in the computer, then the most efficient strategy (i.e. maximum throughput) is to take advantage of "job-level" parallelism, and run each job on a separate processor. This avoids the overheads inherent in parallel programs, and guarantees that each processor is being used at maximum efficiency. This had been the usual situation when using the Power Challenge for running multiple Gaussian jobs. If the number of jobs is less than the number of processors, which is quite likely on large machines with hundreds of processors, then the best strategy is obviously to run some (or all) of the jobs in parallel, in order to use all the available processors. There is some scope for parallelism in jobs that use the chemistry program packages described above, although the efficiency of the parallelism will depend on the particular application and program used. In order to maximise throughput, the number of processors used for each job should be kept small enough that the efficiency remains high. For example, a program may get a speedup of 12 on 16 processors (75% efficiency), and a speedup of 32 on 64 processors (50% efficiency). For one job, it is clearly faster to use more processors. However in terms of maximising throughput and the efficient usage of the compute resource, it is better to simultaneously run 4 independent jobs that each use 16 processors, rather than running 4 jobs one after another using all 64 processors for each job. The scalability of parallel programs can be improved by using a superior network interconnect between the processors, but this is more expensive. Network infrastructure is characterised by the bandwidth or speed of data transfer between cluster nodes and also by the latency or start up cost of sending a zero-sized message. Parallel algorithms depend on high bandwidth and low latency for their success. Since the main goal of our machine is high throughput and optimal price/performance rather than optimal performance for a single job, good scalability is not such a crucial issue,
although it is certainly desirable. Traditional supercomputers like the Power Challenge utilise memory bus infrastructure to interconnect processors with very high bandwidth and low latency. This is unnecessary for our usage requirements, since running many independent Gaussian jobs does not utilise the high-speed networks or fast shared memory systems that make such computers so expensive. A Beowulf cluster is far more cost effective for our situation, which just requires high throughput. However, we would still like to be able to run some jobs in parallel and get reasonable speedup on modest numbers of processors. There may also be exceptional cases where very large jobs, or high priority jobs, may need to use large numbers of processors in order to get acceptable turnaround times. In the following sections we discuss these price/performance trade-offs and how they affected the design of our machine, and present some performance benchmarks and an analysis of the scalability of various computational chemistry codes on a Beowulf cluster. Our plan is to attempt to maintain a job mix of a reasonably large number of sequential jobs and a smaller number of parallel jobs, that can be managed to provide the best throughput utilisation of the whole cluster. In the future, we plan to consider the computational balance of the nodes of our cluster and to upgrade to either a larger number of nodes, or (possibly fewer) more powerful nodes, as user requirements and parallelism opportunities dictate.
4. Designing a Beowulf Cluster

One of the most difficult tasks in designing and commissioning a Beowulf cluster is considering the price/performance benefits from the multitude of different possible configuration options [24]. There are four crucial hardware parameters to choose in the design of a Beowulf cluster for a particular application, such as computational chemistry. These are the type of processor(s) to use in the nodes, the amount of memory installed in each node, the amount and type of disk installed in each node, and the network infrastructure that is used to connect the nodes. The level of demand on each of these resources depends heavily on the methods used by the particular application in question. As the balance points between the price/performance of computational resources, storage resources, and communication resources have changed over the years, many different algorithms have been developed to take advantage of the "cheapest" resource in the system. Broadly speaking, computational chemistry algorithms can be classed as a combination of conventional or direct computation, and sequential or parallel operation. Each of these makes use of compute, storage, and communication resources in different ways. Conventional computation saves to disk many of the intermediate results of the calculation that will be required regularly. This avoids unnecessary re-computation, and even with the relative slowness of disk storage, this is often faster than direct recalculation of these results when they are required. Unfortunately, for large problems the storage requirements can become considerable. Often there is simply not enough disk space available to store all of these intermediate results. The size of the store required for a typical computation is proportional to the number of atoms in the simulation, and to the fourth power of the number of basis set functions used. This size can be reduced significantly through the use of symmetry in the calculation. When there is insufficient storage space, direct computation methods must be used, which recalculate all intermediate values each time they are needed. Direct computation is often significantly slower than conventional methods with current hardware.
Calculations that run in parallel can use either distributed storage or replicated storage. Replicated storage requires that every node has its own copy of the entire data set, and the network is only used to communicate shared values and synchronise the calculation. This clearly limits the size of calculations to the same size that could be achieved on a single node, however this method can achieve good speedups even with a slow network. Distributed storage can only be used effectively with fast networks. It allows the entire data set required for the calculation to be split into chunks, with each node controlling some portion of the data. Far larger problems can be tackled using distributed storage, since the memory required is shared between multiple processors. Reasonable speedups can be achieved as long as the algorithm is careful to arrange the data so that most of the information required by a node is available in local memory. There are tradeoffs in choosing conventional or direct methods when using parallel operation. Parallel chemistry programs may use one or the other based on the type of calculation, the number of processors to be used, and the amount of memory available. Replicated storage can provide efficient parallelism using either conventional or direct computation, as long as there is enough memory and disk space on each node. For very large problems that are limited by the amount of memory available, distributed storage can be used. In that case, direct computation is usually more efficient than the conventional computation, since recomputing values as needed is usually faster than transferring large amounts of stored data between the nodes. It is clear that choosing the appropriate hardware for a Beowulf depends heavily on the types of problems that will be tackled. Guest [25] makes a good distinction between "grand challenge" class problems and "throughput problems". While computational chemistry certainly does have grand challenge problems which require very large jobs that can only feasibly be run on large numbers of processors, there are many situations where job throughput is the more important criteria. In our case, throughput is definitely the primary objective, although some support for parallel processing is also required. Many other organisations building clusters for different applications have opted for the "grand challenge" approach, using more expensive high-speed networks to enable good scalability of parallel jobs. It is clear that where the primary objective is throughput, it is more effective to purchase more nodes (processing power) and use cheaper commodity networking such as Fast Ethernet. However it is important to note that our benchmark results in section 5 show that reasonable scalability over multiple processors can still be achieved with commodity networks for many problems using standard chemistry packages such as GAMESS-US. In addition to the choice of hardware, there are other design choices that need to be addressed in building a Beowulf system. Issues such as the choice of operating system, the software used for cluster management and to handle job scheduling, and installation and maintenance of the system, are also very important. 4.1. Processor Architecture Modern Beowulf systems are built from commodity desktop computers, usually PCs with Intel Pentium processors or workstations with Compaq/DEC Alpha processors. Other platforms and processors may sometimes be used for various reasons, but PCs and Alpha workstations are generally the most cost effective for scientific applications, since they provide the best price/performance for floating point operations. Alpha processors can deliver considerably higher floating point performance than similar clock speed Pentiums [26], however they demand a significant price premium. Machines equipped with commodity Pentium processors (or their clones from companies such as AMD) have a cost benefit due to mass market forces.
Another factor that makes Intel-based machines more attractive is the introduction of symmetric multiprocessing (SMP) capability in commodity chipsets and motherboards. This allows several processors to share the same resources in a single machine. SMP machines support shared memory parallel processing, which is used in the parallel version of Gaussian, for example. Although dual processor PC motherboards cost slightly more than single processor motherboards, the total cost per processor is reduced considerably by purchasing machines with two processors, since there is no need to pay for an extra case, power supply, network card, etc. For compute-bound applications multiprocessor PCs are ideal, however for I/O-bound applications it is important to consider such a choice with extreme care. SMP machines share the same I/O bus, which may not be able to handle the bandwidth required to support different applications running on each of the processors on the machine. There is also overhead in sharing access to the disk. We measured this SMP overhead on a dual processor Pentium II PC by running various Gaussian jobs on one processor, and then running two single processor jobs simultaneously on the same PC. A suite of test jobs (test001, test028, test094, test155, test194, test296, and test302 in the Gaussian 98 distribution) using only a single processor took 665 seconds, whereas each of the jobs using both processors took 730 seconds, indicating an overhead of about 10%. Previously published results gave an overhead of around 20% for Gaussian 94 on test job 302 [7]. We found a 25% overhead for this test using Gaussian 98, which took 66.5 seconds for a single job and 83.0 seconds for two simultaneous jobs on a dual processor node. We also found that Gaussian 98 was slower than Gaussian 94 for these small test jobs, perhaps because it has been optimised for larger molecules, for which it is significantly faster. Even with 10-25% overheads, dual processor machines still provide better price/performance than two single processor machines. SMP PCs are also available with 4 or 8 processors, however they have a significant price premium. Also, they are likely to have an even higher overhead from sharing the I/O bus between multiple processes, which will also limit the scalability of some parallel applications. Multi-processor SMP workstations using Alpha processors have recently become available. Given current pricings they do not appear to offer a price/performance benefit over single processor machines, although this may change in the future. Alpha processors give much higher performance than comparable Pentium processors in basic floating point benchmarks such as SPECfp95 [26]. However benchmarks on a variety of chemistry programs show that Pentiums do much better in comparison to Alphas than would be expected from the SPECfp95 rating [27], and other benchmarks show Alpha 21164 processors as being less than 50% faster than a comparable Pentium processor [7] for Gaussian. Newer Alpha 21264 processors give about twice the SPECfp95 rating as the 21164, but machines using these processors are currently much more than twice the price. Thus dual Pentium PCs offer the best price/performance for our application requirements, given that they currently cost less than single processor Alpha machines. The performance of most computational chemistry calculations scales almost linearly with the clock speed of current generation microprocessors. However the price of processors increases greatly for the highest clock speeds, since there is a substantial price premium for the fastest available processors. There is typically a "sweet spot" in the price/performance ratio, with the most cost effective processors usually being those just behind the current top-of-the-line models. This is the optimal choice for a Beowulf cluster where the goal is to maximise price/performance and throughput of the overall system. Although cheaper low-end processors may appear to have a better price/performance ratio than these penultimate processors, it is important to remember that the processors are only part of the total cost of a node.
Memory is a non-trivial part of the overall node cost. We have chosen to provide 256 MBytes of memory for most of the nodes, which is adequate for virtually all of our simulations. However memory can be a limiting factor in some applications with very large molecules, so we have provided 1GByte of memory in some of the nodes. One of the limitations of commodity PCs is that most motherboards only have support for up to 1GByte of RAM, although this is adequate for all but the biggest problems. Jobs requiring more memory can be tackled using multiple machines working in parallel, and because of the large problem size these are likely to scale well. 4.2. Network Infrastructure There are a number of technologies available for linking together a set of compute nodes in a Beowulf cluster. These include commodity networks such as conventional Ethernet running at 10 Mbit/s and Fast Ethernet running at 100 Mbit/s, as well as less common technologies such as Myrinet, Scalable Coherent Interconnect (SCI), and Virtual Interface Architecture (VIA), that are much faster (higher bandwidth and lower latency) and therefore more expensive. The final choice involves technologies more usually associated with supercomputers, such as the High Performance Parallel Interface (HiPPI). Machines using conventional or Fast Ethernet can be connected using either hubs or switches, using either a flat topology or a hierarchical approach. Hubs share the bandwidth among all communicating nodes, whereas switches provide a dedicated high bandwidth backplane that allows pairs of nodes to communicate independently at the full speed rating (100 Mbit/s for Fast Ethernet). Switches typically have slightly higher latencies than hubs, which is a potential limitation for parallel programs. However most parallel applications must go through the operating system kernel in order to send messages, giving a software latency typically of the order of 100 microseconds, which will dominate the latency from the switch itself. Switches therefore provide a more attractive technical solution but are significantly more expensive than hubs. We experimented with hub technology for our very early prototype machine [24,28], but recent decreases in the price of switch technology and the fact that switched Ethernet provides dedicated (point-to-point) and duplex (two-way simultaneous) connectivity make this a much more attractive option. Gigabit Ethernet is now becoming available, and we are presently experimenting with this technology. It is still quite expensive, although it is expected that it will drop in price and eventually approach the current price of Fast Ethernet. Gigabit Ethernet has similar latencies to Fast Ethernet, and although it is rated at nominally 10 times the bandwidth, it is unlikely that affordable PC nodes will achieve this full bandwidth rating until faster bus technologies become available. In any case, Gigabit switches do not currently provide the scalability to allow full bandwidth between all nodes in large clusters. It is therefore not a cost effective alternative at present to Fast Ethernet for point-to-point connections to each node. However we are considering using it for connecting the disk server nodes of our system, to prevent these nodes and their connections being a bottleneck to the switch infrastructure. Myrinet [29] has been the most commonly used high-end (non-commodity) networking technology for Beowulf clusters that require better communications performance than Fast Ethernet. It provides 1.28Gbit/s (full duplex bandwidth), but at a typical cost of around US$1000-1500 per network card and US$4-6K for a 16-port switch, is very expensive. Giganet [30] has recently announced their cLAN range of high-performance "Cluster Area Network" hardware, which implements the Virtual Interface Architecture (VIA) standard. The bandwidth and latency are similar to Myrinet, for a similar (perhaps slightly cheaper) overall cost. In technological terms, SCI provides the best performance with bandwidths up to 3Gbit/s, but is even more expensive than Myrinet and the additional bandwidth is not useable by current commodity PC buses. Scalability of the network
infrastructure becomes more complex for large clusters where maintaining the full rating of the backplane becomes technologically more difficult (and necessarily more expensive). Note that for our Beowulf cluster which uses dual Pentium PC nodes, the cost per node of using Myrinet or Giganet would have been about the same as the cost of the compute node itself! In other words, using these non-commodity high-speed networking technologies roughly halves the number of compute nodes that can be used in constructing a Beowulf cluster on a fixed budget. Clearly this is not the most cost-effective solution for a cluster that aims to maximise throughput performance, although it may be the best approach for "grand challenge" class problems. It is likely that a number of new products implementing VIA will become available in the near future, which should bring down the prices of these high-end networks to the point where they will eventually become commodity items. The choice of network architecture is perhaps the most difficult aspect of a cluster design to change, and therefore requires careful consideration. Upgrade routes will typically involve complete replacement of the switching infrastructure and network cards in each node, although commodity units such as PC network cards can at least be recycled in other systems. A number of research groups, including the original Beowulf project group at NASA, have investigated the effects of various network combinations, topologies and switching technologies. The original Beowulf project used channel bonding software to make use of two conventional Ethernet card interfaces and double the bandwidth available [31]. This approach yields good performance but necessarily involves special kernel modifications and is perhaps more difficult to maintain than using a single card. Other researchers have considered using combinations of switch and torus configurations to achieve a better compromise bandwidth than is possible from a flat switched structure. We believe that for our purposes, the simpler stackable and essentially flat switch topology as shown in Figure 1 provides adequate performance and is economically optimal. Inexpensive commodity switches for connecting relatively small systems up to 24 nodes are widely available, and provide a good solution for Fast Ethernet speeds. Multiple switches are needed for higher numbers of nodes, and some increase in inter-switch bandwidth is required to maintain the full data rates between arbitrary pairs of nodes. Commodity products to link switches are becoming available for this purpose, and we have used stackable Intel 510T switches with a Matrix module to provide full Fast Ethernet bandwidth (100 Mbit/s) between each node. Figure 2 shows a rear view of one of the shelves, showing the switches and the network cabling for part of the machine. Each of the Intel switches has 24 ports but can also hold an additional 4-port module. The Matrix module can support up to 7 of these 28-port switches (196 nodes) at full bandwidth. Figure 1 shows how two 4-port modules can be used to connect 128 nodes using just 5 switches. We are investigating whether 1000Mbit/s (Gigabit Ethernet) is required for the critical communications link between the front-end and the switch fabric, which may be a potential bottleneck. The "balance point" for a typical parallel application is a useful concept for judging whether processing nodes are adequately networked for supporting parallel processing. Some years ago the T800 transputer was thought to be well balanced, having a processing capacity of the order of 1MFlop/s and a channel bandwidth of about 1Mbit/s. This ratio is approximately matched by present commodity networks (Fast Ethernet at 100Mbit/s) and processing nodes (400-500 MHz Intel Pentium processors, which may perform at 100-200 MFlop/s for typical applications). However processor speeds are doubling roughly every 18 months (Moores Law) and it is likely that soon the economic optimal point for processing power will exceed the ability of Fast Ethernet technology to feed processors with data for typical parallel applications. From this analysis we conclude that the Fast Ethernet technology is still the most economic for a Beowulf system, particularly for our requirements of high throughput for many simultaneous jobs,
which do not need high-end network infrastructure. Computational chemistry applications are often compute-bound rather than communications-bound, so that Fast Ethernet can even provide reasonable speedups for some parallel applications, as shown in section 5. Figure 1. Network infrastructure for the Perseus Beowulf. Each of the Beowulf nodes is connected via 100Base Fast
Ethernet to an Intel 510T switch. The switches are connected through a stack interface to a Matrix module containing a crossbar switch that can support up to 196 nodes at full bandwidth. The configuration shown here is for up to 128 nodes. Each Beowulf node has two Pentium processors (labelled P in the diagram) with 256 MBytes of RAM and 4 or 10 GBytes of local disk.
4.3. Disk Configuration A number of Beowulf clusters have been built using diskless compute nodes. This is a convenient design when the machine is targeted at compute intensive jobs where networked file system (NFS) cross-mounted disks and the switch interconnect is adequate. The advantage of the diskless node approach is that the nodes are very simple to maintain and do not have file systems that can become easily damaged if the nodes are shut down incorrectly. The disadvantage of this method is that it has the additional overhead of accessing data across the network from a file server. If there are a lot of file transactions, the file server and the network could potentially become overloaded, which would affect the performance of parallel jobs and jobs with substantial amounts of I/O.
Figure 2. Rear view of part of the Perseus Beowulf showing the Intel 510T switches (top shelf), connected by
the blue (high-bandwidth) cables, and a number of the dual Pentium compute nodes connected to the switches by the light grey Ethernet cables.
Alternatively, Beowulf nodes can have the full operating system kernel and software installation on their own hard disk, which may reduce network congestion for program loading and other system activities. The disk space for user or applications to work in can either be cross-mounted from a central server or can be a mix of local and central disk space. It may be useful to give each node some local disk space for temporary calculations. Nodes with local disks are fairly easy to setup and design. It is possible to boot them from floppy disks or boot ROMs which can automatically load the full OS and any other required software from a file server. PC hardware provides the option of cheaper EIDE hard disk drives, or more expensive SCSI disk drives. SCSI disk subsystems provide greater throughput than EIDE disks in situations where many competing processes are accessing the disk. However, EIDE disks provide almost the same level of performance where dedicated disk access is required, especially for the large, contiguous scratch files that are typically associated with running conventional computational chemistry. For this reason, we have chosen to provide our compute nodes with EIDE disks holding the operating system software, swap space and temporary work space. As we discussed above, computational chemistry packages can make considerable performance improvements by avoiding re-computation of integrals and basis functions, and storing these intermediate calculations to disk. It is therefore advantageous to provide compute nodes with a generous amount of disk space for temporary data. A SCSI disk subsystem is used on the front-end node, which serves home directories and program binaries to the compute nodes. This file server needs to handle multiple simultaneous disk access requests, and should therefore use the best technology to avoid becoming a bottleneck for the system. Backups are performed over the network to other resources.
4.4. Operating System The choices of operating system for cluster systems at present includes Linux, Solaris, FreeBSD, Digital Unix, and Windows/NT. Of these, Linux is currently by far the most popular, with a large software and developer base worldwide. We eliminated Windows, because of the relative immaturity or unavailability of free software for cluster computing. Likewise, FreeBSD does not at present have the breadth of software availability that Linux enjoys. Although Digital Unix has good support for cluster systems and is perhaps a better solution for Beowulfs using Alpha processors, it was excluded because of its additional cost. For our target system we have primarily considered running Linux or Solaris operating system software on each node. Both of these are available at cost of media only and run well on x86 based machines. We have chosen Linux although we continue to experiment with Solaris on x86 processors. We used the Redhat Linux installation package for configuring our Beowulf system. In general any of the Linux releases can be used, although some appear to be better packaged and easier to configure than others. It is important to understand which of the many default services should be activated or deactivated from the defaults given under a release. There are security implications of running some insecure services when not properly configured, so we have customised the Linux install to remove services that are not necessary for a dedicated cluster environment. This information is discussed quite well in the plethora of books now available for Linux. Our Beowulf uses the common network structure of a single front end-host which bridges between the private internal subnet used to network the compute nodes and the Internet available on a conventional subnet. This means that if the front-end is properly configured the security risk from improper configuration of the Beowulf nodes is reduced. We have experimented with Gnu compilers for Fortran as well as proprietary compilers from Absoft and from the Portland Group (PGI). These represent the only system software costs for our cluster. Generally the PGI compiler has proven to be the highest performing Fortran compiler for our purposes. Other utilities and management software, for example our queuing system Condor, are already available as part of the commonly bundled Linux releases or as public domain software on the Web. 4.5. Cluster Management Software It becomes far too difficult to manually organise for jobs to run on remote nodes for any reasonably sized cluster. Special queuing, scheduling, monitoring and systems management software is required to make efficient use of a large cluster of machines. Several cluster management systems [13] are available for queuing and scheduling jobs on clusters of networked computers, including Beowulf clusters. Commercial systems include Codine and LSF, and there are also a number of freely available packages such as Condor, NQS, DQS, and PBS. They all provide queuing software to allow users to easily submit batch jobs to run on the cluster, a scheduler to organise for the job to be run using appropriate compute resources in the cluster, and a loader to start the job running on the specified node. Some systems also provide basic services for monitoring the status and usage of the cluster. Although many of these software packages perform similar tasks, they vary in their ability to support advanced functionality such as cycle stealing, checkpointing, job migration, resource optimisation, job prioritisation, heterogeneous computers, distributed resource ownership, multiple network interfaces, and scheduling of parallel programs. Since our Beowulf is a dedicated machine that is primarily used for high throughput of single processor jobs, we require only basic batch queuing and scheduling from our cluster management system. Commercial cluster management systems such as Codine and LSF are fairly expensive, and
their advanced capabilities are overkill for our situation. The main constraint in selecting a cluster management system is that our Beowulf is based around a private subnet, which means that the system must support multiple network interfaces. The only free software package providing this support at the time of evaluation was Condor [16a], which is a popular and well-regarded system. We therefore chose Condor for our cluster management software. Condor is a sophisticated queuing and scheduling system. Although it is primarily designed to steal idle cycles from a pool of user workstations, we found it also has ideal support for a dedicated cluster. Part of the appeal of Condor is the high degree of flexibility and customisation that is possible. Condor allows the definition of quite detailed policies for users, queues, and resources, which allows efficient use of the machine, as well providing a means to prioritise use of the resources. This is very useful for a large facility which has many users from several different research groups all wanting their fair share of the overall potential of the machine. Condor provides all of the advanced features of cluster management systems mentioned above, with the exception that it currently only allows efficient scheduling of parallel programs developed using PVM. This is the only limitation of Condor for our purposes, and it means that Condor cannot be used effectively for scheduling parallel versions of Gaussian or GAMESS-US on the cluster. Because of this, we have had to divide the machine into two logical sections, one for running Gaussian jobs under Condor, and another for manually running parallel jobs such as GAMESS-US. Condor currently has some problems in dealing with SMPs, which makes it difficult to efficiently schedule Gaussian jobs that use both processors on a node. However it is expected that this will be fixed in future versions. There are two other noteworthy software endeavours which aim to remove the necessity for specialist queuing and scheduling software. The MOSIX [32] and BPROC [33] projects are developing software to enable an entire cluster of machines to appear as one logical machine, or single system image. This would allow the operating system to oversee the scheduling of processing, and to users the machine will appear as a single computer with the processing capability of the entire cluster. Unfortunately these software products are still in development, and are unsuitable for our situation because they do not currently provide efficient support for jobs involving large amounts of I/O, which is necessary for many computational chemistry applications. 4.6. Other issues The main problems in designing a Beowulf system are in selecting from the myriad of choices available for the hardware and systems software, which we have outlined above. In this section we discuss some of the more important secondary issues involved in constructing a practical Beowulf system. Figure 3 shows the full Perseus Beowulf system, which is spread over six racks, each with 4 shelves. This framework was specially constructed for our system by departmental technicians and is constructed from steel box girders and wooden shelves with large holes cut to allow easy cable reticulation. This allows network and power to each node case as well as relatively easy maintenance access. Some experimentation was required prior to committing the full Beowulf rack configuration. We decided from the outset that it would be valuable to keep the individual compute nodes as normal PCs with their own case and power supply. This is more expensive than housing the motherboards in a conventional computer rack, but the additional cost is less than US$100 per node and it makes maintenance (by replacement) and housekeeping considerably easier. It also means
that the nodes can be easily resold for use as desktop machines when they are decommissioned. Figure 3. The full Perseus Beowulf cluster, which occupies several custom-built shelving racks. The racks are
constructed from steel box girder with wooden shelves.
Modern switched-mode power supplies are relatively efficient and the amount of electricity consumed over that from a centralised (specialised) power supply is marginal. Powering the cluster up at once can put high loads on circuit breakers, and we have routed each rack of our system through separate electrical power runs. Heat output is clearly a concern with such a large number of nodes involved. Our system is housed in a machine room that was originally designed for housing 1970s mainframes and consequently we have adequate chilled air capacity. We are still considering the maintenance (personnel) effort and life cycle (upgrade) costs for our cluster. For university environments, where recurrent funding is not always easy to obtain, the Beowulf solution with an ad hoc approach to maintenance by replacement seems more desirable than the normal heavy recurring maintenance payments required by conventional HPC systems. A number of systems integration companies now sell Beowulf clusters as a fully installed package, only requiring the purchaser to specify the components to be used. This is a good solution for research groups who do not want to invest the time and effort involved in the construction of a Beowulf system. However, we adopted a do-it-yourself approach to the design and construction of our Beowulf cluster. This involved: design and specification of the required configuration of the PCs; design of the network architecture; setting up some high-end PCs as front ends for the cluster; design and construction of racks and power supplies for the PCs; a bulk purchase of commodity PCs configured to our specifications; purchase of off-the-shelf commodity network components; installation of the PCs in the racks; connection of the network fabric; installation of the operating system on each machine; installation of other required software (Condor, MPI, Gaussian, GAMESS-US, etc).
We also tested the design by initially building a small prototype cluster. In general we do not believe the design, construction and maintenance of a Beowulf system is beyond the technical abilities and resources commonly available in university Chemistry and Computer Science departments and campus computing services.
5. Performance Benchmarks
It has proven quite difficult to maintain a coherent set of performance benchmarks for computational chemistry applications. Efforts in this area have been hampered by the combinatorial explosion in computational techniques and the variety and capabilities of high-performance computers. Several attempts have been made to benchmark standard chemistry packages on various compute platforms, but very few up-to-date, broad-ranging comparisons are available in the literature. However, there have been a limited number of surveys conducted which do provide overview results for a range of machines [4,7,25,27]. These benchmark results show that for chemistry packages such as Gaussian and GAMESS, Pentium processors in commodity PCs provide similar performance (and thus significantly better price/performance) to the processors used in high-end workstations and commercial parallel computers. For example, benchmarks using a suite of Gaussian test problems show that a 400 MHz Pentium II processor outperforms the 195 MHz R1000 processors used in the SGI Power Challenge that was previously used for running our computational chemistry problems [7]. The main advantage of commercial parallel computers over commodity Beowulf clusters is that they provide a high-end communications network that provides better efficiency and scalability for parallel programs [25]. Although our Beowulf cluster will mostly be used to provide high throughput for many sequential jobs, we are also interested in running some large jobs in parallel. We therefore ran a number of different benchmark tests in order to explore the efficiency and scalability of parallel versions of chemistry packages such as Gaussian and GAMESS on a commodity Beowulf cluster. This provides an indication of the number of processors that can be used for particular types of computations while still maintaining efficient use of these resources, given that we are attempting to maximise throughput of the machine. In this section we investigate the effectiveness of a commodity Beowulf cluster running parallel versions of GAMESS-US (PC-Unix release) and Gaussian 98 (rev A.7) on several benchmark molecular problems. The results are from our prototype cluster with each node consisting of dual 400 MHz Pentium II processors, 256 MBytes of memory, and a 4 GByte disk. The nodes are connected by Fast Ethernet using an Intel 510T switch. The machines were running Redhat Linux version 6.0 with the 2.2.5-15smp kernel and glibc 2.1. We used the PGI Fortran compiler pghpf 3.0-1 that is provided with Gaussian 98. We have focussed primarily on the parallel version of GAMESS-US, since this is designed to run efficiently on distributed memory machines using standard MPI message passing libraries (for more information, see Windus et al. [9b] and the documentation in the GAMESS-US distribution [9a]). The parallel implementation of Gaussian has been targeted mainly at shared memory machines such as the SGI Power Challenge, and can only be run on a distributed memory Beowulf cluster using the Linda parallel programming environment [22a]. This has substantial overheads, and even on commercial machines with high-end communications networks, such as the IBM SP2, Gaussian using Linda does not give good efficiency for more than about 8-16 nodes [22b].
It is difficult to provide a clear comparison of the performance of codes running on parallel machines, since these codes are often customised for the particular machine being considered. Even obtaining objective wall-clock times for specific problems is difficult simply due to the large number of variables that can be adjusted for a particular chemical problem. Indeed, the algorithms used in parallel computational chemistry are continually being enhanced, requiring comparisons between various versions of the same code. Another difficulty is due to the rapid changes in the area of high-performance computing, which makes it difficult to obtain up-to-date results. One starting point to work towards objective benchmark comparisons has been the use of standard molecular structures and basis sets to use as test cases. For example, the EMSL ab initio benchmarks [4] provide several test cases in order to allow computational chemists to make decisions about which hardware and software platform will be effective for their needs. We have chosen to explore the performance characteristics of two test cases included in the EMSL benchmarks: the ab initio Restricted Hartree-Fock (RHF) single-point energy calculations for ethylene (CH2=CH2) using the Pople-type 6-311+G**(6d) basis set; and 18-crown-6 (C12H24O6) using the 6-31G**(6d) basis set. At the time that these tests were first conceived (several years ago), ethylene (with high D2h symmetry) was considered a relatively small problem that could be run on workstation-class machines, while the crown ether (with lower C1 symmetry) was considered a larger problem requiring a serious parallel machine to tackle effectively. Nowadays, ethylene represents an almost trivial problem in terms of molecular size, while the crown ether is more typical of the size of molecules subject to theoretical investigation [34a,34b,34c,34d,34e]. We have supplemented the ethylene and 18-crown-6 cases from the EMSL benchmarking results with two benchmark cases of our own: benzene (C6H6) and morphine. The benzene and morphine calculations were performed at both the Self-Consistent Field (SCF) Restricted Hartree-Fock (RHF) and Second-Order Mller-Plesset (MP2) levels of theory. In both cases, the calculations were performed using input file molecular geometries with C1 symmetry. Most calculations were performed using the standard Pople-type 6-31G(d,p) basis set, although the effects of using a larger basis set were explored by performing some calculations at the significantly more flexible 6-311++G(3d,2p) basis set. Results from the larger basis set are not presented because they exhibit only the expected increase in computation time, while displaying no significant differences in the trends observed with the smaller basis set. Figure 4. Ethylene. Carbon atoms are shown in brown, while the hydrogen atoms are shown in white.
Figure 5. 18-crown-6. Carbon atoms are shown in brown, oxygen atoms in red, while the hydrogen atoms are shown
in white.
Figure 6. Benzene, C6H6. Carbon atoms are shown in brown, while the hydrogen atoms are shown in white.
Figure 7. Morphine. Carbon atoms are shown in brown, oxygen atoms in red, nitrogen in blue, while the hydrogen
atoms are shown in white.
We have placed considerable effort into exploring the performance characteristics of the Beowulf cluster and so calculations at the SCF wavefunction convergence stage have been performed with both the "direct" (that is, the two-electron integrals are recomputed as needed) and "conventional" (that is, the two-electron integrals are stored on disk and read at each SCF iteration) methods. For benzene and morphine, we have undertaken single-point energy, gradient (the first derivative of the energy with respect to the molecular co-ordinates, corresponding to the negative of the forces acting on atoms within a molecule), and hessian (the second derivative of the energy with respect to the molecular co-ordinates, corresponding to a matrix of force constants within the molecule) calculations in most cases. The EMSL benchmark results show that a shared memory parallel implementation of Gaussian 92 (Rev C) scales very well on the SGI Power Challenge [4]. While we do not expect Gaussian to scale well on our distributed memory Beowulf cluster, each node of the cluster is a two-processor shared memory machine which should provide good speedups for the parallel version of Gaussian. In Table 1 we present selected wall-clock timing data (in seconds) for calculations using parallel versions of Gaussian 98 and GAMESS-US on one dual-processor node of our Beowulf cluster. The communications speed between the two processors on the PC motherboard is significantly faster than the inter-node Fast Ethernet network.
Table 1 shows that GAMESS-US gives better parallel speedup and efficiency than Gaussian 98 for all the benchmark problems, although Gaussian has faster completion times for the largest molecules. Direct computation gives better speedups than conventional computation for both packages, with Gaussian showing no speedup at all for conventional methods. Speedups improve as the size of the molecules increases, with GAMESS-US showing almost perfect speedup for the largest molecules. The scalability of Gaussian is much more strongly dependent on the size of the molecular problem, and it shows quite good speedups for large problems where the direct SCF routine is used. Using dual-Pentium PCs in our Beowulf therefore allows us to efficiently utilise the parallel implementation of Gaussian to a limited extent, which can be very useful for computations involving large molecules. Table 1. Selected wall-clock timing data (in seconds) for ab initio calculations using the parallel versions of Gaussian
98 and GAMESS-US on one dual-Pentium node of our Beowulf cluster. Ideally, each two-processor calculation would exhibit a speedup of 2.00 with respect to the corresponding one-processor calculation.
Gaussian 98 1 Processor Ethylene RHF Conventional Ethylene RHF Direct Ethylene MP2 Conventional Ethylene MP2 Direct Benzene RHF Conventional Benzene RHF Direct Benzene MP2 Conventional Benzene MP2 Direct Crown Ether RHF Direct Morphine RHF Direct 13.3 28.4 19.3 33.8 141.8 135.8 251.1 340.0 3430.2 (57.2 min) 5261.8 (87.7 min) 2 Processors Speedup 1 Processor
GAMESS-US 2 Processors Speedup
13.7 20.9 19.6 27.5 146.2 78.6 250.1 238.2 1914.3 (31.9 min) 2871.5 (47.9 min)
0.97 1.36 0.99 1.23 0.97 1.73 1.00 1.43
4.6 19.9 9.4 24.9 73.5 4777.7 214.1 389.5
3.0 10.7 5.4 13.2 40.8 2433.6 112.1 198.2 3811.7 (63.5 min) 7790.8 (129.9 min)
1.53 1.86 1.74 1.89 1.80 1.96 1.91 1.97 1.92 1.96
7306.5 1.79 (121.8 min) 15284.5 1.83 (254.7 min)
Figure 8 shows the GAMESS-US wall-clock computation times for the RHF/6-31G(d,p) energy, gradient and hessian calculations of benzene using both conventional and direct SCF techniques on our Beowulf cluster. Each calculation took 9 SCF cycles for the wavefunction to converge. We also show some results for the larger molecule, morphine, which requires almost 100 times as much computation. We were unable to use conventional methods with morphine because of a lack of disk space. Using direct methods, the energy and gradient calculations took 13 SCF iterations to converge. The nodes did not have enough memory to perform the hessian calculations for morphine, which highlights shortcomings of the replicated data approach with large molecules. Figure 8 shows that using up to 20 processors, the conventional technique significantly outperforms direct computations on both energy and gradient calculations. Conversely, in the case of the hessian calculations, direct computation rivals the conventional integral storage approach. This result is significant because it means that leaner nodes on our cluster with less disk storage capacity can be used to perform the more computationally expensive hessian calculations using direct methods without time-to-solution penalties. Figure 8. Comparison of wall-clock calculation times for benzene RHF/6-31G(d,p) single-point energy, gradient and
hessian calculations using both "direct" and "conventional" SCF routines with the GAMESS-US ab initio molecular orbital program. Results are also presented for direct energy and gradient calculations for morphine.
In Figure 9 we present the wall-clock computation times for the MP2/6-31G(d,p) energy, gradient and hessian calculations of benzene using both conventional and direct SCF techniques on our Beowulf cluster. Each calculation took 9 SCF cycles for the wavefunction to converge. Note that the MP2 calculations take much longer than the corresponding RHF calculation, since they compute the RHF result and then an additional computationally intensive second-order correction. As with the results generated using RHF/6-31G(d,p) theory, this plot shows that using up to 20 processors, the conventional technique significantly outperforms the direct computation method on both energy and gradient calculations. Interestingly, the conventional approach is also slightly faster on the hessian calculations than the direct computation method. Due to lack of memory, we were unable to perform MP2 calculations for morphine on less than 16 processors, and we do not include these results here. Figure 9. Comparison of wall-clock calculation times for our benchmark benzene MP2/6-31G(d,p) single-point
energy, gradient and hessian calculations using GAMESS-US. Results are presented for both "direct" and "conventional" SCF routines.
Figure 10 shows speedup curves for several benzene calculations using the conventional SCF method on the Beowulf cluster. The RHF calculations do not scale particularly well, although they still give reasonable efficiencies up to around 10-15 processors. In contrast, the much more computationally intensive MP2 calculations give very good speedups. At 20 processors, only the RHF hessian calculation is beginning to suffer diminishingly small gains through adding more processors. A strange feature of these results is the apparent advantage in running the MP2 calculations on a number of processors that is a multiple of five. This observation is repeatable, and is possibly related to how the arrays are distributed for the MP2 calculations and the size of the message buffers. Figure 10. Comparison of "conventional" SCF speedups for benzene RHF/6-31G(d,p) and MP2/6-31G(d,p)
single-point energy, gradient and hessian calculations using GAMESS-US. The solid line represents the ideal speedup for a parallel computer.
Figure 11 shows the speedup curves for several benzene and morphine calculations using the direct SCF method on the Beowulf cluster. Overall, the speedups are generally much better than for the conventional methods shown in Figure 10, as might be expected since the direct methods require more computation and have significantly reduced I/O requirements. For 20 processors, the parallel efficiencies are over 75% for all but the benzene RHF hessian calculation, which scales no better than the conventional calculation, with an efficiency of less than 50%. In contrast, the MP2 hessian calculation shows very good speedups. As with the conventional SCF calculations shown in Figure 10, small improvements in the efficiency of MP2 calculations are observed when using multiples of five processors. Figure 11. Comparison of "direct" SCF speedups for benzene and morphine RHF/6-31G(d,p) and MP2/6-31G(d,p)
single-point energy, gradient and hessian calculations using GAMESS-US. The solid line represents the ideal speedup for a parallel computer.
Comparing the wall-clock times for the benzene RHF and MP2 runs with the 6-31G(d,p) basis set (Figures 8 and 9) shows that the higher level MP2 theory takes longer than the RHF calculation as expected. However, as the number of processors increases, the MP2 simulations close the performance gap between the two methods. Figures 10 and 11 show the superior efficiency of the MP2 calculations more clearly. This is partly due to the fact that the MP2 problem should have better scalability because it requires a lot more computation, but also because of the extra effort that has gone into efficiently parallelising the MP2 level simulations in GAMESS-US. We conclude that more effective use of processing power is made when running the better-coded MP2 algorithm within GAMESS-US. Each single-point MP2 calculation involves the following sequence of computational steps: 1) Initial calculation setup. 2) Initial molecular orbital selection. 3) One-electron integral evaluation. 4) Two-electron integral evaluation. 5) RHF SCF wavefunction convergence routine. 6) Distributed data MP2 calculation. 7) Final molecular property evaluation. A closer look at the individual computational times for each of these various phases of the benzene conventional MP2/6-31G(d,p) calculation is shown in Figure 12. Such a timing breakdown is useful as an analysis tool for evaluating the parallelisation efficiency of each phase of the overall ab initio calculation. There are four sequential computational phases: the initial calculation setup; the initial molecular orbital selection; the one-electron integral evaluation; and the final property evaluation. As more processors are used on the problem, the time taken for these sequential sections of the computation remains essentially unchanged, while the time for the parallelised sections of the code continue to decrease. As the time taken by the parallel sections approaches the time taken by the sequential sections, the parallel efficiency will start to fall off rapidly. Figure 12 illustrates that for an RHF calculation for benzene, parallelisation inefficiencies should begin to become apparent at around 20 processors. An MP2 calculation should show better scalability, since the time taken for the MP2 section of the program is much greater than that for the sequential sections even for 20 processors. Amdahls Law [23a,23b] quantifies this observation, and allows us to compute (for a given number of processors) the maximum possible speedup for a problem that has a particular fraction of its computation that is done sequentially.
Figure 12. Breakdown of the computation times observed for the individual phases of the MP2/6-31G(d,p)
single-point energy conventional calculation for benzene.
Figure 13 illustrates the speedup breakdown for the single-point benzene MP2/6-31G(d,p) two-electron integral evaluation, RHF SCF wavefunction convergence and distributed data MP2 phases of the overall calculation. We see that the two-electron orbital evaluation phase of the calculation parallelises extremely well, giving perfect speedup up to 20 processors. The RHF SCF wavefunction convergence routine scales rather poorly, only giving around 50% efficiency at 20 processors, while the MP2 phase of the calculation scales much better, with around 75% efficiency for 20 processors. Note that the speedup of the overall calculation is not limited by the sequential parts of the computation (Amdahls law), but rather by overheads in the parallelisation of the RHF and MP2 sections of the calculation. Figure 13 also shows the improved performance achieved with multiples of 5 processors on our cluster. The calculation of two-electron orbitals even shows "superlinear" speedup, which may be due to the increased use of cache memory in storing the distributed data.
Figure 13. Breakdown of the parallelisation speedups observed for individual phases of the MP2/6-31G(d,p)
single-point energy calculation of benzene.
6. Conclusions
We have designed and built a Beowulf cluster dedicated to provide maximum throughput for many simultaneous computational chemistry jobs using standard ab initio molecular structure packages such as Gaussian and GAMESS-US. Based on a number of chemistry benchmarks presented or referenced here, we find that commodity PCs with dual Pentium processors offer the best price/performance ratio for these applications, and inexpensive commodity Fast Ethernet provides adequate network connectivity. The Beowulf cluster we have constructed provides a peak performance of around 100 GFlops at a fraction of the cost of a comparable commercial supercomputer. This dedicated cluster outperforms the supercomputer previously used by the chemistry research group, which was shared with many other users. This is likely to be the case for many other organisations.
Our Beowulf cluster was designed specifically to provide high throughput for many sequential (single-processor) jobs, however we also run some jobs in parallel, in order to get the benefits of more memory and faster turn-around times. We have therefore explored the scalability of parallel chemistry applications on the Beowulf cluster. The parallel version of GAMESS-US performs well on Beowulf clusters for most types of calculations, particularly for direct computation methods and MP2 calculations. The speedup and efficiency improves for larger molecules. Gaussian 98 is not designed for distributed memory clusters, however it gives reasonable speedups for direct computation methods using the two processors on each Beowulf node in SMP mode. It is difficult to compare benchmark results for our Beowulf cluster to results reported for chemistry benchmarks on other high-performance parallel computers. This is mainly due to the lack of standard benchmarks that specify precisely the same application problems - in this case the configuration of the chemical system to be simulated, the algorithm used, and the basis set and integrals. Also, computers are continually improving, so results quickly become outdated. However a number of benchmark tests have shown that the single-processor performance of the Beowulf is competitive with that of commercial parallel machines, and usually much better in terms of price/performance. Commercial systems have the advantage of faster communications networks that can provide better scalability for parallel applications. However the commodity Fast Ethernet of the Beowulf cluster gives reasonable speedups for GAMESS-US up to around 20 processors (and very good speedups for large problems), which is perfectly adequate for our situation where many jobs will be run simultaneously. We believe commodity Beowulf clusters can meet the performance needs of many computational chemistry applications, and offer significant price/performance advantages. Another advantage of commodity clusters is that they can be sized to fit any budget, so research groups or university departments can now afford their own dedicated high-performance computer. The Beowulf architecture and the possible upgrade routes for commodity clusters will likely offer an economically attractive platform for computational chemistry for at least the next several years. The trend is for both processor and network technologies to improve in performance and decrease in price. It will be important for memory buses to also continue to improve. The ultimate test of success for our system will be to see how practical it is to maintain and operate over the coming years and how much valuable computational chemistry can be carried out on it.
Acknowledgments
This work was supported by The University of Adelaide and an Australian Research Council (ARC) Research Infrastructure Equipment and Facilities Program (RIEFP) Grant. The system we describe here was constructed for the Adelaide Computational Chemistry Consortium by the Distributed and High Performance Computing Group in the Department of Computer Science at the University of Adelaide. We thank B. Hyde-Parker, J.D. Borkent, F.A. Vaughan and Andrew Naismith for technical assistance in setting up the shelving racks, electrical system and network cabling for our Beowulf, and F.A. Vaughan for discussions on system design.
References
[1] Gaussian, Inc. GaussView: An Affordable, Full-Featured Graphical User Interface for Gaussian, http://www.gaussian.com/gvbroc.htm. [2] Wavefunction, Inc. Spartan, http://www.wavefun.com/software/software.html. [3] G. Schaftenaar MOLDEN: a pre- and post-processing program of molecular and electronic structure, http://www.caos.kun.nl/~schaft/molden/molden.html . [4] D. Feller The EMSL Ab Initio Methods Benchmark Report: A Measure of Hardware and Software Performance in the Area of Electronic Structure Methods, Pacific Northwest National Laboratory technical report PNNL-10481 (Version 3.0), June 1997. http://www.emsl.pnl.gov:2080/docs/tms/abinitio/cover.html. [5a] Center of Excellence in Space Data and Information Sciences (CESDIS), NASA Goddard Space Flight Center Beowulf Project at CESDIS, http://www.beowulf.org/. [5b] Donald J. Becker et al. BEOWULF: A Parallel Workstation for Scientific Computation, Proc. Int. Conf. on Parallel Processing, Aug 1995, Vol 1, 11-14. [5c] Daniel Ridge et al. Beowulf: Harnessing the Power of Parallelism in a Pile-of-PCs, Proc. IEEE Aerospace, 1997. [5d] Beowulf Project Beowulf Consortium, http://www.beowulf.org/consortium.html. [6] J. Tirado-Rives and W.L. Jorgensen Viability of Molecular Modeling with Pentium based PCs, Journal of Computational Chemistry 1996, 17:11, 1385-86. [7] M.C. Nicklaus, R.W. Williams, B. Bienfait, E.S. Billings and M. Hodoscek Computational Chemistry on Commodity-Type Computers, Journal of Chemical and Information Computer Sciences, 1998, 38:5, 893-905. [8] Gaussian, Inc. Gaussian, http://www.gaussian.com/. [9a] Gordon Research Group, Iowa State University GAMESS-US, http://www.msg.ameslab.gov/GAMESS/GAMESS.html. [9b] T.L. Windus, M.W. Schmidt, and M.S. Gordon Parallel Processing With the Ab Initio Program GAMESS, in Toward Teraflop Computing and New Grand Challenge Applications, R.J. Kalia and P. Vashishta, Eds., Nova Science Publishers (New York), 1995.
[10] Argonne National Laboratory The Message Passing Interface (MPI) Standard, http://www.mcs.anl.gov/mpi/. [11] Oak Ridge National Laboratory PVM: Parallel Virtual Machine, http://www.epm.ornl.gov/pvm/pvm_home.html. [12] T.E. Anderson et al. A Case for NOW (Networks of Workstations), IEEE Micro, p54-64, February 1995. [13] Mark A. Baker, Geoffrey C. Fox and Hon W. Yau Cluster Management Software, The NHSE Review, 1996 Volume, First Issue, http://nhse.cs.rice.edu/NHSEreview/CMS/. [14a] R. Buyya, editor High Performance Cluster Computing, Volume 1 Architectures and Systems, Prentice Hall, 1999. [14b] R. Buyya, editor High Performance Cluster Computing, Volume 2 Programming and Applications, Prentice Hall, 1999. [15] Supercomputer Research Institute, Florida State University DQS - Distributed Queueing System, http://www.scri.fsu.edu/~pasko/dqs.html . [16a] The Condor Project, University of Wisconsin Condor: High Throughput Computing, http://www.cs.wisc.edu/condor/. [16b] M. Litzkow, M. Livny and M.W. Mutka Condor - A Hunter of Idle Workstations, Proc. of the 8th International Conference of Distributed Computing Systems, June 1988, pp. 104-111. [17] Molecular Graphics and Simulation Laboratory, National Institutes of Health NIH LoBoS Supercomputer, http://www.lobos.nih.gov/. [18a] Department of Chemistry, University of Calgary COBALT cluster, http://www.cobalt.chem.ucalgary.ca/. [18b] S. Patchkovskii, R. Schmid and T. Zeigler Chemical supercomputing on the cheap: 94 Gflops computer system at CDN$3680/gigaflop, Poster presented at the 82nd Conference of the Canadian Society for Chemistry, Toronto, May 1999. http://www.cobalt.chem.ucalgary.ca/cobalt/poster/. [19] Distributed and High-Performance Computing Group, University of Adelaide Perseus: A Beowulf for computational chemistry, http://dhpc.adelaide.edu.au/projects/beowulf/perseus.html. [20] Computing for Science Ltd GAMESS-UK home page, http://www.dl.ac.uk/CFS/. [21] M. Karplus Chemistry at HARvard Macromolecular Mechanics (CHARMM), http://yuri.harvard.edu/charmm/charmm.html.
[22a] Scientific Computing Associates Linda, http://www.sca.com/linda.html. [22b] Gaussian, Inc. and Scientific Computing Associates Highly Efficient Parallel Computation of Very Accurate Chemical Structures and Properties with Gaussian and the Linda Parallel Execution Environment, http://www.gaussian.com/wp_linda.htm. [23a] Ian Foster Designing and Building Parallel Programs, Addison Wesley, 1995. [23b] R.W. Hockney and C.R. Jesshope Parallel Computers 2, Adam Hilger, 1988. [24] K.A. Hawick, D.A. Grove and F.A. Vaughan Beowulf - A New Hope for Parallel Computing?, Proc. 6th IDEA Workshop, Rutherglen, Victoria, Jan 1999. DHPC Technical Report DHPC-061, http://dhpc.adelaide.edu.au/reports/061/abs-061.html. [25] M.F. Guest, P. Sherwood and J.A. Nichols Massive Parallelism: The Hardware for Computational Chemistry?, http://www.dl.ac.uk/CFS/parallel/MPP/mpp.html. [26] Standard Performance Evaluation Corporation SPEC Benchmarks, http://www.specbench.org/. [27] M.F. Guest Performance of Various Computers in Computational Chemistry, Proc. of the Daresbury Machine Evaluation Workshop, November 1996. Updated version available at http://www.dl.ac.uk/CFS/benchmarks/compchem.html, June 1999. [28] Distributed and High-Performance Computing Group, University of Adelaide DHPC Beowulf Projects, http://dhpc.adelaide.edu.au/projects/beowulf. [29] Myricom Myrinet Products, http://www.myri.com/myrinet/. [30] Giganet Giganet cLAN, http://www.giganet.com/. [31] Thomas Sterling, Donald J. Becker, Daniel Savarese, Michael R. Berry and Chance Reschke Achieving a Balanced Low-Cost Architecture for Mass Storage Management through Multiple Fast Ethernet Channels on the Beowulf Parallel Workstation, Proc. International Parallel Processing Symposium (IPPS96), April 1996. [32] MOSIX R&D Group, The Hebrew University of Jerusalem MOSIX: Scalable Cluster Computing for Linux, http://www.mosix.cs.huji.ac.il/. [33] Center of Excellence in Space Data and Information Sciences (CESDIS), NASA Goddard Space Flight Center BPROC: Beowulf Distributed Process Space, http://www.beowulf.org/software/bproc.html.
[34a] M.A. Buntine, V.J. Hall, F.J. Kosovel and E.R.T. Tiekink The Influence of Crystal Packing on Molecular Geometry: A Crystallographic and Theoretical Investigation of Selected Diorganotin Systems, J. Phys. Chem. A., 102, 2472-2482 (1998). [34b] S.L. Whitbread, P. Valente, M.A. Buntine, P. Clements, S.F. Lincoln and K.P. Wainwright Diastereomeric D-1,4,7,10-Tetrakis((R)-2-hydroxy-2-phenylethyl)-1,4,7,10-tetraaza-cyclododecane and its Alkali Metal Complex Ions. A Potentiometric Titration, Nuclear Magnetic Resonance and Molecular Orbital Study, J. Am. Chem. Soc., 120, 2862-2869 (1998). [34c] E.R.T. Tiekink, V.J. Hall and M.A. Buntine An examination of the influence of crystal structure on molecular structure. The crystal and molecular structures of some diorganotin-chloro-(N,N-dialkyldithiocarbamate)s, R2Sn(S2CNR2)Cl, R = Me, tBu, Ph, Cy; R2 = (Et)2, (Et, Cy) and (Cy)2: a comparison between solid state and theoretical structures, Z. Kristallogr., 214, 124-134 (1999). [34d] E.R.T. Tiekink, V.J. Hall, M.A. Buntine and J. Hook Examination of the effect of crystal packing forces on geometric parameters: a combined crystallographic and theoretical study of 2,2-bipyridyl adducts of R2SnCl2, Z. Kristallogr., 215, 23-33 (2000). [34e] M.W. Heaven, G.M. Stewart, M.A. Buntine and G.F. Metha Neutral tantalum-cluster carbides: A multi-photon ionisation and density functional theory study, J. Phys. Chem. A., in press (2000).

Commodity Cluster Computing For

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Commodity Cluster Computing For

Enviado por

Direitos autorais:

Formatos disponíveis

DHPC Technical Report DHPC-073

Commodity Cluster Computing for Computational Chemistry

3. System Requirements for Computational Chemistry

4. Designing a Beowulf Cluster

GAMESS-US 2 Processors Speedup

0.97 1.36 0.99 1.23 0.97 1.73 1.00 1.43

4.6 19.9 9.4 24.9 73.5 4777.7 214.1 389.5

7306.5 1.79 (121.8 min) 15284.5 1.83 (254.7 min)

Você também pode gostar