The speed of serial computers cannot increase for ever. Price / megaflop increases suddenly for serial computers above a certain level. Finding better solutions faster is needed forever larger problems, e.g., weather forecasting, image processing,. The human brain is massively parallel and for some tasks still outperforms current technology.
The speed of serial computers cannot increase for ever. Price / megaflop increases suddenly for serial computers above a certain level. Finding better solutions faster is needed forever larger problems, e.g., weather forecasting, image processing,. The human brain is massively parallel and for some tasks still outperforms current technology.
Direitos autorais:
Attribution Non-Commercial (BY-NC)
Formatos disponíveis
Baixe no formato PDF, TXT ou leia online no Scribd
The speed of serial computers cannot increase for ever. Price / megaflop increases suddenly for serial computers above a certain level. Finding better solutions faster is needed forever larger problems, e.g., weather forecasting, image processing,. The human brain is massively parallel and for some tasks still outperforms current technology.
Direitos autorais:
Attribution Non-Commercial (BY-NC)
Formatos disponíveis
Baixe no formato PDF, TXT ou leia online no Scribd
Tutor: Leonidas Kapsokalivas Tuesdays 13:00 Lecture 14:00 Lecture Tuesdays 15:00 Tutorial Web Page: www.dcs.kcl.ac.uk/teaching/units/material/7ccspda/ Literature • J. JáJá, Introduction to Parallel Algorithms, Addison-Wesley, 1992. • F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees and Hypercubes, Morgan Kaufmann, 1991. • V. Kumar et al., Introduction to Parallel Computing, Benjamin/Cummings, 2nd edition, 2003. Why do we study parallel algorithms? • The speed of serial computers cannot increase for ever. • Price/megaflop increases suddenly for serial computers above a certain level. • Finding better solutions faster is needed forever larger problems, e.g., weather forecasting, image processing, ... • Fast and cheap PC-clusters have overtaken special designed parallel supercomputer. • The human brain is massively parallel and for some tasks still outperforms current technology. Outline • Part I: Introduction: Parallel Models, Performance of Parallel Algorithms, Communication Complexity. • Part II: Basic Techniques: Trees, Pointer Jumping, Divide and Conquer, Partitioning, Pipelining. • Part III: Searching, Merging, Sorting; Graph Algorithms; String Matching and Pattern Analysis. • If time permits: Selected Arithmetic Computations. Prerequisites You should have a good understanding of elementary data structures and basic techniques for designing and analysing (e.g., calculating the run-time complexity) sequential algorithms. There are many references, one is: Cormen et al., Introduction to Algorithms, 2nd edition, MIT Press, 2001. Introduction The bounds on the resources (e.g., time and space) required by a sequential algorithm are measured as a function of the input size. • Worst-case analysis of algorithms; (maximum amount of that resource required by any instance of size n) • Bounds are expressed asymptotically using the following standard notation: – T (n) = O(f (n)) For positive constant c and n0 such that T (n) ≤ cf (n), for all n ≥ n0 . – T (n) = Ω(f (n)) For positive constant c and n0 such that T (n) ≥ cf (n), for all n ≥ n0 . – T (n) = Θ(f (n)) If T (n) = O(f (n)) = Ω(f (n)). • The running time of a sequential algorithm is estimated by the number of basic operations required. • Uniform cost criterion: One unit of time is charged to reading from and writing into the memory, and to basic arithmetic and logic operations (adding, subtracting, comparing, multiplying two numbers, logic OR or AND of two words). The cost does not depend on the word size. Speedup and Efficiency Let P be a computational problem and n be its input size. Definition: The speedup obtained by a parallel algorithm A using p > 1 processors is: T (n) Sp (n) = , Tp (n) where T (n) denotes the best known sequential running time for P and A solves P in Tp (n) steps. Note, Sp (n) ≤ p. Therefore, we would like to design algorithms that achieve a speedup close to p. Definition: The efficiency obtained by a parallel algorithm A using p processors is: T1 (n) Ep (n) = , p Tp (n) where T1 (n) denotes the running time of A when p = 1. Note, that T1 (n) is not necessarily the same as T (n). Ep (n) indicates the effective utilization of the p processors relative to A. If T1 (n) = T (n) then E = Ep (n) = Sp (n)/p. Clearly, E ≤ 1. Stated in percentage, efficiency is always less than 100%. Example 1: Suppose the best known sequential algorithm solves a problem P with input size n = 100 in 35000 steps. A parallel algorithm A uses 60 processors to solve P in 2500 steps. What is the speedup obtained by A and how efficient is A? n = 100, T (100) = 35000, p = 60, T60 (100) = 2500, 35000 S60 (n) = 2500 = 14 14 E = 60 = 0.23 Is A a good algorithm? Note, that we assume T1 (n) = T (n). What happens if we have more processors available? Limitations on the Running Time There exists a limiting bound on the running time, denoted T∞ (n), beyond which the algorithm cannot run any faster, no matter what the number of processors. For any value of p: Tp (n) ≥ T∞ (n) T1 (n) Ep (n) ≤ p T∞ (n) . Therefore, the efficiency of an algorithm degrades quickly as p grows beyond: T1 (n) p≥ . T∞ (n) Example 2: Suppose, for problem P the best known sequential algorithm runs in O(n) steps, where n is the input size. A parallel algorithm A uses p processors to solve P in O(log n + n/p) steps. What is the speedup obtained by A? n Sp (n) = O( ) log n + n/p If p ≤ n/ log n, then log n ≤ n/p and therefore, n Sp (n) = O( ) = O(p) 2n/p Example 2 (cont.): What happens if p = n? n n Sp (n) = O( ) = O( ) log n + 1 log n This is certainly not good, especially as we need n processors! Conclusion: Algorithm A has good speedup and efficiency for up to n/ log n processors but not for more. Our aim is to develop parallel algorithms that can provably achieve the best possible speedup. Therefore, our model of parallel computation must allow the mathematical derivation of an estimate on the running time Tp (n) and the establishment of lower bounds on the best possible speedup for a given problem. Models of Parallel Computation • The classic model for serial computation is the Turing machine. • A more realistic but still idealised model is the Random Access Machine (RAM), which has been used successfully to predict the performance of sequential algorithms. • Modelling parallel computation is considerably more challenging even if we assume unlimited memory and unit cost. • Many interconnected processors form a new dimension. • Suitable framework for presenting and analysing parallel algorithms which is: Simple enough to describe parallel algorithms easily, and to analyse performance measures such as speed, communication and memory utilization. General enough that it does not rely on a particular class of architectures, i.e., is as hardware-independent as possible. Implementable enough such that the parallel algorithms developed for the model can easily implemented on parallel computers. Accurate enough that the analysis performed captures the actual performance of the algorithms on parallel computers. Categorisation of Parallel Architectures Flynn’s taxonomy introduces in 1966 SISD Single Instruction stream, Single Data stream. This is a standard Von Neumann serial computer. SIMD Single Instruction stream, Multiple Data stream. Multiple processors, possibly using different data streams, execute the same instruction synchronously at each time step (or are switched off). Examples: Illiac IV, MasPar MP-1, MasPar MP-2, Thinking Machines CM-1. MISD Multiple Instruction stream, Single Data stream. Multiple processors using the same data , execute possible different instructions. (This is not commonly used.) MIMD Multiple Instruction str., Multiple Data str. Multiple processors, possibly using different data str., execute possibly different instructions. There is no central control unit. The processors operate autonomously and, usually, asynchronously. SPMD (Single Program, Multiple Data) is often used to design MIMD software. Examples: Beowulf cluster (e.g., a fast Ethernet network of PC’s running Linux), Myrinet, ASCII Red See http://www.top500.org for current fastest computers. Address-Space Organisation • Message-passing – Proc. are connected via an interconnection network. – Local memory that is accessible only by one processor. – Proc. interact by passing messages via the network. – Referred to as distributed-memory or private-memory architecture. – MIMD message-passing computer are referred to as multicomputers. • Shared-Address-Space – Hardware support for read and write access by all processors to a shared address space. – Processors interact by modifying data objects stored in the shared-address space. – MIMD shared-address-space computers are called multiprocessors. Shared-memory parallel computers are shared-address - space computers with a shared memory that is equally accessible to all processors via an interconnection network.