Você está na página 1de 4

HPC Definition: Large volume numerical calculations Fast and efficient algorithms improve algorithms for better efficiency.

. To look at how parallel things operate. Lectures on basic level. Intel telling architecture of things. So making our program according to architecture. Highly scalable algorithms. I increase the number of nodes, time should scale. We check the scalability. Modification of computing model to take into aspect the latest developments. GPU, MIC processors from Intel (accelearators, processor arrays) Intel answer to GPU computing. Codes for complex problems, climate prediction, protein folding, structural prediction. Sequential versus Parallel In these set lectures fall into parallel. Sequential C, Java, FFT, BLAST, Bubble sort. Talk of programming: Serially minded, c = a + b, we think of programming that way. But we think other things in parallel. Main idea steps in perfect order Methodology: launch a set of task, communicate to make progress. Sorting 500 answer papers by making 5 equal piles, have them sorted by 5 people, merge them together. Communication becomes important Data Parallelism versus Functional Parallelism Team work requires communication. Data Parallel: For i = 0,99 a[i] = b[i] + c[i] multithread can be split into many sets and send to different processors Functional parallelism For i = 0,99 a[i] = b[i] + c[i]

y[i] = c[i]^2 + 2*b[i] Most problems have both data parallelism and functional parallelism Multithreading on one, and message passing on other. Different ways of parallelizing Ways of computing, not programming Oldest parallel programming in processes: arrays and vector computing vector computers. MIC or accelerated card are similar to it. Each processor has its own memory. Why it is not very popular: you had a set of arrays. IF you have a vector of 120 communicate of processor requires high bandwidth. Splitting was to be programmed. Worst problem Needed special design, people using HPC small. Special hardware not produced in large volume. So finally it was cost and scalability issues. Main paradigms shared versus distributed memory programming. Shared memory. All CPU access the same memory. If I give address to some memory then it points to same memory. Today we talk about quad core, multicore one memory that is shared or different. By multi node architecture clusters we are having both shared and distributed memory. A multiple CPU computer with shared memory is called multiprocessor. Uniform memory access or symmetric multiprocessor (SMP) Distance from CPU to memory is same. Distributed multiprocessor. Cache memory. Cache takes a copy of primary memory and keep it there. It is faster. Asymmetric multiprocessing Every CPU fetches the element it has to do and keeps it in cache. Problem of Cache Coherence

Each of the cache could have a copy here. One processor modify that all the cache should get the same. Also you need processor synchronization. Cache coherence and cache miss: Because you think memory location is same. How to overcome this: The moment variable is changed (not in programming level but below it) The moment it is changed it should invalidate in all other cache. So it says cache miss and fetch the data again. Too many cache miss will slow down Computer writes in chunks. So entire block will get invalidated. Just increasing clock speed and bus speed is not enough. Mechanisms of memory management. Distributed multiprocessing NUMA: Non uniform memory access Advantage build this with commodity processors. Interconnection networks in distributed memory architecture Shared medium: Everybody broadcast CPUs connected through some medium Ethernet, every thing broadcasts, Attach a tag to go to CPU 2 but not CPU 3. We do not used it. What is used is Switched medium: Point to point messaging. Different interconnection topologies Best way to connect. Distance form one processor to other is minimum 2 D mesh, hyper tree, fat tree. Fat tree: used in existing clusters in IIT M and use in future also. Link thickness keeps increasing as you go up. Increase bandwidth as you go up. That is very popular you can keep scaling up. Simple Parallel programming sorting numbers in a large array Divide into 5 pieces. Each part is sorted by an independent sequential algorithm and left within region. The resultant parts are merged by simply reordering among adjacent parts.

Some points on work break down 1) We need a Parallel algorithm 2) We should always do simple intuitive breakdowns. 3) Usually highly optimized sequential algorithm are not easily parallization. We have to redo 4) Breaking work often involves some pre or post processing 5) Think of cost in terms of time. Fine versus large grain parallelism and relationship to communicate. Code takes 1 hour breaking into two 35 mins, breaking again will not be half after a certain time their wont be any advantage but there will be problems of communicating and cache miss.

Você também pode gostar