Assignment 1

COMPUTER ARCHITECTURE
ASSIGNMENT 1
BY: SHIVAM SHARMA,

NUID: 001790680
email sharma.s@husky.neu.edu
COMPUTER ARCHITECTURE ASSIGNMENT 1
Summary:
Both the given benchmarks were tested thoroughly on the given 3 Micro-architectural platforms.
The loops were adjusted after thorough experimentation, which came out to be different for
different platforms.
Each of the benchmark code was compiled using GCC and CC compiler with Linux & Sun platform
with possible optimization levels.
The Assembly listings and Flat Profiles were studied and then the conclusion were made regarding
the given problems.
Problem 1:
(a)
Yes, the Dhrystone ran slower the very first time we ran it on the system, but running the Dhrystone
several times afterwards, the pattern was not observed.
The reason behind this is when the benchmark is run the second time on a particular system, the
cache memory already retains some of the data from the previous run and the system avoids some
instructions to be executed again. This in turn makes the system take less time to execute the code.
Furthermore, this observation can be attributed to Locality as well. As Temporal locality comes into
picture for this.
But, on running the program again at a later time this pattern was not observed, as there was some
data already stored in the cache of the systems by my previously ran program and the other users
using the system.
(b)
Dhrystone is a benchmark that computes computing performance for integers on a particular
system.
The given benchmark - Dhrystone was tried on 3 different Micro-architectural platforms i.e. x8664bit, x86-32bit & SunOs, as provided in the College of Engineering laboratory.
The Dhrystone was compiled using the GCC and CC compiler and was analysed for its performance
on the various architectures, and the appropriate loops number was decided based on the
experiments. The number of loops in the benchmark were adjusted after testing different loop sizes
and a particular loop size was deduced for each of the machines.
Testing done with following number of loops with around 15-25 runs for each of the changed code
and the following results were obtained on x86-64bit machine (Nano)
X86-64bit machine
(i) 100,000 loops -
For all Runs

Dhrystone time for 100000 passes = 0
Floating point exception
(ii) 10,00,000 loops -
For majority runs we observed

For only 2 runs I got - :
This machine benchmarks at 1000000 dhrystones/second
After seeing the seeing the above results there was again a need to increase the loop size, as the
system was executing the files too fast and it cant be analysed. Therefore, we again increase the
loop size by x10 factor.
(iii) 100,00,000 loopsPasses- 0,1,1,0,1,0,0,1,1,0,1,0,1,0,1,0,1,1,0,1,1,0,1,0,1,0,0,1,1,0
Runs#- R1,R4, R6,R7,R10,R12,R14,R16,R19,R22,R24,R26,R27,R30
For passes = 0
For passes =1
So a similar result as above number of loops. Therefore we increase the loops size again.
(iv) 10,00,00,000 loops (10^8)

For all Runs R1-R25:

But these number of passes are still insufficient we increase again by a factor of 10.
(v) 100,00,00,000 loops (10^9)

R1-R2 : Dhrystone time for 1000000000 passes = 60

R9
: Dhrystone time for 1000000000 passes = 61
R10
R11
R12-R13: Dhrystone time for 1000000000 passes = 59
R14-R15: Dhrystone time for 1000000000 passes = 60
So now we know the number of loops should be between 10^8 & 10^9. Therefore by further testing
we get deduce the number of loops to be selected as 35,00,00,000 (35x10^7). Afterwards, we apply
the optimisation operations by adjusting these set of loops in the benchmark.
(vi) 35,00,00,000 Loops
(a) First we run the benchmark without using the optimisation on nano machine (x86-64bit)The results are as follows:
The Dhrystone was run for 15 times, using GCC compile without any optimization switch. We get the
Dhrystone time for 3500000000 passes as 23 & 24. Also, the machine benchmarks at 15217391 and
14583333 Dhrystones/second respectively.
Now, we used the cc compiler as specified in the problem and without optimization switches.
We get similar results, with difference of just one run.

(b) We run with -O1 optimisation switch
We see a drastic difference between the nonoptimised code and O1 optimised code. The
Dhrystone time reduced by almost 12 s. The majority time we got Dhrystone time as 11 s and
31818181 Dhrystones/seconds.
Now, we generate the assembly code for non-optimised code and O1 optimised code and compare.
Without Optimization
-O1 Optimization
Performance improvement can be attributed to the removal of the redundant part of the code,
thereby leading to a reduced code size which is more efficient. Serious reduction in the registers
used earlier and after optimisation.
The software used above for comparison is Winmerge. The Gray part on the left shows the
operations not performed by the other party, Yellow part shows different operations performed by
the other file and White part shows same operations.
(c) We apply O2 optimisation level
The results in terms of Dhrystone time and Rate of Dhrystones has further improved.
-O2 Optimization
After analysing the assembly we see a further reduction of code and perfect optimisation with best
results. Proper ratio of trade off was maintained between debug information and code performance.
During the profiling we saw some of the function getting omitted, which were later found to be
called from inside of other functions. The usage of the registers was further minimized and more
weight was put on usage of Logical operations, thereby reducing the memory consumption of the
system and enhancing performance. Also there, were some new flags raised in this level of
optimisation further making the code to be more efficient.
(d) We apply O3 optimisation level But the results obtained were pretty dangerously optimised which changed the code. The O3 level
was run for checking the runtime and it came as shown below.
Similar results were obtained on the Sun machines. The benchmark was checked thoroughly with
Sun platform as well.
Part C.
(i)
Here I attach the flat profile of the benchmark obtained using Gprof without optimisation.
From above screenshot we observe that Func1 and Proc7 are the most called functions. Therefore,
most frequently executed.
(ii)
Func1 takes 3.52 seconds and 14.54% of the total running time & Proc7 takes 1.80 seconds and
7.43% of the total time.
(iii)
Below attach is the O2 optimised code profile.
Therefore on comparing the profile of the optimised code with the non-optimised code, we can
observe that Self seconds of the corresponding programs have decreased drastically.
The optimization changes the percentage because, as we can see in the profile above some
functions are missing (not shown) for the optimised profile, they are called from inside of other
functions. So ultimately there is adjustment of time among the previous functions and newly
adjusted ones, leading to a change in percentage.
(iv)
Ran the given benchmark Dhrystone on Solaris compiled with both GCC and CC.
Observed better results with the cc compiler as it automatically includes the useful libraries and
performs required tweaks and cc is made by Sun, therefore Solaris could have been tuned with the
cc compiler for better performance.
Problem 2:
Linpack Benchmark was run on x86-64bit Nano machine and the picture below shows the results-
So we obtain the MFLOPS as, 998.009950.

Now we apply the -O1 optimisation level. We observe the Mflops value as 1013.131313.
We obtain good results with O2 level optimisation as the MFLOPS value comes to be 3714.814815.
Now we do the profiling of both the unoptimised and O2 level optimisation code we get the
following results
Without Optimisation
-O2 level optimisation
Now checking the assembly for both the casesBelow is the screenshot taken for the assembly comparison for Linpack without Optimisation & with
O2 level Optimisation.
From analysing the assembly code for both we find out in the O2 level optimization, code is
substantially reduced. There are some functions which are called from inside of the other functions
and which may have created redundancy earlier and thereby it improves performance after
optimisation. This can also be observed from the Flat profile of both shown above, as for the
optimised analysis some components are not shown in the Flat profile.
Also, the optimised code makes use of various logical functions like XOR etc. which takes lesser
memory compared to the full-fledged registers operations. There is a substantial reduction in
registers usage, which leads to better performance. Several, optimisation functions and Flags
associated with O2 level optimisation incorporated.
-O2 Optimization
Problem 3:
I am choosing Whetstone benchmark as the third benchmark to be tested and running the same on
x86-64bit ISA.
Whetstone benchmark evaluate Floating point arithematic operations. It is a synthetic benchmark
made to mimic typical Algol 60 programs.
The benchmark was downloaded from the following link
http://www.netlib.org/benchmark/whetstone.c
Ran the whetstone on x86-64bit ISA.
First we set the loops 10^5. We got the following results -
Then after increasing the loop size to 10^6 loops, we got :
The graph below can be analysed for having an idea of which system the whetstone will run
efficiently. From the Graph we find out the best can be obtained on corei7-3960x system, with the
highest number of Gflops.
Problem 4:
Linux is more popular platform than Solaris. Linux with its perma-beta kernel and semi-debugged
GUI beats solaris.
The first difference is observed when the x86-64 bit ISA machine produce faster results than SPARC
ISA machine. For running the Dhrystone benchmark on x86 with 35,00,00,000 number of loops it
took 23 seconds Dhrystone time, whereas if we execute the Dhrystone with same number of loops
(i.e. 35,00,00,000) it will take forever and lead to wastage of time. On Solaris, we are getting 25
seconds Dhrystone time for 1,00,00,000 number off loops. It is because of the possible Bit size
difference.
We compile the Dhrystone with same number of loops for both SPARC and X86-64bit and generarte
assembly for both. So we take the Proc. 0 assembly code for both the ISA for comparison.
SPARC
X86-64Bit
We can see from the above compiled assembly code, x86-64bit ISA is using movq as it operates with
the Quadword (64 bit) values whereas the SPARC uses the mov instruction with 32 bits registers.
Furthermore, In SPARC there is a constant use of SETHI instruction set and also it makes frequent
use of Logical operations (-XOR, -OR) instead of conventional instructions which takes larger memory
for execution.
X86 with lesser registers save registers by threading switches and handle interrupts faster than
SPARC.
In addition to it, x86 is using Memory to memory storing majority time but SPARC uses Register to
Register- loading and storing
On the whole, If we compare full assembly code generated, we observe that the code size for x86 is
lesser than SPARC.
Problem 5:
Benchmarking suites are combination of different benchmarks. The advantage of using these are the
inefficiency or shortness of one benchmark can be overcome by another benchmark, thereby giving
an effective performance estimation.
STAMP Benchmark suite
(i)
STAMP (STANFORD TRANSACTIONAL APPLICATIONS FOR MULTI-PROCESSING) Benchmark suite Analysis and tests related to transactional memory applications
Parsec (Princeton Application Repository for Shared-Memory Computers) Benchmark Suite In
addition to high performance computing applications, this includes applications like recognition,
mining & synthesis. Designed to assess shared memory programs for chip-multiprocessors &
memory speeds, SMT performance, out-of-order execution, and performance scaling.
(ii)
STAMP
STAMP includes eight applications and thirty variants of input parameters and data sets in order to
represent several application domains and cover a wide range of transactional execution cases.
Applications
- Bayes (machine learning)
- Genome (bioinformatics)
- Intruder (security) etc.
Parsec
Performance scaling measured when increasing the number of threads. Identyfying the bottlenecks
and classifying them as hardware or software.
The Parsec relied on the following things
-Multithreaded applications included
-emerging workloads
-diversity in representing usage of multiprocessors
-implementing cutting edge techniques
-research scope
Working methodology includes following characteristics
-Parallelization
-Working sets & locality
-Communications to computation ratio & sharing
-Off chip traffic
By analysing the Parsec, the future characteristics of the chip multiprocessors were predicted and
analysed to support performance of coming applications
This was tested in one of the papers on following applications efficiently

-video games
-virtual worlds
-coverage analysis
Inherent PARSEC benchmarks Examples blackscholes(financial analysis), bodytrack (computer
vision), canneal (engineering) etc.
(iii)
STAMP Benchmark suite
Citations
(1) STAMP Need Not Be Considered Harmful Wenjia Ruan, Yujie Liu, and Michael Spear Lehigh
University- 2014/2/14
(2) WormBench - A Configurable Workload for Evaluating Transactional Memory Systems
Ferad Zyulkyarov Barcelona Supercomputing Center Belgrade University, Osman Unsal Barcelona
Supercomputing Center , Adrian Cristal Barcelona Supercomputing Center, Eduard Ayguade*
Barcelona Supercomputing Center
PARSEC Benchmark Suite

Citations
(1) Analysis of Non-Uniform Cache Architecture Policies for Chip-Multiprocessors Using the Parsec
Benchmark Suite - Javier Lira, Carlos Molina, Antonio Gonzlez
(2) PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on
Chip-Multiprocessors - Christian Bienia, Sanjeev Kumar and Kai Li
(3) The PARSEC Benchmark Suite: Characterization and Architectural Implications- Christian Bienia,
Sanjeev Kumar, Jaswinder Pal Singh and Kai Li
(iv)
Removal trivial notations from functions of the benchmark suite, which can lead to inefficiency and if
required replacing the same notation with an efficient one. Ex: as described in one of the above
cited papers, the removal of TM_PURE notation from function and also removal of TM_RESTART
makes the code efficient. As other notations which are already present in the benchmark can be
modified and put to us in their place.

Assignment 1

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Assignment 1

Enviado por

Direitos autorais:

Formatos disponíveis

COMPUTER ARCHITECTURE

BY: SHIVAM SHARMA,

COMPUTER ARCHITECTURE ASSIGNMENT 1

For all Runs

(ii) 10,00,000 loops -

For majority runs we observed

(iv) 10,00,00,000 loops (10^8)

Dhrystone time for 100000000 passes = 6

(v) 100,00,00,000 loops (10^9)

R3-R5 : Dhrystone time for 1000000000 passes = 61

We get similar results, with difference of just one run.

So we obtain the MFLOPS as, 998.009950.

-O2 level optimisation

Then after increasing the loop size to 10^6 loops, we got :

This was tested in one of the papers on following applications efficiently

PARSEC Benchmark Suite

Você também pode gostar