Você está na página 1de 16

Cover Sheet

Fall 2015
CmpEn 431 Final Project
Team Member Names: Robert Gobao, David Sun-Chu

Base mcf

IPC
0.3186

Best Execution Time


0.7847 ms

Base milc

0.4234

0.5905 ms

Best mcf

1.9188

0.2085 ms

Best milc

1.5167

0.2637 ms

Best execution time mcf issue width and datapath type:


4-Way Dynamic
Best execution time milc issue width and datapath type:
4-Way Dynamic
Overall best execution times:
Best Integer GM (4 benchmarks)

0.26695 ms

Best Floating point GM (2 benchmarks)

0.2451 ms

Best execution time GM Integer issue width and datapath type:


4-Way Dynamic
Best execution time GM Floating point issue width and datapath type:
4-Way Dynamic

Introduction
Finding the best design for a processor is often no easy task. However, with lots of knowledge
of the subject and a solid plan of attack, the task is less daunting. Our end goal was blunt,
produce the lowest execution time for integer and floating point benchmarks in terms of two
geometric means. To achieve this goal, we needed to set up a general plan and adapt it as needed.
Essentially, we decide to isolate variables in order to determine what works best for those
particular settings. That is why we are testing memory hierarchy, branch prediction, and datapath
types separately because they are the most important parts of the processor. Within those units,
we will also run different tests for things like the L1 cache, L2 cache, etc. to further extract the
most performance possible from independent parts of the processor. First we will focus on
instructions per cycle then transition to execution time and determine what datapath gives us the
smallest value. Lastly, we will go back through and verify that our schemes for the memory
hierarchy hold true for the issue width and datapath type we ultimately choose.
Memory Hierarchy - L1 Cache
In the design of an L1 cache, the main objective should be to reduce time to hit in the cache. To
achieve this, you would need to make a smaller cache with smaller blocks that is also direct
mapped. When we performed our experiments, we started off by testing how block size affects
the sim_IPC for both integer and floating point benchmarks. For these experiments, we used the
baseline settings, while varying L1 cache properties.

L1 Cache - Block Size Tests - Integer Benchmarks


0.4500
0.4000
0.3500

sim_IPC

0.3000
0.2500
8Kb
0.2000

16Kb
32Kb

0.1500
0.1000
0.0500
0.0000
8

16

32

Block Size (Bytes)

64

L1 Cache - Block Size Tests - Floating Point Benchmarks


0.4500
0.4000
0.3500

sim_IPC

0.3000
0.2500
8Kb
0.2000

16Kb
32Kb

0.1500
0.1000
0.0500
0.0000
8

16

32

64

Block Size (Bytes)

From our experiments, it appeared as though the best block size was 64 bytes for integer and 32
bytes for floating point. We thought these results were intriguing because the best block sizes
were the ones that were larger than the ones we assumed, which does not align with the theory
behind a good L1 cache. But the problem with the theory is that it is rather vague and doesnt
really define what smaller would be considered so that means the best block sizes in our
experiments are actually considered smaller blocks.
Once we established a block size that provided sufficient performance for the benchmarks, we
decided to test associativity and replacement types. We never tested the FIFO replacement type
since it seems to not be a suitable replacement type because it exhibits terrible temporal locality.

L1 Cache - Associativity/Replacement Tests - Integer Benchmarks

0.3646

0.3689

16Kb - 32B - Direct Mapped - RR

16Kb - 64B - 2-Way Associative - RR


sim_IPC

0.3745

16Kb - 64B - 4-Way Associative - LRU


16Kb - 32B - 2-Way Associative - LRU

0.3771

16KB - 64B - 2-Way Associative - LRU


0.3927

0.3500 0.3550 0.3600 0.3650 0.3700 0.3750 0.3800 0.3850 0.3900 0.3950

L1 Cache - Associativity/Replacement Tests - Floating Point


Benchmarks

0.4065

0.3933

16Kb - 32B - Direct Mapped - RR


16Kb - 64B - 2-Way Associative - RR

sim_IPC

0.3805

16Kb - 64B - 4-Way Associative - LRU


16Kb - 32B - 2-Way Associative - LRU
0.3990

0.3942

0.3650 0.3700 0.3750 0.3800 0.3850 0.3900 0.3950 0.4000 0.4050 0.4100

16KB - 64B - 2-Way Associative - LRU

After running the experiments, we had determined a clear cut winner for both the integer and
floating point benchmarks. For integer it was a 16 Kb 64 B 2-Way Associative LRU cache
and for floating point it was a 16 Kb 32 B Direct Mapped RR cache. Our results somewhat
aligned with the theoretical good L1 Cache by being small and direct mapped, however it was
not always the case that the theory trumped what actually occurred in experimentation.
Memory Hierarchy - L2 Cache
The memory hierarchy is a huge limiting factor in computer architecture. Our plans for testing
the L2 cache was to keep word blocks big and the size relatively big in order to minimize miss
rate. Our starting point for these experiments were taken from the cache slides chapter 5B slide
15:

Chapter 5B slide 15 from Professor Mary Jane Irwins CMPEN 431 cache PowerPoint
From this slide, we noticed that larger caches contributed to smaller miss rates. One thing we
took into consideration was that as block size became a significant fraction of the cache size,
miss rate actually started to increase. So we started our testing with the biggest cache with
smallest block sizes that kept reasonable performance.

Geometric Means

1024KB 128B word L2 Cache vs Associativity


0.37
0.355894065 0.353712547
0.36
0.349290239
0.342789304
0.35
0.336630181
0.34
0.33
0.32
0.31
0.292342522 0.289017941
0.3
0.282399256
0.29
0.27403518
0.28
0.266050446
0.27
0.26
0.25
0.24
0.23
0.22
0.21
0.2
16
8
4
2
1
Associativity

Int Geo Mean


Float Geo Mean

In order to decrease the miss rate, we decided to use large L2 caches with very large word blocks
in order to utilize spatial locality. Because our assignment varied latencies with associativity we
needed to find out if the associativity effected the IPC more than the latency that came with it.
Through various runs, we found that the 1MB cache with 2-way associativity performed the best
in both integer and floating point benchmarks.

Geometric Mean

512KB Cache w/ 128B words vs Associativity


0.37
0.357144831 0.355615152
0.36
0.343802657
0.35
0.336716322
0.34 0.329416233
0.33
0.32
0.31
0.300716694 0.298352124
0.3
0.290230943
0.29
0.280869756
0.272379772
0.28
0.27
0.26
0.25
0.24
0.23
0.22
0.21
0.2
16
8
4
2
1
Associativity

Int Mean
Float Mean

The next task was to determine if the cache size could overcome the latency associated with its
size. We kept the word blocks 128B because having larger word blocks increased spatial locality.
Because we already have the 1MB cache statistics, we decided to test the 512KB cache and see if
any increase in IPC would come from it. To our surprise, the 512KB, 2-way associative cache
came out with higher IPC in both the integer and floating point benchmarks as compared to the
1MB, 2-way tested before. Because of this we decided to use a 512KB, 2-way associative cache
in testing our later optimized designs.
Memory Hierarchy - TLBs
From what we learned in lecture, the primary goal of a good TLB is to reduce the amount of
page faults. Page faults are very costly in terms of performance because it involves bringing the
requested data into physical memory, setting up a page table entry to map the faulting virtual
address to the correct physical address, and resuming the program, which is a huge waste of
time. In order to prevent page faults in our TLBs, we decided to maximize the amount of entries
in the iTLB and dTLB and make them as associative as possible.

Max Entries, Max Associativity, LRU


Max Entries, Lesser Associativity, LRU
Baseline Settings

Int sim_IPC
0.2981
0.2981
0.2979

Fp sim_IPC
0.3377
0.3377
0.3776

Our TLB tests did not yield much of a performance gain in terms of IPC for the integer and
floating point benchmarks. This certainly disappointed us because in theory, we should have
seen a noticeable performance increase. We will though still use the max amount of entries and
max associativity in our TLBs since it should come into play more with out-of-order datapaths.

Branch Prediction
In order to determine which branch predictor we wanted to use, we tested all the possible
predictors SimpleScalar has to offer. Because branch prediction is a huge limiting factor in
modern processors and drastically affects out of order machines, we wanted to make sure to
extensively test all possible predictors.

Branch Prediction with Baseline


0.43

Geometric Means

0.4043

0.404

0.41

0.4057

0.4056

0.39

0.35

0.361

0.35968

0.37
0.337858905

0.36232

0.36241

0.337859

0.33
0.31 0.297876944

0.297876944

0.29
0.27
0.25
taken

not taken

Bimodal

2 level

Combo

combo 2048

Predictor
int

float

In order to isolate the performance of each predictor, we decided to test each with baseline
configuration and determine the best performing scheme from the geometric means of integer
and floating point benchmarks. From our class experiments, we tested taken, not taken, and
bimodal and determined that bimodal outperformed the others. So we had a good starting point
in what we thought would perform the best. We also tested the combinational and 2 level
predictors that SimpleScalar had in order to see if we could squeeze out anymore performance
boost. From our data, we found that the combinational predictor outperformed the others by .001
IPC. Combinational includes both a 2 level and bimodal predictor, which seems like a lot more
hardware and space on the chip than necessary for the performance increase, but since we are not
limited to space or hardware, we decided to go with the combinational in our final optimized
design. In order to squeeze out even more performance out of combinational we also tried
increasing the meta-table size from 1024 entries to 2048, however we didnt see any noticeable
increase in performance.

Datapath Experiments
In order to determine the best datapath, we used the previously optimized memory hierarchy and
branch predictor with all else held at baseline.

Geometric Means

Datapath Execution Time with Optimized L1,Branch Predictor,


and TLBs
0.65
0.63
0.61
0.59
0.57
0.55
0.53
0.51
0.49
0.47
0.45
0.43
0.41
0.39
0.37
0.35
0.33
0.31
0.29
0.27
0.25

0.620917242
0.553183477
0.50762776
0.448633086
0.332902151

0.325335402

2 way static

4 way static

2 way dynamic

0.344291866
0.315155495
0.284061021

4 way dynamic

0.298579464

8 way dynamic

Datapath
Int ExeTime

Fp ExeTime

The purpose of these experiments were to determine which datapath had the best trade-off from
clock cycle time and IPC. The execution time formula is:

=
_
In terms of performance (smaller execution time), there is a direct relationship between IPC and
execution time while there is an inverse relationship between clock cycle time and execution
time. So even though dynamic 8-way has a much higher IPC than the others, its higher clock
cycle time may offset that increase in IPC and produce a higher execution time.
From our homework in class, we already determined that the dynamic 8-way datapath produced
the best IPC, however while observing the execution times for this project, we found that the
clock cycle time was just too high for dynamic 8-way to be dominant. From our graph, we found
that dynamic 4-way outperformed both dynamic 8-way and dynamic 2-way, so we decided to
use dynamic 4-way in our final configuration.
Mid-Project Change of Focus
After we finished testing the issue widths and datapath types, we decided to focus on execution
time rather than IPC, as that is the whole purpose of the project, to produce an execution time for
the integer and floating point benchmarks that is the best. With this change of mindset, it will
require us to slightly backtrack in our testing and it will involve us revisiting the memory
hierarchy to verify that we do indeed have the best memory schemes for our respective
machines. But before we do revisit memory hierarchy for our machines, we need to take a look

at some of the less important settings in SimpleScalar to see if they improve the execution time
at all.
Miscellaneous Settings
In order to be thorough in finding the best design for integer and floating point benchmarks,
we decided to investigate some of the more obscure settings in SimpleScalar that we may not
have looked at this semester with the goal to further decrease execution time for either our
integer or floating point machine.
The first settings we played with were changing the amount of functional units such as the
number of integer ALUs and multipliers and floating point ALUs and multipliers. After running
some quick tests, it was apparent to us that these settings (-res:ialu, -res:imult, -res:fpalu, res:fpmult) did not really decrease execution time as we anticipated and in some cases the
execution time actually got worse.
Next, we decided to analyze the effect of memory bus width on our machines. The default setting
for the bus width was 8. Naturally, we predicted that we should be maximizing this number so
we can send more data across the bus at once.

Effect of Memory Bus Width


0.32000

0.31516

0.31000

Execution Time (ms)

0.30000
0.29000
0.28000

0.2841
0.2764
Floating Point Execution Time

0.27000

0.2629

Integer Execution Time

0.26000
0.25000
0.24000
0.23000
Memory Bus Width: 8B

Memory Bus Width: 16B

Not only were we correct in our prediction, we noticed a significant drop in execution time for
both integer and floating point machines which was awesome to see.
The final obscure setting we looked at was fetch speed. Fetch speed was either 1, 2, 3 or 4 so we
decided to run brief tests to make sure we could find an appropriate speed.

Fetch Speed = 2
Fetch Speed = 3
Fetch Speed = 4

Int ExTime (ms)


0.276
0.2765
0.2764

Fp ExTime (ms)
0.2638
0.2628
0.2629

Interestingly, fetch speed had a minor effect on the sim_IPC for both integer and floating point
benchmarks. However, in a project such as this, we need every fraction of execution time we can
get so testing fetch speed proved useful as the integer machine did better with a fetch speed of 2
and the floating point machine required a fetch speed of 3 to extract the most performance.
Memory Hierarchy Revisited L1 Cache
To verify that our best L1 cache was indeed the best for the 4-way dynamic superscalar
machine, we went through and ran quick tests to see if the L1 cache for integer and floating point
benchmarks gave us better execution time. For integer benchmarks, the scheme we found earlier
thankfully held true. However, for the floating point benchmarks our worst fears came to fruition
and the cache scheme we found earlier to be optimal was not giving us the best execution time
for the dynamic datapath. This really confused us because this meant the memory hierarchy was
dependent upon other factors we failed to account for, which was totally unexpected. Even
though we encountered this setback, we persisted on and ran some tests of the L1 cache again in
hopes of decreasing the execution time for floating point benchmarks.

L1 Cache - Verification/Optimization
0.3500

Execution Time (ms)

0.3000
0.2500
0.2000
Integer Execution Time

0.1500

Floating Point Execution Time


0.1000
0.0500
0.0000
16Kb - 2-Way 16Kb - 4-Way 32Kb - 4-Way 64Kb - 2-Way 64Kb - Direct
Mapped

(Caches use 64 Byte blocks with LRU Replacement)

Our integer tests proved that we had realized the best scheme in our initial L1 cache experiments
earlier. Even though floating point baffled us, we finally determined that a 64 Kb 64 Byte
Direct Mapped LRU cache was the best for the floating point benchmarks. Throughout the
retesting of the L1 cache, we decided to keep the block size 64 bytes because it enables to have
an ifqsize of 8, which was required to ensure the performance of our machines were the best it
could possibly be. We even did some quick side experiments involving the changing of the
ifqsize for the dynamic machine and what we noticed was no matter how good the L1 cache was,
ifqsize had a more profound effect on execution time.
Memory Hierarchy Revisited - L2 Cache
Determining which L2 cache performed the best with baseline configurations was a good starting
point and gave us a good idea of what the final configuration should have used. Unfortunately, as
we finished up our optimized configurations, we realized that different cache sizes and
associativitys influenced the integer and floating point execution times. We went through and
tested all associativitys of 1MB and 512KB caches with 128B word blocks with our optimized
configurations for integer and floating point benchmarks. We found that the 1MB, 2-way L2
cache increased floating point benchmarks, while a 1MB, 4-way L2 increased integer
benchmarks. At this point in our experiments, we were trying to squeeze that last bit of
performance out of these designs in order to get the best possible execution times. We managed
to decrease our execution times by .05ms by changing the L2 cache.
Memory Hierarchy Revisited TLBs
When we ran TLB tests earlier, we noticed that there was very little performance gain by setting
the TLBs to be the max amount of entries and associativity. To verify that we were correct
earlier, we ran a few brief tests on the TLBs for our dynamic machines and to our surprise,
noticed the execution time going up as we made the TLB smaller and less associative, exactly
what it should do theory. Our initial TLB settings were the best after all, we just needed to test it
on a dynamic datapath to see a visible change.
Conclusion
Upon receiving this assignment, we quickly realized that we could be stuck in the lab for hours
upon hours checking every possible combination of memory hierarchy and datapath. In order to
combat this fear, we initially planned to isolate components by leaving all else constant with the
baseline configuration while varying the component settings based on what we learned in class.
We broke the assignment down into parts: L1 and L2 Cache, TLBs, Branch Prediction, Datapath,
and miscellaneous. In order to isolate components, we researched each of them and noted if any
had predetermined dependencies. After we figured out which component settings were the best
with baseline experiments, we then used those setting in our optimized configuration. After
running some benchmarks on our optimized machines, we realized that there were some
interdependencies with cache and TLB settings with integer and floating point benchmarks.
Some caches were better suited for integers while some caches were better for floating point.
Because of this reason, we decided to fork our experiment and try to optimize integer and

floating point benchmarks separately. After revisiting caching for our designs, we were able to
obtain an execution time of 0.26695ms and 0.2451ms for the integer and floating point
benchmarks respectively.

Baseline Machine
-fastfwd 500000
-max:inst 2500000
-fetch:ifqsize 1
-fetch:speed 1
-fetch:mplat 3
-bpred nottaken
-bpred:ras 0
-decode:width 1
-issue:width 1
-issue:inorder true
-issue:wrongpath false
-ruu:size 4
-lsq:size 2
-res:ialu 1
-res:imult 1
-res:fpalu 1
-res:fpmult 1
-res:memport 2
-cache:il1 il1:1024:8:1:r
-cache:dl1 dl1:1024:8:1:r
-cache:il2 dl2
-cache:dl2 ul2:2048:16:2:r
-tlb:lat 30
-tlb:itlb itlb:16:4096:4:r
-tlb:dtlb dtlb:32:4096:4:r
-cache:dl1lat 1
-cache:il1lat 1
-cache:dl2lat 5
-cache:il2lat 5
-mem:lat 90 10
-mem:width 8
-redir:sim sim1.out

Best Integer Machine (mcf)


-fastfwd 500000
-max:inst 2500000
-fetch:ifqsize 8
-fetch:speed 2
-fetch:mplat 3
-bpred comb
-bpred:comb 2048
-bpred:ras 16
-bpred:btb 1024 4
-decode:width 4
-issue:width 4
-issue:inorder false
-issue:wrongpath true
-ruu:size 32
-lsq:size 16
-res:ialu 4
-res:imult 4
-res:fpalu 4
-res:fpmult 4
-res:memport 2
-cache:il1 il1:128:64:2:l
-cache:dl1 dl1:128:64:2:l
-cache:il2 dl2
-cache:dl2 ul2:2048:128:4:l
-tlb:lat 30
-tlb:itlb itlb:1:4096:256:l
-tlb:dtlb dtlb:1:4096:512:l
-cache:dl1lat 3
-cache:il1lat 3
-cache:dl2lat 10
-cache:il2lat 10
-mem:lat 90 10
-mem:width 16
-redir:sim sim1.out

Best Floating Point Machine (milc)


-fastfwd 500000
-max:inst 2500000
-fetch:ifqsize 8
-fetch:speed 3
-fetch:mplat 3
-bpred comb
-bpred:comb 2048
-bpred:ras 16
-bpred:btb 1024 4
-decode:width 4
-issue:width 4
-issue:inorder false
-issue:wrongpath true
-ruu:size 32
-lsq:size 16
-res:ialu 4
-res:imult 4
-res:fpalu 4
-res:fpmult 4
-res:memport 2
-cache:il1 il1:1024:64:1:l
-cache:dl1 dl1:1024:64:1:l
-cache:il2 dl2
-cache:dl2 ul2:4096:128:2:l
-tlb:lat 30
-tlb:itlb itlb:1:4096:256:l
-tlb:dtlb dtlb:1:4096:512:l
-cache:dl1lat 4
-cache:il1lat 4
-cache:dl2lat 9
-cache:il2lat 9
-mem:lat 90 10
-mem:width 16
-redir:sim sim1.out

Você também pode gostar