Você está na página 1de 30

Blazing Saddles: Getting the

Performance Out of VCS


Gregg D. Lahti
Corrent Corporation

Tim Schneider
Synopsys Corporation

Gopal Varshney
Corrent Corporation

Blazing Saddles: Getting the Performance Out of VCS


1

Gregg D.
Lahti

High Noon in ASICville


Its two weeks before your chip tape-out deadline:
A new bug is found
You must fix the bug and re-run your regression simulations
across the compute ranch
And still make the tapeout date.
Or you may be the next gateslinger to be in the managers
layoff sights at High Noon
Are you SURE youre getting the most performance from
VCS?

Blazing Saddles: Getting the Performance Out of VCS


2

Gregg D.
Lahti

VCS Usage Model


Consider your VCS Usage
Large designs yield a large number of tests
At Corrent, over 2000 total tests for 3M gate and 8M gate
ASICs
75% of an Engineers time is debugging the design in
specific areas

VCS needs to be utilized in two modes:


Debugging mode, where extra visibility into the simulation is
required
Regression mode, where performance is required

Blazing Saddles: Getting the Performance Out of VCS


3

Gregg D.
Lahti

VCS Debugging Mode


Useful for point problems
Dumping signal state takes VCS resources
Slows down simulation speed, especially lots of I/O to disk
Usually includes some debugger like Debussy

Blazing Saddles: Getting the Performance Out of VCS


4

Gregg D.
Lahti

VCS Regression Mode

Optimize for speed!


Usually many tests run in batch mode
Just verifying pass/fail operation of design
Debug only tests that fail in debugging mode with
signal state saving turned

Blazing Saddles: Getting the Performance Out of VCS


5

Items That Kill VCS


Performance

Gregg D.
Lahti

VCS performance can be hampered by command-line


switches used without discretion:

-I The interractive mode


-PP Post processing of dumpfile
+cli Turns on command-line interactive mode
+acc+2 Used for backwards-compatible PLI calls
Lack of Mupdate Loss of saved-state compile info
-P [library] Compile in libraries that may not be required

Blazing Saddles: Getting the Performance Out of VCS


6

Items That Kill VCS


Performance (cont)

Gregg D.
Lahti

Coding styles can kill VCS performance


Delay loops or unneeded timing assignments kill
performance, as VCS cannot optimize the execution:
always @(posedge clk or negedge reset_n) begin
if (~reset_n) q <= #0 0;
else
q <= #1 d;
end // always

#0 and #1 delays kill performance by as much as 200%


with an average increase of 30-50%!1
1 Verilog Nonblocking Assignments With Delays, Myths and Mysteries, Cliff Cummings, Boston SNUG 2002 paper.

Blazing Saddles: Getting the Performance Out of VCS


7

Items That Kill VCS


Performance (cont)

Gregg D.
Lahti

Use of initial and always blocks


The following construct is legal Verilog but may cause VCS
to lockup in an infinite loop:
always begin
mysig = 1;
end // always

Mixing blocking and non-block assignments may cause


poor VCS simulation speed results and bad logic if coded
incorrectly. Not a good idea to mix these!

Blazing Saddles: Getting the Performance Out of VCS


8

Items That Kill VCS


Performance (cont)
PLI and .tab files
Disk & Network I/O
Optimize your simulations by running on local disk
Network filesystems can be 10X slower!
Bribe your sysadmin for /tmp space, run as follows:
-o /tmp/siv Mdir /tmp/csrc

Blazing Saddles: Getting the Performance Out of VCS


9

Gregg D.
Lahti

Items That Improve VCS


Performance

Gregg D.
Lahti

Command line switches


+rad Verilog pre-processor that optimizes code

Coding styles
Remove #0 and #1 delays
Separate sequential items from combinatorial processes

In-line C instead of PLI calls


Use the Direct Kernel Interface (DKI)
Use the DKI for Debussy
+vcsd along with the proper vcsd .tab file
http://www.solvnet.synopsys.com/retrieve/900611.html

Blazing Saddles: Getting the Performance Out of VCS


10

10

Gregg D.
Lahti

11

Cleaning the pli.tab File


PLI tab files can be optimized!
Look for the following in your .tab files:
acc:rw,cbka: *

The * signifies every signal in your design gets read/write


access and visibility by the debugger.
Streamline this, as you probably dont need EVERY signal!

Debussy ships with a very unoptimized pli.tab file:


Example line:
$fsdbDumpvars check=plicompileDumpvars call=plitaskDumpvars
misc=plimiscFSDB acc=read,callback_all:%*
Replace
Replacethis!
this!

Replace the %* with %TASK, improves performance by as


much as 15-20%

Blazing Saddles: Getting the Performance Out of VCS


11

Cleaning the pli.tab File


(cont)

Gregg D.
Lahti

12

If your home-grown PLIs and debugging PLIs are


still dragging down simulation, use the
+vcs+pli+learn flag.
Run the simulation with this flag turned on
VCS figures out which PLI calls are utilized in the design
VCS generates a new pli.tab file to be used

Useful for simulation speed improvements, results


vary based on PLI usage
Caveat Emptor:
Change your PLI interface or usage, need to re-run with flag
or risk incorrect/failing simulation!

Blazing Saddles: Getting the Performance Out of VCS


12

Gregg D.
Lahti

Profiling Simulations
Profile your simulation to see where the time is spent
Useful to see if code, PLI or library is causing the
bottleneck
Easy with VCS 5.2 and later:
use +prof in command line compile script
VCS creates a vcs.prof outputfile
Read the file, see where the time is spent

Blazing Saddles: Getting the Performance Out of VCS


13

13

Gregg D.
Lahti

Profiling Simulations (cont)


Output of vcs.prof log:
Total
TotalSimulation
SimulationTime
Time
// Synopsys VCS 6.2R12
// Simulation profile: vcs.prof
// Simulation Time:
976.180 seconds
======================================================================
TOP LEVEL VIEW
======================================================================
TYPE
%Totaltime
---------------------------------------------------------------------PLI
0.23
VCD
0.99
KERNEL
7.76
DESIGN
91.02

Blazing Saddles: Getting the Performance Out of VCS


14

14

Gregg D.
Lahti

15

Profiling Simulations (cont)


=====================================================================
MODULE VIEW
=====================================================================
Module(index)
%Totaltime
No of Instances
Definition
--------------------------------------------------------------------delaychain
(1) 67.75
56 ../top/rtl/delaychain.v:15.
dll_delay_line
(2) 7.16
2 ../rtl/ddrctlr/rtl/dll_delay_line.v:21.
ckrst
(3) 2.47
1
../top/rtl/ckrst.v:13.
INVDL
(4) 1.25 8431
/projects/clibs/umc/0.15vst/
tapeoutkit/stdcell/UMCL15U210T2_2.2/Verilog_simulation_models/INVDL.v:32
.
hurricane_tb
(5) 1.23
1
../tb/hurricane_tb.v:31.
pdisp
(6) 1.16
8
../rtl/hurricane/rtl/pdisp.v:33.
dll_mux
(7) 0.96 1720 ../rtl/ddrctlr/rtl/dll_mux.v:21.
dll_buf
(8) 0.56 1732 ../rtl/ddrctlr/rtl/dll_buf.v:21.
spsram_1536x32
(9) 0.54
8
/projects/clibs/rams_nobist_M4one/
0.15vst_2.0/Verilog_fix/spsram_1536x32.v:8.

Blazing Saddles: Getting the Performance Out of VCS


15

Gregg D.
Lahti

Profiling Simulations (cont)


Ouch! Spending 67% of VCS time in delaychain.v
We used this to hand-tune the clock trees from the clock generation
block (get around anemic Apollo clock tree insertion issues).
Delaychain.v code looks like this:
module delaychain (
sigin,
sigout
);
input sigin;
output [120:0] sigout;
wire [120:0] sigout;
BUFD1 buf_u000 (.A(sigin), .Z(sigout[0]));
INVDL inv_u001 (.A(sigin), .Z(sigout[1]));

INVDL inv_u120 (.A(sigout[119]), .Z(sigout[120]));


Endmodule

Blazing Saddles: Getting the Performance Out of VCS


16

16

Gregg D.
Lahti

Profiling Simulations (cont)


Compile and run with +nospecify gives us better
performance:
// Synopsys VCS 6.2R12
// Simulation profile: vcs.prof
// Simulation Time:
149.660 seconds
======================================================================
TOP LEVEL VIEW
======================================================================
TYPE
%Totaltime
---------------------------------------------------------------------PLI
1.65
VCD
0.02
KERNEL
11.86
DESIGN
86.46

Blazing Saddles: Getting the Performance Out of VCS


17

17

Gregg D.
Lahti

18

Profiling Simulations (cont)


Better run, but still spending time in delaychain.v
=======================================================================
MODULE VIEW
=======================================================================
Module(index)
%Totaltime
No of Instances
Definition
---------------------------------------------------------------------delaychain
(1) 17.24
56
../top/rtl/delaychain.v:15.
hurricane_tb
(2)
8.43
1
../tb/hurricane_tb_nodump.v:31.
dll_mux
(3)
5.06 1720
../rtl/ddrctlr/rtl/dll_mux.v:21.
INVDL
(4)
4.12 8431
/projects/clibs/umc/0.15vst/
tapeoutkit/stdcell/UMCL15U210T2_2.2/Verilog_simulation_models/INVDL.v:32
.
pdisp
(5)
3.83
8
../rtl/hurricane/rtl/pdisp.v:33.
dll_buf
(6)
3.10 1732
../rtl/ddrctlr/rtl/dll_buf.v:21.
rctl
(7)
2.35
8
../rtl/hurricane/rtl/rctl.v:32.
dll_delay_element (8)
2.24 1720
../rtl/ddrctlr/rtl/
dll_delay_element.v:20.

Blazing Saddles: Getting the Performance Out of VCS


18

Gregg D.
Lahti

Profiling Simulations (cont)


Change the Verilog, ifdef the instanced gates only for
synthesis and gate-level simulation
module delaychain (
sigin,
sigout
);
input sigin;
Conditionally
Conditionallycompile-in
compile-in
output [120:0] sigout;
wire [120:0] sigout;
`ifdef SYNTH_DELAYCHAIN
BUFD1 buf_u000 (.A(sigin), .Z(sigout[0]));
RTL
INVDL inv_u001 (.A(sigin), .Z(sigout[1]));
RTLversion,
version,VCS
VCS
can
optimize
this!

can optimize this!


INVDL inv_u120 (.A(sigout[119]), .Z(sigout[120]));
`else
assign sigout
= { sigin, {60{!sigin, sigin}} };
`endif
Endmodule

Blazing Saddles: Getting the Performance Out of VCS


19

19

Gregg D.
Lahti

Profiling Simulations (cont)


Compile and run with +nospecify and new IFDEFed
delaychain.v provides much better performance:
// Synopsys VCS 6.2R12
// Simulation profile: vcs.prof
// Simulation Time:
124.410 seconds
======================================================================
TOP LEVEL VIEW
======================================================================
TYPE
%Totaltime
---------------------------------------------------------------------PLI
1.15
VCD
0.02
KERNEL
13.34
DESIGN
85.50

Blazing Saddles: Getting the Performance Out of VCS


20

20

Gregg D.
Lahti

21

Profiling Simulations (cont)


The delaychain.v isnt on the top list of CPU hogs:
=======================================================================
MODULE VIEW
=======================================================================
Module(index)
%Totaltime
No of Instances
Definition
---------------------------------------------------------------hurricane_tb
(1) 6.79
1
../tb/hurricane_tb_nodump.v:31.
dll_mux
(2) 5.90 1720
../rtl/ddrctlr/rtl/dll_mux.v:21.
pdisp
(3) 4.92
8
../rtl/hurricane/rtl/pdisp.v:33.
rctl
(4) 3.76
8
../rtl/hurricane/rtl/rctl.v:32.
dll_buf
(5) 3.66 1732
../rtl/ddrctlr/rtl/dll_buf.v:21.
xaux_regs
(6) 2.93
8
../rtl/hurricane/rtl/xaux_regs.v:249.
dll_delay_element(7)2.63 1720 ../rtl/ddrctlr/rtl/dll_delay_element.v:20.
delaychain
(8) 2.08
56
../top/rtl/delaychain.v:15.
tdc_cdb
(9) 1.67
1
../rtl/tdc/rtl/tdc_cdb.v:16.
spsram_1536x32 (10) 1.32
8
/projects/clibs/rams_nobist_M4one/
0.15vst_2.0/verilog_fix/spsram_1536x32.v:8.

Blazing Saddles: Getting the Performance Out of VCS


21

Gregg D.
Lahti

22

Speeding Gate Simulations


Gate-level, back-annoated simulations are really
slow, but useful
Checks for real-world conditions that STA may have missed
Useful for boot/power-up testing (does your chip come out
of reset?)
Ensure that the layout netlist works as specified

Good place for simulation speed improvements!


VCS has two switches that optimize gate-level
simulation:
+timopt
+memopt

Blazing Saddles: Getting the Performance Out of VCS


22

Speeding Gate Simulations


(cont)

Gregg D.
Lahti

23

The +timopt flag:


Optimization based on clock signals and sequential devices
in design
Useful since +rad cant optimize SDF annotated designs
Used as +timopt+time, where time is the smallest clock
period in the design
VCS generates a configuration file that shows more
optimization that can be done by hand
In one Corrent design, 32% of the design was optimized
using +timopt
Speed improvement varies, our test case measured a 20%
improvement

Blazing Saddles: Getting the Performance Out of VCS


23

Speeding Gate Simulations


(cont)

Gregg D.
Lahti

+memopt can compress memory structures during


compile
Useful if gate sims dont fit into the memory foot print (i.e.
Linux ~3GB process size limitation)
May not always work, process overflows process-size limit
Use +memopt+2 to spawn second child process for
compilation

On Linux bump the process size limit from generic


3GB size to 3.7GB size:
Edit /usr/src/linux-2.4/include/asm-i386/page.h
Change 0xC0000000 to 0xEC000000

Blazing Saddles: Getting the Performance Out of VCS


24

24

Gregg D.
Lahti

25

Performance Results
Corrent increased RTL-based simulation performance
by over 6X!
Average speed increase measured over 15 different
simulation runs of a 15M gate RTL design
Incremental changes measured between flag settings
Profiling was essential!

Corrent increased gate-level simulations by average


of 22% using +timopt
Some simulations were able to fit into 3.7GB of process size
with +memopt

Blazing Saddles: Getting the Performance Out of VCS


25

Gregg D.
Lahti

Performance Results (cont)


Test Name
A:
A:Baseline
Baselinescript
script
B:
Removal
B: Removalof
ofinstanced
instanced
gate-level
delay
gate-level delaychains
chains
C:
Remove
C: Remove+acc+2,
+acc+2,-I,
-I,and
and
PP
switches
PP switches
D:
D:Remove
Removecompile-in
compile-in
Debussy
DebussyPLI
PLIand
andother
other
debugging
PLIs
debugging PLIs
E:
E:With
With+nospecify
+nospecifyswitch
switch

ahb_cfg_test1
ahb_cfg_test2
ahb_cfg_test_incr
ahb_cfg_wr_rd
ahb_ddr_bw
ahb_ddr_test1
ahb_ddr_test_incr
ahb_ddr_wr_rd
ahb_memctl_test1
ahb_memctl_test_sdram
gmi_mission_test1
gmi_mission_test2
gmi_mission_test3
gmi_pause_test1
gmi_ser_test

Speed Increase over (A)

30.770
13.130
186.830
67.610
115.140
61.570
103.870
100.100
217.370
131.330
416.290
416.880
416.340
76.250
97.630
-

7.340
4.370
33.860
13.980
22.310
13.000
21.420
20.520
34.560
23.970
76.550
76.840
76.850
15.730
18.230

7.360
4.380
33.780
13.940
22.250
12.940
21.270
20.340
34.330
23.900
76.510
76.940
76.630
15.700
18.180

6.650
4.170
29.530
12.400
18.970
11.260
18.510
17.400
29.610
21.170
67.790
67.890
67.670
13.920
16.160

5.940
3.740
25.740
10.980
16.680
9.890
16.320
15.400
25.220
18.420
59.320
59.560
59.470
12.560
14.140

533%

534%

608%

693%

Blazing Saddles: Getting the Performance Out of VCS


26

26

Gregg D.
Lahti

Performance Results (cont)


Command line switches alone:
Test Name
A:
A:Baseline
Baselinescript
script
B:
Removal
B: Removalof
ofinstanced
instanced
gate-level
delay
gate-level delaychains
chains
C:
Remove
C: Remove+acc+2,
+acc+2,-I,
-I,and
and
PP
switches
PP switches
D:
D:Remove
Removecompile-in
compile-in
Debussy
DebussyPLI
PLIand
andother
other
debugging
PLIs
debugging PLIs
E:
E:With
With+nospecify
+nospecifyswitch
switch

ahb_cfg_test1
ahb_cfg_test2
ahb_cfg_test_incr
ahb_cfg_wr_rd
ahb_ddr_bw
ahb_ddr_test1
ahb_ddr_test_incr
ahb_ddr_wr_rd
ahb_memctl_test1
ahb_memctl_test_sdram
gmi_mission_test1
gmi_mission_test2
gmi_mission_test3
gmi_pause_test1
gmi_ser_test

Speed Increase over (B) -

B
7.340
4.370
33.860
13.980
22.310
13.000
21.420
20.520
34.560
23.970
76.550
76.840
76.850
15.730
18.230

7.360
4.380
33.780
13.940
22.250
12.940
21.270
20.340
34.330
23.900
76.510
76.940
76.630
15.700
18.180

6.650
4.170
29.530
12.400
18.970
11.260
18.510
17.400
29.610
21.170
67.790
67.890
67.670
13.920
16.160

5.940
3.740
25.740
10.980
16.680
9.890
16.320
15.400
25.220
18.420
59.320
59.560
59.470
12.560
14.140

0%

14%

30%

Blazing Saddles: Getting the Performance Out of VCS


27

27

Gregg D.
Lahti

Summary
Clean the simulation scripts by removing the
following:
-I
+acc+2
-PLI [library] (unused PLI calls)
-PP

Add these flags into the simulation scripts:


-Mupdate o csrc (use local disk)
+rad
+nospecify
+nbaopt (if required)

Blazing Saddles: Getting the Performance Out of VCS


28

28

Gregg D.
Lahti

Summary (cont)
Clean your Verilog of #0 and #1 delays
Optimize your pli.tab files
Profile your simulation! Nasty time-sinks can be
resolved!

Blazing Saddles: Getting the Performance Out of VCS


29

29

Gregg D.
Lahti

30

References
VCS 5.0 and 6.0 User Guides, Synopsys Corporation, 2002.
Test Benches: The Dark Side of IP Reuse, Gregg D. Lahti, San Jose SNUG 2000
paper. http://gateslinger.com/chiphead.htm or
http://www.synopsys.com/news/pubs/snug/snug00/lahti_final.pdf
Verilog Nonblocking Assignments With Delays, Myths and Mysteries, Cliff
Cummings, Boston SNUG 2002 paper. http://www.sunburst-design.com/papers/.
ESNUG posts: 380 item 11, 383 item 9, 387 item 16.
http://deepchip.com/esnug .html
Solvnet: http://solvnet.synopsys.com

Special thanks to Mark Warren for the fruit basket and review of the paper!

Blazing Saddles: Getting the Performance Out of VCS


30

Você também pode gostar