Você está na página 1de 26

F U N C T I O N A L V E R I F I C A T I O N W

H
I
T
E
P
A
P
E
R
w w w . m e n t o r . c o m
QUESTA SIM PERFORMANCE HANDBOOK
Mentor Graphics Company Confidential
2012 Mentor Graphics Corporation
All Rights Reserved
Note: This document contains information that is confidential
and proprietary to Mentor Graphics Corporation. This
information is supplied for identification, maintenance,
evaluation, engineering, and inspection purposes only, and
shall not be duplicated or disclosed without prior written
permission from an authorized representative of Mentor
Graphics. This document and any other confidential
information shall not be released to any third party without a
valid Confidential Information Exchange agreement signed by
the third party and an authorized Mentor Graphics
representative. In accepting this document, the recipient
agrees to make every reasonable effort to prevent the
unauthorized use of this information.
Questa Sim performance handbook
www. mentor. com
2 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
TABLE OF CONTENTS
1. How to use this handbook
2. Performance tips for simulating with Questa
2.1. vlog/vcom/sccom compilation
2.1.1. General guidelines for compilation
2.1.2. Incremental compilation
2.1.3. Distributed compile with sccom
2.1.4. Optimizing SystemC
2.1.5. Specific compile-time optimizations for VHDL designs
2.1.6. Precompiled libraries
2.2. vopt optimization engine
2.2.1. Optimizations when using PLI
2.2.2. Inlining percentage
2.2.3. Optimizing cells
2.2.4. Floating parameters/generics
2.2.5. Parallel code generation in vopt
2.2.6. Pre-optimized Design Unit (PDU)
2.2.7. Tuning vopt performance
2.3. Simulation performance
2.3.1. Simulator resolution
2.3.2. WLF logging
2.3.3. FSDB logging
2.3.4. Additional guidelines on simulation runtime performance
2.4. Simulating Verilog and SystemVerilog designs
2.4.1. RTL simulations
2.4.1.1. Modeling memory
2.4.1.2. Simulating with code coverage
2.4.2. Gate-level simulations
2.4.2.1. Compiled SDF
2.4.2.2. Timing simulations
2.4.3. Simulating assertions
2.4.4. ATPG test simulations
2.5. Simulating VHDL designs
2.6. Multi-core simulations
Questa Sim performance handbook
www. mentor. com
3 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
TABLE OF CONTENTS (CONTD)
3. Analyzing Questa Sim performance reports and other utility tools
3.1. Understanding simstats report
3.2. Understanding write report l output
3.3. Profiling 64-bit binaries and designs with PLI/DPI
3.4. Understanding profile reports
3.5. Understanding capacity reports
3.6. Tools to debug run-to-run variability
3.7. Performance data collection for QVIP
4. Performance-aware coding techniques
4.1. Verilog and SystemVerilog coding styles for improving performance
4.2. VHDL coding styles for improving performance
4.3. Performance-aware coding of SystemVerilog Testbench and with Verification Methodology (UVM/OVM)
4.3.1. The evil of UVM auto-config
4.3.2. More tips: A list of SV coding styles that affect performance and memory
4.4. Writing performance-aware constraints
4.4.1. Solver and memory
4.4.2. Solver and performance
4.5. Performance-aware coverage collection
4.5.1. Code coverage
4.5.2. Functional coverage
4.6. Writing performance-aware assertions and cover directives
Questa Sim performance handbook
www. mentor. com
4 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
1. HOW TO USE THIS HANDBOOK
This handbook is a collection of suggestions intended to help customers get optimal performance from the Questa
simulator. It is broken into three broad sections: performance-aware coding techniques, performance tips for
simulation and a performance analysis section. Each section contains multiple sub-sections intended to help users
of this handbook quickly navigate to the appropriate section.
The contents of this document are current as of October 2012 and Questa Sim 10.1c, with any specific change in
behavior in older, supported versions or newer versions explicitly mentioned with version number.
This handbook was developed through the combined expertise of the Questa Verification Technologist team. As
customer coding techniques and the Questa simulator continue to evolve, so will this document. At the beginning
of any performance engagement with a customer please get the latest version from the collateral index. If you have
any input you would like to provide please email vt_questa@mentor.com.
Questa Sim performance handbook
www. mentor. com
5 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
2. PERFORMANCE TIPS FOR SIMULATING WITH QUESTA
2.1 vlog/vcom/sccom compilation
2.1.1 General guidelines for compilation
a. Do not compile with novopt. Compiling with novopt causes assembly code to be generated. If optimization
is performed next, this generated assembly code is discarded.
b. When compiling into a single work library, avoid calling vlog/vcom multiple times (many commands). Instead
make a list and use the f switch to provide the list to vlog/vcom.
c. Do not use sv switch to vlog if the design is pure Verilog.
d. Compilation is I/O intensive. The file system performance, network traffic and machine load are big factors in
the compilation throughput.
If you have a large numbers of files, copy them to a local disk or /tmp prior to compiling and simulating.
Another approach for really large designs that involve millions of file I/O during compilation and
optimization is to create the work library in the /tmp area of the (grid) machine used and copy the
compressed work library to the network location after completion of the build process. When you simulate,
you start by copying the compressed work library to /tmp area of grid machine, untar it, run the simulation
and remove the libraries from /tmp.
An important tip is to keep the paths in the modelsim.ini file, relative.
e. If the compilations are intended for batch mode only (for example, regressions) you can use the
nodbgsym switch to prevent generation of debug symbols database that would be useful for some GUI-
based debugging operations such as source annotation and textual dataflow.
f. Use +cover instead of cover. It is more powerful and flexible, and often yields better performance. Also it is
suggested to enable the coverage collection through +cover in the vopt phase rather than in compile phase.
It allows creation of different versions of optimized image with coverage and without coverage, without the
need to recompile.
g. Avoid using the lower optimization control switches (vlog/vcom O[0|1]) during the compilation phase. The
vlog/vcom options that have equivalent vopt options are more specific and are used to specify non-default
values of the option to specific design units, while the default value is passed to vopt. The optimization
control switches passed to vlog/vcom will affect how the design units are optimized during vopt phase.
h. Avoid using +acc settings on vcom or vlog. These settings are difficult to detect during optimization and can
significantly impact performance.
2.1.2 Incremental compilation
The default incremental compilation allows faster compilation turn-around time. Compile with noincr only when it
is required to do a fresh compilation of the entire design.
Note: Small changes to one part of the design may cause the whole design to be recompiled even with
incremental compilation if the change affects the rest of the design.
A more creative use of incremental compilation is through compiling independent blocks of the design into
different work libraries. The compile scripts can be written such that when portions of the design change, only the
corresponding work libraries are touched for recompilation.
Memoization:
See this Wikipedia entry for a good starter: http://en.wikipedia.org/wiki/Memoization.
Questa Sim performance handbook
www. mentor. com
6 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
It may be possible to setup a compilation flow based on memoization to greatly speed up incremental
compilations. The technique would work well in a fully disciplined compilation flow in which there are no side-
effects on the compilation other than source. There are serious concerns with using the technique in a setup where
environment variables, command line arguments, compilation units, library paths, incdirs and macro definitions
play a major part in the flow and is not recommended.
2.1.3 Distributed compile with sccom
sccom supports distributed compilation of C/C++ files using the MPICH library starting Questa Sim 10.0. The
distributed compilation feature can be enabled with the undocumented switch sccom -distributed <hostsFile>,
where <hostsFile> contains the list of hostnames and the number of processes to be run on a given host.
Note: User has to setup a ring of MPI hosts by starting the mpd daemons on each of the hosts that will be used
for running distributed sccom children.
Benchmarks have shown a performance improvement of 4X-5.5X with distributed sccom against the non-
distributed version.
2.1.4 Optimizing SystemC
To optimize SystemC, gcc optimization switches can be used with sccom as sccom O2/-O3. However note that
compile times may increase when using these options.
Note: Using visibility switch sccom g reduces performance. The debug switch g can be used O switches but
this leads to limited debug capability.
2.1.5 Specific compile-time optimizations for VHDL designs
Most VHDL designs are optimized for performance with the default compiler options. Some designs with a number
of for loops or large number of arrays may simulate faster if you use additional compiler arguments to vcom. The
-O5 option implements additional compiler optimizations, especially for loops. The -nocheck arguments eliminate
checks for out of bounds scalar assignments or out-of-bound access to arrays. These arguments are summarized
below:

2.1.6 Precompiled libraries
External IP blocks or large legacy blocks that do not change during the design development phase can be
precompiled as a locked work library in a common workspace and linked during optimization and simulation from
the work areas of different engineers. This locked library can be created once for every major release of Questa Sim
(or refreshed) and both 32bit and 64bit versions can be precompiled.
2.2 vopt optimization engine
It is important to understand that preserving visibility through vopt +acc options have a very negative impact on
performance both in terms of simulation runtime and memory footprint. The various +acc options preserve access
to the various objects specified by the settings and hence prevent a number of optimizations from being applied.
In fact, using vopt +acc is almost equivalent to simulating with novopt. Users should never use novopt or a
global +acc setting to resolve a visibility problem.
-nocheck - Disable range and index checks
-noindexcheck - Disable index checks
-norangecheck - Disable run-time range checks
-O5 - Enable additional compiler optimizations especially for for loops
Questa Sim performance handbook
www. mentor. com
7 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
For optimal performance vopt options should contain only the minimal set of accesses required for successful
simulation (generated using the learn flow, for example) and should not contain any other +acc options.
To explicitly preserve visibility, use reduced set of +acc options by restricting through subset of specifications,
depth, module name, instance name, etc.
Tip: The switches +noacc and +nocover that are available from Questa Sim 10.1 can be used to disable access or
coverage collection on objects or specific regions of the design.
2.2.1 Optimizations when using PLI
Optimizations are disabled automatically when using PLI to generate visibility requirements for the PLI in the 2-step
flow.
Tip: Use the PLI learn flow to generate required visibility and turn off automatic +acc application using vsim no_
autoacc. In a competitive situation it is often easier to find a pre-existing TAB file from another tool and use it
directly.
An optimized image created by vopt in a 3-step flow will not be touched during elaboration due to presence of PLI.
Any visibility required in such a case should be provided as part of the vopt stage.
When a .tab file is required for simulation, the visibility required is generally the regions specified in the TAB file. Use
this file with both vopt and vsim and optimize the rest of the design. Ensure that the TAB file does not ask to
preserve complete visibility through the acc=r:* option. This causes+acc to be applied during optimization.
2.2.2 Inlining percentage
Higher inlining percentage translates to better performance and smaller memory footprint in a lot of cases. Inlining
is not an optimization by itself but when instances are inlined, it causes a number of optimizations to kick in, which
in turn give a better runtime performance and memory footprint.
The inlining number can be found towards the end of the write report l output.
Note: Hierarchical references and +acc=p, n or f can prevent/reduce inlining and hence disable a number of
optimizations that can be applied only on inlined instances.
The undocumented switch vopt -inlineVerbose prints detailed messages on the different inlining decisions made,
into the optimization log. The output from this switch can be used to check why the inlining number would be
poor and can help make decisions on changing the inline factor.
Tip: vopt inlineFactor=_n_ changes the inlining number and this switch is undocumented. 0 indicates no inlining
and values in increasing powers of 2 can sometimes result in higher inlining percentage, with the default inline
factor being 128. However be aware that indiscriminate increase of inlineFactor can result in explosion in native
code size and can result in memory allocation errors. Use undocumented command
mti_kcmd codestats transcript after elaboration to keep an eye on the code expansion.
A high inlineFactor is more useful for design units closer to the leaves of the design hierarchy and it is
recommended to use pdu to isolate such areas than changing the value for the global vopt, if it is required to
change the inlining value.
Module instances: 865 (697 inlined, 25465 table entries)
Questa Sim performance handbook
www. mentor. com
8 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
2.2.3 Optimizing cells
Non-empty specify blocks with a specify path determine whether module instances are optimized using gate-level
cell optimizations. Please see section 2.4.2 to learn how to use write_cell_report to determine if a cell is optimized
in the design.
Tip: Use undocumented switch vopt autoco to automatically identify module instances as cell candidates if they
do not contain specify blocks. The switch works well for UDP based cells.
There are 2 undocumented switches that may help optimize more module instances as cells under certain
conditions. It is to be noted that the switches are not to be used in customer flows without factory knowledge.
Typically, UDP based gates are better as optimized cells and in general, module instances that belong to a
synthesized netlist fall into this category especially in functional (non-timing or zero-delay) simulations. Assignment
based gates fare better as RTL modules with inlining.
When working with an existing customer you may find usage of the fast switch on the vlog command line. The
fast switch is a deprecated switch and should not be used in customer flows. Vopt is the desired gate-level
optimized flow. Please contact the factory via vt_questa@mentor.com if you encounter customers using fast in
their design flows for any reason.
2.2.4 Floating parameters/generics
No optimization is done on floated parameters and generics and associated logic. This can cause a big hit to the
performance especially when parameters are floated everywhere in the design.
Avoid using the generic vopt +floatparameters or vopt +floatparameters+top. (note the trailing .) for floating all
parameters in design units recursively. Float only the required parameters by specifying parameter name or
specifying instance names with +floatparameters.
Tip: If too many parameters need to be floated, it is suggested to create an optimized design for each generic or
parameter value you need to simulate. It is also recommended to create a PDU to insulate versions with floating
parameters.
2.2.5 Parallel Code Generation in vopt
The parallel code generation feature of vopt is on by default starting Questa Sim 10.1 release for Verilog/
SystemVerilog designs and starting Questa Sim 10.2 for VHDL designs or designs with major VHDL portions in them.
The vopt engine automatically determines the number of parallel jobs to run at once, based on the number of
cores on the host machine. This number can be changed by the user through the switch vopt -j <n>. Running on
hosts with at least 8 cores would be good, 16 cores is probably optimal.
Tip: Optimization runs that spend the majority of time in code generation are ideal candidates for the best
performance throughput from this feature.
vopt +forcecellopt -forces aggressive cell optimization algorithm on every module instance
that does not qualify as a cell when used together with -autoco; note
that this option may cause the vopt process time to increase.
vopt +inlinecui - allows module instantiations to be inlined into a cell that is being
optimized even when +acc=p is being used (allowing cell
optimizations); it is useful for optimizing cells that do not optimize
because they instantiate small modules without timing, similar to a UDP.
Questa Sim performance handbook
www. mentor. com
9 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
The code generation time can be found from the vopt log, with the undocumented option opt=tphases added to
vopt. However note that adding opt=summary to vopt will not produce any output when parallel code
generation is turned on.
2.2.6 Pre-optimized Design Unit (PDU)
The PDU flow optimizes portions of the design while allowing other portions to be modified or recompiled.
Tip: Use the PDU flow to maximize throughput for designs with many different testbenches and an unchanged
DUT or for multiple different configurations of the same DUT, such as fully optimized configuration and a debug
configuration with full or partial visibility.
The difference between this flow and a standard optimized flow is that vopt is run twice. The first run creates the
PDU (usually the DUT). The optional second run of vopt optimizes the testbench and loads the previously
optimized DUT. In some cases where the TB is very small compared to the pre-optimized DUT, vopt may not be
required for the TB.
For the first simulation, there are no time savings. However, for the next simulation, simply compile the next test
and launch the simulator. The simulator will sub-invoke the optimization engine on the testbench (or simply load
the TB for novopt mode) and load the pre-optimized DUT. In this second simulation, the time to optimize the DUT
is saved. For a large gate level design, this can be substantial. The more the number of runs, greater is the total
throughput of the runs.
To ignore the PDU object, delete the object from the library (vdel) or instruct the simulator to ignore it using vsim
pduignore.
Note: Creating a PDU (black-box) does not improve simulation run-time compared to the standard vopt flow. It
helps reduce optimization time by reusing optimized portions of the design.
2.2.7 Tuning vopt performance
It has been found that there can be considerable reduction in the vopt wall-clock time if the compiled work
libraries are available in the local scratch space of the machine used for simulation.
Tip: In Questa 10.0, use vopt nodbgsym switch to prevent the creation of debug symbol library files while creating
optimized image. The switch can help improve vopt performance as well as reduce the disk image of the optimized
image. However, note that certain debug features such as source code annotation, textual dataflow, etc. depend on
the debug symbol libraries and using the switch can cause reduced debug performance. This is particularly
effective on grids with slow servers.
Note that this switch is a compile-only option starting Questa Sim 10.1 and has no impact on vopt.
2.3 Simulation performance
2.3.1 Simulator resolution
The resolution of the simulation can cause a major impact on the runtime, for example, a design that can be
simulated with a 1ns resolution can be orders of magnitude slower when simulated at 1fs resolution. Look in the
design unit section of write report l to determine the time scale in effect for each design unit.
When the simulator resolution is not explicitly provided through vsim -t override option, the minimum of the time-
precision value of all the applicable timescales is used as the resolution. The timescales can be provided in the HDL
Questa Sim performance handbook
www. mentor. com
10 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
code or through vlog/vopt timescale options. When the vsim timescale override is not provided, the timescale in
effect is in the following order:
a. Explicit timescale directives that are in effect during compilation.
b. vlog timescale for design units that do not have HDL directives.
c. vopt timescale for all other design units that do not have HDL directives or vlog directives.
2.3.2 WLF logging
The impact of logging signals on the performance of a test depends on 2 main factors the signals preserved for
logging and the actual logging operation. When signals in the design are preserved by using +acc settings, several
optimizations are disabled and this can have a cascading effect on the optimization level of the design.
Several opportunities to speed-up the design are lost by preserving visibility to signals through global settings such
as +acc,+acc=npr, etc. During the development phase of the design, it is suggested that visibility is limited to only
the block or region under development, for various debug operations. 3rd party IP blocks, pre-verified blocks of
code should be optimized fully and only the block under test can be kept open for logging/debugging. This can
be achieved either through the PDU flow or using +noacc option to the blocks that are not required for
debugging.
WLF threading improves simulation performance when writing lots of logged data to WLF file. It takes advantage of
multithreading technology and is supported on Multi-core Linux and Solaris machines. However it should also be
noted that logging can cause the simulation to be slower than when signals are not logged due to the overhead
involved in logging signals.
Note: Multi-threaded WLF logging may report the simstats cpu time as higher than the wall-clock time. In such
cases, the wall-clock time gives a better representation of the actual simulation time.
During the debug phase, it is recommended to log only the signals or blocks of interest in place of logging every
signal in the design. This can help reduce the impact of logging on performance and still provide the required
signals for debugging.
In regression mode, the tests should have minimum visibility of signals required for successful simulation of design.
The minimal set of +acc settings required for external accesses such as PLI, DO file, etc. can be obtained from the
PLI learn flow which is documented in detail in the Questa Sim manuals. This ensures a design that has the
maximum available optimization level applied on it and hence maximum throughput can be realized.
Typically there will be no need for logging signals during a regression run and it is suggested to not log any signals
for optimal performance.
2.3.3 FSDB logging
Logging signals with Novas/Verdi and generating a FSDB is known to cause performance slowdown due to
increased overhead of logging. For the same design with the same set of visibility, FSDB logging is expected to
increase simulation runtime by 10-15%. Anything over this should be investigated and the profiler is a good starting
point to figure out how much time is spent under FSDB PLI calls.
It has been found that using the latest version of Verdi can sometimes help in improving Questa Sim runtime
performance.
Questa Sim performance handbook
www. mentor. com
11 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
Tip: Use the environment variable setenv SKIP _ CELL _ INSTANCE 1 to prevent FSDB logging of cell
internals. In a large gate-level design, setting this environment variable can help with faster logging and smaller
fsdb file size.
During evaluations or competitive replacements, it is important to make sure that Questa Sim and the competitor
simulator(s) are logging approximately the same amount of signals. The size of the FSDB file generated is an early
indicator of any discrepancies.
2.3.4 Additional guidelines on simulation runtime performance
a. Have adequate physical memory to run the process. Swapping to virtual memory can significantly impact
performance of any run.
b. Restrict the use of 64-bit OS version of Questa Sim to those jobs that require more than 4GB of memory to
run. 64-bit OS versions consume approximately 30% more memory and are approximately 30% to 2x slower
than 32-bit versions of the same OS.
c. When there are no design and TB changes or vsim option changes between many simulation runs of a
regression suite, it is best to use the Questa Sim elaboration file flow if the elaboration time of a single test is
significant. It can provide elaboration time savings when there are a large number of simulations in the suite.
d. Elaborating an optimized image is generally faster than elaborating a non-optimized image. The simulator
loads fewer objects across the network when an optimized image with good inlining is elaborated.
e. The simulator GUI adds memory and runtime overhead. Unless you are interactively debugging a design,
execute the designs from shell by using vsim -c.
f. Run the simulations in the coarsest resolution possible. For example, do not run in ps mode if ns resolution is
functional. Usually some design knowledge is good to know if it is safe to use a coarser resolution without
encountering rounding off issues.
g. Use self-checking testbenches to eliminate the need for file I/O. This improves performance.
h. Compile and run designs from a local drive/disk whenever possible. Network traffic can significantly slow
down the processes that require large amounts of file I/O. If you have a large numbers of files, copy them to a
local disk prior to compiling and simulating.
i. Monitor the load of the machine. A machine with multiple jobs competing for CPU and memory resources
impacts wall clock run time. Also multi-cpu machines must compete for the same memory interface and can
impact the run time of a job.
j. A lot of messages can make a simulation slower. Use vsim msgmode tran to output elaboration and runtime
messages only to transcript and not to WLF file. This helps improve performance but the messages are not
available in the Message Viewer post-simulation.

When creating environments such as OVM, turn off messaging for better performance.
k. Logging signals can cause a slowdown in performance and should be approached cautiously.
2.4 Simulating Verilog and SystemVerilog designs
2.4.1 RTL simulations
2.4.1.1 Modeling memory
Tip: Use sparse memories when number of accessed addresses is low compared to the total number of addresses
in a memory declaration.
set_report_severity_action_hier(OVM_INFO, OVM_NO_ACTION);
set_report_severity_action_hier(OVM_WARNING, OVM_NO_ACTION);
Questa Sim performance handbook
www. mentor. com
12 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
Example:
Consider the following memory declaration:
This declaration needs 8MB of physical memory with a 4-state type. Using sparse memory modeling, this number
reduces to 125KB with 1000 accesses.
Note: Memory accesses will be slightly slower with sparse memories.
2.4.1.2 Simulating with code coverage
Simulations with code coverage turned on are expected to be slower than simulations without code coverage. The
degree of slowdown depends on the design, on optimizations, and on coverage options.
A few general rules of thumb:
If only statement and branch coverage are on, the overhead should be less than 20%.
If expression coverage is on and there are a lot of assignments with complex expressions on the RHS,
especially xor expressions, it can be 2-5x.
If toggle coverage is turned on everywhere, a lot of optimizations are turned off and the slowdown can easily
be 2-4x or more.
Note: Even though no coverage is collected during simulations, coveropt and +cover options in vlog/vopt
commands will slow down simulations since they disable optimizations.
Coverage exclusions:
There can be a huge performance penalty when a large number of coverage exclusions are applied. The best
approach from the performance perspective is to use pragma exclusions embedded in the source code. There is
virtually zero overhead involved with this flow and the gains can be enormous but the drawback is that you have
to touch the design files.
A few alternate suggestions to write more efficient coverage exclusions:
It is more efficient to go over a whole vector than doing it bit by bit.
It may be possible to use wildcards or other techniques to combine the different exclusion commands into a
smaller but equivalent set.
It has also been found that multiple coverage exclusions applied from a DO file generally take less time to complete
than copy-paste of the exclusion commands in live, interactive mode. The slowdown with copy-paste of exclusions
in the GUI command prompt is due to the time taken to format transcript text for each Tcl command.
2.4.2 Gate-level simulations
By default, timing-ready cells are optimized by Questa Sim and the ratio of optimized cells against total number of
cells in the design is an important factor and can affect simulation performance. It is to be noted that cells
automatically qualify for specific cell optimizations but the presence of novopt or +acc or +acc=c cause all cell
reg [31:0] mem1 [0:1000000];
Questa Sim performance handbook
www. mentor. com
13 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
optimizations to be turned off and no optimizations can be done. In addition, gate-level optimizations are turned
off using vopt +check<CODE> option and can result in lower performance.
Better performance and low memory footprint require the high instance count gates to be optimized. Use the
reports generated by write cell_report and write report l to determine level of optimization.
Tip: Use messages generated from vopt debugCellOpt to diagnose cells that dont optimize.
The built-in diagnostic tools debugCellOpt and write cell_report are very useful to figure out cell optimization
issues. The cell report prints out the list of all instances of cells in the design and how many were not optimized /
optimized. This is a good starting point to find all non-optimized cells in the design and look for high instance
count non-optimized cells (that cause the most impact on performance). The optimization log with debugCellOpt
provides further insight into the cell optimization decisions, often printing out reasons for not optimizing a
particular cell.
It is very important to diagnose cell optimization issues with debugCellOpt as early as possible in the cell library
development stage. It can help catch a number of modeling issues that may be difficult to overcome later in the
design cycle.
2.4.2.1 Compiled SDF
SDF files used repetitively without modification should be compiled using sdfcom for significant optimization time
savings. When sdfcom is run implicitly by vopt, the SDF is parsed and visibility is preserved in those cases where
interconnect delays are annotated to objects that will become inlined by vopt. This is not a problem for modules
with specify blocks. However, some large designs may have interconnect delays being annotated to modules that
do not have specify blocks. This case is not an issue when using compiled SDF.
When vopt is run, sdfcom will be run implicitly if it detects one of the following:
$sdf_annotate
sdfmin/ -sdfmax/ -sdftyp
2.4.2.2 Timing simulations
Due to how Questas optimized evaluators & schedulers work, all cell timing should be done through specify blocks
Do not use delayed timing statements
Avoid using distributed delays. Use `ifdef blocks where these values must be used
Use +delay_mode_path during vlog or vopt
All blocking & nonblocking assignments should be zero delay
Keep the default path delay values as small as possible
Note: The threshold argument in the $width check should not be used in modeling. It cannot be overridden
during SDF annotation as it is not defined in the SDF standard and off side-effects can result from its use.
Timing checks:
Avoid using a high activity signal such as a clock in a timingcheck condition.
Questa Sim performance handbook
www. mentor. com
14 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
Remove timingchecks from cells using vopt +notimingchecks if these checks are not required to improve
performance and reduce memory usage. In a 3-step flow, the +notimingchecks switch should be provided to
vopt.
2.4.3 Simulating assertions
The assertion related switches are mainly for debugging assertions and are intended to give the user access to all
the debug features and hence are not meant for best performance. Most of the assertion debug features require
the use of the assertdebug switch for this reason.
The nosva/-nopsl switches are intended to run a simulation as if there were no SVA/PSL asserts or covers directives
in the design. The switches can be used to easily remove such constructs from any simulation without having to
re-compile or edit the user code.
Tip: Use of +acc=a was intended for logging only the signals in assertions so as to limit the need for +acc on the
full design
2.4.4 ATPG test simulations
The most common way to accelerate ATPG testing is by a technique called broadside-load.
The idea of ATPG test in manufacturing is this:
a. Initialize all registers in design by shifting in a multi-thousand-bit vector, one clock per bit.
b. Apply a real (non-scan) clock to propagate register values through combinational logic and capture results in
registers.
c. Read the state of all registers by shifting out a multi-thousand-bit vector, one clock per bit
Simplistic simulation solutions exactly emulate this strategy. For a 10,000 register chip, it will take 20,001 clocks to
simulate each scan vector. That is expensive.
The broadside-load solution just directly initializes the nets and registers in an out-of-band parallel fashion, say by
using the force command, TB hierarchical references or a DPI/PLI C application. This can help save the 20,000 clocks
in the example mentioned above and can lead to orders of magnitude performance improvement for ATPG
simulations.
Alternate approaches include simulating one or two representative vectors serially and use broadside-load for the
remaining. It would also help to write the PLI models that provide the scan chain vectors as higher performance
DPI models.
2.5 Simulating VHDL designs
The undocumented switch vopt vhdebug=all is a diagnostic tool to dump VHDL optimization information. In
addition there are some key areas to look out for while diagnosing slowdown in a VHDL intensive design.
a. During simulation time, look out for two or three distant delay resolution events, for example, ns and fs OR ns
and ms.
b. Look out for very fine resolution simulation like fs being used.
c. Look out for clock suppression percentage using vopt -vhdebug=all option. Anything below 70-80% implies
that it would be useful to capture unique flop coding styles and report.
Questa Sim performance handbook
www. mentor. com
15 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
d. Look out for inlining percentage, using vopt -vhdebug=all option. Anything below 60-70% means that there is
potential of improvement.
e. If the number of deltas taken per simulation cycle are very large (~50+) to propagate all values and stabilize, it
is a good indication of potential performance improvements.
f. Avoid any coverage and/or full +acc flags because they severely affects vhdl performance.
g. It helps to look at underlying RTL/Gate level libraries. We have observed that Verilog libraries turn out to be
faster. Key reasons are that a) DUT was in Verilog and inserting Verilog libraries remove mixed boundaries b)
many a times Verilog libraries are not equivalent to VHDL ones (more pessimistic), but suffice user
requirements.
h. Look out for Synopsys DW libraries (accelerated ones are reported with -vhdebug=all ) , VHDL fixed / float
packages usages, and IEEE libraries from different sources (eg; ieee_ proposed.)
i. Logging fewer (minimal) signals will improve simulator performance. Logging Variables in VHDL can be slower
and care must be taken before logging them as they can expensive.
j. Mixed language designs: We have observed cases where clock trees buffers are written in Verilog and the
primary design is VHDL. This creates lots of mixed language boundaries on high activity signals. It would help
to bring buffer instances in same language to remove mixed language boundary.
k. FSDB dumping and Debug API accesses can be slow.
l. Make sure the design was not compiled (vcom) with low optimization level (-O1) switches.
2.6 Multi-core simulations
Questa MC2 (Multi-core multi-computer) simulations help increase simulator performance by partitioning a design
to run on multiple cores (or systems). The MC2 manual and the DVT slide library contain detailed information on
design qualifications, MC2 flow and debugging MC2 simulations in case of simulation mismatch.
Note: For MC2 simulations the wall clock time makes more sense than individual cpu times for each partition. The
Cpu does more work but distributing it between cores causes total turn-around time to be reduced. Cpu time
reductions are not expected, it could be misleading and accidental.
3. ANALYZING QUESTA SIM PERFORMANCE REPORTS AND
OTHER UTILITY TOOLS
3.1 Understanding simstats report
The simstats command reports performance related statistics about the design elaboration and simulation. The
statistics measure the simulation kernel process (vsimk) for a single invocation of vsim.
The elaboration statistics are measured once at the end of elaboration. The simulation memory statistics are
measured at the time you invoke simstats. The simulation time statistics are updated at the end of each run
command.
Tip: There may be performance degradation if the value of elab working set (or working set) exceeds the actual
memory size elab memory (or memory). The memory number is the amount of memory that the OS has allocated
for the vsimk process, while working set is the amount of memory in use for the current simulation.
Always consider the cpu time for performance analysis and fine-tuning. The cpu time is the actual cumulative
processor time for the vsimk process. Processor time is the amount of time the cpu is actually running the vsimk
process. It can differ from the wall clock time if the vsimk process is swapped out for another process.
Questa Sim performance handbook
www. mentor. com
16 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
Note: The cpu time can sometimes exceed the value of the wall clock time. This can happen when another cpu
core (or thread) writes data into the WLF file. In such cases, the wall clock time gives a better idea of the actual
simulation time.
3.2 Understanding write report l output
The time resolution is an important factor affecting performance. It is indicated at the top of the report. If you see a
resolution way smaller than expected, search through the report for modules contributing towards the smaller
resolution and confirm if the design can function correctly even at a coarser resolution.
The bottom of the report file contains a summary of the design. Some of the things to watch for are:
Number of instances and number of instances inlined.
Small number of optimized cells (cells that are not fast cells).
Large Memories that are not being modeled as sparse.
Number of unaccelerated continuous assignments.
Large numbers of parameters, tasks and functions, external references, etc.
A pure RTL design that might contain large numbers of UDPs, timingchecks, and path delays.
The design unit section is also useful in the following ways:
Indicates the frequency with which module instances occur. Typically rewrites done to a module with a lot of
instances will have a larger impact on performance.
Shows the time resolution each module requested, allowing quick identification of modules requesting small
resolution. (fs)
3.3 Profiling 64-bit binaries and designs with PLI/DPI
Profiling is now supported on Linux_x86_64 platform starting 6.6d release and is done the same way as with Linux
(32bit) platform.
It is not uncommon to see the PROFILER_STACK_UNWIND_ANOMALY entry in the profile reports from 64-bit Linux
OS with PLI/DPI code and it means that the profiling system calls made by Questa SIM could not successfully map
the stack back to functions in these cases.
It is important to compile the PLI/DPI code with stack frames so that the profiler can unwind the stack correctly
(gcc/g++ -fPIC), and make sure the debug symbols are visibile (-g) and runs fast (-O).
For profiling a design running on Windows OS, add these two switches to the link commands so that the symbols
are added to the .dll and the profiler can use it.
/DEBUG /DEBUGTYPE:COFF
3.4 Understanding profile reports
The DVT slide library contains many presentations and collateral on the profiler windows and how to interpret
profile data.
The NoContext item is a bucket for all samples that could not be mapped to a particular region of HDL or DPI code.
Very high values of this item compared to the rest of the items should be investigated. The profile summary
Questa Sim performance handbook
www. mentor. com
17 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
command in Questa Sim 10.2 generates an easy-to-understand list of buckets that the individual samples fall into,
to help narrow down a specific entity with respect to the design.
Note that when the profile miss-count exceeds 20% (indicated by a warning message at the end of profiling), redo
the profile with profile option keep_unknown command. This command is turned on by default from 10.1a version.
Factory assistance may be needed to understand the reports when unknowns are involved. But design hot-spots
can be found during first pass of the profile database from the Design Units profile reports.
3.5 Understanding capacity reports
Capacity reports contain summary information of classes, queues, dynamic and associative arrays, covergroups,
assertions, Verilog memories, calls to randomize() etc. along with the current timestep, peak value and its
corresponding timestep.
Note: Adding vsim capacity to collect fine-grain information of capacity data can affect performance greatly,
even if the data is not written out.
Tip: One source of memory leak can be dynamic objects that are constructed continually without getting
destroyed or garbage collected. Such objects can be identified from the same peak timestep and current timestep
with the peak value growing as simulation advances.
You can also produce a solver report write report solver (at the vsim prompt) which will provide a summary of all
of the randomize() callsites encountered along with some information about memory usage.
It is imperative to generate capacity reports at time 0 and at the end of simulation.
3.6 Tools to debug run-to-run variability
Here is a list of tools and utilities that can be used to debug/track down run-to-run variability problems (typically
seen in overloaded grid environments). Detailed information on the usage of these utility functions can be found
from man pages or from the Internet.
Note: Some of these features may not exist on your machine/OS, or may not produce the desired effect.
top - Use M (memory usage sort), P (processor usage sort), 1 (list all processor loads vs average)
free - Accurate mem usage (unlike top)
cat /etc/procinfo - Tells information about each core; current CPU MHz values are also monitored
taskset/numactl - retrieve or set a processess CPU affinity; eg, tasket c 7 <vsim >
iostat - used for monitoring system input/output device loading by observing the time the devices are active
in relation to their average transfer rates
vmstat - reports information about processes, memory, paging, block IO, traps, and cpu activity
mpstat - report (on the screen) processor-related statistics
cat /etc/issue - OS type & version
less /var/log/messages - Look for errors/warning about overheating, etc.
less /var/log/cron - Look for cron jobs that may be running in the background
cat /proc/loadavg - System load over 1 min, 5 min, 15 min
Questa Sim performance handbook
www. mentor. com
18 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
cd /sys/devices/system/cpu - Look around in this directory & below to see how governors based on usage/
thermal properties are set up
lscpu - Shows number of cores, CPUs, threads, cache info
3.7 Performance data collection for QVIP (formerly MVC)
This section describes the steps to generate required reports for a performance issue with QVIP.
a. Enable External Method logging at the beginning of the run by using the command:
questa_mvc_do_cmd {hist record external} or
questa_mvc_do_cmd {hist record external /full/instance/name}

This dumps out one .ext_log text file per VIP instance that records all activity on that interface instance, and
can be replayed independently of the other VIP interfaces or of the System. There is no knowledge of the
design in the file.

Note: It is best to get the user to use the command without the instance name - unless the user definitely
knows which instance is causing the problem.
b. Run the Questa profiler.
c. Run these two commands at the end of the simulation run:
questa_mvc_show and questa_mvc_show PERFS

They write transaction counters to the transcript and can help in debugging performance problems.
Send the following files to the factory
The .ext_log files for the MVC instances
The transcript with the output from the show commands above
Questa profiler database
4. PERFORMANCE-AWARE CODING TECHNIQUES
4.1 Verilog and SystemVerilog coding styles for improving performance
Certain Verilog coding styles are notorious from the performance point of view. The following list introduces a few
frequently encountered coding styles that affect performance and provides alternate suggestions to achieve better
performance.
a. Avoid too many triggers on an always block as this can result in unnecessary executions. Instead, break up a
large block that triggers on many events into smaller blocks that triggers less often.
b. Avoid pausing and restarting a thread of execution. For example:


Instead, consider rewriting it as:
repeat (100) (posedge clk);
#(CLOCK_PERIOD_T * 100-1);
@(posedge clk);
Questa Sim performance handbook
www. mentor. com
19 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
c. Avoid too many assignments for complex Boolean calculations. For complex Boolean expressions, a single
assignment is more efficient that a collection of assignments:
d. Avoid repeated calculation, which can be expensive. Use parameters instead:
e. Do not use multiple continuous assignments to the same net within a cell.Do not rely on net resolution when
dealing with a cells internal nets and do not use an output port to directly drive an internal logic. Although
this is legal Verilog, a mistake in connection can lead to unexpected results. Always use a temporary internal
net and buffer to output. And there cannot be a direct path from an input to an output. Use a continuous
assignment or a primitive instead.
4.2 VHDL coding styles for improving performance
Certain VHDL coding styles are notorious from the performance point of view. The following list introduces a few
frequently encountered coding styles that affect performance and provides alternate suggestions to achieve better
performance.
a. Avoid large array of signals. Instead, use variables or shared variables if possible.
b. Avoid math operations on std_logic_vectors. These operations are accelerated, but still it is faster to use
integers.
c. Avoid VHDL Gate- level simulations. Use Verilog gate-level with VHDL testbench it is faster than VHDL. Use
Verilog gate level cells instead of VHDL vitals for Altera (or Xilinx) if necessary. VHDL is typically slower than
Verilog due to 9-states instead of 4 in Verilog. The VHDL library flops (unlike Verilog) are pessimistic in nature
as they cater to various meta states like U, X etc. This being high activity processes causes slowdown in
VHDL simulations.
d. Avoid using processes with wait statements. For example:
This piece of code can be rewritten using concurrent assignments:
e. Initialize constants in loops rather than statically defining them where they are declared.
assign A = (B^C) | (D&E ) // Saves three gates
parameter HALF_CLOCK_T = CLOCK_PERIOD_T/2;

repeat (100)
begin
#HALF_CLOCK_T B = C;
#HALF_CLOCK_T A = B;
end
process begin
while not suspend loop
clk <= 0;
wait for 100 ns;
clk <= 1;
end loop;
wait;
end process;
clk <= not clk after 100 ns;
Questa Sim performance handbook
www. mentor. com
20 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
f. Treat vectors as atomics and avoid assigning to individual bits. Simulation can treat vector as one object rather
than n bits (scalared vector). A number of processes on vectors are faster when compared to the same
operation on scalared vectors.
g. Avoid having too many signals in the sensitivity list. Instead, break up processes into multiple ones with
separate and optimal sensitivity lists.
h. Avoid conversion functions at port map boundaries, as this can hinder boundary optimizations.
i. Serial assignments like out <= to _ uxo1(in1); in a datapath would insert deltas in vhdl simulations and
can cause slowdowns.
j. Model memories using variables instead of signals to improve runtime and memory performance.
k. Avoid complex FLOP/LATCH coding styles, pessimistic flop coding style to handle X and U values that result
in low identification of clocked process.
l. Code to reduce the number of simulation events.
m. Make use of variables instead of signals whereever possible.
n. Code causing cancelling events far in the future will be detrimental to simulator performance (refer to Questa
user guide for details).
o. Avoid using Tcl based testbenches that use force commands to trigger test vectors; this can cause significant
slowdown compared to compiled testbench.
p. Use IEEE libraries packaged with Questa install tree instead of external, user defined or proposed ones.
q. Prefer integer and bit/bit_vector type (instead of std_logic_vectors) for arithmetic computations.
4.3 Performance-aware coding of SystemVerilog Testbench and with Verification
Methodology (UVM/OVM)
A detailed description of SV and UVM coding guidelines for improving performance are published on the
Verification Academy (https://verificationacademy.com/uvm-ovm/SV/PerformanceGuidelines) and as part of the
Methodology Cookbook (http://uvmdoc.mentorg.com/mc/UVM/Performance_Guidelines).
4.3.1 The evil of UVM auto-config
The following is an important section the regarding use of UVM auto-configuration and its effect on performance.
This has recently come up at a high-profile customer resulting in site visits by UVM and Questa experts.
Note: The Questa Methodology team strongly encourages pro-active avoidance of UVM auto-config.
Auto-config is a mechanism that is enabled by default and it will do automatic get_configs for any fields youve
declared with `uvm_field macros, whether you want that or not. However, even if you dutifully avoid use of `uvm_
field macros, auto-config will still incur a heavy performance penalty.
Auto-config is implemented in a method, apply_config_settings, implemented in uvm_component. This method is
called by the uvm_component base::super.build_phase(). (For more info, go to the UVM Reference (HTML) and read
up on uvm_component->apply_config_settings.) This method queries the resource database to return a list of all
resources that are visible by that component. This entails traversing the entire resource db and using regular
expression matching against that components full hierarchical name. Multiply this by 100, 500, 1000 components
or more and you will get noticeable performance degradation. In many cases disabling auto-config reduced the
number of resource db lookups and helped reduce the build time significantly.
You can avoid auto-config in your components by not calling uvm_component::build_phase() as follows:
Questa Sim performance handbook
www. mentor. com
21 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
If the component directly extends uvm_component, implement build_phase() and do not call
super.build_phase.
If the component extends another user-defined class that does not implement build_phase, do as abovei.e.
implement build_phase() and do not call super.build_phase
If the component extends another user-defined class that does implement build_phase, call super.build_
phase() if you need the base-class functionality. If that base class calls super.build_phase in uvm_component,
you will not be able to avoid auto-config for this component.
Unfortunately, without modifying UVM, you cannot avoid TLM ports from invoking auto-config. Their internal
implementation instantiates a component (merely to give the port hierarchy), whose build_phase() unconditionally
invokes auto-config.
Mantis http://www.eda.org/svdb/view.php?id=4065 has been filed to fix this in the UVM. Until its fixed, please be
aware that auto-config is to be pro-actively avoided.
4.3.2 More tips: A list of SV coding styles that affect performance and memory
a. Avoid heavy use of associative arrays with string type keys; integer or class handle keys are much more
efficient.
b. In general, having class based activity unconditional on every clock (or time step) is a bad ideatry to have TB
mechanisms that are aware of other activity or even triggers in order to perform class based activity.
c. For testbench code using a virtual interface rather than a hierarchical reference is generally slower; unless you
must have the abstraction, use direct references when possible.
d. Using tasks, functions, parameter values, etc. from a package is much more efficient than doing a hierarchical
reference to a different module.
e. Avoid `includes of source; use package imports. A `include can cause redundant code to exist in your design
which can slow down compilation, optimization and simulation. Import guarantees sharing of package
elements.
4.4 Writing performance-aware constraints
Random number generation and constraint solving are related but separate functionalities. The process of
constraint solving is not random, but picking a solution requires making random decisions. Any bug fix or
performance tweak between Questa Sim versions has the potential to change the organization of solutions, and
even if RNG stays the same, the choices may be different. It can result in changes in behavior of tests or can affect
performance in some cases.
The vsim solverev <version> switch will attempt to preserve stability of the constraint solver between minor
versions by disabling optimizations that affect random stability, but will not disable bug fixes. Also note that this
switch cannot be used to get the same behavior as previous major releases.
Tip: The ACT solver is best for arithmetic constraints and those with very large constraint set while the BDD solver
is best for logical operators. Questa SIM automatically chooses the solver engine to use based on the constraint but
this can be overridden through the vsim solveengine bdd | act switch.
4.4.1 Solver and memory (capacity)
The default/expected memory usage for a call to randomize() can vary wildly based upon the specific random
variable and constraint combination presented to the solver.
Questa Sim performance handbook
www. mentor. com
22 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
There is currently no way to limit the amount of memory the solver can/will consume on a per-randomize basis.
There are environment variables that can be used to tune the sizes of the solver caches, but this wont allow a user
enough control to limit the resources available to a single randomize() call. There is one modelsim.ini variable
SolveGraphMaxSize that can be used to limit the size of the BDD graphs generated by the BDD solver engine, but
it doesnt necessarily translate into bytes. The default SolveGraphMaxSize of 10000 roughly translates into a data
structure that consumes ~1GB of memory.
In general, the ACT solver uses significantly less memory than the BDD solver (but the ACT solver may not be able
to solve all of the same randomize scenarios that can be solved by the BDD solver, and vice-versa).
4.4.2 Solver and performance
If you run the performance profiler and look at the ranked or calltree reports, you will see the filename/linenumber
of the randomize() calls that are performance hotspots. However, there is no way to differentiate between
constraints that are expensive and constraints that are not.
Tip: Do not use concatenation for constraining a large bit vector by equality. For example:
The concatenation equalities in the constraint can be rewritten in the following way as a list of equality constraints:
4.5 Performance-aware coverage collection
4.5.1 Code coverage
Part of the expense in code coverage collection comes from having to de-optimize. This is necessary for visibility,
and also to some degree to keep the code coverage numbers the same (for example clock optimizations might
reduce the number of times a statement is executed). Toggle coverage is particularly expensive due to the number
of things that can be collected on any given design.
Tip: Collect code coverage only for portions of the design actually required and the PDU flow may be used for this.
Turning it on globally and indiscriminately can slow down things greatly. Use +nocover to disable code coverage
collection to portions of the design.
4.5.2 Functional coverage
For functional coverage there are three golden rules:
Use the slowest possible sampling clock on SV covergroups this makes a BIG difference!

Tip: The undocumented optimization switch vopt cvgoptinferclocks can be used if the user is unable to
change the source code. The switch assumes the sampling clock is fast and replaces it with data signals. Note
that the switch can result in minor side effects on coverage collection.
rand bit [127:0] foo;
rand bit [7:0] a, b, c, d, e, ...;
constraint bar {
foo == {a, b, c, d, ...};
}
constraint bar {
foo[127:120] == a;
foo[119:112] == b;
foo[111:104] == c;
... }
Questa Sim performance handbook
www. mentor. com
23 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
Be judicious while constructing large crosses seemingly innocuous lines of code can create large numbers of
cross bins. Remember that crosses are multiplicative (A x B x C), that is, if A has 3 bins and B has 400 bins and C
has 200 bins, the cross has 3x400x200. When the cross bins are empty, the problem wont be so obvious since
the bins wont exist. But assuming the goal is 100% coverage, eventually that means all the bins must exist
Avoid collecting functional coverage on classes for which large numbers of instances will be created at run
time. Functional coverage retains information for every instance and this will continue to use memory
whenever a new instance is created.
Tip: If the signals in the sensitivity list of a SV covergroup are from deep down in the design hierarchy, it is a good
idea to assign the signal to a temp and using the temp in the covergroup.
If simulation is run a grid machine, saving the UCDB file in the local disk space of the grid machine can give better
runtime performance. The UCDB can be copied over to project work area after simulation finishes, to prevent
network I/O traffic from affecting the performance throughput.
4.6 Writing performance-aware assertions and cover directives
Writing assertions, like writing RTL code, also requires some significant experience to know which constructs will
are more efficient that others in terms of simulation performance, just as significant RTL coding experience is
required to learn what structures a synthesis tool infers from the RTL code. This section provides a few key
guideline to follow as well as a few key things to avoid that arent obvious to the person new to writing assertions.
Also the reader is assumed to have a basic understanding of assertion terminology, sequence operators, and syntax.
Providing the reader with this capability is beyond the scope of this document.
a. Properties referencing multiple clocks are more expensive than properties referencing a single clock.
b. Avoid long or infinite time ranges. For example, if there is an effective timeout in a sequence, make the time
range for the timeout as short as possible given the overall functional/timing requirements of the design.
Using infinite time range in an assertion means that it is never able to fail. Take the example of a simple
handshake protocol which requires an acknowledgment for every ready indication shown below:
A much more simulation efficient way of expressing the concept of eventually with assertions use the goto
operator shown below:
Within a sequence, the use of large or unbounded time range can severely impact simulation performance.
The reason for this is because a separate thread is spawned for each possibility in the legal range. For example,
the sequence:
property handshake_check;
always @(posedge clk) rdy |-> ##[1:$] acpt ##1 !acpt;
endproperty
assert property(handshake_check);
property handshake_check;
always @(posedge clk) rdy |-> acpt[->1] ##1 !acpt;
endproperty
assert property(handshake_check);
(a ##1 b[*1 to 8000] ##5 c ##1 d)
Questa Sim performance handbook
www. mentor. com
24 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
can result in 8000 separate threads of the form:
c. Use the system functions like $rose and $fell - to avoid inadvertently spawning a new thread or several new
threads each cycle. In the example below...

... a thread will be started at every clock edge to check if a is not true. Better way to write this is:
d. Use a qualifying condition when repetitively checking ([->n]) for multiple occurrence of a condition in the
antecedent expression of an assertion. Many times writing assertions involves the need to check for multiple
occurrences of an expression to trigger when additional expressions are evaluated. In the examples below the
intent is to check for 48 occurrences (non-consecutive) of signal a, and on the 48 time signal a is true, signal b
is required to also be true.

When re-written in its equivalent form below, the above property is extremely expensive, in terms of
spawning new threads. Threads grow at an nearly an exponential rate since a brand new thread is started
each and every cycle signal a is not true, but previously started threads in turn spawn new threads each
subsequent cycles due the unbounded time range when signal a is false.
e. Be very careful when using the non-consecutive [=n] operator on the left-hand side of an implication and also
in general. Consider the property below:
Most people would incorrectly interpret this property as: a followed by 2 non-consecutive occurrences of d
followed at least 1 cycle later by c which is then followed one cycle by e at which time the property should
pass. However, as written, the property allows for both c and e to assert after the second occurrence of d but
not pass until the third occurrence of d which could be sometime after e. This is completely unexpected
behavior and most people would believe an assertion bug has been found. However the behavior is correct
because d[=2] is equivalent to (d[*1:$] ##1 d)[*2] ##1 d[*1:$]. It is the last d[*1:$] which keeps a thread from the
left-hand side (LHS) of the implication alive until the third occurrence of d. In order for a property with
implication to pass, all threads started from both the Left-hand side (LHS) and right-hand side (RHS) of the
implication must complete. In this example the threads from the LHS dont complete until signal d occurs a
(a ##1 b[*1] ##5 c ##1 d);
(a ##1 b[*2] ##5 c ##1 d);

(a ##1 b[*8000] ##5 c ##1 d);


(!a[*0:$] ##1 a) |-> b;
$rose(a) |-> b;
a[->48] |-> b;
(!a[*0:$] ##1 a)[*48] |-> b;
property p3;
@(posedge clk) a ##1 d[=2] ##1 c |-> ##1 e;
endproperty
assert property (p3);
Questa Sim performance handbook
www. mentor. com
25 [ 26]
Mentor Gr aphi cs I nter nal Use Onl y
third time; even if all threads from the RHS have already completed. To avoid this behavior the d[=2] can be
replaced by d[->2] to get the intended behavior.
f. Specify behaviors accurately. Take the SVA sequence below:
This sequence named easy appears to be straight-forward and it is. It states that a is followed by b which is
followed by c (all with delay of a single cycle/clock). However, if the correct behavior requires a and b signals
to either remain asserted or to deassert in the next cycle, then this simple sequence will not check for the
expected behavior. The following modifications would:
OR (depending on expected behavior):
Another example: If a sequence is needed that says - a happens at a clock edge followed by b in 4 to 8 clock
cycles followed by c. It can be written as:

This accurately represents the above requirement. However in most cases, the requirement is when a asserts,
its to be followed by b asserting in 4 to 8 clock cycles and the first time b asserts within the [4:8] cycle range
it should be followed by c. This is represented by:
The difference between sequences s1 and s2 is that in s2, c has to follow the first occurrence of b in [*4:8]
range whereas in seq1, c can follow any occurrence of b in [*4:8]. In most cases the requirement is that of s2.
g. Use cover property in place of cover sequence when needed.
A thread will be started at every clock edge as long as the dll_state is DL_INACTIVE which really makes no
sense. A better way to write this is to use the cover property statement:
sequence easy;
always @(posedge clk) a ##1 b ##1 c;
endsequence
a ##1 a & b ##1 a & b & c;
a ##1 !a & b ##1 !a & ! b & c;
sequence s1;
always @(posedge clk) a ##[4:8]b ##1 c;
endsequence
sequence s2;
always @(posedge clk) a ##1 !b[*3:7] ##1 b ##1 c;
endsequence
cover sequence (@(posedge clk)
dll_state == DL_INACTIVE [*1:$] ##1 dll_state == DL_INIT
[*1:$] ##1 dll_state == DL_ACTIVE);
Questa Sim performance handbook
2012 Mentor Graphics Corporation, all rights reserved. This document contains information that is proprietary to Mentor Graphics Corporation and may
be duplicated in whole or in part by the original recipient for internal business purposes only, provided that this entire notice appears in all copies.
In accepting this document, the recipient agrees to make every reasonable effort to prevent unauthorized use of this information. All trademarks
mentioned in this document are the trademarks of their respective owners.
F o r t h e l a t e s t p r o d u c t i n f o r ma t i o n , c a l l u s o r v i s i t : w w w . m e n t o r . c o m
MGC 12-12 TECH10900-w
h. Be careful what you do in an assertion pass statement. SV assertions have an action block which contains an
assertion pass statement as well as an assertion failure statement. If an assertion has a pass statement, then
the pass statement gets executed on both real and vacuous passes. Unless you care about vacuous passes
you should use the assert control task $assertvacuousoff to turn off executing of pass action blocks for
vacuous passes.
i. Take into account reset conditions. You dont want to see false failures due to an assertion failing because
either the design is not yet initialized or that a reset occurs during operation.

Você também pode gostar