Multi Processor

2014 15th International Microprocessor Test and Verification Workshop
A case study of multiprocessor bugs found using

RIS generators and memory usage techniques
Deepak Venkatesan and Pradeep Nagarajan
Systems and Software Group
ARM Embedded Technologies Pvt. Ltd.
Bangalore, India
Email: deepak.venkatesan@arm.com, pradeep.nagarajan@arm.com
Abstract—Random Instruction Sequence (RIS) Generators generator which aided in the discovery of these bugs are
are widely used for functional verification of processors. In the described in this paper.
case of multiprocessor systems, these RIS tools can be used to
generate instructions that act upon data shared between the
multiple processors in the system. In addition to the random
II. RANDOM INSTRUCTION SEQUENCE (RIS) GENERATORS
instruction sequences, addition of semi-directed functional In the case of a uniprocessor system, Random Instruction
sequences and deterministic instruction patterns in the tool – Sequence (RIS) generators produce sequences of random
orthogonal to the randomly generated instructions increase the instructions based on user-supplied weight file inputs. Based
chances of reaching the desired corner cases faster. Using on the functional area being verified, these weight files can be
carefully crafted address reuse and virtual memory mapping modified to generate a specific subset of the whole instruction
data inconsistency issues and arbitration bugs in the processors set. The RIS generators help in covering multiple combinations
can be hit faster. This paper presents a case study of two bugs of instruction sequences in a relatively shorter span of time.
which were found in cache coherent multiprocessor systems using
a RIS tool, and semi-directed sequences. Similarly, in the case of multiprocessor systems, the RIS
generators produce sequences of instructions for each of the
Keywords—multiprocessor verification, random instruction processor in the system. However, there are additional
sequence, RIS generator, virtual address, cache invalidation considerations in the case of multiprocessor systems. The
request. addresses and data being acted upon by a processor, and the
virtual to physical memory mapping could be unique only to
I. INTRODUCTION that processor, or they could be shared between multiple
processors. These considerations define the kind of traffic that
With the time-to-market for SoCs shrinking over the years,
would be generated, and the kind of events that would be
the process of verifying complex multiprocessor systems has
triggered in a multiprocessor system.
become increasingly challenging. Especially, in the case of
multiprocessor systems with hardware managed data coherence
between the processor caches– the complexity of functional III. ARM RIS GENERATOR
verification is quite large[1][2]. In addition to stress-testing of ARM makes use of multiple RIS Generators to verify
individual processors in the multiprocessor system, it is also processors. Among these, this paper will describe a
imperative to create scenarios which stress-test the data multiprocessor-aware RIS generator that was used to find the
coherence mechanisms between the processors. Since the state bugs discussed in the subject of this paper. This RIS generator
space for functional verification involving interaction between is a highly configurable tool which can take user inputs for
processors is huge, random instruction sequence (RIS) instruction weights, and generate random set of instructions for
generators mark an important step in the verification each processor in the system.
process[3][4].
It is necessary that the right address combinations and A. User-supplied Inputs
virtual memory reuse methods are applied to the random The RIS generator takes user-inputs in the form of
instruction sequences[5]. This becomes important especially in configuration files. The configuration files are parsed by the
the case of multiprocessor systems – where the way data are generator, and the information is used in instruction sequence
shared between the participating processors in the RIS generation. There are three different kinds of configuration
generator should increase the stress on the data coherence information that can be passed by the user to the RIS generator.
mechanisms used in the system.
1) The weights for the different instructions and instruction
The subsequent sections of this paper give an overview of groups. The final random instruction sequence has instructions
one of the multiprocessor RIS generators used in ARM®, and distributed based on these initial weights supplied by the user.
the address usage and virtual memory reuse methods
implemented in the generator. Also, two multiprocessor bugs 2) The memory and system constraints for the generator. In
that were found by the RIS generator – and the features of the this piece of configuration information, the details such as
1550-4093/15 $31.00 © 2015 IEEE 4

DOI 10.1109/MTV.2014.28
memory chunks available for the generator to use, distribution The number of region groups is configurable in the RIS
of cacheability attributes across the memory chunks, the generator. This number is decided based on the number of
system memory map, the RIS generator knobs, virtual memory processors present in the multiprocessor system, the amount of
organization, etc. are supplied by the user. memory available in the testbench, and the traffic load and
patterns that are expected in the system.
3) Semi-directed user-supplied code. These are semi-
directed subroutines which are targeted at a specific processor Both
functionality[6] – such as cache line evictors. Such subroutines Processor 0
are called funcs. The funcs are used to create specific scenarios Processor 0 Processor 1 and
in the processors, or make use of the addresses used by the RIS Processor 1
Generator to create background irritator traffic.
R W SO R W SO R W SO
Memory and Semi-directed
Instruction
System Subroutines
Weights
Constraints ‘Funcs’
Private Private Common

Memory – Memory – (Shared)
Random Instruction Sequence (RIS) Generator Processor 0 Processor 1 Memory
Multiple such memory units

Generated instruction are used by the RIS generator
sequence output
for each processor R-Reads; W-Writes; SO-System Operations
Fig. 2. Shared and Non-shared memory organization in the RIS Generator
Final Test Stimulus C. Result Checking Mechanism

For a group of random instruction sequences, the final
Fig. 1. The flow describing the working of ARM RIS Generator result of executing the instructions is not known during the
generation of the sequences. As a result, comparison of the
B. Address generation in the RIS generator final data with a golden set of values is not inherently possible.
For multiprocessor systems with hardware managed data The instructions could be fed to a software model of the
coherence, creating repeated migration of cache lines between processor and the processor state can be compared with the
the different processor caches is an widely used method of software model after the execution of each instruction in a
stress testing[7]. To enable this, the addresses used by the lock-step mode[9]. But, this would slow down the test run-
instructions need to be generated in a way that data are shared time, and also restricts the platforms on which the RIS
between the participating processors. At the same time, when generated tests could be run. Considering these limitations, the
multiple processors can write to a shared memory address, the RIS generator employs a two pass consistency check
final value at the given address is not deterministic[8]. mechanism of result checking.
Considering these, the RIS Generator divides the memory Since only the data in private memory regions are
allocated to the instruction sequences into two types— private deterministic, self-checking in the case of RIS generator is
memory regions and common memory regions. restricted to private memory regions.
Private memory is a region in memory, specific to each In the two pass consistency check mechanism in the RIS
processor, to which only that processor will write to throughout generator, each test sequence is run twice. The results at the
the test. Whereas, common memory is a region to which any of end of each of the two runs are compared. If there is a
the processors in a multiprocessor system can write to mismatch between the results of the two runs, an error is
throughout the test. By demarcating memory this way, it is flagged, and the test is terminated for further debug. In case
ensured that the data in private memory of each processor there is no mismatch, the next set of random instruction
remains deterministic, and can be checked. sequences generated is picked up, and the process is continued.
The organization of private and common memory regions The mechanism of two pass consistency check done in the
for a multiprocessor system with two component processors is RIS generator is shown in Fig 3. At the end of RUN 1, the
shown in Fig. 2. Reads, writes and system operations are content of all general purpose registers, floating point registers,
shown in the figure as representative memory operations. program counter, and the private memory regions are backed
These could be triggered by the different random instructions up in a part of the testbench memory. Similarly, at the end of
picked out from the instruction set to act upon these regions. RUN 2, all of the above contents are backed up in memory.
One private region per processor, and one common region Now, a compare subroutine is called, which compares each
for all processors in the system together form a region group. register and memory location value between RUN 1 and RUN
5
2. If there is a mismatch between RUN 1 and RUN 2 values, an A. RIS Generator Requirements
error is flagged. This bug is a scenario which requires at least two
Processor 0
processors. The bug was found using the virtual address reuse
methods used in the RIS Generator and by varying the
cacheability attributes of the memory page in between two
Instruction groups of instruction sequences.
Sequence 1
RUN 1
The RIS Generator produces random instruction sequences
Backup in the form of small sized units called ‘tests’. Each test uses its
memory own virtual memory address (VA) mapped to a physical
memory address (PA) available in the testbench. The virtual
Save contents of address is unique for a given test, i.e., the virtual address used
non-shared Processor-0 memory by one test is not available for use to any other test at the same
and processor registers time. Once a test completes, the virtual address that was used
by the test becomes free, and it can subsequently be reused by
another test, which could run on any of the available processors
Instruction in the multiprocessor system.
Sequence 1
RUN 2
The cache and TLB contents which were populated during
a test are not cleared upon the completion of the test. Instead,
they are allowed to be present, since the virtual and physical
Compare contents of
addresses are not used by any other test simultaneously. Only
non-shared Processor-0 when the RIS Generator is requested for the same virtual
memory and processor registers address to be reused later for a new test, the cache and TLB
with saved data from RUN 1 contents for the given virtual address are cleared.
Based on the above virtual memory usage model, two
instruction sequences are generated by the RIS Generator—
Test 1 and Test 2, as shown in Fig. 4.
Print TEST PASS
or TEST FAIL on Processor-0
Test 1 runs on Processor 0
Fig. 3. The flow used by the two pass consistency check mechanism
implemented in the RIS generator Page Translation Map ‘A’
Using the two pass consistency check mechanism, the RIS Virtual Address : VA1
generator would be able to find issues where there is a Physical Address : PA1
functional difference between RUN 1 and RUN 2. These are Memory Attribute : cacheable
usually timing related bugs, and bugs which depend on the
state of the processor, cache hits and misses, and translation
lookaside buffers (TLBs) contents. By adding additional
irritation to the random instruction stream – in the way of Test 2 runs on Processor 1
virtual address reuse and change of page attributes, the chances
Page Translation Map ‘B’
of finding such timing-dependent bugs could be increased.
One of the limitations with this approach is that only bugs Virtual Address : VA1
that occur with changes in the processor states and timing of Physical Address : PA1
events would be discovered. This method of consistency check Memory Attribute : non-cacheable
will not be able to find bugs which manifest with the same
signature in both RUN 1 and RUN 2.
Fig. 4. The memory organization for two instruction sequences in Processor-0
The two pass consistency check mechanism of result and Processor-1. The virtual and physical addresses used by both instruction
checking is repeated independently for each processor that is sequences are the same, but the cacheability attributes of the memory differ.
The Processor-0 and Processor-1 make use of the same virtual address VA1 at
part of the multiprocessor system, and the PASS or FAIL result different times during the test.
of each of the processors is printed separately.
Test 1 and Test 2 run on different times on two different
IV. CASE STUDY I processors – Processor 0 and Processor 1. Test 1, which runs
first on Processor 0, has a memory page attribute of cacheable.
This section describes the RIS Generator requirements and Whereas, Test 2, which runs at a later time on Processor 1, has
processor conditions that lead to the discovery of a bug in a a memory page attribute of non-cacheable. Using this premise,
multiprocessor system with two participating processors. the sequence of events that lead to the discovery of the bug is
shown in Fig. 5.
6
B. Sequence of events in the processor • In addition to random instruction sequences, the reuse
Processor 0 runs Test 1 with page translation mapping ‘A’ of virtual memory addresses in the generator led to the
as shown in Fig. 4. Since the page translation mapping ‘A’ was conditions required for the bug to manifest.
cacheable, cache entries are created for the data. Similarly, the
page translation from VA1 to PA1 is stored in the TLB. At the Processor 0 Processor 1
end of Test 1, the cache and TLB content is not cleared.
Initially, Translation Lookaside
At a later time, Processor 1 requests to use the same VA1 to Buffer (TLB) has mapping ‘A’
PA1 mapping to run Test 2. The RIS Generator changes the
page translation map from ‘A’ to ‘B’. In this process, the page
attributes become non-cacheable. Subsequently, the TLB
entries pointing to page translation mapping ‘A’ are cleared. Execute Instruction Sequence 1
Also, cache entries that were allocated with the page translation Using VA1 with page mapping A
mapping ‘A’ are also invalidated. The cache and TLB entry
invalidate operations are executed in Processor 1, but they are
also broadcasted to Processor 0, so that any previous entries in Instruction Sequence 1 passed.
Processor 0 could be removed. Address PA1 stays in Processor 0
cache, after instruction sequence 1
In the case of this multiprocessor system, the caches are
physically tagged. Processor 1 invalidates the physical memory Change VA1 page mapping
address from its cache. On receiving the cache invalidation Mapping A Æ Mapping B
broadcast, Processor 0 also has to invalidate the physical
memory address from its cache.
Subsequently, Processor 1 goes ahead and runs Test 2 using Invalidate TLB entry for
the page translation mapping ‘B’. In a normal case, both Test 1 TLB invalidate broadcast mapping ‘A’ in Processor 1,
received and broadcast it to Processor 0
and Test 2 which run on different processors would go through
the two pass consistency check mechanism in the RIS
Generator and indicate a TEST PASS.
Invalidate cache entries with
However, a bug in the broadcast path of the processor PA1 in Processor 1
prevents this from happening. When the cache invalidate
command was broadcasted from Processor 1 to Processor 0, Cache invalidate broadcast
due to a bug in the broadcast path of Processor 0, the processor received Broadcast cache invalidation
fails to invalidate the stale cache entry for PA1 from its cache. But failed to invalidate! to Processor 0
Processor 1 assumes that any stray cache entries for PA1 with This is the bug!
the old page translation mapping has been removed from the
multiprocessor system – when in reality, the cache entry for
Execute Instruction Sequence 2
PA1 still continues to remain in Processor 0’s cache. Processor Using VA1 with page mapping B
1 executes Test 2 with the page translation mapping ‘B’ –
where the memory page is marked with the attribute non-
cacheable. All data being written by Processor 1 in Test 2 goes Eventually, PA1 gets evicted
directly to the external memory. from
Processor 0 cache and overwrites
While the Test 2 is being executed in Processor 1, the stray contents of external memory
cache entry for PA1 which remained in the cache of Processor
0 gets evicted out of the cache naturally – as more data is
moved into Processor 0’s cache. The evicted cache line gets
written to the main memory – corrupting the memory entries
Instruction Sequence 2 failed.
being used by Test 2 running on Processor 1. Because, Processor 0
At the end of Test 2, when the two pass consistency check overwrites Processor 1’s data
Bug caught!
is done on Processor 1, an error is flagged. This is the result of
the data that got overwritten by the stray cache line evicted out
Fig. 5. Case Study I: Processor 0 executes Instruction Sequence 1, and
of Processor 0’s cache. In this way, the processor bug is found Processor 1 executes Instruction Sequence 2. Both use the same virtual address
by the RIS Generator. at different times with different page mappings. So, the failure of a broadcasted
cache invalidation request is caught by data check
C. Summary of Case Study I • The virtual memory address management method
From the description of events which led to the discovery where the cache and TLB entries are not cleared as
of the processor bug, the following observations could be soon as they lose context is also an important
made. requirement that led to the discovery of this bug.
7
• The bug in the processor was not a direct functional Memory Irritator is another func used by the RIS
issue that manifests every time a cache broadcast is Generator. This func reads the base address of the memory
done. Instead, it depends on the exact timing regions used by the RIS Generator to generate random
conditions of events occurring inside the processor. As instruction sequences. Using this base address, the func creates
a result, the bug does not show up in RUN 1 of the test, dummy memory transactions originating from the processor
but occurs only in RUN 2. This helped in finding the running the func. These transactions can be configured to be
bug using the two pass data consistency check method chosen from among a wide range of transactions as desired by
used by the RIS Generator. the user – based on the scenario that is being targeted in the
RIS Generator.
V. CASE STUDY II In the case of this bug, the memory irritator gets the base
This section describes a multiprocessor bug that was found address of the memory region used by the RIS Generator –
using semi-directed subroutines along with random instruction which is PA. Then, the func generates read requests and cache
sequences generated by the RIS Generator. line clean and invalidate requests continuously on the base
address PA. This creates a background noise of requests
A. RIS Generator Requirements originating from the processor running the memory irritator
func.
The random instruction sequences generated by the RIS
Generator are useful in covering the instruction set space of the
B. Sequence of events in the processor
processor. But, for creating specific processor centric traffic
patterns, additional instrumentation is necessary. For example, This bug requires that at least two processors are present in
to create natural cache line evictions from the processor caches, the multiprocessor system. The first processor – Processor 0 –
the random instruction sequences are not very helpful. Instead, is scheduled to run the cache index evictor func. The second
a directed subroutine which can create streaming memory processor – Processor 1 – is scheduled to run the memory
accesses – which in turn can start evicting cache lines from the irritator func. These funcs are run in addition to the random
processor caches, becomes necessary. instruction sequences generated by the RIS Generator. This
gives a variation in timing.
In the ARM RIS Generator, such directed subroutines are
called funcs. Since the bug which is analyzed in this section The sequence of events which lead to the bug are shown in
needs a stream of cache line evictions to happen from the Fig. 6. Both Processor 0 and Processor 1 make use of the same
processor caches, the RIS Generator makes use of a func to physical address PA. Due to the hardware managed data
create this scenario. coherence implemented in the multiprocessor system, the read
requests to address PA created in Processor 1 by the memory
Cache Index Evictor is a func used by the RIS Generator. irritator func result in cache snoop requests being sent from
The objective of the func is to create a stream of cache line Processor 1 to Processor 0 cache for the address PA. Similarly,
evictions from the processor caches naturally – using read or the cache clean and invalidate requests on address PA are also
write instructions. broadcasted from Processor 1 to Processor 0.
The processor cache is organized into n cache ways. Each From the perspective of Processor 0, a stream of cache lines
cache line can be allocated in one of the cache ways. If all are getting evicted around address PA, while a continuous
ways in which a new cache line could possibly be allocated stream of read requests and cache clean and invalidate requests
should be full, the processor evicts a cache line from one of the are received on the same address PA.
cache ways and allocates the new cache line. If the cache is 4-
way set associative, five read requests that bring in cache lines The multiprocessor system implements an arbiter which
from external memory to the cache are sufficient to cause the arbitrates between the different memory requests and cache
first cache line eviction. All these five read requests have to be maintenance operation requests coming from the different
to addresses that differ by the size of one cache way. processors in the system. Due to arbitration priorities and slot
allocation order, the external read requests and the cache clean
Let us consider a physical memory address PA – which is invalidate requests occupy all the slots in the arbiter. As a
the same address as the base address of the memory region result, a read request to address PA+xm originating from
being used by the RIS Generator to generate random Processor 0 is kept pending forever.
instruction sequences. The RIS Generator provides APIs to the
funcs to extract the addresses being used by the generator. As It must be noted that the address PA on which broadcasted
shown in Fig. 6, the cache index evictor func starts by making requests are being sent from Processor 1, and the address
a read request to physical memory addresses PA. This brings PA+xm, which is an address generated on Processor 0 by the
the cache line containing the address PA into the cache. cache index evictor func – share the same cache index.
Subsequently, the func makes read requests to physical As the request from Processor 0 is kept pending forever, a
memory addresses– PA+m, PA+2m, PA+3m and PA+4m, and deadlock detection mechanism in the testbench is triggered.
so on. Considering a 4-way set associative cache, starting from This indicates a lock-up in the processor from which the
the read request PA+4m, the cache lines start to get evicted processor cannot recover without being reset. This led to the
from the processor cache. All subsequent read requests from discovery of this deadlock bug.
this point onwards result in a continuous stream of cache lines
being evicted from the processor cache.
8
VI. CONCLUSION
Processor 0 Processor 1
As it has been demonstrated through the case study of two
multiprocessor bugs found by the ARM RIS Generator, the
random instruction sequences aid in discovering the category
of bugs which are timing sensitive and require a precise
sequence of events in the processor – which are difficult to
predict. For stress testing multiprocessor systems with
Common address ‘PA’
generated by the RIS
hardware managed data coherence between the processor
generator caches, creating migration of cache lines between the processor
Keep looping
caches by using shared memory – is an industry standard
y times
method. Additionally, the methods of managing shared
memory resources, and virtual memory reuse techniques used
by the ARM RIS Generator are instrumental in finding such
Cache Line Evictor ‘func’ Memory Irritator rare bugs.
Issue Read from
Issue Reads from Address- PA Similarly, user-defined inputs in the form of weights and
Addresses- + directed subroutines also play an important role in directing the
PA + m Cache line clean and random instruction sequence in a desired direction – to achieve
PA + 2m invalidate the goal of finding the rare bugs in a relatively short span of
PA + 3m of Address- PA time. Since the time for verification has been reducing over the
.
years, the memory techniques described in this paper directly
(where ‘m’ is the size of one
help in improving the quality of verification and shortens the
cache way)
Address- PA: Read time-to-market for processors.
request and
Cache line clean and
invalidate VII. REFERENCES
A stream of cache lines
requests broadcasted [1] Logan, C.A., "Directions in multiprocessor verification," Computers and
keep getting evicted
to Processor 0 Communications, 1995., Conference Proceedings of the 1995 IEEE
continuously Fourteenth Annual International Phoenix Conference on , vol., no.,
pp.29,33, 28-31 Mar 1995
Continuous [2] Stenstrom, P., "A survey of cache coherence schemes for
PA requests multiprocessors," Computer , vol.23, no.6, pp.12,24, June 1990
[3] Wood, D.A.; Gibson, G.A.; Katz, R.H., "Verifying a multiprocessor
Due to continuous requests cache controller using random test generation," Design & Test of
on PA, reads to PA + xm are Computers, IEEE , vol.7, no.4, pp.13,25, Aug. 1990
blocked, [4] Yingbiao Yao; Jianwu Zhang; Bin Wang; Qingdong Yao, "A Pseudo-
and the processor locks up Random Program Generator for Processor Functional Verification,"
This is a bug! Integrated Circuits, 2007. ISIC '07. International Symposium on , vol.,
no., pp.441,444, 26-28 Sept. 2007
[5] Adir, A.; Almog, E.; Fournier, L.; Marcus, E.; Rimon, M.; Vinov, M.;
Ziv, A., "Genesys-Pro: innovations in test program generation for
Fig. 6. Case Study II: A continuous stream of requests on Address ‘A’ can functional processor verification," Design & Test of Computers, IEEE ,
block a different address ‘B’, where B and A belong to the same cache ‘set’, vol.21, no.2, pp.84,93, Mar-Apr 2004
but are separated by multiples of ‘way’ size [6] Qin, X.; Mishra, P., "Automated generation of directed tests for
transition coverage in cache coherence protocols," Design, Automation
C. Summary of Case Study 2 & Test in Europe Conference & Exhibition (DATE), 2012 , vol., no.,
pp.3,8, 12-16 March 2012
From the description of events which led to the discovery
[7] Adir, A.; Shurek, G., "Generating concurrent test-programs with
of the processor bug, the following observations could be collisions for multi-processor verification," High-Level Design
made. Validation and Test Workshop, 2002. Seventh IEEE International , vol.,
no., pp.77,82, 27-29 Oct. 2002
• The directed subroutines – cache index evictor and [8] Becky Cavanaugh, Melanie Typaldos, “Random Test Generation for
memory irritator are essential in addition to the Multi-Processor Systems” - Obsidian Software Inc, 2008
random instruction sequences to find this bug. [9] Ilya Wagner, “An Effective Verification Solution for Modern
Microprocessors”, University of Michigan, 2008
• Even though the memory irritator only adds
background noise to the transactions going on in the
multiprocessor system, it has helped in finding this
case of lock-up.
• This is a case of timing related bug – where the exact
sequence of processor events need to be triggered by
the memory request traffic to lead to the conditions
that manifest the bug.

Multi Processor

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Multi Processor

Enviado por

Direitos autorais:

Formatos disponíveis

2014 15th International Microprocessor Test and Verification Workshop

A case study of multiprocessor bugs found using

1550-4093/15 $31.00 © 2015 IEEE 4

Private Private Common

Multiple such memory units

Final Test Stimulus C. Result Checking Mechanism

Você também pode gostar