Escolar Documentos
Profissional Documentos
Cultura Documentos
Executive Summary
This case study is an example of a project where CriticalBlues engineering skills and
technology were brought in to improve product performance. CriticalBlue's technology
enabled the rapid identification of optimization opportunities which were then exploited to
provide quick, tangible performance gains, making a new product more attractive to
consumers.
Introduction
The Web browser is a key application on Android based devices. This is reflected by the wide
range of available benchmarking websites and their prominence in consumer reviews for new
Android products.
This case study describes some of the optimizations made during a customer project where
Browsermark benchmark scores for a new platform were not as good as measured on
competitors products.
In Android 4.1 Jelly Bean, the Android Web Browser is based on the WebKit layout engine, a
heavily optimized open source library. Even though most high-level optimizations have
already been implemented, opportunities for point optimization still exist in key functions. In
particular further architectural optimization is possible, where we focus on matching the
software more closely to its target platform, delivering measurable benchmark gains.
Since the WebKit code base is very large, it is impractical to manually change the code at all
possible optimization points; instead, in this case study we used our Prism technology to
capture and analyze runtime behavior, enabling us to direct engineering effort where it made
the greatest impact. We focused on just two optimizations, both targeting the Advanced
DOM Search section of the Browsermark suite.
Trace capture flow using Prism tools on the target before transferring the results to the host
The resulting trace allowed our engineers to understand how the benchmark stressed not
only the WebKit software implementation but also the processor architecture. This insight
allowed us to identify and ultimately rectify these performance bottlenecks on this platform.
Unfortunately, it is difficult to identify problematic branches in advance. Often they will not
become apparent until runtime, and then only when processing certain data. In this project,
Prism's runtime analysis of the WebKit identified branch instructions with high misprediction
rates while running Browsermark.
In the screenshot above, we can see that Prism compiled a list of source locations with
associated performance statistics, including a misprediction count for any branches. The
greatest cause of misprediction stalls for this run was found at line 66 of
MarkupAccumulator.cpp, in the appendCharacterReplacingEntities()
function. The question was: why was this causing a problem?
The answer came from analyzing the execution counts to the left of each source line. In the
screenshot above, Line 66 is a for loop test that was executed 1.21 million times. The
preceding line was executed over 257 thousand times, so on average the loop went around
almost 5 times on average. Looking at the definition for the five-element array entityMaps
(as shown below), this is not surprising, given that the break on line 71 was so rarely executed.
Loops present a particular problem for branch prediction, since the same branch instruction
will always be predicted wrongly at least once when the loop exits. This is not normally an
This code could be optimized by reducing the number of times the program flow entered the
loop. Since the test of the if statement in the loop kernel rarely passed anyway, hoisting it
outside of the loop should improve performance.
Understanding the functionality of the loop in its context further supported this approach.
The loop iterated over a character buffer (content[]), looking for matches contained in the
entityMaps array. Since the characters in entityMaps appeared in content[] only
infrequently, it was possible to construct a 'fast reject' test condition that would be more
predictable.
In the optimized code above, the if condition will rarely be true so that can be predicted
easily with a Branch Target Address Cache, common in most application processor
architectures. Indeed, rerunning the analysis verified that this optimization dramatically
reduced the total stall cycles in the function, as shown below.
On embedded platforms, main memory is often slower than that found on desktop systems,
in an effort to save power and space. Therefore, it is vital for optimal performance to use the
L1 and L2 cache memories on the processor as efficiently as possible.
The Prism analysis of the WebKit, while running the Browsermark DOM Search, highlighted
the cache performance issues shown in the screenshot below. Here Prism identified that the
method WebCore::DynamicNodeList::item(unsigned int) was responsible for
34% of the data loaded into the cache; however, 81% of that data was never actually used.
Since the cache was being filled with useless data, the miss rate increased as useful data was
unnecessarily evicted. The overall effect was to cause more processor stalls as data was
fetched from main memory, which lead to a longer runtime and a lower benchmark score.
The JavaScript fragment below produced the sort of runtime behavior we have just discussed.
The call to getElementByTagName caused the creation of a DynamicNodeList object
and the aNodeList[i] operation called the item method.
As seen in the screenshot above, the accesses causing memory traffic were into a Node
object defined in Node.h. After we examined the implementation of the methods involved,
this turned out to be unsurprising: whenever a new item was requested from the
The complexity of the Node class meant that the data being accessed was widely spread in
memory, so we were accessing data with poor spatial locality. This explained the poor cache
performance that we detected while analyzing the code. As new data was accessed, the cache
loaded in an entire block of data from the surrounding memory locations, under the mistaken
assumption that nearby data would be accessed soon.
Once the cause of the issue was identified, the following optimization was implemented. The
original code contained a pointer about how to proceed as shown in the screenshot below.
Execution counts indicated that last item caching was having little impact
Here we see that caching of the previously returned node was implemented (lines 137 to 140)
so that repeated accesses to the same item would not trigger a DOM traversal. However, the
execution counts showed that over 88% of calls to the method missed the cache. An
improvement here would make a significant difference.
The optimization shown in the box below allowed for more aggressive caching of nodes, by
holding the entire NodeList in the Vector structure allItems. The vector was filled with
if (m_caches->useAllItemsCache) {
if (m_caches->allItems.empty()) {
Node* n = m_rootNode->firstChild();
while (n) {
if (n->isElementNode() && nodeMatches(static_cast<Element*>(n))){
m_caches->allItems.push_back(n);
}
n = n->traverseNextNode(m_rootNode.get())
}
}
return m_caches->allItems[offset];
}
The lookup into a Vector object is fast by design, since its tightly-packed structure has good
spatial locality. The potential downside is the cost of populating the list in the first place. This
is amortized if the list is iterated multiple times. For even a single iteration over the entire list,
it should still be cheaper than the original approach.
Looking forward, an additional danger lies in the live nature of the NodeList data. If the
underlying DOM is modified, the vector should be flushed and rebuilt from scratch.
Fortunately, the infrastructure for implementing this is already in place, so although
functionality is straightforward to maintain, performance could still drop dramatically in the
worst case. This can be further mitigated by tracking the number of item lookups between
vector flushes, and by reverting to the original scheme above a certain threshold.
When rerun with the optimized code, the DOM search score increased by 16%.
Results
When both optimizations discussed in the previous sections were integrated into the WebKit
code base, the DOM Advanced search score increased by 22% over the original code.
Thanks to the detailed analysis provided by Prism, engineering effort was focused where it
could make the greatest impact; as a result, this performance increase was obtained by
modifying only ~60 lines of code in the original source.
In addition to delivering better benchmark scores to the customer, this project was used as a
training exercise to educate the customer's engineers on how to investigate performance
bottlenecks and figure out improvement strategies.
Before and after results for the DOM Advanced Search Benchmark