Você está na página 1de 9

Embedded WebKit Case Study

Executive Summary
This case study is an example of a project where CriticalBlues engineering skills and
technology were brought in to improve product performance. CriticalBlue's technology
enabled the rapid identification of optimization opportunities which were then exploited to
provide quick, tangible performance gains, making a new product more attractive to
consumers.

Introduction
The Web browser is a key application on Android based devices. This is reflected by the wide
range of available benchmarking websites and their prominence in consumer reviews for new
Android products.

This case study describes some of the optimizations made during a customer project where
Browsermark benchmark scores for a new platform were not as good as measured on
competitors products.

Browsermark (browsermark.rightware.com) consists of a number of sub-tests that stress


different aspects of web browser functionality, with an emphasis on JavaScript execution and
on the performance of various APIs provided by the browser, including DOM and WebGL.
(Note: this project was based on Browsermark 1.0.)

In Android 4.1 Jelly Bean, the Android Web Browser is based on the WebKit layout engine, a
heavily optimized open source library. Even though most high-level optimizations have
already been implemented, opportunities for point optimization still exist in key functions. In
particular further architectural optimization is possible, where we focus on matching the
software more closely to its target platform, delivering measurable benchmark gains.

Since the WebKit code base is very large, it is impractical to manually change the code at all
possible optimization points; instead, in this case study we used our Prism technology to
capture and analyze runtime behavior, enabling us to direct engineering effort where it made
the greatest impact. We focused on just two optimizations, both targeting the Advanced
DOM Search section of the Browsermark suite.

2013 Issue 001 Page 1 of 9


Embedded WebKit Optimization

Analysis with Prism Tools


We used CriticalBlue's Prism dynamic analysis technology to capture the behavior of the
underlying WebKit engine within the Android Browser. Traces were captured on the device as
Browsermark ran, and then analyzed with Prism on a host workstation.

Trace capture flow using Prism tools on the target before transferring the results to the host

The resulting trace allowed our engineers to understand how the benchmark stressed not
only the WebKit software implementation but also the processor architecture. This insight
allowed us to identify and ultimately rectify these performance bottlenecks on this platform.

Processor Pipeline Optimization


Fast application processors rely on deep execution pipelines to support high clock
frequencies. To keep these pipelines full, branch prediction is used to support the speculative
issuing of instructions before the condition for the branch is resolved. In normal conditions
performance remains high, but if frequent branch mispredictions occur this will result in a
substantial performance penalty since partly executed speculative instructions must be
flushed from the pipeline and correct instructions must be issued instead.

Unfortunately, it is difficult to identify problematic branches in advance. Often they will not
become apparent until runtime, and then only when processing certain data. In this project,
Prism's runtime analysis of the WebKit identified branch instructions with high misprediction
rates while running Browsermark.

2013 Issue 001 Page 2 of 9


Embedded WebKit Optimization

Code responsible for stall

Execution count shows List of branch mispredictions


loop iterating ~5 times causing stalls
on average

Branch misprediction highlighted in the source code

In the screenshot above, we can see that Prism compiled a list of source locations with
associated performance statistics, including a misprediction count for any branches. The
greatest cause of misprediction stalls for this run was found at line 66 of
MarkupAccumulator.cpp, in the appendCharacterReplacingEntities()
function. The question was: why was this causing a problem?

The answer came from analyzing the execution counts to the left of each source line. In the
screenshot above, Line 66 is a for loop test that was executed 1.21 million times. The
preceding line was executed over 257 thousand times, so on average the loop went around
almost 5 times on average. Looking at the definition for the five-element array entityMaps
(as shown below), this is not surprising, given that the break on line 71 was so rarely executed.

static const EntityDescription entityMaps[] = {


{ &, ampReference, EntityAmp },
{ <, ltReference, EntityLt },
{ >, gtReference, EntityGt },
{ , quotReference, EntityQuot },
{ noBreakSpace, ndspReference, EntityNbsp }
};

Declaration of the array used by the inner loop

Loops present a particular problem for branch prediction, since the same branch instruction
will always be predicted wrongly at least once when the loop exits. This is not normally an

2013 Issue 001 Page 3 of 9


Embedded WebKit Optimization
issue since high-iteration loops will amortize the cost. Here, however, we only had a
maximum of five iterations, which was insufficient to justify the overhead of branch
prediction.

This code could be optimized by reducing the number of times the program flow entered the
loop. Since the test of the if statement in the loop kernel rarely passed anyway, hoisting it
outside of the loop should improve performance.

Understanding the functionality of the loop in its context further supported this approach.
The loop iterated over a character buffer (content[]), looking for matches contained in the
entityMaps array. Since the characters in entityMaps appeared in content[] only
infrequently, it was possible to construct a 'fast reject' test condition that would be more
predictable.

for (size_t i = 0; i < length; ++i) {


// Fast reject if content[i] is not an element of entityMaps
if ((content[i] == noBreakSpace) || (content[i] <= '>')) {
for (size_t m = 0; m < WTF_ARRAY_LENGTH(entityMaps); ++m) {
// Original loop kernel
}
}
}
Optimized source code

In the optimized code above, the if condition will rarely be true so that can be predicted
easily with a Branch Target Address Cache, common in most application processor
architectures. Indeed, rerunning the analysis verified that this optimization dramatically
reduced the total stall cycles in the function, as shown below.

2013 Issue 001 Page 4 of 9


Embedded WebKit Optimization

Performance statistics before and after optimization

This optimization improved the Browsermark DOM search score by 6%.

Data Layout Optimization


The runtime of applications such as Web browsers, which build large data structures, is very
dependent of the performance of the memory sub-system, as large amounts of data are
frequently accessed.

On embedded platforms, main memory is often slower than that found on desktop systems,
in an effort to save power and space. Therefore, it is vital for optimal performance to use the
L1 and L2 cache memories on the processor as efficiently as possible.

The Prism analysis of the WebKit, while running the Browsermark DOM Search, highlighted
the cache performance issues shown in the screenshot below. Here Prism identified that the
method WebCore::DynamicNodeList::item(unsigned int) was responsible for
34% of the data loaded into the cache; however, 81% of that data was never actually used.

Since the cache was being filled with useless data, the miss rate increased as useful data was
unnecessarily evicted. The overall effect was to cause more processor stalls as data was
fetched from main memory, which lead to a longer runtime and a lower benchmark score.

2013 Issue 001 Page 5 of 9


Embedded WebKit Optimization
The WebCore::DynamicNodeList::item method was called on the Document Object
Model (DOM) by JavaScript code executing in the browser.

Cache efficiency statistics for DynamicNodeList::item

The JavaScript fragment below produced the sort of runtime behavior we have just discussed.
The call to getElementByTagName caused the creation of a DynamicNodeList object
and the aNodeList[i] operation called the item method.

var aNodeList = document.getElementByTagName(h1);

for (var i = 0; i < aNodeList.length; i++) {


aNodeList[i].style.color = red;
}

JavaScript making use of a NodeList object provided by the DOM

The key point is that NodeList, as returned by document.getElementByTagName(),


is a live object, in that the contents of the list can be modified dynamically. As we shall see
later, this explains the design of the original implementation.

As seen in the screenshot above, the accesses causing memory traffic were into a Node
object defined in Node.h. After we examined the implementation of the methods involved,
this turned out to be unsurprising: whenever a new item was requested from the

2013 Issue 001 Page 6 of 9


Embedded WebKit Optimization
NodeList, it caused a traversal of the DOM tree as the next suitable element was searched
for. As noted earlier, this is in general necessary: the content of the list may change since the
last lookup, so a new traversal is required. In this case study, during the search data members
on each DOM tree node object were checked to find a match, and this caused the data to be
loaded into the cache.

The complexity of the Node class meant that the data being accessed was widely spread in
memory, so we were accessing data with poor spatial locality. This explained the poor cache
performance that we detected while analyzing the code. As new data was accessed, the cache
loaded in an entire block of data from the surrounding memory locations, under the mistaken
assumption that nearby data would be accessed soon.

Once the cause of the issue was identified, the following optimization was implemented. The
original code contained a pointer about how to proceed as shown in the screenshot below.

Execution counts indicated that last item caching was having little impact

Here we see that caching of the previously returned node was implemented (lines 137 to 140)
so that repeated accesses to the same item would not trigger a DOM traversal. However, the
execution counts showed that over 88% of calls to the method missed the cache. An
improvement here would make a significant difference.

The optimization shown in the box below allowed for more aggressive caching of nodes, by
holding the entire NodeList in the Vector structure allItems. The vector was filled with

2013 Issue 001 Page 7 of 9


Embedded WebKit Optimization
all of the matching nodes when the NodeList was created. From that point, each call to
item simply returned the corresponding vector lookup.

if (m_caches->useAllItemsCache) {
if (m_caches->allItems.empty()) {
Node* n = m_rootNode->firstChild();
while (n) {
if (n->isElementNode() && nodeMatches(static_cast<Element*>(n))){
m_caches->allItems.push_back(n);
}
n = n->traverseNextNode(m_rootNode.get())
}
}
return m_caches->allItems[offset];
}

Part of the optimized item() implementation

The lookup into a Vector object is fast by design, since its tightly-packed structure has good
spatial locality. The potential downside is the cost of populating the list in the first place. This
is amortized if the list is iterated multiple times. For even a single iteration over the entire list,
it should still be cheaper than the original approach.

Looking forward, an additional danger lies in the live nature of the NodeList data. If the
underlying DOM is modified, the vector should be flushed and rebuilt from scratch.
Fortunately, the infrastructure for implementing this is already in place, so although
functionality is straightforward to maintain, performance could still drop dramatically in the
worst case. This can be further mitigated by tracking the number of item lookups between
vector flushes, and by reverting to the original scheme above a certain threshold.

When rerun with the optimized code, the DOM search score increased by 16%.

2013 Issue 001 Page 8 of 9


Embedded WebKit Optimization

Results
When both optimizations discussed in the previous sections were integrated into the WebKit
code base, the DOM Advanced search score increased by 22% over the original code.

Thanks to the detailed analysis provided by Prism, engineering effort was focused where it
could make the greatest impact; as a result, this performance increase was obtained by
modifying only ~60 lines of code in the original source.

In addition to delivering better benchmark scores to the customer, this project was used as a
training exercise to educate the customer's engineers on how to investigate performance
bottlenecks and figure out improvement strategies.

Before and after results for the DOM Advanced Search Benchmark

2013 Issue 001 Page 9 of 9

Você também pode gostar