Detecting Mobile Malicious Webpages in Real Time

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2016.2575828, IEEE
Transactions on Mobile Computing
Detecting Mobile Malicious Webpages

in Real Time
Chaitrali Amrutkar, Young Seuk Kim, and Patrick Traynor, Senior Member, IEEE
Abstract—Mobile specific webpages differ significantly from their desktop counterparts in content, layout and functionality. Accordingly,
existing techniques to detect malicious websites are unlikely to work for such webpages. In this paper, we design and implement kAYO,
a mechanism that distinguishes between malicious and benign mobile webpages. kAYO makes this determination based on static
features of a webpage ranging from the number of iframes to the presence of known fraudulent phone numbers. First, we experimentally
demonstrate the need for mobile specific techniques and then identify a range of new static features that highly correlate with mobile
malicious webpages. We then apply kAYO to a dataset of over 350,000 known benign and malicious mobile webpages and demonstrate
90% accuracy in classification. Moreover, we discover, characterize and report a number of webpages missed by Google Safe Browsing
and VirusTotal, but detected by kAYO. Finally, we build a browser extension using kAYO to protect users from malicious mobile websites
in real-time. In doing so, we provide the first static analysis technique to detect malicious mobile webpages.
Index Terms—Mobile security, webpages, web browsers, machine learning.
1 I NTRODUCTION pages. kAYO uses static features of mobile webpages

Mobile devices are increasingly being used to access derived from their HTML and JavaScript content, URL
the web. However, in spite of significant advances in and advanced mobile specific capabilities. We first exper-
processor power and bandwidth, the browsing experi- imentally demonstrate that the distributions of identical
ence on mobile devices is considerably different. These static features when extracted from desktop and mobile
differences can largely be attributed to the dramatic webpages vary dramatically. We then collect over 350,000
reduction of screen size, which impacts the content, mobile benign and malicious webpages over a period of
functionality and layout of mobile webpages. three months. We then use a binomial classification tech-
Content, functionality and layout have regularly been nique to develop a model for kAYO to provide 90% ac-
used to perform static analysis to determine malicious- curacy and 89% true positive rate. kAYO’s performance
ness in the desktop space [20], [37], [51]. Features such matches or exceeds that of existing static techniques used
as the frequency of iframes and the number of redirec- in the desktop space. kAYO also detects a number of
tions have traditionally served as strong indicators of malicious mobile webpages not precisely detected by
malicious intent. Due to the significant changes made existing techniques such as VirusTotal and Google Safe
to accommodate mobile devices, such assertions may Browsing. Finally, we discuss the limitations of existing
no longer be true. For example, whereas such behavior tools to detect mobile malicious webpages and build a
would be flagged as suspicious in the desktop setting, browser extension based on kAYO that provides real-
many popular benign mobile webpages require multiple time feedback to mobile browser users.
redirections before users gain access to content. Previous We make the following contributions:
techniques also fail to consider mobile specific webpage • Experimentally demonstrate the differences in the
elements such as calls to mobile APIs. For instance, links “security features” of desktop and mobile web-
that spawn the phone’s dialer (and the reputation of the pages: We experimentally demonstrate that the dis-
number itself) can provide strong evidence of the intent tributions of static features used in existing tech-
of the page. New tools are therefore necessary to identify niques (e.g., the number of redirections) are different
malicious pages in the mobile web. when measured on mobile and desktop webpages.
In this paper, we present kAYO1 , a fast and reliable Moreover, we illustrate that certain features are in-
static analysis technique to detect malicious mobile web- versely correlated or unrelated to or non-indicative
to a webpage being malicious when extracted from
• Chaitrali Amrutkar is with Google. This work was completed while she each space. The results of our experiments demon-
was a graduate student at the Georgia Tech Information Security Center strate the need for mobile specific techniques for
(GTISC), Georgia Institute of Technology, Atlanta, GA 30332, USA. E-
mail:chaitrali.amrutkar@gmail.com detecting malicious webpages.
• Young Seuk Kim is with PWC. He completed this work at GTISC, • Design and implement a classifier for malicious
Georgia Institute of Technology, Atlanta, GA 30332, USA. E-mail: and benign mobile webpages: We collect over
ykim320@gmail.com
• Patrick Traynor is with the Florida Institute for Cyber Security, University 350,000 benign and malicious mobile webpages. We
of Florida, Gainesville, FL 32611, USA. E-mail:traynor@cise.ufl.edu then identify new static features from these web-
1. Our technique is called “kAYO” (knockout in boxing terminology) pages that distinguish between mobile benign and
1536-1233because
(c) 2016 IEEE. Personalout
it knocks use ismalicious
permitted, but republication/redistribution
mobile webpages. requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html
malicious webpages. kAYO provides 90% accuracy for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMC.2016.2575828, IEEE
2
in classification and shows improvement of two or- have focused on websites built for desktop browsers
ders of magnitude in the speed of feature extraction in the past. Mobile browsers have been shown to dif-
over similar existing techniques. We further empir- fer from their desktop counterparts in terms of se-
ically demonstrate the significance of kAYO’s fea- curity [13], [14]. Although differences in mobile and
tures. Finally, we also identify 173 mobile webpages desktop websites have been observed before [19], it is
implementing cross-channel attacks, which attempt unclear how these differences impact security. Further-
to induce mobile users to call numbers associated more, the threats on mobile and desktop websites are
with known fraud campaigns. somewhat different [26]. Static analysis techniques using
• Implement a browser extension based on kAYO: features of desktop webpages have been primarily stud-
To the best of our knowledge kAYO is the first ied for drive-by-downloads on desktop websites [20],
technique that detects mobile specific malicious [51], whereas, the biggest threat on the mobile web
webpages by static analysis. Existing tools such as at present is believed to be phishing [18]. Efforts in
Google Safe Browsing are not enabled on the mobile mitigating phishing attacks on desktop websites include
versions of browsers, thereby precluding mobile isolating browser applications of different trust level [29],
users. Moreover, the mobile specific design of kAYO email filtering [28], using content-based features [55],
enables detection of malicious mobile webpages [59] and blacklists [38]. The best-known non-proprietary
missed by existing techniques. Finally, our survey content-based approach to detect phishing webpages is
of existing extensions on Firefox desktop browser Cantina [59]. Cantina suffers from performance problems
suggests that there is a paucity of tools that help due to the time lag involved in querying the Google
users identify mobile malicious webpages. To fill search engine. Moreover, Cantina does not work well
this void, we build a Firefox mobile browser ex- on webpages written in languages other than English.
tension using kAYO, which informs users about the Finally, existing techniques do not account for new mo-
maliciousness of the webpages they intend to visit bile threats such as known fraud phone numbers that
in real-time. We plan to make the extension publicly attempt to trigger the dialer on the phone. Consequently,
available post publication. whether existing static analysis techniques to detect
We note that we define maliciousness broadly, as is malicious desktop websites will work well on mobile
done in the prior literature on the static detection in the websites is yet to be explored.
desktop space [20], [37], [51]. However, because drive- Mobile application security: Significant work has been
by-downloads are not at all common in the mobile space done in the past few years on the security of mobile
at the time of writing, the overwhelming majority of applications. Static feature extraction, especially with
detected pages are related to phishing. respect to permissions, has been one of the most im-
portant early areas of research [21], [23], [25], [27]. Such
2 R ELATED W ORK techniques have led to dramatically more rapid detection
Content-based and in-depth inspection techniques to of malicious applications across a range of marketplaces.
detect malicious websites: Dynamic approaches using DNS based approaches to detect malicious domains:
virtual machines [45], [51] and honeyclient systems [32], A popular approach in detecting malicious activity on
[42], [46] provide deeper visibility into the behavior of the web is by leveraging distinguishing features between
a webpage. Therefore, such systems have a very low malicious and benign DNS usage. Both passive DNS
false positive rate and are more accurate. However, monitoring [15], [17], [49], [58] and active DNS probing
downloading and executing each webpage impacts per- methods [31], [33] have been used to identify malicious
formance and hinders scalability of dynamic approaches. domains. While some of these efforts focused solely
This performance penalty can be avoided by using static on detecting fast flux service networks [31], [47], [48],
approaches. Static approaches rely on the structural and [54], another [17] can also detect domains implementing
lexical properties of a webpage and do not execute the phishing and drive-by-downloads. DNS based mech-
content of the webpage. One such technique of detecting anisms do not provide deeper understanding of the
malicious URLs is using statistical methods for URL specific activity implemented by a webpage or domain.
classification based on a URL’s lexical and host-based
properties [28], [30], [35], [39]. However, URL-based 3 M OTIVATION
techniques usually suffer from high false positive rates. Static analysis techniques to detect malicious websites of-
Using HTML and JavaScript features extracted from a ten use features of a webpage such as HTML, JavaScript
webpage in addition to URL classification helps address and characteristics of the URL. Usually, these features
this drawback and provides better results [20], [41], are fed to machine learning techniques to classify benign
[55], [59]. Static approaches avoid performance penalty and malicious webpages. These techniques are predi-
of dynamic approaches. Additionally, using fast and cated on the assumption that the features are distributed
reliable static approaches to detect benign webpages can differently across benign and malicious webpages. Ac-
avoid expensive in-depth analysis of all webpages. cordingly, any changes in the distribution of static fea-
Differences between mobile and desktop websites: tures in benign and/or malicious webpages impacts
1536-1233All
(c) 2016 IEEE. Personal
these use is permitted,
approaches forbutmalicious
republication/redistribution
webpage requires IEEE permission.
detection See http://www.ieee.org/publications_standards/publications/rights/index.html
classification results of static analysis techniques. for While
more information.
1
Desktop 0.14 Desktop
Mobile Mobile
Fraction of total webpages

0.8 0.12
0.1
0.6
0.08
0.4 0.06
0.04
0.2
0.02
0 0
0 1 2 3 4 5 6 7 8 0 10 20 30 40 50
iframes per webpage Javascript per webpage
(a) (b)
1 1
Desktop Desktop
Mobile Mobile

0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 1 2 3 4 5 6 7 8 0 2 4 6 8 10 12
Redirections per webpage IP addresses returned per URL
(c) (d)
Fig. 1: Normalized density curves of static features. There is substantial difference between the distributions of the
number of (a) iframes, (b) Javascript and (c) redirections when measured on mobile and desktop versions of the
same websites, whereas, the distribution of the number of (d) IP addresses is similar.
successful, these static analysis techniques have been in mobile and desktop webpages.
used exclusively for desktop webpages [20], [51], [59]. 2) Infrastructure: Website providers use JavaScript or
Mobile websites are significantly different from their user agent strings to identify and then redirect mobile
desktop counterparts in content, functionality and lay- users to a mobile specific version. Figure 1 (c) shows the
out. Consequently, existing tools using static features normalized density of the number of redirection steps
to detect malicious desktop webpages are unlikely to taken by the desktop and mobile versions of the top
work for mobile webpages. We explain four factors that 10,000 websites on Alexa before landing on the final
motivate building separate static analysis techniques to URL3 . Even the most popular mobile websites show
detect malicious mobile webpages. multiple redirects, which has traditionally been a prop-
1) Differences in content: Mobile websites are often erty of desktop websites hosting malware [51]. However,
simpler than their desktop counterparts. Therefore, the multiple redirects does not necessarily indicate bad be-
distribution of content-based static features (such as the havior for mobile websites due to the characteristics of
number of JavaScripts) on mobile webpages differs from their hosting infrastructure.
that of desktop webpages. For example, Figure 1 (a) We note that not all static features used in existing
and Figure 1 (b) show the normalized density of the techniques differ when measured on mobile and desktop
number of iframes and the number of Javascript found in webpages. For example, the number of IP addresses re-
mobile2 and the corresponding desktop versions of the turned by DNS servers for mobile and desktop versions
top-level webpage of the 10,000 most popular websites of the same sites are comparable. Mobile websites appear
from Alexa [11]. Approximately 90% of mobile webpages to share their hosting infrastructure with the correspond-
do not have any iframes, whereas the corresponding ing desktop websites [36]. We used seven public (Google,
desktop webpages have multiple iframes. Desktop web- OpenDNS, UltraDNS, Norton, DynDNS, Level3, and
pages have more Javascripts than mobile webpages. Scrubit) DNS servers to obtain the IP addresses returned
Due to the simplicity of mobile webpages, the majority in the DNS A records of mobile and corresponding
of other content related static features used in exist- desktop URLs of Alexa top 10,000 websites. As seen
ing techniques including, the number of images, page in Figure 1 (d), the distributions of the number of IP
length, the number of hidden elements, and the number addresses returned by the seven DNS servers are similar
of elements with a small area also differ in magnitude
3. We use the term final URL to denote the URL that is rendered in
2. We describe the method used to define and identify mobile the browser after redirections (if any) from the seed URL. The final
1536-1233webpages
(c) 2016 IEEE.
inPersonal
detail use
in isSection
permitted,4.2.
but republication/redistribution requires IEEE permission.
URLSee http://www.ieee.org/publications_standards/publications/rights/index.html
may change based on the browser’s user agent. for more information.
4
for mobile and desktop websites. predictions about its potential maliciousness. We first
3) Impact of screen size: The screen size of a mobile discuss the feature set used in kAYO followed by the
phone is significantly smaller that that of a desktop collection process of the dataset.
computer. Therefore, a mobile user only sees a part of
the URL of a webpage. Intuitively, the author of a mobile 4.1 kAYO Feature Set
phishing webpage may only need to include misleading
A webpage has several components including HTML
words at the beginning of the URL and a short URL
and JavaScript code, images, the URL, and the header.
might suffice to trick a user.
Mobile specific webpages also access applications run-
4) Mobile specific functionality: Mobile websites
ning on a user’s device using web APIs (e.g., the dialer).
enable access to a user’s personal information and
We extract structural, lexical and quantitative properties
advanced capabilities of mobile devices through web
of such components to generate kAYO’s feature set. We
APIs. Existing static analysis techniques do not consider
focus on extracting mobile relevant features that take
these mobile specific functionalities in their feature set.
minimal extraction time. Our hypothesis is that such
We argue and later demonstrate that accounting for
features are strong indicators of whether a webpage has
the mobile specific functionalities helps identify new
been built for assisting a user in their web browsing
threats specific to the mobile web. For example, the
experience or for malicious purposes.
presence of a known ‘bank’ fraud number on a website
Our feature set consists of 44 features, 11 of which are
might indicate that the webpage is a phishing webpage
new and not previously identified or used. We describe
imitating the same bank [16].
these new features in detail. A subset of features in kAYO
have been used by other authors in static inspection of
Limitations of existing techniques: These discrepancies
desktop webpages in the past.5 However, it is important
between mobile and desktop webpages demand inves-
to note that these features in mobile webpages and
tigation. Existing static analysis techniques and tools for
desktop webpages differ in magnitude (e.g., number of
detecting malicious webpages are focused on desktop
iframes) and show varying correlation with the nature
webpages. Therefore, they are unable to detect mobile specific
of the webpage (i.e., malicious/benign).
threats with high accuracy.4 Secondly, several webpages
We divide kAYO’s 44 features into four classes: mobile
built specifically for mobile, return empty pages when
specific-, JavaScript, HTML and URL features. To the best
rendered in a desktop browser. Thus, even existing
of our knowledge, we are the first to use these mobile
dynamic analysis techniques that execute websites in
specific features, and do not claim novelty on using
desktop browsers on virtual machines, are ineffective
subsets of other previously identified features. Table 1
on such mobile websites. Finally, signature based tools
summarizes the 8 mobile, 10 JavaScript, 14 HTML and 12
such as Google Safe Browsing currently only work with
URL features. We empirically illustrate the effectiveness
desktop browsers. We manually visited five mobile spe-
of each of the features in Section 5.2.
cific known malicious webpages collected from Phish-
Tank [6], from the Google Chrome mobile browser. We
4.1.1 Mobile specific features
observed that these webpages are flagged as malicious
on the Chrome desktop browser, but not on the Chrome We collect eight mobile specific features to capture
mobile browser whose users are the real targets of the the advanced capabilities of mobile webpages. Mobile
mobile malicious webpages. Although enabling Google websites enable access to personal data from a user’s
Safe Browsing in mobile Chrome is an engineering effort, phone, an experience not offered by desktop websites.
we argue and later demonstrate that a mobile specific For example, mobile web APIs such as tel: and sms:
static technique can also detect new threats previously spawn the dialer and the SMS applications respectively
unseen by such services. on a mobile device. In order to characterize the behavior
Goals: Considering the limitations of existing tech- of mobile API calls, we extracted the number of API
niques, the goals of this work are three-fold. First, calls tel:, sms:, smsto:, mms: and mmsto: from each
identifying relevant static features from mobile specific mobile webpage. We further extracted the target phone
webpages in the wild. Second, implementing a fast numbers from these API calls. We ran the commercially
and reliable static analysis technique to detect malicious available Pindrop Security Phone Reputation System
mobile webpages in real-time. And finally, developing (PRS) [7] on each phone number. Based on the results of
a mobile browser extension that will inspect mobile the PRS, we gave the score of 1/0 (known fraud/benign)
webpages in real-time and provide feedback to the user. to each phone number scraped from the mobile API
calls, and added the score as a feature in kAYO. We only
extracted phone numbers with API prefixes that could
4 M ETHODOLOGY trigger an application installed on a user’s phone. We
Our objective is to design and develop a technique to
identify mobile specific malicious webpages in real-time. 5. A subset of kAYO’s features was selected from prior literature on
desktop webpage classification using static features. Only the features
We extract static features from a webpage and make deemed to be the most critical and definitively applicable to mobile
webpages as shown by manual analysis, were selected based on the
1536-1233 (c)
4. 2016
We IEEE. Personal usethis
demonstrate is permitted, but republication/redistribution
experimentally in Section 5. requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html
authors’ experience and knowledge of the area. for more information.
5
Total # of
Category Features
features
Mobile specific # of API calls to tel:, sms:, smsto:, mms:, mmsto:, geolocation; # of apk, # of ipa 8
presence of JS, noscript, internal JS, external JS, embedded JS;
JavaScript 10
# of JS, noscript, internal JS, external JS, embedded JS
presence of internal links, external links, images; # of internal links, external links, images
# of cookies from header, secure and HTTPOnly cookies,
HTML 14
presence of redirections and iframes, # of redirects and iframes,
whether webpage served over SSL, % of white spaces in the HTMl content
# of misleading words in the URL such as login and bank , length of URL
# of forward slashes and question marks, digits, dots, hyphens and underscores,
URL 12
# of equal signs and ampersand, subdomains, two letter subdomains, semicolons,
presence of subdomain, % of digits in hostname
Total: 44
TABLE 1: The 44 features of kAYO from four categories. The significance of both new mobile and prior features
res is evaluated in Section 5.2.
did not consider phone number strings simply listed on Webpages generally include three types of JavaScript:
webpages without an API prefix. internal, external and embedded. An internal JavaScript
We argue that due to the popularity of application is one hosted on the same domain as that of its parent
markets such as Google play and iTunes, a website host- webpage, whereas, an external JavaScript’s domain is
ing its own mobile application binary (e.g., .apk or .ipa different from its host’s domain. Since mobile web-
files) possibly suggests bad behavior. If we found more pages are often simpler than desktop webpages and
than a threshold (in the few hundreds) of apk/ipa files phishing is the biggest threat on mobile webpages at
on the same webpage, we assumed that the webpage present, we expect that benign webpages will include
was a third-party app store (of which there are many) more external JavaScript for advertisements and analyt-
and was unlikely to be malicious. ics purposes, whereas malicious webpages will have a
lower number of external JavaScript. Accordingly, we
4.1.2 JavaScript features determine whether a webpage holds external and inter-
nal JavaScript, and then extract the number of internal
JavaScript enables client-side user interaction, asyn-
and external JavaScript from a webpage. Unlike internal
chronous communication with servers, and modification
and external JavaScript, embedded JavaScript code is
of the DOM objects of webpages on the fly. We extract
contained in the webpage. If the number of lines of
10 features that capture the JavaScript relevant static
JavaScript is relatively small, a webpage with embedded
behavior of a webpage, two of which are new. All the
JavaScript loads faster than pages that must reference
features are faster to extract than the features based on
external code. This is because, as the web browser loads
JavaScript deobfuscation.
the page and encounters the reference to the external
JavaScript found on malicious webpages can be obfus-
code, it must make a separate request to the web server
cated. Instead of deobfuscating every JavaScript, we ex-
to fetch the code. Webpages built for performance often
tract simple JavaScript related features from a webpage.
use a number of embedded JavaScript. Performance is
The primary reason in choosing this approach is that
critical in the mobile web since it impacts revenue and
a large number of benign webpages include potentially
user interest [50]. Therefore, we determine whether a
dangerous JavaScript code as shown by Yue et al. [57].
webpage hosts embedded JavaScript and then calculate
For example, 44.4% of the top 6,805 websites from Alexa
the number of embedded JavaScript in a webpage. Our
use the potentially dangerous eval function. These ob-
hypothesis is that on average, benign webpages will
servations invalidate the assumption made in existing
have more embedded JavaScript. Finally, we determine
techniques [20]; that potentially dangerous JavaScript
whether JavaScript is present at all on a webpage, and
keywords are more frequently used in malicious web-
measure the total number of JavaScript on the webpage
pages. Secondly, external JavaScript can be very large,
including embedded, internal and external. Note that we
sometimes of the order of a few megabytes. Our goal is
believe that this feature is indicative, but not an alarm by
to build a real-time browser extension based on kAYO.
itself as malicious pages could also seek to gain revenue
Accordingly, we avoided using features that would slow
from advertisements.
down the feature extraction process.
We argue that benign webpage writers take effort to
provide good user experience, whereas the goal for ma- 4.1.3 HTML features
licious webpage authors is to trick users into performing We extract 14 features in total from the HTML code
unintentional actions with minimal effort. We therefore of each webpage. Popular webpages include a number
examine whether a webpage has noscript content and of images, and internal and external HTML links for
measure the number of noscript. Intuitively, a benign better user experience. For example, the top-level page
webpage writer will have more noscript in the code to of m.cnn.com includes links to other news articles pub-
ensure good experience even for a security savvy user. lished by CNN (internal HTML links), advertisements
1536-1233We
(c) 2016
add IEEE. Personaltwo
these use is newly
permitted, but republication/redistribution
identified features requires IEEE
to our permission.
set. forSee a
http://www.ieee.org/publications_standards/publications/rights/index.html
local restaurant (external HTML link) andforimages more information.
6
related to the latest breaking news. Accordingly, we first Mobile Webpage Indicators
determine whether a webpage has any images, internal Top Level Domain .mobi
and external HTML links. We then extract the number of m., mobile., touch., 3g., sp.,
Subdomain
internal links, external links and images from a webpage s., mini., mobileweb., t.
/mobile, /mobileweb, /m, /mobi,
as features of kAYO. URL Path Prefix
/?m=1, /mobil, /m home
Malicious webpages (especially those implementing
drive-by-downloads and clickjacking) include links to TABLE 2: Indicators of mobile specific webpages ex-
bad content in iframes [51]. Recall that the distribution of tracted by manual analysis of the top-level mobile and
iframes on mobile webpages is different as compared to desktop webpages of the 1,000 most popular websites on
that on desktop webpages. However, we do not rule out Alexa. We identified one top-level domain (TLD), nine
the possibility of a mobile malicious webpage including subdomains and seven URL path prefixes.
malicious content in iframes and consider the presence
and number of iframes in a webpage as features in
as features. Our URL feature set also contains the
kAYO. Past research also shows that malicious websites
number of semicolons, equal signs and ampersand
take several redirections before leading the user to the
symbols, hyphens and underscores, forward slashes and
target webpage to avoid DNS based detection [51]. Recall
question marks. Interested readers are referred to prior
that mobile webpages generally take at least one or more
literature [28], [34], [39] for details on the importance of
redirections since both desktop and mobile versions
these URL features.
of the webpage share hosting infrastructure. Therefore,
we determine whether a webpage was redirected and
Note that the HTML, JavaScript and URL features are
then measure the number of redirections the user ex-
not specific to mobile and can be used for analyzing
periences before landing on the final URL. Finally, we
desktop webpages as well. However, the mobile features
extract other features such as the percentage of white
derived from mobile applications such as dialer and SMS
spaces in the HTML content, the number of cookies
do not apply to desktop webpages.
from the header, the number of secure and HTTPOnly
cookies, and whether the webpage is served over an
SSL connection. Readers are encouraged to refer to prior 4.2 Data Collection
literature [20], [41], [59] for more information on the Our data gathering process included accumulating
usefulness of these HTML features. labeled benign and malicious mobile specific webpages.
First, we describe an experiment that identifies and
4.1.4 URL features defines ‘mobile specific webpages’. We then conduct
Structural and lexical properties of a URL have been the data collection process over three months in 2013.
used to differentiate between malicious and benign web- We use these crawls specifically because they are close
pages. However, using only URL features for such dif- to the publication of the related work, making them as
ferentiation leads to a high false positive rate. We extract close to equivalent as possible.
12 URL features in total.
Authors of phishing webpages often exploit the fa- Identification of mobile specific webpages: We crawled
miliarity of users to a webpage [22] by including words the top-level webpage of the 1,000 most popular websites
in the URL that can mislead a user into believing that from Alexa.com [11] using the Android mobile and
the phishing webpage is the legitimate webpage. Words desktop Internet Explorer (IE) browsers. We used
such as login and bank are commonly used in the URL Android mobile version 4.0 and IE desktop version 9.0
of the login webpage for benign websites that are highly for Windows 7. We then manually analyzed each pair
prone to imitation. Only a part of the URL is visible to of final URLs for the same seed URL when crawled
the user of a mobile phone due to the small screen [13]. from each browser. Before classifying a URL as mobile
Therefore, intuitively, the author of a phishing webpage specific, we confirmed that the final URLs for desktop
will include misleading words at the beginning of the and mobile were different for the same seed URL. We
URL. We consider the presence of such words in the also compared the contents of each pair of desktop and
URL as a new feature in kAYO. mobile webpages, and ensured that the two contents
A significant number of phishing domain names are were different. We ignored all the seed URLs that led to
simply IP addresses of machines hosting them [28], an identical final URL when crawled from the desktop
[40]. Therefore, we calculated the number of digits in and the mobile browser. Our analysis identified nine
a URL and the percentage of digits in the hostname. subdomains (e.g., m.) and seven URL path prefixes
Phishing webpage developers usually create a number (e.g., /mobile) in the URLs of popular websites to
of subdomains to include deceptive keywords such as represent their mobile specific webpages. Additionally,
paypal as a subdomain. This might increase the length we considered all URLs with the ‘.mobi’ Top Level
of phishing URLs [40]. Therefore, we include the length Domain (TLD) to be mobile sites [12]. We defined
of a URL, whether the URL contains a subdomain, a mobile specific webpage as one containing any of
1536-1233the
number of usesubdomains,
is permitted, but republication/redistribution
and the number requiresof
IEEEdots
permission. See http://www.ieee.org/publications_standards/publications/rights/index.html
these 17 mobile indicators in the URL and showing for more information.
7
cious/benign) of webpages in our dataset by using
VirusTotal [9] and Google Safe Browsing [10]. The
1200 Google Safe Browsing tool performs both static and
# of mobile webpages/10k URLs
1000
dynamic analysis on webpages [51]. It first discards
benign webpages identified using static analysis and
800
then performs dynamic analysis on the webpages tagged
600 as malicious. VirusTotal queries 41 different malware de-
tection tools based on dynamic analysis, crowd sourcing
400
and signatures. To be conservative, we labeled a URL
200 as malicious only when Google Safe Browsing tagged
0 a URL as malicious, or four or more tools queried
0 200000 400000 600000 800000 1e+06
by VirusTotal labeled the URL as malicious. We also
Alexa rank
performed manual inspection if necessary. For example,
the URLs from PhishTank are crowdsourced, and Google
Fig. 2: Number of mobile specific websites found in
Safe Browsing and VirusTotal do not detect all valid
every 10,000 websites in the Alexa top 1M.
URLs from PhishTank as malicious. We manually visited
such URLs to ensure that they are phishing webpages.
differences in content from the corresponding desktop Our final dataset consisted of 349,137 benign URLs and
webpage. Table 2 summarizes the mobile indicators. 5,231 malicious URLs. We used this dataset to train
kAYO’s model. We note that we waited for a number
Building the dataset: To generate training data for our of months to determine if many of our pages were ever
model, we statically crawled the top-level webpage of classified by this engine, so as to give other detection
the top 1,000,000 most popular websites from Alexa tools time to discover the candidate sites [43]. Finally,
from an Android mobile browser. We then extracted the we used the lifetime of pages that were clear scams
mobile specific webpages using the algorithm described (e.g., banking pages that disappeared within 24 hours)
above. Figure 2 shows the number of top-level mobile to judge a small subset of pages.
specific webpages found in the dataset. 1,244 out of the
first 10,000 most popular websites offer a mobile specific 5 I MPLEMENTATION AND E VALUATION
version and 763 maintain mobile specific webpages in We describe the machine learning techniques we consid-
the 10,000-20,000 range. From 20,000 onwards upto one ered to tackle the problem of classifying mobile specific
million, the number of mobile specific webpages found webpages as malicious or benign. We then discuss the
using our algorithm is largely constant. We observed strengths and weaknesses of each classification tech-
that 485 out of the top one million Alexa websites have nique, and the process for selecting the best model
the ‘.mobi’ TLD. Using the 17 mobile indicators defined for kAYO. We build and evaluate our chosen model
in Table 2, we collected 53,638 mobile specific URLs for accuracy, false positive rate and true positive rate.
at the top-level by statically crawling each website in Finally, we compare kAYO to existing techniques and
Alexa from an Android mobile browser. We then crawled empirically demonstrate the significance of kAYO’s fea-
each of the 53,638 mobile specific websites two levels tures. We note that where automated analysis is possible,
deep. Interestingly, we found links to several non-mobile we use our full datasets; however, as is commonly done
URLs on the mobile specific webpages. We discarded in the research community, we use randomly selected
non-mobile webpages and were left with 295,512 mobile subsets of our data when extensive manual analysis and
URLs at depth two. In total, we derived 349,150 mobile verification is required.
URLs from the Alexa one million websites.
Gathering data for malicious mobile URLs was chal-
lenging since the mobile web is still evolving and 5.1 Model Selection and Implementation
new threats are emerging. We monitored several public We treated the problem of detecting malicious webpages
blacklists [2], [3], [5] continually for three months and as a binary classification problem. We considered each
extracted mobile specific URLs from the blacklists. We known benign mobile webpage as a negative sample
set up a continuous feed from two public blacklists and each known malicious mobile webpage as a pos-
and crawled newly uploaded malicious URLs every itive sample. We considered a wide range of popular
two seconds. We also monitored PhishTank’s [6] online binary classification techniques in machine learning, but
dataset for mobile specific phishing URLs. After moni- for space discuss three popular options: Support Vector
toring these sources for three months, we gathered data Machines (SVM), naı̈ve Bayes and logistic regression.
from 531 top-level and 4,681 depth two mobile specific Support Vector Machines (SVM) is a popular binary
malicious URLs. Note that our dataset also contains classifier. However, it works well only on a few thousand
mobile URLs that were submitted to the blacklists before samples of data. Due to the scaling problem of SVMs
2013, but were live at the time of crawling. and our large dataset, SVM was not the best choice for
1536-1233 (c)We
2016 IEEE. Personal use is permitted,
established ground but republication/redistribution
truth of the labels requires IEEE permission.
(mali- See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
kAYO.
8
Factor Cantina [56], [59] kAYO
Mobile web
Designed to detect Phishing
1 threats
Detects pages written in English-only Any language
Avg. feature extraction time 2.82 sec 0.016 sec
0.8 Evaluation set size
200 34914
(# of webpages)
True positive rate
True positive rate 97% 89%

0.6
False positive rate 6% 8%
Depends on
External dependencies None
0.4 Google search
Detects pages missed by
No Yes
Google Safe Browsing?
0.2
TABLE 3: Comparison of kAYO with Cantina, a tech-
0 nique using static webpage features to detect phish-
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate
ing webpages in real-time. kAYO’s evaluation set size
Fig. 3: The final ROC curve for kAYO’s logistic regression is over two orders of magnitude larger than that of
model with regularization. Cantina. Moreover, kAYO’s feature extraction process is
two orders of magnitude faster than Cantina. Cantina’s
functionality is dependent on external tools unlike kAYO
Naı̈ve Bayes is generally used when the values of dif- and Cantina works well only on webpages written in the
ferent features are mutually independent. Many features English language. kAYO does not have these drawbacks.
that we extracted were mutually dependent. For exam-
ple, the number of scripts in a webpage was depen-
dent on the number of internal, external and embedded mean of all false positive rates6 and false negative rates7
JavaScript in the webpage, which were three other fea- output from every cross-validation step, and found the
tures of our model. Since the assumptions required for best threshold for differentiating between malicious and
optimal performance of naı̈ve Bayes did not hold for our benign data. Figure 3 shows the final ROC curve.
dataset, we could not use the naı̈ve Bayes classifier. kAYO provided 91% true positive rate and 7% false
Logistic Regression is a scalable classification technique positive rate on the cross-validation set. We used the
and makes no assumption about the distribution of best parameters obtained from the training and cross-
values of the features. Therefore, this technique was the validation steps to test the 10% labeled dataset set aside.
best fit for our dataset. We used the binomial variation Our test set shows 90% accuracy, 8% false positive rate
of logistic regression to model kAYO and employed and 89% true positive rate. We believe that this rate
`1-regularization to avoid overfitting of the data. is equivalent to that of the desktop-specific schemes,
which use dramatically smaller datasets than our own,
We used the scrapy [8] web scraping framework to as accuracy falls significantly when datasets increase in
crawl the collected mobile URLs. We then built a parser size. We also anticipate that the false positive rate on
for extracting features discussed in Section 4.1 from each the test set would be lower than what was found using
input webpage dynamically. The crawler and feature the labeled samples because kAYO detected a number
extraction scripts were implemented in Python. We used of malicious mobile URLs in the wild that we hand
logistic regression on the extracted features for train- verified, and were not detected by tools that we used for
ing and testing. We programmed the logistic regression establishing ground truth of our datatset. More details
model in the numerical computing language Octave [1]. on examples and in-depth analysis of mobile malicious
We tested the model on a machine with quad core 3.4 URLs detected by kAYO in the wild can be found in
GHz Intel Core i7 processor and 16 GB memory. Section 7.
Comparison with existing static techniques:8 We have
identified and used 11 new mobile-relevant features
previously not studied. We note that none of the ex-
5.2 Evaluation
isting techniques account for mobile specific features
Our dataset contained 349,137 benign URLs and 5,231 considered in kAYO. The non-commercial static analysis
malicious URLs. We divided our dataset into three sub- technique closest to kAYO is Cantina [59]. It detects
sets: training, cross-validation and test. We first ran- phishing webpages in real-time using static features of
domly shuffled the data and set aside 10% of the data webpages. We compare kAYO with the methodology,
as the test set. The remaining 90% of data was used for
training and 10-fold cross-validation. 6. False positive rate is equal to (1 - recall) or (1 - sensitivity).
For each validation round we calculated the accuracy, 7. False negative rate is also known as precision or specificity.
8. To the best of our knowledge, kAYO is the first technique that
the false positive rate and the true positive rate on the uses static features of webpages to detect malicious mobile pages.
validation set. We further used `1-regularization to avoid Therefore, we compare against existing desktop techniques. We also
overfitting. We varied the regularization parameter from could not secure access to the code or software of these related desktop
techniques from the respective authors upon request. Thus, our only
0 to 1,000 in the intervals of 10 and chose the best option was to base kAYO’s comparison on the results discussed in the
1536-1233parameter.
We thenuse is permitted,
plotted but republication/redistribution
a ROC curve byrequires IEEE permission.
taking the related See http://www.ieee.org/publications_standards/publications/rights/index.html
research papers of existing techniques. for more information.
9
Tech- Designed for Tested False Evalua-
nique Enviro- on Pos tion set
Threat
nment rate size 100
Train
Desktop Drive by Drive by Test
[20] 9.9 15000
Downloads download only
Malicious Drive by 80
[53] Desktop 13.7 15000
JavaScript download only
Drive by
[39] Desktop Spam URLs 14.8 15000
download only
Accuracy (%)
60
Union of Drive by, Drive by
Desktop 17.1 15000
[39], [53] malicious JS, download only
[24], [37] spam URLs
40
Existing Existing mobile
kAYO Mobile 8.1 34914
mobile web web threats
threats
20
TABLE 4: Comparison of kAYO with five existing static
analysis techniques that detect malicious desktop web- 0
Ex1: Previous features Ex2: kAYO’s features
pages. kAYO provides the lowest false positive rate on
an evaluation set twice as large as the one used by other Fig. 4: Ex1: Results of a model trained on desktop
techniques. kAYO also considers mobile web threats, webpages using desktop features studied in earlier tech-
whereas, the other techniques are focused on detecting niques and then tested on mobile webpages. Ex2: Results
desktop web threats. of a model trained on mobile webpages by adding mo-
bile specific features to the feature set and tested on mo-
Time in sec Considers bile webpages. Ex1 shows that a model trained on desk-
Technique Feature Classi- mobile top pages using features from related work performs
extraction fication webpages? poorly when applied to mobile webpages. However,
[20] 3.06 0.24 7 when a model is trained with the same static features
[53] 0.15 0.034 7 and additional mobile specific features exclusively on a
[39] 3.56 0.020 7
mobile datatset, the results of testing on a mobile dataset
Union of
N/A N/A 7 improve significantly as seen in Ex2.
[24], [37], [39], [53]
kAYO 0.016 0.0002 3 We also compared kAYO’s performance with existing
TABLE 5: Comparison of kAYO with five existing static static analysis tools that detect non-phishing attacks.
analysis techniques that detect malicious desktop web- The closest non-commercial tool to kAYO based on the
pages. kAYO’s feature extraction process is 10 times diversity of features and the scale of the evaluation set
faster than the fastest existing technique [53] and classi- is Prophiler [20]. Prophiler detects drive-by-downloads
fication time is 100 times faster than the fastest existing on desktop webpages. We compare kAYO’s performance
technique [39]. kAYO is the only technique that considers with the performance numbers of existing static tech-
mobile specific features of webpages. niques described by Canali et al. [20]. Canali et al.
performed an analysis of 15,000 webpages consisting
of about 5,000 known webpages launching drive-by-
speed and performance of Cantina given in related downloads. The contenders of the comparison were
research papers [56], [59]. Table 3 summarizes the com- then existing tools detecting malicious JavaScript [24],
parisons. Cantina provides better true positive rate and [37], [53], drive-by-downloads [20] and spam URLs [39].
comparable false positive rate against kAYO. However, Table 4 and Table 5 show the comparison of performance
there are several drawbacks to Cantina. First, Cantina’s of kAYO with each of these techniques. kAYO provides
depends on the results of Google’s search engine. More- the lowest false positive rate over an evaluation set twice
over, Cantina assumes that every webpage not ranked as large as the one used by other techniques as shown in
by Google is malicious. We argue that this is a strong Table 4. Moreover, kAYO’s feature extraction process is
assumption and might lead to a high false positive 10 times faster than the fastest existing technique [53]
rate. Additionally, this methodology prevents Cantina and classification process is 100 times faster than the
from analyzing webpages not visited by Google’s Safe fastest existing technique [39]. Finally, all the existing
Browsing kAYO does not depend on any external tools techniques are focused on desktop threats, whereas,
and can detect malicious webpages missed by Google. kAYO focuses on mobile specific threats. Accordingly,
Second, kAYO’s feature extraction process is over two had we been able to run these tools over our dataset,
orders of magnitude faster than Cantina. On an average, they would have performed more poorly.
kAYO takes 0.016 seconds to extract the features of a Need for mobile specific techniques: Because neither
webpage and Cantina takes 2.82 seconds. We argue that Cantina nor Prophiler were made available to us, we
this improvement in the speed of analyzing webpages performed an experiment to demonstrate the need for
makes kAYO more usable than Cantina in real-time. new mobile specific models. Intuitively, due to the dis-
Finally, Cantina works only on webpages written in parity in the same static features when measured on mo-
English due to its heuristic features whereas kAYO can bile and desktop webpages (as discussed in Section 3),
1536-1233work
(c) 2016with
IEEE. Personal use is permitted,
webpages writtenbut republication/redistribution
in any language. requires IEEE permission.
andSee http://www.ieee.org/publications_standards/publications/rights/index.html
the emergence of new mobile specific features, for more information.
a
10
model trained on desktop webpages will not generate
precise results for mobile webpages. Note that we are
0.4
not making claims about the exact performance of each
system against our dataset; rather, we are attempting
Pearson Correlation Coefficient (PCC)

to demonstrate (in the absence of either being made 0.2
available and in good faith) that previously published
techniques not considering the changes and new features
0
identified in this work perform significantly worse than
our own when analyzing malicious mobile webpages.
For this experiment, we created a training dataset of -0.2
desktop webpages and a test datatset of mobile web-

pages. We statically crawled Alexa top 10,000 webpages -0.4
to the two links deep using the desktop Internet Explorer
Newly identified features
browser version 9.0 for Windows 7. We obtained the Features adopted from other techniques
-0.6
desktop malicious webpages by monitoring public black- 0 5 10 15 20 25 30 35 40 45
lists [2], [3], [5] and crawling live URLs two links deep. Feature Number (44 in total)
We verified ground truth of these URLs using Google
Safe Browsing and VirusTotal. We randomly shuffled Fig. 5: The Pearson Coefficient Correlation (PCC) of each
the webpages and chose 10,000 webpages while keeping of the features extracted in kAYO with the label (mali-
the proportion of benign and malicious webpages in the cious/benign), found using our evaluation dataset. Each
dataset equivalent to the mobile dataset described in point corresponds to the correlation of a feature with the
Section 4.2. We then created a test dataset of 1000 mobile label. In total there are 44 points corresponding to the 44
benign and malicious webpages by randomly selecting features of kAYO, including the newly identified features
URLs from the larger dataset described in Section 4.2. and the ones adopted from existing techniques. The Y
We extracted 33 out of the 44 static features in kAYO value of each point depicts the predictive power of the
from each webpage in the desktop and mobile datasets. corresponding feature i.e. PCC. The greater the absolute
We disregarded the 11 new mobile features used in value of the PCC of a feature, the better predictive power
kAYO and instead focused our analysis on the 33 fea- of the feature. Note that all the PCC values are non-
tures previously used in similar desktop static tech- zero implying that every feature in kAYO’s feature set is
niques. We note that the goal of this experiment is not significant and impacts the result of classification.
to extract all desktop relevant features used earlier, but Proprietary techniques: We note that Google’s Safe
demonstrate that a model trained on features extracted Browsing tool is proprietary and no information can be
from desktop webpages does not perform well when retrieved about its performance. Moreover, Google Safe
applied to mobile webpages. We believe that these 33 Browsing is not enabled on mobile browsers at present.
features accurately represent the static features used Therefore, even if Google’s tool identifies a mobile
in earlier techniques to detect malicious desktop web- webpage as bad, Chrome and Firefox mobile browser
pages [20], [24], [28], [37], [39], [53], [56], [59]. users do not benefit from this information unlike the
We used logistic regression with regularization to train desktop users of these browsers. Even though enabling
a model on the desktop webpage dataset and tested Safe Browsing on mobile is an engineering effort, we
the model on the mobile dataset. Figure 4 shows the later demonstrate the importance of employing a mobile
results of our experiments. Ex1 shows that using 33 specific technique such as kAYO.
features, we achieved 77% accuracy in training on desk- Significance of kAYO’s feature set: It is important
top webpages. However, when the parameters obtained to observe that kAYO’s feature set has been carefully
from this model were applied to the mobile dataset, created to ensure relevance to mobile webpages and neg-
the accuracy reduced significantly to 40%. The difference ligible extraction time. We experimentally demonstrate
between the accuracy of the training and testing dataset is the significance of kAYO’s features using the Pearson
the important comparison metric in this experiment as it product-moment Correlation Coefficient (PCC). PCC is a
demonstrates the inability of previous desktop-only models to measure of the linear dependence between two variables
accurately characterize mobile webpages. Ex2 simply shows giving a value between +1 and −1 inclusive [52]. In other
kAYO’s results (discussed in Section 5.2) of training words, PCC provides information about the predictive
and testing on mobile webpages considering mobile power of a feature over the classification result. The
specific features. Both training (91%) and testing (90%) larger the absolute value of the PCC of a feature, the
dataset accuracies improve notably. More importantly, more its predictive power. For example, a feature with
the accuracies of the training and testing datasets in Ex2 PCC -0.6 is a better predictor of whether a webpage is
are comparable unlike those in Ex1. These results con- malicious than a feature with PCC 0.21. It is important to
firm our intuition that mobile specific static techniques note that identifying features with very high PCC values
are necessary - without using the new mobile features, is extremely difficult given the hundreds of different
previously proposed techniques perform poorly.
1536-1233 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html
components of webpages and the diversity of threats. for more information.
11
We find the PCC between each feature in kAYO’s existing services (e.g., pages including spam phone num-
feature set and the label (benign/malicious), from bers). Second, building an extension allows immediate
the test set used for evaluation. Intuitively, if kAYO’s use of our technique. We discuss other potential avenues
features are significant, then the absolute value of the of adopting kAYO in Section 7.3.
PCC of each feature with the label must be non-zero. We developed a browser extension using kAYO for
Figure 5 shows the plot of the PCC of each of the 44 Firefox Mobile9 , which informs users about the mali-
features of kAYO with the label. The circles show the ciousness of the webpages they intend to visit. Our
PCC of the newly identified features of kAYO and the Xs goal was to build an extension that runs in real-time.
depict the PCC of features adopted from earlier works. Therefore, instead of running the feature extraction pro-
As seen in the Figure, all the PCC values are non-zero, cess in a mobile browser, we outsourced the processing
implying that every feature in kAYO is significant. intensive functions to a backend server. Figure 6 shows
the architecture of the extension. User enters the URL
Comparison with existing browser tools: Browser ex- he wants to visit in the extension toolbar. The extension
tensions and plugins help protect users from visiting then opens a socket and sends the URL and user agent
malicious websites. The most prevalent threat on the mo- information to kAYO’s backend server over HTTPS. The
bile web at present is phishing. Therefore, we surveyed server crawls the mobile URL and extracts static features
the most popular anti-phishing Firefox desktop exten- from the webpage. This feature set is input to kAYO’s
sions for comparison with kAYO. These 33 extensions trained model, which classifies the webpage as malicious
were selected by searching for the keyword ‘phishing’ or benign. The output is then sent back to the user’s
on the Firefox extension store. Most of the extensions browser in real-time. If the URL is benign according to
were certificate verifiers, password protectors or file kAYO, the extension renders the intended webpage in
protectors. We did not find any extensions performing the browser automatically. Otherwise, a warning mes-
content-based static analysis. We disregarded extensions sage is shown to the user recommending them not to
that were built only for one specific website (e.g., FB visit the URL.
Phishing Protector and LibertyGuard) or were no longer Users of the extension will browse both mobile specific
supported (e.g. Nophish). We then chose the top five ex- and desktop webpages since not all websites offer a
tensions (Anti Phishing 1.0, DontPhishMe, Netcraft Tool- mobile specific version. Recall that being a mobile spe-
bar, PhishTank SiteChecker and Phish Tester) based on cific technique, kAYO does not perform well on desktop
the number of users for further analysis. We randomly webpages. Consequently, processing all pages of interest
selected a set of 10 known malicious URLs from our through kAYO might output incorrect results for desktop
dataset and queried each tool with the URLs. PhishTank webpages. To address this problem, the backend server
SiteChecker simply queried PhishTank and returned the first detects whether the intended webpage is mobile
result, detecting three of the 10 URLs. Netcraft detected specific using the same method explained in Section 4.2.
three out of the 10 URLs as well, two of which were The webpage is processed by kAYO only if it is mobile.
also detected by Phish Advisor. Anti Phishing 1.0 de- The desktop webpages are analyzed using Google Safe
tected one URL. Phish Tester and DontPhishMe did not Browsing. Note that any other existing technique for de-
generate any results. tecting desktop malicious webpages can be used instead
We also tested the freely available trial version of the of Google Safe Browsing.
Lookout safe browsing tool [4]. Lookout is one of the We performed manual analysis of 100 randomly se-
most popular security applications available for mobile lected URLs (90 benign and 10 malicious) from our test
devices. This tool protects users of the Android mobile dataset and measured the performance of kAYO in real-
and the Chrome mobile browsers from phishing scams time. On an average, an output was rendered in 829
and malicious links on the mobile web. We browsed the ms on average from the time the user entered a URL
same 10 known malicious URLs from both the Android in kAYO’s toolbar. We argue that the good performance
mobile and Chrome mobile browser on a device running is due to careful selection of quickly extractable features
the Android 4.0 operating system. We were presented with and lower complexity of mobile webpages as compared
alerts for only two out of the 10 URLs by Lookout, while kAYO to desktop webpages. The maximum delay in result gen-
detected eight out of the 10 webpages. eration was seen in scraping the input webpage from its
Given the paucity of a working extension to detect respective server. Caching already scraped webpages can
different threats on mobile webpages, and the unavail- reduce this delay, as we demonstrated experimentally,
ability of signature-based tools such as Google Safe by an average of 85%. Figure 7 shows a screen shot of
Browsing for mobile browsers, we developed a mobile our browser extension at work. We plan to make the
browser extension using kAYO. extension available publicly post publication.
6 B ROWSER E XTENSION
Building a browser extension based on kAYO adds value
9. Firefox Mobile is one of the very few mainstream mobile browsers
for two reasons. First, the mobile specific design of kAYO that support browser extensions. Similar extensions can easily be
1536-1233enables
(c) 2016 IEEE. Personal use isof
detection permitted,
newbutthreats
previouslyrequires IEEE permission.
unseen See http://www.ieee.org/publications_standards/publications/rights/index.html
by developed on other mobile browsers once supported. for more information.
12
2) URL=m.example.com, 3) HTTP request
User agent string m.example.com
kAYO server Internet
8) Output 4) HTTP response

1) User enters kAYO's toolbar: 5) Is page mobile?
Benign/malicious
URL m.example.com
Yes No
6) mobile
page
Firefox Mobile 7) Output Extract 7) Output
Browser features 6) desktop
page
Google Safe
kAYO's Model
Browsing
Fig. 6: Architecture of the mobile browser extension based on kAYO. User enters the URL he wants to visit in the
extension toolbar and receives a response in real-time from our backend server about the maliciousness of the URL.
If the URL is benign according to kAYO, the page of interest is rendered in the browser. Otherwise, the user is
shown a warning message to not visit the URL.
(a) (b)
kAYO
extension
toolbar
Input parameters:
URL and user agent
Warning
generated
by kAYO
Crawl input
webpage
Proceeding
options
Result
(c) (d)
Fig. 7: (a): Chrome desktop browser informing the user of a potentially malicious webpage. The webpage is a
known mobile phishing webpage. (b): The same webpage when rendered on the Chrome mobile browser, whose
users are the real targets, does not provide any warning. (c): kAYO extension running on the Firefox mobile browser
detects the webpage as malicious and warns the user. (d): Screenshot of command line processing at the extension
server.
7 D ISCUSSION evaluating kAYO. However, such dynamic analysis tech-

niques execute webpages on desktop browsers running
kAYO detected a number of malicious webpages in the on virtual machines and miss mobile specific threats. To
wild that were not found by existing techniques. We validate our intuition, we performed manual analysis on
investigate these webpages in detail and then describe webpages that were identified as malicious by kAYO,
the limitations and future work of kAYO. but were tagged as benign by Google Safe Browsing
and VirusTotal. Performing manual in-depth analysis for
all webpages classified as malicious by kAYO was not
7.1 Investigating False Positives feasible. Therefore, we chose a random subset of 100
We used Google Safe Browsing and VirusTotal for es- URLs from the false positives obtained by running kAYO
1536-1233tablishing
ground use is truth
permitted,of
but our
dataset for training requires IEEEand onSeethe
permission. test dataset in Section 5.2.
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
13
We verified each of the 100 URLs by visiting them given the large set of webpages collected, we believe that
manually from an Android mobile browser version 4.0. our dataset is a representative cross section. Finally, the
We found 10 URLs to be suspicious. These 10 URLs focus of this work was on mobile webpages designed
contained survey pages to win iPads or Visa gift cards, for phones. We defer the analysis of webpages built for
uncommon online electronic equipment stores and stores tablets to future work.
selling health-related products. Most URLs did not have kAYO’s features reflect current trends in mobile ma-
a Google page rank. One particular webpage prompted licious webpages. The potential of bad activity in the
a user to download a binary file masquerading as a flash mobile web could increase yet further over time. kAYO’s
update. We downloaded and found the binary file to be feature set and model will need to be updated, according
malicious by querying VirusTotal. Another webpage had to the new threats faced by the mobile web in the
a known bank fraud phone number prefixed with the future. However, such updates are necessary in all static
tel: API. Two out of the 10 suspicious URLs were also techniques that aim to detect new threats.
marked as suspicious by the Lookout safe browsing tool. In-depth dynamic analysis of webpages may provide
We have reported these 10 webpages to PhishTank. Five additional important details. However, because such ap-
out of the 10 URLs that we submitted have already been proaches incur significantly higher costs, this approach
validated by PhishTank and marked as malicious. All 10 conflicts with our design goal of creating a real-time
URLs went offline within one week of submission. This detector. Accordingly, we leave the significant challenge
further strengthens our intuition since phishing URLs of efficient operation of such tools to future work.
are usually short lived [31], [44]. We note that PhishTank Using signature based blacklist approaches such as
might not validate some of the URLs we submitted. This Google Safe Browsing might improve the performance
is because, PhishTank’s validation process is based on of kAYO’s browser extension. A blacklist can be synchro-
crowdsourcing and threats such as known bank fraud nized with kAYO’s extension server and enforced locally.
numbers on a website might not be detected without Although such techniques might reduce the average
the availability of tools such as Pindrop PRS [7]. delay in page rendering, they will also preclude from
protection against webpages that change dynamically
7.2 Cross-channel threats defeating kAYO’s goal of real-time evaluation. We plan
to investigate performance enhancing designs that pre-
We found 173 unique mobile webpages in our dataset serve real-time evaluation in future work.
(including training and testing) that hosted API prefixed
known fraudulent phone numbers and were all tagged
as benign by Google Safe Browsing and VirusTotal.
These numbers are associated with a number of known 8 C ONCLUSION
financial fraud campaigns against a number of different Mobile webpages are significantly different than their
major US-based institutions) according to our queries to desktop counterparts in content, functionality and lay-
the Pindrop Security PRS [7]. These results show that out. Therefore, existing techniques using static features
adversaries have begun to exploit such cross-channels of desktop webpages to detect malicious behavior do not
(e.g., create a phishing webpage and include a fraud work well for mobile specific pages. We designed and
phone number) to attack mobile users. Moreover, these developed a fast and reliable static analysis technique
experiments suggest that the false positive rate of kAYO called kAYO that detects mobile malicious webpages.
might be lower in reality, given that mechanisms fail to classify kAYO makes these detections by measuring 44 mobile
such pages as malicious. We intend to conduct a further relevant features from webpages, out of which 11 are
analysis of such attacks in our future work. newly identified mobile specific features. kAYO provides
90% accuracy in classification, and detects a number
of malicious mobile webpages in the wild that are not
7.3 Limitations and Future Work
detected by existing techniques such as Google Safe
The expected concerns of kAYO are similar to those of Browsing and VirusTotal. Finally, we build a browser
existing malicious website detection tools using static extension using kAYO that provides real-time feedback
analysis. Evasion by mimicking the features we consider to users. We conclude that kAYO detects new mobile
to be good indicators of a legitimate webpage can be specific threats such as websites hosting known fraud
used to defeat kAYO. However, our comprehensive set numbers and takes the first step towards identifying new
of features makes it harder to evade kAYO, as seen from security challenges in the modern mobile web.
our evaluation over a large dataset.
We statically crawled the top million websites of
Alexa. Therefore, we did not collect webpages that use ACKNOWLEDGMENTS
JavaScript to detect and redirect to the mobile webpage. The authors would like to thank EldoS for access to their
We have also missed the mobile webpages represented driver framework and VirusTotal for providing malware
by ways other than the ones used by the top 1,000 samples. This work was supported in part by the US
websites. We do not make any claims about gathering all National Science Foundation under grant numbers CNS-
1536-1233mobile
(c) 2016 IEEE. Personal use isfrom
webpages permitted, but republication/redistribution
Alexa top one million. requires IEEE permission.
However, See http://www.ieee.org/publications_standards/publications/rights/index.html
1464087 and CNS-1464088. Any opinions, findings, for more
andinformation.
14
conclusions or recommendations expressed in this mate- [32] A. Ikinci, T. Holz, and F. Freiling. Monkey-spider: Detecting malicious
websites with low-interaction honeyclients. In Proceedings of Sicherheit,
rial are those of the authors and do not necessarily reflect Schutz und Zuverlassigkeit, 2008.
the views of the NSF. [33] L. Invernizzi, S. Benvenuti, M. Cova, P. M. Comparetti, C. Kruegel, and
G. Vigna. Evilseed: A guided approach to finding malicious web pages. In
Proceedings of the 2012 IEEE Symposium on Security and Privacy, 2012.
R EFERENCES [34] P. Kolari, T. Finin, and A. Joshi. Svms for the blogosphere: Blog identi-
fication and splog detection. In Proceedings of AAAI Spring Symposium on
[1] Gnu octave: high-level interpreted language. http://www.gnu.org/ Computational Approaches to Analysing Weblogs, 2006.
software/octave/. [35] A. Le, A. Markopoulou, and M. Faloutsos. Phishdef: Url names say it all.
[2] hphosts, a community managed hosts file. http://hphosts.gt500.org/hosts. In Proceedings of IEEE International Conference on Computer Communications
txt. (INFOCOM), 2011.
[3] Joewein.de LLC blacklist. http://www.joewein.net/dl/bl/dom-bl-base. [36] C. Lever, M. Antonakakis, B. Reaves, P. Traynor, and W. Lee. The core of
txt. the matter: Analyzing malicious traffic in cellular carriers. In Proceedings
[4] Lookout. https://play.google.com/store/apps/details?hl=en&id=com. of Network and Distributed System Security Symposium (NDSS), 2013.
lookout. [37] P. Likarish, E. Jung, and I. Jo. Obfuscated malicious javascript detection
[5] Malware Domains List. http://mirror1.malwaredomains.com/files/ using classification techniques. In Proceedings of Malicious and Unwanted
domains.txt. Software (MALWARE), 2009.
[6] Phishtank. http://www.phishtank.com/. [38] C. Ludl, S. Mcallister, E. Kirda, and C. Kruegel. On the effectiveness of
[7] Pindrop phone reputation service. http://pindropsecurity.com/ techniques to detect phishing sites. In Proceedings of the 4th International
phone-fraud-solutions/phone reputation service prs/. Conference on Detection of Intrusions and Malware, and Vulnerability Assessment
[8] Scrapy — an open source web scraping framework for python. http:// (DIMVA), 2007.
scrapy.org/. [39] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Beyond blacklists: Learning
[9] VirusTotal. https://www.virustotal.com/en/. to detect malicious web sites from suspicious URLs. In Proceedings of the
[10] Google developers: Safe Browsing API. https://developers.google.com/ SIGKDD Conference, 2009.
safe-browsing/, 2012. [40] D. K. McGrath and M. Gupta. Behind phishing: an examination of phisher
[11] Alexa, the web information company. http://www.alexa.com/topsites, modi operandi. In Proceedings of the 1st Usenix Workshop on Large-Scale
2013. Exploits and Emergent Threats (LEET), 2008.
[12] dotmobi. internet made mobile. anywhere, any device. http://dotmobi. [41] E. Medvet, E. Kirda, and C. Kruegel. Visual-similarity-based phishing
com/, 2013. detection. In Proceedings of International Conference on Security and Privacy
[13] C. Amrutkar, K. Singh, A. Verma, and P. Traynor. VulnerableMe: Measuring in Communication Netowrks (SecureComm), 2008.
systemic weaknesses in mobile browser security. In Proceedings of the [42] Y. min Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski, S. Chen, and
International Conference on Information Systems Security (ICISS), 2012. S. King. Automated web patrol with strider honeymonkeys: Finding web
[14] C. Amrutkar, P. Traynor, and P. C. van Oorschot. Measuring SSL indicators sites that exploit browser vulnerabilities. In Proceedings of the Networking
on mobile browsers: Extended life, or end of the road? In Proceedings of the and Distributed Systems Security (NDSS), 2006.
Information Security Conference (ISC), 2012. [43] A. Mohaisen and O. Alrawi. AV-Meter: An Evaluation of Antivirus Scans
[15] M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster. Building and Labels. In Proceedings of the 4th International Conference on Detection of
a dynamic reputation system for DNS. In Proceedings of the 19th USENIX Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2014.
Conference on Security (SECURITY), 2010. [44] T. Moore, R. Clayton, and H. Stern. Temporal correlations between spam
[16] V. A. Balasubramaniyan, A. Poonawalla, M. Ahamad, M. T. Hunter, and and phishing websites. In Proceedings of the 2nd USENIX conference on Large-
P. Traynor. Pindr0p: using single-ended audio features to determine call scale exploits and emergent threats: botnets, spyware, worms, and more (LEET),
provenance. In Proceedings of the 17th ACM Conference on Computer and 2009.
Communications Security (CCS), 2010. [45] A. Moshchuk, T. Bragin, S. D. Gribble, and H. M. Levy. A crawler-based
[17] L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi. EXPOSURE : Finding study of spyware on the web. In Proceedings of Network and Distributed
malicious domains using passive DNS analysis. In Proceedings of the 18th System Security Symposium (NDSS), 2006.
Annual Network and Distributed System Security Symposium (NDSS), 2011. [46] J. Nazario. Phoneyc: a virtual client honeypot. In Proceedings of the
[18] M. Boodaei. Mobile users three times more vulnerable to phishing attacks. 2nd USENIX conference on Large-scale Exploits and Emergent Threats: botnets,
http://www.trusteer.com/blog/mobile-users-three- spyware, worms, and more (LEET), 2009.
times-more-vulnerable-to-phishing-attacks, 2011. [47] J. Nazario and T. Holz. As the net churns: Fast-flux botnet observations. In
[19] M. Butkiewicz, Z. Wu, S. Li, P. Murali, V. Hristidis, H. V. Madhyastha, Proceedings of 3rd International Conference on Malicious and Unwanted Software
and V. Sekar. Enabling the transition to the mobile web with websieve. In (MALWARE), 2008.
Proceedings of the 14th Workshop on Mobile Computing Systems and Applications [48] E. Passerini, R. Paleari, L. Martignoni, and D. Bruschi. Fluxor: Detecting
(HotMobile), 2013. and monitoring fast-flux service networks. In Proceedings of the 5th Inter-
[20] D. Canali, M. Cova, G. Vigna, and C. Kruegel. Prophiler: a fast filter for national Conference on Detection of Intrusions and Malware, and Vulnerability
the large-scale detection of malicious web pages. In Proceedings of the 20th Assessment (DIMVA), 2008.
International Conference on World Wide Web (WWW), 2011. [49] R. Perdisci, I. Corona, D. Dagon, and W. Lee. Detecting malicious flux
[21] S. Chakradeo, B. Reaves, P. Traynor, and W. Enck. MAST: Triage for Market- service networks through passive analysis of recursive DNS traces. In
scale Mobile Malware Analysis. In Proceedings of the ACM Conference on Proceedings of Annual Computer Security Applications Conference (ACSAC),
Security and Privacy in Wireless and Mobile Networks (WiSec), 2013. 2009.
[22] R. Dhamija, J. D. Tygar, and M. Hearst. Why phishing works. In Proceedings [50] G. Podjarny. Mobile web performance op-
of the SIGCHI Conference on Human Factors in Computing Systems, 2006. timization. http://www.slideshare.net/blazeio/
[23] W. Enck, D. Octeau, P. McDaniel, and S. Chaudhuri. A study of Android mobile-web-performance-optimization-tips-and-tricks.
application security. In Proceedings of the 20th USENIX Security Symposium, [51] N. Provos, P. Mavrommatis, M. A. Rajab, and F. Monrose. All your
2011. iframes point to us. In Proceedings of the 17th USENIX conference on Security
[24] B. Feinstein and D. Peck. Caffeine monkey: Automated collection, detection (SECURITY), 2008.
and analysis of malicious javascript. In Proceedings of the Black Hat Security [52] L. Rodgers and W. A. Nicewander. Thirteen ways to look at the correlation
Conference, 2007. coefficient. The American Statistician, 1988.
[25] A. P. Felt, E. Chin, S. Hanna, D. Song, and D. Wagner. Android permissions [53] C. Seifert, I. Welch, and P. Komisarczuk. Identification of malicious web
demystified. In Proceedings of the 18th ACM conference on Computer and pages with static heuristics. In Telecommunication Networks and Applications
communications security, 2011. Conference, 2008.
[26] A. P. Felt and D. Wagner. Phishing on mobile devices. In Web 2.0 Security [54] F. Weimer. Passive DNS replication. 2005.
and Privacy (W2SP), 2011. [55] C. Whittaker, B. Ryner, and M. Nazif. Large-scale automatic classification
[27] A. P. Felt, H. J. Wang, A. Moshchuk, S. Hanna, and E. Chin. Permission of phishing pages. In Proceedings of the Networking and Distributed Systems
re-delegation: attacks and defenses. In Proceedings of the 20th USENIX Security (NDSS), 2010.
conference on Security, 2011. [56] G. Xiang, J. Hong, C. P. Rose, and L. Cranor. Cantina+: A feature-rich
[28] I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In machine learning framework for detecting phishing web sites. ACM
Proceedings of the 16th International Conference on World Wide Web (WWW), Transactions on Information and System Security (TISSEC), 14(2), Sept. 2011.
2007. [57] C. Yue and H. Wang. Characterizing insecure javascript practices on the
[29] S. Gajek, A.-R. Sadeghi, C. Stüble, and M. Winandy. Compartmented web. In Proceedings of the 18th international conference on World Wide Web
security for browsers or how to thwart a phisher with trusted computing. (WWW), 2009.
In Second International Conference on Availability, Reliability and Security [58] B. Zdrnja, N. Brownlee, and D. Wessels. Passive monitoring of DNS
(ARES), 2007. anomalies. In Proceedings of the 4th International Conference on Detection of
[30] S. Garera, N. Provos, M. Chew, and A. D. Rubin. A framework for detection Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2007.
and measurement of phishing attacks. In Proceedings of the ACM workshop [59] Y. Zhang, J. I. Hong, and L. F. Cranor. Cantina: a content-based approach
on recurring malcode, 2007. to detecting phishing web sites. In Proceedings of the 16th international
[31] T. Holz, C. Gorecki, K. Rieck, and F. C. Freiling. Measuring and detecting conference on World Wide Web (WWW), 2007.
fast-flux service networks. In Proceedings of the Network and Distributed
1536-1233 (c) 2016 IEEE.Security
System PersonalSymposium
use is permitted, but republication/redistribution
(NDSS), 2008. requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Detecting Mobile Malicious Webpages in Real Time

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Detecting Mobile Malicious Webpages in Real Time

Enviado por

Direitos autorais:

Formatos disponíveis

This article has been accepted for publication in a future issue of this journal, but has not been

Detecting Mobile Malicious Webpages

Index Terms—Mobile security, webpages, web browsers, machine learning.

1 I NTRODUCTION pages. kAYO uses static features of mobile webpages

Fraction of total webpages

Fraction of total webpages

True positive rate 97% 89%

Pearson Correlation Coefficient (PCC)

desktop webpages and a test datatset of mobile web-

8) Output 4) HTTP response

7 D ISCUSSION evaluating kAYO. However, such dynamic analysis tech-

Você também pode gostar