Você está na página 1de 6

SoftwareX 10 (2019) 100305

Contents lists available at ScienceDirect

SoftwareX
journal homepage: www.elsevier.com/locate/softx

Original software publication

NewsCompare - A novel application for detecting news influence in a


country✩

Cristian Pop a , Alexandru Popa a,b ,
a
University of Bucharest, Romania
b
National Institute of Research and Development in Informatics, Romania

article info a b s t r a c t

Article history: We present a new application, developed mostly from scratch, serving as a fast and efficient web
Received 20 May 2019 crawler, with added network visualization and content analysis tools. It can be used to perform
Received in revised form 22 July 2019 experimental research in a number of fields, including web graph analysis, basic text comparison or
Accepted 31 July 2019
even testing out sociological theories, depending on the user’s inclination. A use case is provided, where
Keywords: the application is applied to Romanian news websites, from which several interesting observations can
Web crawler be drawn. The application itself and its code are released under a GPL license, and can be used by other
Web scraper researchers as-is (for use cases similar to our own), or expanded upon by interested developers.
Text similarity © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license
Social networks (http://creativecommons.org/licenses/by/4.0/).

Code metadata

Current code version v1


Permanent link to code/repository used for this code version For example: https://github.com/ElsevierSoftwareX/SOFTX_2019_172
Code Ocean compute capsule
Legal Code License GNU General Public License v3
Code versioning system used git
Software code languages, tools, and services used Java, PostgreSQL, Javascript
Compilation requirements, operating environments & dependencies JDK 8+, npm
If available Link to developer documentation/manual github.com/buxomant/NewsCompareBackend/blob/master/README.md
Support email for questions cristi90ro@gmail.com

Software metadata

Current software version 1.0


Permanent link to executables of this version http://newscompare.tech
Legal Software License GNU General Public License v3
Computing platforms/Operating Systems web-based
Installation requirements & dependencies
If available, link to user manual–if formally published include a reference to the github.com/buxomant/NewsCompareBackend/blob/master/README.md
publication in the reference list
Support email for questions cristi90ro@gmail.com

1. Motivation and significance

The topic of fake news is in the collective consciousness for


✩ A detailed version of this manuscript is available on (Pop and Popa, 2019).
∗ Corresponding author at: University of Bucharest, Romania. some time now, due to its alleged impact on swaying public
E-mail addresses: cristi90ro@gmail.com (C. Pop), opinion on important issues, going so far as to potentially influ-
alexandru.popa@fmi.unibuc.ro (A. Popa). ence election results [1] in some cases. We find entire articles

https://doi.org/10.1016/j.softx.2019.100305
2352-7110/© 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
2 C. Pop and A. Popa / SoftwareX 10 (2019) 100305

devoted to studying their impact [2], and methods of detec- help make the case of whether a website is drawing inspiration
tion [3]. While some of the conclusions in these articles may be from others, or even which website is the likely origin of a
merely tentative, there are still some hard-to-dispute facts we can particular piece of text (acting as an audit of sorts). With some
start using as a basis. For instance, we know that more than two tweaking, the text comparison component can also be co-opted
thirds of Americans report getting at least some of their news on into serving as a plagiarism detector.
social media according to a Pew Research study [4] from 2017.
Worldwide, 48% of people surveyed reported believing a fake
news story was real before finding out it was fake, according to 2. Software description
an Ipsos report [5]. Interestingly, the same report finds that 63%
of people are confident in their own ability to identify fake news, 2.1. Software architecture
while only 41% are confident that the average person can do the
same. Are people in general overly confident about themselves, The application is comprised of two distinct components, as
or too cynical about others? Hard to say, but nevertheless an seen in the diagram below, with the source code of both made
interesting idea to explore. freely available on GitHub [12,13]:
Enterprising research out there has already found insightful
characteristics of fake news, with one paper going so far as to
NewsCompare
draw parallels between fake news and satire [6]. This could not Statistics
be easily done without the appropriate technology to gather
large quantities of data, and analyzing it in new and creative
API
ways. Taken to its logical conclusion, such research could even- Back-end Front-end
tually lead to heuristic algorithms able to detect and filter out Website list
fake news, a monumental breakthrough in and of itself. Our
application aims primarily to simplify this gathering of data (for
non-technical users), while also providing some tentative analysis Web crawler Content indexer
tools to serve as a stepping stone in setting up even more complex Website graph
tools. We have found there to be a relative scarcity of off-the-
shelf solutions for web crawling (even less so with integrated
content analysis), and the products that do exist are either non- Similarity finder
free or are written in interpreted languages (e.g. Ruby/Python), Similar content
with comparatively slower performance. The implementation is
written nearly from scratch, and contains a host of low-level
performance optimizations based on the current web crawling The back-end does most of the heavy lifting, on every iteration
guidelines and literature. It feels relevant to note that frameworks performing the following steps in sequence:
such as Apache Nutch [7] do exist, which allow for developing 1. Check the database for websites that have not been visited
web crawling solutions not quite at such a low level. However, in the previous X hours (where X can be configured, default
we felt our implementation provides opportunity for far more in- is 1).
depth customization and optimization. Being able to crawl and 2. Visit these websites, downloading the text content in its
analyze 100+ websites’ worth of content in the space of an hour
entirety.
enables quick turnaround time and boosts the scope of potential
3. Scan the text content for hyperlinks to visit recursively
research applications.
(either sub-pages of the original site, or external links).
One potential application lies in examining how news sources
External websites are ignored if not found in the list of
disseminate their content, how this fits in to their respective
websites of interest.
ecosystem, and how they continuously adapt in order to keep up
4. Exhaustively visit the sub-pages from the previous step,
a working business model, are all intriguing subjects in their own
right. We know from existing research that newspaper publishers downloading the text content in its entirety.
are aggressively trying to expand into the digital realm, going 5. Create an index over all the text content collected, for quick
as far as adopting a ‘‘digital first’’ approach, but the data shows text searches.
they are still heavily reliant on print in terms of revenue [8]. 6. Scan the index for text similarities above a certain thresh-
Exclusively online news outlets on the other hand do not have the old.
luxury of print to fall back on, so we expect them to make that The back-end also provides a set of API endpoints which are
much more of an effort in establishing a foothold in the online
required by the front-end to provide its own functionality:
market to draw revenue from. This is actually supported by some
of our findings, see Section 4.1 for a specific example. • Display a searchable list of all the websites in the database.
Our application can also be used as a starting point for any • Allow websites to be manually toggled as being ‘‘of interest’’,
number of future research endeavors, particularly in conjunction meaning they will start being picked up on the next web
with other, tangentially related pieces of work. For instance, crawling iteration.
anyone interested in the implications of security within social • Display statistics about the web crawler’s overall progress.
networks and social engineering could easily start from existing • Visualize the hyperlinks found in the most recent iteration
research in this area [9–11]. It should not take much imagination as a directed graph.
to concoct experiments using our fast, efficient web crawler and • Display a searchable list of all the text similarities found in
content analyzer, while pointing it at any given grouping of the most recent iteration.
websites, social network or otherwise. For instance, it can be used
to create a ‘‘map’’ of a limited section of the internet, visualizing For more technical details we refer to the full version of
it as a directed graph. Or it can just as easily find a pattern of text our manuscript published on arXiv [14]. A fully featured imple-
similarities between a number of websites, with results grouped mentation for our use case is available online at http://www.
into convenient slices of time. This last factor could potentially newscompare.tech [15] for demonstration purposes.
C. Pop and A. Popa / SoftwareX 10 (2019) 100305 3

2.2. Software set-up being news and media consumption. For an unassuming mid-
sized country on the geographical fringe of the European Union, it
The back-end compiles to an executable JAR file, requiring boasts the highest average peak internet speeds in the European
Java runtime 8+ to run. Source code compilation will additionally region, and is ranked at number 10 worldwide, according to
require Java development kit 8+. Either running from source code, Akamai’s 2016 report [16]. Coupled with the generally affordable
or compiled binaries, the environment also requires a PostgreSQL access plans (both wired and wireless), it is no wonder that adop-
database, which can be set up using the set of scripts provided tion rates are on a continuous upwards trend, reaching 81% in
2018 among the 16–74 year old population, according to Eurostat
with the project.
data [17]. While still slightly below EU average (89% in the same
When compiling the back-end, a number of configuration
Eurostat data set), if we extrapolate from existing trends we could
options can be provided by tweaking values in the ‘‘applica-
speculate that the adoption rate should reach the EU average in
tion.properties’’ file. The same configuration file also contains a due course.
default user name and password to connect to the database in A recent report by Reuters Institute [18] states that 88% of
order to make further configuration changes (as well as directly Romanians get their news online, 82% from TV, 67% from social
view and alter content). media and just 18% from print. We can therefore expect that
The front-end is a Javascript-based single-page app (SPA), online news websites hold significant sway in shaping public
which does not have any special requirements other than a mod- opinion, considering the significant section of the populace re-
ern web browser. We use ‘‘webpack’’ to bundle all the code files lying on them. A similar report from the year before [19] finds
and other resources, and this resulting bundle can be simply that ‘‘the Romanian news environment is defined by intense
copied over to the target web server. Configuration options can competition for television and online audiences, sustained by
be passed to webpack by tweaking values inside a JSON file understaffed newsrooms that struggle for financial survival’’.
provided, but this is currently limited to specifying the address
of the back-end. 4.1. Graph analysis
More comprehensive setup steps can be found in the ‘‘readme’’
We start our study by directing the application towards the
files included in the GitHub project repositories [12,13]).
top news websites by monthly popularity [20]. From there, we
get a record of all links encountered, which can be later visualized
3. Impact as the edges on a directed graph. What we end up with is a
fairly large grouping of websites, centered around a smaller core
What we try to add to the existing body of work is effectively of websites that we are actively interested in (i.e. Romanian-
a new solution in the form of a fast, efficient, mostly automated language news websites). The grouping itself is interesting, if we
application able to gather vast amounts of information about lay out a visual representation of the graph we can see that the
websites, in as generic a form as possible. Our aim is to have an links are not formed at random, and are more heavily weighted
information dump that is easily to gather, and greatly simplifies towards some websites. On one end, there are very few sites with
the work of future researchers who need large sample sizes to in- a great number of links, and on the other end many sites with a
terpret and derive conclusions from, according to various specific very small number of links. A linear plot makes it harder to notice
use cases. Some of these use case ideas have already been at least this fact, so we need to create a log–log plot to make it stand
tentatively explored in articles mentioned in this introduction. out more in our distribution. Our plots seem to line up with con-
clusions from existing research targeting internet topology [21],
We are confident that a good deal of research endeavors would
claiming that we should expect to see a surprisingly simple set of
have benefited from the kind of data dump we can now provide,
power-laws that describe concisely skewed distributions of graph
and yet more research can benefit from it going forward.
properties such as the node outdegree and indegree.
We put our application to the test on an individual use case For instance, we can see a distinctly non-random pattern if we
(i.e. Romanian news websites), as part of a continuous develop- look at the entire set of Romanian-language websites (not just
ment cycle. A good deal of effort has been made in ensuring the news) that our crawler has visited at least once. The plots below
application has more than just a niche appeal about it, and that display data points for roughly 65,000 Romanian websites found
it can be run reliably for long stretches without much manual in this manner.
interference. However, we also expect (and welcome) any con-
structive criticism and bug reports that get us closer to a flawless Number of Romanian websites by outdegree
product. Despite not coming from a sociology background, we try 4
10
our hand at interpreting the results we get from our use case, at
least to the extent that we are aware of what characteristics to
look for (see Section 4.2 for more details).
103
Number of websites

4. Use case: an analysis on Romanian news websites

To test our application, we define a particular use case by re-


stricting the web crawling and analysis to a limited geographical 102
area. We have made this choice largely in the interest of a fast
turnaround time, to be able to make quick, experimental changes
to our algorithms and study their impact immediately. We avoid
making hardcoded assumptions, so that any tools we use can be 101
repurposed with a different scope in mind, large or small.
Romania is actually an interesting choice in this respect, boast-
ing a number of surprising, confusing, or ultimately even para-
doxical characteristics. We consider the country to be in a rather 100 0
10 101 102 103 104
unique position with regard to the relationship of Romanians
Outdegree
with their fulfillment of basic needs and wants, one of which
4 C. Pop and A. Popa / SoftwareX 10 (2019) 100305

Number of Romanian websites by indegree Table 1


Romanian websites graph properties.
104
Graph Diameter Average Average Clustering Maximum
distance degree coefficient degree
Total websites 15 7.80 3.871 0.362 38198
News websites 8 3.08 1.687 0.543 1874
103
Number of websites

The full data sets can be taken from CSV files stored on
102 GitHub [22]. This file format can be plugged straight into graph
visualization software such as Gephi [23], and potentially others
with some tweaking. Since this graph of news websites effectively
represents a social network, it exhibits all the standard properties
101 of one, e.g., power law degree distribution, a high clustering coef-
ficient, and a small diameter (relative to the number of nodes in
the graph). We calculate these values for our graph (see Table 1),
and provide the data points behind the plots on GitHub [22] for
100 0 verification.
10 101 102 103 104
It feels relevant to note that the website corresponding to
Indegree the highest degree node (by an overwhelming margin) belongs
By restricting the graph to include only Romanian news sites to hotnews.ro, an online-only news outlet. We could interpret
(plus direct neighbors), we can still see a hint of the same pattern this both as a clearly focused effort to increase their footprint in
developing, but since the sample size is much smaller, we see the only arena where they are competing, but potentially also as
outliers are more noticeable. In this graph we have 1404 nodes a sort of underdog mentality, trying to make a disproportionate
effort to get to the top and hold their position. It is likely that this
(of which 157 are news websites) and 2450 edges.
strategy is paying off, considering how they are now considered
Number of Romanian news websites by outdegree one of the largest Romanian news websites, pulling in around
2 250.000–300.000 unique users daily and more than 3 million
10
monthly unique visitors and around 30 million monthly page
views, according to stats measured by the Romanian nonprofit
organization BRAT (Romanian Joint Industry Committee for Print
and Internet) [24].
Number of websites

Other online news outlets that started out as more tradi-


tional media companies, like television (e.g. stirileprotv.ro and
antena3.ro), or print (e.g. libertatea.ro) appear to serve more
101 as an extension of their main business, seeming to make little
more than a token effort in establishing an online presence. The
majority of outward links from these websites simply seem to
promote other websites owned by the same parent company,
while links from online-only outlets are a bit more varied and
balanced.

4.2. Social analysis based on the data


100 0
10 101 102 103
On the front of content comparisons, we have some interesting
Outdegree
results showing similarities between distinct news outlets. We
Number of Romanian news websites by indegree can see many instances where near-identical articles are dis-
10 2 played on different websites, with these websites sharing the
same media group parent company (this is to be expected). If we
filter out these cases, we then see instances where the application
pinpoints articles about the same event, or on the same topic,
with a fair rate of accuracy. While it would not be enough to
Number of websites

conclusively pinpoint plagiarism, it is certainly a potential step


in that direction, if we follow up on these leads (manually, for
the time being). Gathering this kind of historic data can also be
101 used to paint a picture about what kind of articles each particular
outlet is liable to pick up on, and if we can notice any consistent
groupings of websites emerging from there. A list of similar
articles we have found over the course of running the application
can be found in CSV format on Github [22].
To address fake news specifically, a recent report from Face-
book [25] announces they have undergone efforts to remove
what they call ‘‘Coordinated Inauthentic Behavior’’, i.e. pages that
100 0 engage in manipulative behavior towards users on their platform.
10 101 102 103
This is of particular interest to us, since some of these pages pose
Indegree
as Romanian news sources, which fits nicely into our use case.
C. Pop and A. Popa / SoftwareX 10 (2019) 100305 5

While the Facebook pages are no longer available, their associated Declaration of competing interest
websites are still alive and kicking: destanga.ro, perele.ro, an-
tifakenews.ro, momentulzero.ro. These websites have not been We wish to confirm that there are no known conflicts of
discovered organically by our web crawler, despite having seen interest associated with this publication.
around 96,000 distinct URLs thus far, which would indicate that
there are no links pointing to them at all in our entire data set. Acknowledgment
Out of curiosity, we add them to our list of target websites to
see what we can learn from them, if anything. What we find is This work was supported by project PN 19 37 04 01 ‘‘New
that they are largely isolated nodes in our graph, having very few solutions for complex problems in current ICT research fields
distinct outward links, all of them pointing only towards Google, based on modeling and optimization’’, funded by the Romanian
Facebook, or Wordpress. Core Program of the Ministry of Research and Innovation (MCI),
Our text comparison component was only able to find very Romania, 2019–2022.
few matches involving these 4 websites over several runs, all
of them between momentulzero.ro and the ironically titled an- References
tifakenews.ro. This is a tentative indicator that at least some of
these sites (labeled as misleading and manipulative by Facebook) [1] Allcott H, Gentzkow M. Social media and fake news in the 2016 election.
are either coming from the same source or have the same goal J Econ Perspect 2017;31:211–36. http://dx.doi.org/10.1257/jep.31.2.211.
in mind. Taking just a cursory glance at some of the articles [2] Lazer DMJ, Baum M, Benkler Y, J. Berinsky A, M. Greenhill K, Menczer F,
J. Metzger M, Nyhan B, Pennycook G, Rothschild D, Schudson M, Sloman S,
served, we can see that they are quite short (around 500–1000
Sunstein C, A. Thorson E, J. Watts D, Zittrain JL. The science of fake news.
characters on average), and have no citations of any kind, even Science 2018;359:1094–6. http://dx.doi.org/10.1126/science.aao2998.
when alleging to use a direct quote from a particular person [3] Conroy NJ, Rubin VL, Chen Y. Automatic deception detection: Methods for
or institution (confirmed by our web crawler being unable to finding fake news. In: Proceedings of the 78th ASIS&T Annual Meeting:
find any hyperlinks outside of social media). These are all good Information Science with Impact: Research in and for the Community.
American Society for Information Science; 2015, p. 82:1–4, URL http:
heuristic indicators that seem to support Facebook’s conclusions
//dl.acm.org/citation.cfm?id=2857070.2857152.
in this particular case, and potentially lead us to other examples [4] Elisa Shearer JG. News use across social media platforms 2017.
in need of a closer look. 2017, URL http://www.journalism.org/2017/09/07/news-use-across-social-
media-platforms-2017/. Online [Accessed 16 February 2019].
5. Conclusions [5] Ipsos. Fake news – ipsos perils of perception report. 2018, URL https:
//www.ipsos.com/en-au/fake-news-ipsos-perils-perception-report. Online
[Accessed 16 February 2019].
Throughout our inquiry, we manage to delve deep into the [6] Horne BD, Adali S. This just in: Fake news packs a lot in title, uses simpler,
innards of our target websites, and glean some fairly intimate repetitive content in text body, more similar to satire than real news. 2017,
knowledge regarding their architecture and contents. Some of CoRR abs/1703.09398, arXiv:1703.09398.
our expectations get challenged along the way, we might arrive [7] Foundation AS. Apache nutch. 2019, URL https://nutch.apache.org/. Online
[Accessed 25 February 2019].
at some surprising conclusions, and oftentimes the issues and
[8] Myllylahti M. Does digital bring home the bacon? San Diego: International
challenges we come across can be particularly frustrating to get Communication Association; 2017.
through, but still yield satisfying results. While we cannot expect [9] Gupta B. [CFC] Computer and Cyber Security: Principles, Algorithm,
groundbreaking results at every turn, or a ‘‘smoking gun’’ behind Applications and Perspectives. 2017.
every corner, we trust that our application is capable of doing [10] Plageras A, Psannis K, Stergiou C, Wang H, Gupta BB. Efficient iot-based
great things in capable hands. The amount of time saved by sensor BIG data collection-processing and analysis in smart buildings.
Future Gener Comput Syst 2017;82. http://dx.doi.org/10.1016/j.future.2017.
automating away cumbersome tasks empowers us to look at an 09.082.
increasingly larger picture, at a fine resolution. Sifting through [11] Gupta B, Agrawal D, Yamaguchi S. Handbook of research on modern
this picture to find occasional nuggets of meaning can become cryptographic solutions for computer and cyber security. 2016, http://dx.
a rewarding task in and of itself. doi.org/10.4018/978-1-5225-0105-3.
The current list of features and functionality included in the [12] Pop C. Newscompare backend. 2019, URL https://github.com/buxomant/
NewsCompareBackend. Online [Accessed 25 February 2019].
application is representative of the ideas we came up with, both [13] Pop C. Newscompare frontend. 2019, URL https://github.com/buxomant/
on our own, and by studying existing research, all while timebox- NewsCompareFrontend. Online [Accessed 25 February 2019].
ing the implementation time to prevent ‘‘feature creep’’ so that [14] Pop C, Popa A. NewsCompare - a novel application for detecting news
the project does not drag on for many months or even years. We influence in a country. 2019, CoRR abs/1904.00712, arXiv:1904.00712.
expect most of the future work involved will be around adding [15] Pop C. Newscompare. 2019, URL http://newscompare.tech/. Online
[Accessed 18 March 2019].
new statistics, reports and visualizations to the front-end, making [16] IAkamai Technologies. Q3 2016 State of the Internet – Connectivity
it more friendly to people coming from a non-tech background. Report. Tech. rep., Akamai Technologies, Inc; 2016, URL https://www.
Barring some unforeseen revolutionary idea, the resulting data set akamai.com/us/en/multimedia/documents/state-of-the-internet/q3-2016-
gathered by the back-end component should be generic enough state-of-the-internet-connectivity-report.pdf.
to be molded to match most reporting needs. [17] Eurostat. Level of internet access - households %. 2019, URL
https://ec.europa.eu/eurostat/tgm/graph.do?tab=graph&plugin=1&
As mentioned by Marres and Weltevrede [26], ‘‘it would be language=en&pcode=tin00134&toolbox=type. Online [Accessed 16 February
a mistake to approach scrapers as if they were stable, stand- 2019].
alone machines: scrapers come in and fall out of use; they work, [18] Reuters Institute. Romania - Reuters institute digital news report 2018.
and then they no longer work’’. We can certainly note that the 2019, URL http://www.digitalnewsreport.org/survey/2018/romania-2018/.
Online [Accessed 25 February 2019].
stability of a particular piece of software is correlated with the
[19] Reuters Institute. Romania - Reuters institute digital news report 2017.
amount of time spent bug-fixing, debugging and generally testing 2019, URL http://www.digitalnewsreport.org/survey/2017/romania-2017/.
through use. To that end, making the app available to the public Online [Accessed 25 February 2019].
as a generic tool is probably the best way to find and fix the [20] trafic.ro. Top siteuri stiri/massmedia dupa vizitatori pe luna. 2019.
more glaring issues and omissions. After some growing pains, we URL http://www.trafic.ro/vizitatori/top-siteuri-stiri-massmedia/luna. On-
line [Accessed 16 February 2019].
expect to emerge with increasingly robust and battle-hardened
[21] Faloutsos M, Faloutsos P, Faloutsos C. On power-law relationships of the
versions of code, though some maintenance is likely to be re- internet topology. SIGCOMM Comput Commun Rev 1999;29(4):251–62.
quired on a semi-regular basis, in case entirely breaking changes http://dx.doi.org/10.1145/316194.316229, URL http://doi.acm.org/10.1145/
start to become widely adopted by target websites. 316194.316229.
6 C. Pop and A. Popa / SoftwareX 10 (2019) 100305

[22] Pop C. Newscompare result data sets. 2019, URL https://github.com/ [25] Facebook. Removing Coordinated Inauthentic Behavior From the UK and
buxomant/NewsCompareBackend/tree/master/resources. Online [Accessed Romania. 2019. URL https://newsroom.fb.com/news/2019/03/removing-
25 February 2019]. cib-uk-and-romania/. Online [Accessed 25 February 2019].
[23] Gephi Consortium. Gephi. 2019. URL https://gephi.org/. Online [Accessed [26] Marres N, Weltevrede E. Scraping the social: Issues in live research. J
25 February 2019]. Cultural Econ 2012.
[24] Studiul de Audienţă şi Trafic Internet (SATI). 2019. www.hotnews.ro. URL
https://www.brat.ro/sati/site/hotnews-ro-1/. Online [Accessed 3 March
2019].

Você também pode gostar