Escolar Documentos
Profissional Documentos
Cultura Documentos
[New Guide] Download the 2017 Guide to Web Development: Frameworks and Responsive Design
Download Guide
Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free
Download
Last week, the ICIJ publicly released data from its most recent year-long investigation into the
offshore industry, known as the Paradise Papers. In the last few weeks since the ICIJ announced their
investigation, we've seen many reports being published covering activities of companies like Nike,
Apple, and the Queen of England's estate, and connections of Russian investments to politicians like
Wilbur Ross and companies like Facebook and Twitter.
More than 13 million leaked documents, emails, and database records have been analyzed using text
analysis, full-text and faceted search, and, most interestingly to us, graph visualization and graph-
based search.
The International Consortium of Investigative Journalists (ICIJ) makes use of the Neo4j graph
database internally to aid their investigations. As the ICIJ says on their website:
The ICIJ has built a powerful search engine that sits atop Neo4j that allows for searching the Paradise
Papers dataset and has made this available to the public as a web application. However, releasing the
data as a Neo4j database enables much more powerful analysis of the data. Since Neo4j is an open-
Download
source database one DZone's
this means popular
that everyone
Microservices,
Refcardz
has access
Eclipse,
to the for
and
samefree: Getting
powerful
Java EE7.
toolsStarted With
for making sense of
the data.
In a previous post, we showed how graph analysis and Cypher the query language for graphs can
be used to query the data to find connections in the Paradise Papers data. In this post, we show some
techniques for querying and analyzing the data in Neo4j, including how we can create data
visualizations to help up draw insight and how we can use graph analysis to learn more about the
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 1/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
visualizations to help up draw insight, and how we can use graph analysis to learn more about the
offshore finance industry.
Download Free Refcard Download Free Refcard
Graph Querying
For a more thorough overview of the data model and example queries, see our previous post here.
The nodes in the graph are the entities, and relationships connect them. We also store key-value
pair properties on both the nodes and relationships, such as names, addresses, and data provenance
attributes.
Graph visualization is a powerful way to explore data. For example, identifying highly connected
clusters of nodes can be done by visually examining the graph.
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 2/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
Exploratory Queries
We can also perform aggregations when we query for tabular data. Let's examine the overall size and
shape of the Paradise Papers dataset.
1 MATCH (n) RETURN labels(n) AS labels, COUNT(*) AS count ORDER BY count DESC
3 "labels" "count"
5 ["Officer"] 77012
7 ["Address"] 59228
9 ["Entity"] 24957
10
11 ["Intermediary"]2031
12
13 ["Other"] 186
14
We can see that the data consists of information on over 84,000 officers (these are people or
companies who play a role in an offshore company) with connections to almost 25,000 offshore legal
titi 63 000 dd Th dd ill b i t tt l t d f
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 3/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
entities, across 63,000 addresses. The addresses will become important to us later as we made use of
location data.
We can also count the number of the different types of relationships in the dataset:
3 "type(r)" "COUNT(*)"
5 "OFFICER_OF" 221112
7 "REGISTERED_ADDRESS"128311
9 "CONNECTED_TO" 10552
10
11 "INTERMEDIARY_OF" 4063
12
13 "SAME_NAME_AS" 416
14
15 "SAME_ID_AS" 2
16
And compute degree distribution, to give us an idea of how connected different pieces of the graph
are, on average:
2
3
4 "type" "max""avg""stdev"
8 ["Address"] 9268 2 59
10 ["Intermediary"]115 5 8
11
12 ["Officer"] 2726 4 20
13
14 ["Entity"] 312 11 13
15
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 4/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
I was curious to see if there were any indirect connections between two public figures who appear in
the Paradise Papers dataset: Rex Tillerson (the U.S. Secretary of State who had connections to a
Bermuda-based oil and gas company with operations in Yemen) and the Queen of England, whose
estate, it was reported, was an investor in a Bermuda-based company. We can easily query for such a
path using Cypher:
1 MATCH p=shortestPath((rex:Officer)-[*]-(queen:Officer))
3 RETURN p
This shows us a single shortest path connecting the Queen of England and Rex Tillerson. The path
goes through several offshore entities and officers with connections to these entities. If we adjust our
query slightly to include all shortest paths, we see that several of the officers in our path share
connections with many legal entities.
1 MATCH p=allShortestPaths((rex:Officer)-[*]-(queen:Officer))
3 RETURN p
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 5/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
A quick Google search reveals that these individuals are corporate services managers: individuals who
are paid to serve as directors of offshore entities to handle the administration of these entities.
Graph Algorithms
Querying the data using Cypher is useful for exploring the graph and answering questions that we
have, such as, What are all the offshore legal entities that Wilbur Ross is connected to? But what if we
want to know who are the most influential nodes in the network? Or elements of the graph who have
the highest transitive relevance?
We can easily run the PageRank centrality algorithm on the whole graph dataset using Cypher:
1 CALL algo.pageRank(null,null,{write:true,writeProperty:'pagerank_g'})
...and then query for the Entity node with the highest PageRank score:
5
"entity" "jurisdiction" "pagerank"
7
"WORLDCARE LIMITED" "Bermuda" 18.110508499999998
9
"Ferrous Resources Limited" "Isle of Man" 17.326935999999996
10
11
"American Contractors Insurance Group Ltd." "Bermuda" 15.6201275
12
13
"Gulf Keystone Petroleum Limited" "Bermuda" 12.81925
14
15
"Warburg Pincus (Bermuda) Private Equity X, L.P.""Bermuda" 12.312412
16
17
"M d Oil Li it d" "B d " 11 611646499999999
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 6/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
"Madagascar Oil Limited" "Bermuda" 11.611646499999999
18
19
"Coller International Partners IV-D, L.P." "Cayman Islands"11.394854
20
21
"Milestone Insurance Co., Ltd." "Bermuda" 11.224089
22
23
"CL Acquisition Holdings Limited" "Cayman Islands"11.0752455
24
25
"Alpha and Omega Semiconductor Limited" "Bermuda" 10.965910000000001
26
27
"Coller International Partners V-A, L.P." "Cayman Islands"10.8205005
28
29
Geo Analysis
The registered addresses of many of the officers and legal entities are available in the Paradise Papers
data. Using a service such as the Nominatim API or Google's geocoding API, we can perform a lookup
to turn these address strings into latitude and longitude points.
Once we have geocoded these addresses, we can use geographic analysis to find more insights into the
data. Neo4j has a JavaScript driver which makes it easy to build web applications that query Neo4j
using Cypher.
One visualization tool we can use is a heat map, where observations are represented as colors. More
intense colors mean more addresses in that area. Examining a heatmap of Paradise Papers addresses
shows a high concentration of addresses in the Atlantic, just off the coast of North America. Many of
these addresses are in Bermuda, a known offshore jurisdiction.
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 7/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
If we compare this heat map with a heat map of geocoded addresses from the Panama Papers dataset
(an earlier leak investigated by ICIJ), we can see we have quite a different geographic distribution of
addresses.
Instead of a large concentration in the Atlantic, we see a higher concentration in Asia and, to a lesser
degree, Europe. The Panama Papers leak has a high number of addresses in Singapore and Kuala
Lumpur.
Using the geocoded addresses, we can also interactively explore the Paradise Papers as a map.
Clicking on an address marker of interest issues a Cypher query to find the Officer and Entity nodes
connected to this address.
Exploring the ritzy suburbs of Las Vegas, we can see many addresses that show up in the Paradise
Papers. In fact, we easily stumble upon the casino magnate Sheldon G. Adelson who it was revealed
has a connection to a Bermuda company he uses to register his casino's private jets, transferring tens
of millions of dollars to a tax-free jurisdiction.
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 8/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
Annotated map of geocoded addresses in Paradise Papers showing the registered address of Officer
nodes and connected legal entities and jurisdictions. Try it live.
Entity Jurisdictions
When looking at the implications of the structure of the offshore finance industry, one of the questions
investigative journalists try to answer is "Who are the enablers?" One aspect of finding enablers is to
look at the jurisdictions that make the offshore industry possible.
One can theorize about historical, legal, and economic reasons why some jurisdictions may be chosen
for citizens of certain countries, but data like the Paradise Papers are so important for gaining insight
into the offshore finance industry because much of this world is so secretive. Next, we examine some
of the jurisdiction information in the data.
1 MATCH (e:Entity)
4 RETURN *
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 9/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
We can see that Bermuda and the Cayman Islands far outnumber the other jurisdictions. This makes
sense given what we know about the main source of the data, which was a law firm with offices in
Bermuda (and many other countries).
We can extend our analysis to begin to answer the question, "Are there certain jurisdictions that
citizens of particular countries prefer?" or "What are the most popular offshore jurisdictions, by
country of residence of the beneficiary or officer?" We can begin to take a look at that answer by
creating a bipartite graph of Officer country and entity jurisdiction. We can visualize this data in a
chord diagram that shows us the relative distribution of flow through the bipartite graph.
1 MATCH (a:Address)--(o:Officer)--(e:Entity)
3 COUNT(*) AS num
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 10/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
This diagram shows us that the United States is by far the most popular country for Officer to give as
their registered address. And of those officers with addresses in the US, Bermuda, and Cayman
Islands are the most popular offshore jurisdictions. This is not surprising as we saw earlier that those
two jurisdictions are by far the most popular in the dataset.
This was an overview of the now-public Paradise Papers dataset released by ICIJ. ICIJ has released
the leaked data packaged as a Neo4j database to enable everyone to use the same open-source
software they use for making sense of the complex web of the offshore finance industry.
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 11/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
y g p y
You can find the Paradise Papers dataset available on the Neo4j Sandbox and soon available for
download as a Neo4j database on the ICIJ website. We encourage you to explore the data and see what
insights you can find about the offshore finance industry.
As you explore the data, be sure to check out some of the great resources for learning Cypher and
graph databases. And if you like the work that the ICIJ is doing, remember that they are an
independent media organization and rely on your generous donations to operate.
You can find the code for generating all visualizations in this post on GitHub.
Editor's note: ICIJ has published this data with the following note: "There are legitimate uses for
offshore companies and trusts. We do not intend to suggest or imply that any people, companies or
other entities included in the ICIJ Offshore Leaks Database have broken the law or otherwise acted
improperly. Many people and entities have the same or similar names. We suggest you confirm the
identities of any individuals or entities located in the database based on addresses or other
identifiable information."
The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high
performance NoSQL & Big Data Access.
Topics: BIG DATA , DATA ANALYTICS , GRAPH ANALYTICS , PARADISE PAPERS , CYPHER , DATA VISUALIZATION ,
GRAPH QUERY
Published at DZone with permission of Michael Hunger, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 12/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern
Big Data platform.
Ethereum is one of the popular blockchain platforms that runs smart contracts. It is used by many
companies to create dapps (decentralized applications).
One possible solution could be to save blocks and transactions data in another database and perform
complicated queries there.
Kundera uses Web3j under the hood to fetch block data and the JPA layer to store the data.
Add dependency:
1 <dependency>
2 <groupId>com.impetus.kundera.client</groupId>
3 <artifactId>kundera-ethereum</artifactId>
4 <version>${kundera.version}</version>
5 </dependency>
1 database.type=mongodb
2 database.host=localhost
3 database.port=27017
4 database.name=EthereumDB
5
6 ## generate Block and Transaction tables
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 13/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
7 schema.auto.generate=true
8
9 ## Drop existing tables
10 schema.drop.existing=true
11
## RPC HTTP end point or IPC socket file location or infura end point can be specified
12
13 ethereum.node.endpoint=http://localhost:8545/
Import Data
Import all the data starting from the genesis block:
2 importer.importUptoLatestBlock();
2 importer.importBlocks(BigInteger.valueOf(1000000), BigInteger.valueOf(2000000));
Query Data
We'll consider database-specific queries and JPA queries.
Database-Specific Queries
For example, let's find the top five miners with the number of blocks mined.
1 db.Block.aggregate([
4 { $limit : 5 }
5 ]);
JPA queries
Let's find gas and gasPrice for a particular user in a particular block.
2
3 List<Transaction> results = query.getResultList();
Find the perfect platform for a scalable self service model to manage Big Data workloads in the
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 14/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
Find the perfect platform for a scalable self-service model to manage Big Data workloads in the
Cloud. Download the free O'Reilly eBook to learn more.
Topics: KUNDERA, ETHEREUM, BLOCKCHAIN, NOSQL, MONGODB, BIG DATA, DATA ANALYTICS
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 15/15