Você está na página 1de 15

22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data

[New Guide] Download the 2017 Guide to Web Development: Frameworks and Responsive Design
Download Guide

Paradise Papers: An In-Depth Graph


Analysis
by Michael Hunger by William Lyon Nov. 22, 17 Big Data Zone

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free
Download

Last week, the ICIJ publicly released data from its most recent year-long investigation into the
offshore industry, known as the Paradise Papers. In the last few weeks since the ICIJ announced their
investigation, we've seen many reports being published covering activities of companies like Nike,
Apple, and the Queen of England's estate, and connections of Russian investments to politicians like
Wilbur Ross and companies like Facebook and Twitter.

More than 13 million leaked documents, emails, and database records have been analyzed using text
analysis, full-text and faceted search, and, most interestingly to us, graph visualization and graph-
based search.

The International Consortium of Investigative Journalists (ICIJ) makes use of the Neo4j graph
database internally to aid their investigations. As the ICIJ says on their website:

Graph databases are the best way to explore the relationships


between these people and entities it's much more intuitive to
use for this purpose than a SQL database or other types of
NoSQL databases. In addition, graphs allow to understand
these networks in a very intuitive way and easily discover
connections.

The ICIJ has built a powerful search engine that sits atop Neo4j that allows for searching the Paradise
Papers dataset and has made this available to the public as a web application. However, releasing the
data as a Neo4j database enables much more powerful analysis of the data. Since Neo4j is an open-
Download
source database one DZone's
this means popular
that everyone
Microservices,
Refcardz
has access
Eclipse,
to the for
and
samefree: Getting
powerful
Java EE7.
toolsStarted With
for making sense of
the data.

In a previous post, we showed how graph analysis and Cypher the query language for graphs can
be used to query the data to find connections in the Paradise Papers data. In this post, we show some
techniques for querying and analyzing the data in Neo4j, including how we can create data
visualizations to help up draw insight and how we can use graph analysis to learn more about the
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 1/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
visualizations to help up draw insight, and how we can use graph analysis to learn more about the
offshore finance industry.
Download Free Refcard Download Free Refcard

Graph Querying
For a more thorough overview of the data model and example queries, see our previous post here.

The Data Model


The Paradise Papers dataset uses the property graph data model to represent data about offshore legal
Download
entities, officers who may be beneficiaries Free Refcard
or shareholders of the entities, and the intermediaries that
acted to create the legal entities.

The nodes in the graph are the entities, and relationships connect them. We also store key-value
pair properties on both the nodes and relationships, such as names, addresses, and data provenance
attributes.

Graph visualization is a powerful way to explore data. For example, identifying highly connected
clusters of nodes can be done by visually examining the graph.

https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 2/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data

Exploratory Queries
We can also perform aggregations when we query for tabular data. Let's examine the overall size and
shape of the Paradise Papers dataset.

How many nodes are there in the Paradise Papers dataset?

1 MATCH (n) RETURN labels(n) AS labels, COUNT(*) AS count ORDER BY count DESC

3 "labels" "count"

5 ["Officer"] 77012

7 ["Address"] 59228

9 ["Entity"] 24957

10

11 ["Intermediary"]2031

12

13 ["Other"] 186

14

We can see that the data consists of information on over 84,000 officers (these are people or
companies who play a role in an offshore company) with connections to almost 25,000 offshore legal
titi 63 000 dd Th dd ill b i t tt l t d f
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 3/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
entities, across 63,000 addresses. The addresses will become important to us later as we made use of
location data.

We can also count the number of the different types of relationships in the dataset:

1 MATCH ()-[r]->() RETURN type(r), COUNT(*) ORDER BY COUNT(*) DESC

3 "type(r)" "COUNT(*)"

5 "OFFICER_OF" 221112

7 "REGISTERED_ADDRESS"128311

9 "CONNECTED_TO" 10552

10

11 "INTERMEDIARY_OF" 4063

12

13 "SAME_NAME_AS" 416

14

15 "SAME_ID_AS" 2

16

And compute degree distribution, to give us an idea of how connected different pieces of the graph
are, on average:

1 MATCH (n) WITH labels(n) AS type, SIZE( (n)--() ) AS degree


RETURN type, MAX(degree) AS max, ROUND(AVG(degree)) AS avg, ROUND(STDEV(degree)) AS stdev

2
3

4 "type" "max""avg""stdev"

6 ["Other"] 2891 44 236

8 ["Address"] 9268 2 59

10 ["Intermediary"]115 5 8

11

12 ["Officer"] 2726 4 20

13

14 ["Entity"] 312 11 13

15

The Shortest Path From the Queen of England to Rex


Tillerson
One powerful feature of a graph database like Neo4j is the ability to query for paths of arbitrary
length. This allows us to find connections between nodes when we don't know what the connections
are, or even the length of the path.

https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 4/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data

I was curious to see if there were any indirect connections between two public figures who appear in
the Paradise Papers dataset: Rex Tillerson (the U.S. Secretary of State who had connections to a
Bermuda-based oil and gas company with operations in Yemen) and the Queen of England, whose
estate, it was reported, was an investor in a Bermuda-based company. We can easily query for such a
path using Cypher:

1 MATCH p=shortestPath((rex:Officer)-[*]-(queen:Officer))

2 WHERE rex.name = "Tillerson - Rex" AND queen.name = "The Duchy of Lancaster"

3 RETURN p

This shows us a single shortest path connecting the Queen of England and Rex Tillerson. The path
goes through several offshore entities and officers with connections to these entities. If we adjust our
query slightly to include all shortest paths, we see that several of the officers in our path share
connections with many legal entities.

1 MATCH p=allShortestPaths((rex:Officer)-[*]-(queen:Officer))

2 WHERE rex.name = "Tillerson - Rex" AND queen.name = "The Duchy of Lancaster"

3 RETURN p

https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 5/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data

A quick Google search reveals that these individuals are corporate services managers: individuals who
are paid to serve as directors of offshore entities to handle the administration of these entities.

Graph Algorithms
Querying the data using Cypher is useful for exploring the graph and answering questions that we
have, such as, What are all the offshore legal entities that Wilbur Ross is connected to? But what if we
want to know who are the most influential nodes in the network? Or elements of the graph who have
the highest transitive relevance?

We can easily run the PageRank centrality algorithm on the whole graph dataset using Cypher:

1 CALL algo.pageRank(null,null,{write:true,writeProperty:'pagerank_g'})

...and then query for the Entity node with the highest PageRank score:

1 MATCH (e:Entity) WHERE exists(e.pagerank_g)

2 RETURN e.name AS entity, e.jurisdiction_description AS jurisdiction,

3 e.pagerank_g AS pagerank ORDER BY pagerank DESC LIMIT 15

5
"entity" "jurisdiction" "pagerank"

7
"WORLDCARE LIMITED" "Bermuda" 18.110508499999998

9
"Ferrous Resources Limited" "Isle of Man" 17.326935999999996

10

11
"American Contractors Insurance Group Ltd." "Bermuda" 15.6201275

12

13
"Gulf Keystone Petroleum Limited" "Bermuda" 12.81925

14

15
"Warburg Pincus (Bermuda) Private Equity X, L.P.""Bermuda" 12.312412

16

17
"M d Oil Li it d" "B d " 11 611646499999999
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 6/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
"Madagascar Oil Limited" "Bermuda" 11.611646499999999

18

19
"Coller International Partners IV-D, L.P." "Cayman Islands"11.394854

20

21
"Milestone Insurance Co., Ltd." "Bermuda" 11.224089

22

23
"CL Acquisition Holdings Limited" "Cayman Islands"11.0752455

24

25
"Alpha and Omega Semiconductor Limited" "Bermuda" 10.965910000000001

26

27
"Coller International Partners V-A, L.P." "Cayman Islands"10.8205005

28

29

Geo Analysis
The registered addresses of many of the officers and legal entities are available in the Paradise Papers
data. Using a service such as the Nominatim API or Google's geocoding API, we can perform a lookup
to turn these address strings into latitude and longitude points.

Once we have geocoded these addresses, we can use geographic analysis to find more insights into the
data. Neo4j has a JavaScript driver which makes it easy to build web applications that query Neo4j
using Cypher.

One visualization tool we can use is a heat map, where observations are represented as colors. More
intense colors mean more addresses in that area. Examining a heatmap of Paradise Papers addresses
shows a high concentration of addresses in the Atlantic, just off the coast of North America. Many of
these addresses are in Bermuda, a known offshore jurisdiction.

https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 7/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data

Heat map of Paradise Papers geocoded addresses. Try it live.

If we compare this heat map with a heat map of geocoded addresses from the Panama Papers dataset
(an earlier leak investigated by ICIJ), we can see we have quite a different geographic distribution of
addresses.

Instead of a large concentration in the Atlantic, we see a higher concentration in Asia and, to a lesser
degree, Europe. The Panama Papers leak has a high number of addresses in Singapore and Kuala
Lumpur.

Heat map of Panama Papers geocoded addresses

Using the geocoded addresses, we can also interactively explore the Paradise Papers as a map.
Clicking on an address marker of interest issues a Cypher query to find the Officer and Entity nodes
connected to this address.

Exploring the ritzy suburbs of Las Vegas, we can see many addresses that show up in the Paradise
Papers. In fact, we easily stumble upon the casino magnate Sheldon G. Adelson who it was revealed
has a connection to a Bermuda company he uses to register his casino's private jets, transferring tens
of millions of dollars to a tax-free jurisdiction.

https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 8/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data

Annotated map of geocoded addresses in Paradise Papers showing the registered address of Officer
nodes and connected legal entities and jurisdictions. Try it live.

Entity Jurisdictions
When looking at the implications of the structure of the offshore finance industry, one of the questions
investigative journalists try to answer is "Who are the enablers?" One aspect of finding enablers is to
look at the jurisdictions that make the offshore industry possible.

One can theorize about historical, legal, and economic reasons why some jurisdictions may be chosen
for citizens of certain countries, but data like the Paradise Papers are so important for gaining insight
into the offshore finance industry because much of this world is so secretive. Next, we examine some
of the jurisdiction information in the data.

1 MATCH (e:Entity)

2 WITH e.jurisdiction_description AS juris, COUNT(*) AS count

3 WHERE count > 20

4 RETURN *

5 ORDER BY count ASC

https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed:% 9/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data

We can see that Bermuda and the Cayman Islands far outnumber the other jurisdictions. This makes
sense given what we know about the main source of the data, which was a law firm with offices in
Bermuda (and many other countries).

We can extend our analysis to begin to answer the question, "Are there certain jurisdictions that
citizens of particular countries prefer?" or "What are the most popular offshore jurisdictions, by
country of residence of the beneficiary or officer?" We can begin to take a look at that answer by
creating a bipartite graph of Officer country and entity jurisdiction. We can visualize this data in a
chord diagram that shows us the relative distribution of flow through the bipartite graph.

1 MATCH (a:Address)--(o:Officer)--(e:Entity)

2 WITH a.countries AS officer_country, e.jurisdiction_description AS juris,

3 COUNT(*) AS num

4 RETURN * ORDER BY num DESC

https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 10/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data

This diagram shows us that the United States is by far the most popular country for Officer to give as
their registered address. And of those officers with addresses in the US, Bermuda, and Cayman
Islands are the most popular offshore jurisdictions. This is not surprising as we saw earlier that those
two jurisdictions are by far the most popular in the dataset.

What Can You Find?

This was an overview of the now-public Paradise Papers dataset released by ICIJ. ICIJ has released
the leaked data packaged as a Neo4j database to enable everyone to use the same open-source
software they use for making sense of the complex web of the offshore finance industry.
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 11/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
y g p y

You can find the Paradise Papers dataset available on the Neo4j Sandbox and soon available for
download as a Neo4j database on the ICIJ website. We encourage you to explore the data and see what
insights you can find about the offshore finance industry.

As you explore the data, be sure to check out some of the great resources for learning Cypher and
graph databases. And if you like the work that the ICIJ is doing, remember that they are an
independent media organization and rely on your generous donations to operate.

You can find the code for generating all visualizations in this post on GitHub.

Editor's note: ICIJ has published this data with the following note: "There are legitimate uses for
offshore companies and trusts. We do not intend to suggest or imply that any people, companies or
other entities included in the ICIJ Offshore Leaks Database have broken the law or otherwise acted
improperly. Many people and entities have the same or similar names. We suggest you confirm the
identities of any individuals or entities located in the database based on addresses or other
identifiable information."

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high
performance NoSQL & Big Data Access.

Like This Article? Read More From DZone


The Power Behind the Paradise London Company Shares Its Top
Papers Five Graph Visualization Tools

The Art of Data Visualization Free DZone Refcard


Data Warehousing

Topics: BIG DATA , DATA ANALYTICS , GRAPH ANALYTICS , PARADISE PAPERS , CYPHER , DATA VISUALIZATION ,
GRAPH QUERY

Published at DZone with permission of Michael Hunger, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.

Big Data Partner Resources


High-Performance Integration with NoSQL DBs The fastest databases need the the fastest Drivers.
Cdata

CData Software - NoSQL and Big Data Integration


Cdata

How to Drive Big Data Projects to Maturity


Qubole

https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 12/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data

Workload-Aware Auto-Scaling: A new paradigm for Big Data Workloads [eBook]


Qubole

Introduction to Ethereum Data


Importer
by Devender Yadav Nov 22, 17 Big Data Zone

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern
Big Data platform.

Ethereum is one of the popular blockchain platforms that runs smart contracts. It is used by many
companies to create dapps (decentralized applications).

Query Ethereum Public Data


Libraries like Web3j can be used to fetch blocks and transactions and perform basic operations. But
complex analytical queries like find top five miners in a particular block range can't be performed.

One possible solution could be to save blocks and transactions data in another database and perform
complicated queries there.

Kundera With Ethereum


The Kundera-Ethereum module can be used to import block data to any Kundera-supported database
and also perform JPA queries over it.

Kundera uses Web3j under the hood to fetch block data and the JPA layer to store the data.

Add dependency:

1 <dependency>

2 <groupId>com.impetus.kundera.client</groupId>

3 <artifactId>kundera-ethereum</artifactId>

4 <version>${kundera.version}</version>

5 </dependency>

Define kundera-ethereum.properties file in the classpath.

Sample file to store data in MongoDB:

1 database.type=mongodb

2 database.host=localhost

3 database.port=27017

4 database.name=EthereumDB

5
6 ## generate Block and Transaction tables

https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 13/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data

7 schema.auto.generate=true

8
9 ## Drop existing tables

10 schema.drop.existing=true

11
## RPC HTTP end point or IPC socket file location or infura end point can be specified

12
13 ethereum.node.endpoint=http://localhost:8545/

All set! Start importing blocks now.

Import Data
Import all the data starting from the genesis block:

1 BlockchainImporter importer = BlockchainImporter.initialize();

2 importer.importUptoLatestBlock();

Import the data from block 1,000,000 block 2,000,000th:

1 BlockchainImporter importer = BlockchainImporter.initialize();

2 importer.importBlocks(BigInteger.valueOf(1000000), BigInteger.valueOf(2000000));

Query Data
We'll consider database-specific queries and JPA queries.

Database-Specific Queries
For example, let's find the top five miners with the number of blocks mined.

1 db.Block.aggregate([

2 {$group:{_id:"$miner", numBlocksMined: { $sum: 1}}},

3 { $sort : { numBlocksMined: -1 }},

4 { $limit : 5 }

5 ]);

JPA queries
Let's find gas and gasPrice for a particular user in a particular block.

1 Query query = em.createQuery(


"Select t.gas,t.gasPrice from Transaction t where t.blockNumber='0x455a56' and

2
3 List<Transaction> results = query.getResultList();

For more details check Kundera with Ethereum Blockchain.

Find the perfect platform for a scalable self service model to manage Big Data workloads in the
https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 14/15
22/11/2017 Paradise Papers: An In-Depth Graph Analysis - DZone Big Data
Find the perfect platform for a scalable self-service model to manage Big Data workloads in the
Cloud. Download the free O'Reilly eBook to learn more.

Like This Article? Read More From DZone


MongoDB World Announcements Save Large Files in MongoDB
Using Kundera

Big Data Scoop: AI, Data Science, Free DZone Refcard


MongoDB, Google Analytics, and Data Warehousing
More

Topics: KUNDERA, ETHEREUM, BLOCKCHAIN, NOSQL, MONGODB, BIG DATA, DATA ANALYTICS

Opinions expressed by DZone contributors are their own.

https://dzone.com/articles/paradise-papers-an-in-depth-graph-analysis?utm_medium=feed&utm_source=feedpress.me&utm_campaign=Feed: 15/15

Você também pode gostar