Você está na página 1de 20

GGE5405/GGE6505 Introduction to Big Data & Data Science

Assignment 3
Graph data management with Neo4J and Cypher:
A case of bus transit network in Moncton.

Introduction
This assignment introduces you to big data management using Neo4j and the Cypher graph query
language in order to develop and query a graph database. You will implement a database in Neo4j
for a bus transit network in Moncton, New Brunswick. The graph data model is shown in Figure1.
The step-by-step instructions are provided with cypher codes, follow them strictly to develop
your graph database and at the end answer the following questions. More information about
Neo4j and Cypher is available here:

https://neo4j.com/docs/developer-manual/current/cypher/functions/
https://neo4j.com/docs/cypher-refcard/current/
http://www.lyonwj.com/2016/06/26/graph-of-thrones-neo4j-social-network-analysis/
https://neo4j.com/developer/cypher-query-language/
Figure 1: Graph Database Model for the Bus Transit Network in Moncton
Step 1:
Download, Install and Configure Neo4J
Download the latest version Neo4J from here, (Community Edition for individuals).
Installation is straight forward.
After Installation, launch your Neo4J community Edition.
Create a folder on your C or D drive where your database will be created and browse to it as your
database location.

Configure the Options:


Remove the comment signs (#) at the beginning of dbms.directories.import=import and
dbms. security., if they have not been removed already.

Save and Close the config window.


Start the database and click on the http-link to launch the browser interface.

Create another folder in your database folder called import and add all the csv files provided
for you into the folder.
The cypher codes provided below are to be run on the Neo4J browser
(http://localhost:7474/browser/) one-by-one.

Creating the Time-Tree


//Create Time Tree Indexes
CREATE INDEX ON :Year(yearid);
CREATE INDEX ON :Month(monthid);
CREATE INDEX ON :Day(dayid);
CREATE INDEX ON :Hour(hourid);
CREATE INDEX ON :Minute(minuteid);
CREATE INDEX ON :Second(secondid);

//Create the time-tree of the data


WITH range(2016, 2016) AS YEARS, range(6,6) AS MONTHS, range(7,10) as Days,
range(0,23) as Hours, range(0,59) as Minutes, range(0,59) as Seconds
FOREACH(year IN YEARS |
CREATE (y:Year {yearid: year})
FOREACH(month IN MONTHS |
CREATE (m:Month {monthid: month})
MERGE (y)-[:HAS_MONTH]->(m)
FOREACH(day IN Days |
CREATE (d:Day {dayid: day})
MERGE (m)-[:HAS_DAY]->(d)
FOREACH(hour IN Hours |
CREATE (h:Hour {hourid: hour})
MERGE (d)-[:HAS_HOUR]->(h)
FOREACH(minute IN Minutes |
CREATE (mm:Minute {minuteid: minute})
MERGE (h)-[:HAS_MINUTE]->(mm)
FOREACH(second IN Seconds |
CREATE (ss:Second {secondid: second})
MERGE (mm)-[:HAS_SECOND]->(ss)))))));

// Adding NEXT relationship from the days up to the Seconds

//Day
MATCH (year:Year)-[:HAS_MONTH]->(month)-[:HAS_DAY]->(day)
WITH year,month,day
ORDER BY year.yearid, month.monthid, day.dayid
WITH collect(day) as days
FOREACH(i in RANGE(0, length(days)-2) |
FOREACH(day1 in [days[i]] |
FOREACH(day2 in [days[i+1]] |
CREATE UNIQUE (day1)-[:NEXT_DAY]->(day2))));

// Hour
MATCH (d:Day)-[:HAS_HOUR]->(h:Hour)
WITH h, d
ORDER BY d.dayid, h.hourid
WITH collect(h) AS hours
FOREACH(i in RANGE(0, size(hours)-2) |
FOREACH(h1 in [hours[i]] |
FOREACH(h2 in [hours[i+1]] |
CREATE UNIQUE (h1)-[:NEXT_HOUR]->(h2))));
//Connect minutes Sequentially.
MATCH (d:Day)-[:HAS_HOUR]->(h:Hour)-[:HAS_MINUTE]->(mm:Minute)
WITH d, h, mm
ORDER BY d.dayid, h.hourid, mm.minuteid
WITH collect(mm) AS minutes
FOREACH(i in RANGE(0, size(minutes)-2) |
FOREACH(mm1 in [minutes[i]] |
FOREACH(mm2 in [minutes[i+1]] |
CREATE UNIQUE (mm1)-[:NEXT_MINUTE]->(mm2))));

//Connect seconds Sequentially. (This will take some hours)


MATCH (:Month {monthid:6})-[:HAS_DAY]->(d:Day)-[:HAS_HOUR]->(h:Hour)-[:HAS_MINUTE]-
>(mm:Minute)-[:HAS_SECOND]->(ss:Second)
WITH d, h, mm, ss
ORDER BY d.dayid, h.hourid, mm.minuteid, ss.secondid
WITH collect(ss) AS seconds
FOREACH(i in RANGE(0, size(seconds)-2) |
FOREACH(ss1 in [seconds[i]] |
FOREACH(ss2 in [seconds[i+1]] |
CREATE UNIQUE (ss1)-[:NEXT_SECOND]->(ss2))));

//
==========================================================
= END TIME TREE

//
==========================================================
= ADDING NODES
//Create the constrains
create constraint on (a:Agency) assert a.agencyID is unique;
create constraint on (r:Line) assert r.lineID is unique;
create constraint on (t:Trip) assert t.tripID is unique;
create constraint on (c:Calendar) assert c.cal_serviceID is unique;
create constraint on (s:BusStop) assert s.BusStopID is unique;
create constraint on (o:Origin) assert o.OriginID is unique;
create constraint on (d:Destination) assert d.DestinationID is unique;
create constraint on (m:Moves) assert m.MoveID is unique;
create constraint on (s:Stops) assert s.StopID is unique;

//add the agency

// The files should be at the /import folder at: Neo4j/Name_project/import

load csv with headers from


'file:///agency.csv' as csv
create (a:Agency {agencyID:csv.agency_id,agName:csv.agency_name, agUrl:csv.agency_url,
AgTimezone:csv.agency_timezone});

// add the Lines

load csv with headers from


'file:///routes.csv' as csv
match (a:Agency {agencyID: csv.agency_id})
create (a)-[:RUNS]->(r:Line {lineID: csv.route_id, lShortName: csv.route_short_name, lLongName:
csv.route_long_name, lType: toInt(csv.route_type)});

// add the trips (linked by RouteId (this info must or present on trips.txt) - this file must be built
manually)
load csv with headers from
'file:///line51.csv' as csv
match (r:Line {lineID: csv.busLine})
merge (r)<-[:IS_COMPOSED_OF]-(t:Trip {tripID: csv.trip_id, service_id: csv.service, date: csv.date,
starttime: csv.scheduledStart, endtime: csv.scheduledEnd});

// tip: use merge when loading equal data


//Loading Origin and Destination directly from the new Realtime data

load csv with headers from


'file:///line51.csv' as csv
match (t:Trip {tripID: csv.trip_id})
where csv.OrgnDest = 'Original'
create (t)-[:STARTS_AT]->(r:Origin {OriginID: csv.FID, tripID: csv.trip_id, latitude:
toFloat(csv.latitude), longitude: toFloat(csv.longitude),
date: csv.date, time: csv.time, sequence: toInt(csv.sequence)});

load csv with headers from


'file:///line51.csv' as csv
match (t:Trip {tripID: csv.trip_id})
where csv.OrgnDest = 'Destination'
create (t)-[:ENDS_AT]->(d:Destination {DestinationID: csv.FID, tripId: csv.trip_id, latitude:
toFloat(csv.latitude), longitude: toFloat(csv.longitude),
date: csv.date, time: csv.time, sequence: toInt(csv.sequence)});

//add the Bus stops (BusStops)

load csv with headers from


'file:///busstops.csv' as csv
create (bs:BusStop {BusStopID: csv.stop_id, sName: csv.stop_name, sLat: toFloat(csv.stop_lat), sLon:
toFloat(csv.stop_lon),
sParentStation: csv.parent_station, locType: csv.location_type});

//Add Households as an attribute of a BusStops

load csv with headers from


'file:///Civic_Address_PerStop.csv' as csv
MATCH (bs:BusStop{BusStopID: csv.UNIQUEID} )
SET bs.Households = csv.Count;

//Adding Streets

LOAD CSV WITH HEADERS FROM 'file:///Streets.csv' AS line


CREATE (:Streets {streetID: line.FID, streetName: line.STNAME, streetType: line.STTYPE,
streetNoSpace:line.NOSPACE});

CREATE INDEX ON :Streets(streetNoSpace);

//add Stops and Moves (linked to a Trip by the tripID and date)
load csv with headers from
'file:///Stops.csv' as csv
merge (s:Stops {StopID: csv.FID, tripID: csv.trip_id, latitude: csv.latitude, longitude: csv.longitude,
date: csv.date, time: csv.time,
sequence: toInt(csv.sequence), BusEvent: csv.BusEvent, BusEventPlace: csv.BusEventPlace,
ArvDep: csv.ArvDep});

load csv with headers from


'file:///Moves.csv' as csv
merge (m:Moves {MoveID: csv.FID, tripID: csv.trip_id, latitude: csv.latitude, longitude: csv.longitude,
date: csv.date, time: csv.time,
sequence: toInt(csv.sequence), BusEvent: csv.BusEvent, BusEventPlace: csv.BusEventPlace});
////////////////////////////////////

//CREATE INDEXES
////////////////////////////////////////

CREATE INDEX ON :Origin(OriginID);


CREATE INDEX ON :Destination(DestinationID);
CREATE INDEX ON :Moves(MoveID);
CREATE INDEX ON :Stops(StopsID);

CREATE INDEX ON :Stops(BusEventPlace);


CREATE INDEX ON :Stops(BusEvent);

CREATE INDEX ON :Moves(BusEventPlace);


CREATE INDEX ON :Moves(BusEvent);

// Create relationship with BusStops and streets

match (bs:BusStop), (st:Stops) where bs.BusStopID = st.BusEventPlace


merge (st)-[:STOPS_AT]->(bs);

match (st:Stops)-[r:STOPS_AT]->(bs:BusStop)
where st.ArvDep = 'Arrival'
set r.Arrivaltime = "Arrivaltime" + ":" + st.time;

match (st:Stops)-[r:STOPS_AT]->(bs:BusStop)
where st.ArvDep = 'Departure'
set r.Departuretime = "Departuretime" + ":" + st.time;
match (ss:Streets), (st:Stops) where ss.streetName = st.BusEventPlace
merge (st)-[:SUSPENSION_OF_MOVEMENT]->(ss);

match (ss:Streets), (st:Moves) where ss.streetName = st.BusEventPlace


merge (st)-[:MOVES_ON]->(ss);

match (bs:BusStop), (st:Moves) where bs.BusStopID = st.BusEventPlace


merge (st)-[:DID_NOT_STOP_AT]->(bs);

// Create Sequences of Stops and Moves with Origin and Destination (pay attention in every
situation!)

MATCH (st:Stops),(mv:Moves)
WHERE st.sequence = mv.sequence+1
AND st.tripID = mv.tripID
MERGE (mv)-[:NEXT]->(st);

////////////////////////////////////
MATCH (st:Stops),(mv:Moves)
WHERE mv.sequence = st.sequence+1
AND st.tripID = mv.tripID
MERGE (st)-[:NEXT]->(mv);

/////////////////////////////////

MATCH (st:Stops),(mv:Moves)
WHERE mv.sequence = st.sequence+1
AND st.tripID = mv.tripID
MERGE (st)-[:NEXT]->(mv);
//////////////////////////////////

MATCH (st:Stops),(st1:Stops)
WHERE st.sequence = st1.sequence+1
AND st.tripID = st1.tripID
MERGE (st1)-[:NEXT]->(st);

/////////////////////////////////////////////////

MATCH (mv:Moves),(mv1: Moves)


WHERE mv.sequence = mv1.sequence+1
AND mv.tripID = mv1.tripID
MERGE (mv1)-[:NEXT]->(mv);

///////////////////////////////////////

MATCH (or:Origin),(mv: Moves)


WHERE mv.sequence = or.sequence+1
AND mv.tripID = or.tripID
MERGE (or)-[:NEXT]->(mv)

//////////////////////////////////////////////

MATCH (or:Origin),(st:Stops)
WHERE st.sequence = or.sequence+1
AND st.tripID = or.tripID
MERGE (or)-[:NEXT]->(st);
//////////////////////////////////////////

MATCH (dt:Destination),(mv:Moves)
WHERE dt.sequence = mv.sequence+1
AND dt.tripId = mv.tripID
MERGE (mv)-[:NEXT]->(dt);

///////////////////////////////////////

MATCH (dt:Destination),(st:Stops)
WHERE dt.sequence = st.sequence+1
AND dt.tripId = st.tripID
MERGE (st)-[:NEXT]->(dt);

//Adding the hour, min and sec,year, month and day to link to the Time Tree

MERGE (t:Trip)
ON MATCH SET t.year = toInt(substring(t.date,6,4));
MERGE (t:Trip)
ON MATCH SET t.month = toInt(substring(t.date,0,2));
MERGE (t:Trip)
ON MATCH SET t.day = toInt(substring(t.date,3,2));

//On Origin, Destination, Moves and Stops


MERGE (or:Origin)
ON MATCH SET or.year = toInt(substring(or.date,6,4));
MERGE (or:Origin)
ON MATCH SET or.month = toInt(substring(or.date,0,2));
MERGE (or:Origin)
ON MATCH SET or.day = toInt(substring(or.date,3,2));
MERGE (or:Origin)
ON MATCH SET or.hour = toInt(substring(or.time,0,2));
MERGE (or:Origin)
ON MATCH SET or.minute = toInt(substring(or.time,3,2));
MERGE (or:Origin)
ON MATCH SET or.second = toInt(substring(or.time,6));

MERGE (dt:Destination)
ON MATCH SET dt.year = toInt(substring(dt.date,6,4));
MERGE (dt:Destination)
ON MATCH SET dt.month = toInt(substring(dt.date,0,2));
MERGE (dt:Destination)
ON MATCH SET dt.day = toInt(substring(dt.date,3.2));
MERGE (dt:Destination)
ON MATCH SET dt.hour = toInt(substring(dt.time,0,2));
MERGE (dt:Destination)
ON MATCH SET dt.minute = toInt(substring(dt.time,3,2));
MERGE (dt:Destination)
ON MATCH SET dt.second = toInt(substring(dt.time,6));

MERGE (mv:Moves)
ON MATCH SET mv.year = toInt(substring(mv.date,6,4));
MERGE (mv:Moves)
ON MATCH SET mv.month = toInt(substring(mv.date,0,2));
MERGE (mv:Moves)
ON MATCH SET mv.day = toInt(substring(mv.date,3,2));
MERGE (mv:Moves)
ON MATCH SET mv.hour = toInt(substring(mv.time,0,2));
MERGE (mv:Moves)
ON MATCH SET mv.minute = toInt(substring(mv.time,3,2));
MERGE (mv:Moves)
ON MATCH SET mv.second = toInt(substring(mv.time,6));

MERGE (st:Stops)
ON MATCH SET st.year = toInt(substring(st.date,6,4));
MERGE (st:Stops)
ON MATCH SET st.month = toInt(substring(st.date,0,2));
MERGE (st:Stops)
ON MATCH SET st.day = toInt(substring(st.date,3,2));
MERGE (st:Stops)
ON MATCH SET st.hour = toInt(substring(st.time,0,2));
MERGE (st:Stops)
ON MATCH SET st.minute = toInt(substring(st.time,3,2));
MERGE (st:Stops)
ON MATCH SET st.second = toInt(substring(st.time,6));

CREATE INDEX ON :Trip(year);


CREATE INDEX ON :Trip(month);
CREATE INDEX ON :Trip(day);

CREATE INDEX ON :Origin(year);


CREATE INDEX ON :Origin(month);
CREATE INDEX ON :Origin(day);
CREATE INDEX ON :Origin(hour);
CREATE INDEX ON :Origin(minute);
CREATE INDEX ON :Origin(second);

CREATE INDEX ON :Destination(year);


CREATE INDEX ON :Destination(month);
CREATE INDEX ON :Destination(day);
CREATE INDEX ON :Destination(hour);
CREATE INDEX ON :Destination(minute);
CREATE INDEX ON :Destination(second);

CREATE INDEX ON :Moves(year);


CREATE INDEX ON :Moves(month);
CREATE INDEX ON :Moves(day);
CREATE INDEX ON :Moves(hour);
CREATE INDEX ON :Moves(minute);
CREATE INDEX ON :Moves(second);

CREATE INDEX ON :Stops(year);


CREATE INDEX ON :Stops(month);
CREATE INDEX ON :Stops(day);
CREATE INDEX ON :Stops(hour);
CREATE INDEX ON :Stops(minute);
CREATE INDEX ON :Stops(second);
// Connecting to the Time Tree:
//Trip
MATCH (t:Trip) WITH t
MATCH (hh1:Year {yearid:t.year}) WITH t,hh1
MATCH (hh1)-[r1]->(mm1:Month {monthid:t.month}) WITH t,hh1,mm1
MATCH (mm1)-[r2]->(ss1:Day {dayid:t.day}) WITH t, hh1,mm1,ss1
CREATE (t)-[:HAPPENS_AT]->(ss1);

//Origin
MATCH (t:Origin) WITH t
MATCH (yy:Year {yearid:t.year}) WITH t,yy
MATCH (yy)-[r1]->(mm:Month {monthid:t.month}) WITH t,yy,mm
MATCH (mm)-[r2]->(dd:Day {dayid:t.day}) WITH t,yy,mm,dd
MATCH (dd)-[r3]->(hh:Hour {hourid:t.hour}) WITH t,yy,mm,dd, hh
MATCH (hh)-[r4]->(mm1:Minute {minuteid:t.minute}) WITH t,yy,mm,dd, hh, mm1
MATCH (mm1)-[r5]->(ss:Second {secondid:t.second}) WITH t,yy,mm,dd,hh,mm1,ss
CREATE (t)-[:HAPPENS_AT]->(ss);

//Destination
MATCH (t:Destination) WITH t
MATCH (yy:Year {yearid:t.year}) WITH t,yy
MATCH (yy)-[r1]->(mm:Month {monthid:t.month}) WITH t,yy,mm
MATCH (mm)-[r2]->(dd:Day {dayid:t.day}) WITH t,yy,mm,dd
MATCH (dd)-[r3]->(hh:Hour {hourid:t.hour}) WITH t,yy,mm,dd, hh
MATCH (hh)-[r4]->(mm1:Minute {minuteid:t.minute}) WITH t,yy,mm,dd, hh, mm1
MATCH (mm1)-[r5]->(ss:Second {secondid:t.second}) WITH t,yy,mm,dd,hh,mm1,ss
CREATE (t)-[:HAPPENS_AT]->(ss);
//Moves
MATCH (t:Moves) WITH t
MATCH (yy:Year {yearid:t.year}) WITH t,yy
MATCH (yy)-[r1]->(mm:Month {monthid:t.month}) WITH t,yy,mm
MATCH (mm)-[r2]->(dd:Day {dayid:t.day}) WITH t,yy,mm,dd
MATCH (dd)-[r3]->(hh:Hour {hourid:t.hour}) WITH t,yy,mm,dd, hh
MATCH (hh)-[r4]->(mm1:Minute {minuteid:t.minute}) WITH t,yy,mm,dd, hh, mm1
MATCH (mm1)-[r5]->(ss:Second {secondid:t.second}) WITH t,yy,mm,dd,hh,mm1,ss
CREATE (t)-[:HAPPENS_AT]->(ss);

//Stops ( This will take a while)


MATCH (t:Stops) WITH t
MATCH (yy:Year {yearid:t.year}) WITH t,yy
MATCH (yy)-[r1]->(mm:Month {monthid:t.month}) WITH t,yy,mm
MATCH (mm)-[r2]->(dd:Day {dayid:t.day}) WITH t,yy,mm,dd
MATCH (dd)-[r3]->(hh:Hour {hourid:t.hour}) WITH t,yy,mm,dd, hh
MATCH (hh)-[r4]->(mm1:Minute {minuteid:t.minute}) WITH t,yy,mm,dd, hh, mm1
MATCH (mm1)-[r5]->(ss:Second {secondid:t.second}) WITH t,yy,mm,dd,hh,mm1,ss
CREATE (t)-[:HAPPENS_AT]->(ss);
// END

These codes are also attached to this assignment as .cql file, you may decide to open it with an
editor and work from there.

QUESTIONS
Provide the cypher query code used and the results obtained for answering the following
questions:
1. How many trips are in the model?
2. How many nodes and relationships there are in the model?
3. Retrieve and show all the information from a single trip using all the entities (i.e. nodes
and relationships) (Hint: use connectivity of a single trip).
4. Retrieve and show the shortest trip in the database (Hint: use shortest path in cypher)
5. Retrieve and show the longest trip in the database (Hint: use shortest path in cypher)
6. Retrieve and show the shortest trip in the database at 8am, 4pm and 6pm respectively
(Hint: you will need to use the time tree)
7. Find the busiest street in the network (Hint: use Centrality or PageRank measure in
cypher)
8. Find the busiest bus stop in the network (Hint: use Centrality or PageRank measure in
cypher)

Você também pode gostar