Você está na página 1de 4

ORIGINAL ARTICLE

WHAT DO RDF AND SPARQL BRING TO BIG DATA PROJECTS?


Bob DuCharme
TopQuadrant, Charlottesville, Virginia

Photo Credit, Erich Bremer: http://www.ebremer.com/nexus/2011-05-15

The Resource Description Format (RDF), a W3C standard since 1999, describes a data model that can represent most known structured and semi-structured data formats. RDFs simplicity and exibility (and its accompanying standards, such as the SPARQL query language and an optional schema language) provide a great infrastructure for addressing many of the issues that make big data different from traditional relational database management. Because of these features, both open-source efforts and offerings from commercial vendors such as IBM, Oracle, and Cray have found that RDF technology offers an excellent platform for taking an agile approach with large, dynamic aggregations of data that wont t neatly into predened BEING BUILT tables.

globally unique names for subjects and predicates. These usually look like URLs (uniform resource locators, or web addresses), but because theyre identiers and not necessarily locators, their job is to provide unambiguous names. They take the form of URLs because, as with web addresses and Java package names, a domain name owner can control the naming conventions used with that domain name.

There are several syntaxes for expressing RDF data, and the Turtle format is gradually supplanting the RDF/XML format that was released as part of the original RDF standard. Turtle uses the XML namespace convention of letting a prex stand in for a base URI, so that if the prex fbv represents ON WEB the URI http://foobarco.net/vocab/ STANDARDS, RDF USES then fbv:inStock means the same thing URIS AS GLOBALLY UNIQUE as http://foobarco.net/vocab/inStock. NAMES FOR SUBJECTS Using these conventions, we can deTriples All the Way Down clare two prexes and then represent AND PREDICATES. the triple about part p1234 in Turtle RDF expresses data by using threewith the following lines, which inpart statements called triples. These clude another triple about the same parts supplier: three parts are known as the subject, predicate, and object, but you can think of them as an instance identier, a property @prefix fbd: < http://foobarco.net/data/ > . name, and a property value. For example, if a parts inventory @prefix fbv: < http://foobarco.net/vocab/ > . in my Foobar Companys relational database says that we fbd:p1234 fbv:inStock "9". have nine of part p1234 in stock, then a simplied triple fbd:p1234 fbv:supplier "Joes Part Company". representing this might be {p1234 inStock 9}. A proper RDF version of this triple would be a little different, because subjects and predicates must be globally unique identiers. After all, p1234 can mean different things in different contexts, and so can inStockat a cooking website, it might refer to a soup ingredient. Being built on web standards, RDF uses URIs, or uniform resource identiers, as You can see that the object, or third part of a triple, need not be a URI. However, using one (or a prexed name equivalent) has some advantages. This alternative version of the second triple above represents the parts supplier with a prexed name instead of a text string:
fbd:p1234 fbv:supplier fbd:s9483.

38BD

BIG DATA

MARCH 2013  DOI: 10.1089/big.2012.0004

ORIGINAL ARTICLE
DuCharme

If we represent the part supplier company with a URI or prexed name, how do we know its actual name? The same way that we know anything in RDFbecause with a URI representing that company as a resource, we can attach all the data we want to it. Below, the seven triples following the two namespace declarations tell us a bit more about the part and its supplier:
@prefix fbd: < http://foobarco.net/data/ > . @prefix fbv: < http://foobarco.net/vocab/ > . fbd:p1234 fbv:inStock "9". fbd:p1234 fbv:name "Blue reverse flange". fbd:p1234 fbv:supplier fbd:s9483. fbd:s9483 fbv:name "Joes Part Company". fbd:s9483 fbv:homePage "http://www.joespartco.com". fbd:s9483 fbv:contactName "Gina Smith". fbd:s9483 fbv:contactEmail "gina.smith@joespartco.com".

a Friend vocabulary. The sharing of vocabularies, both for broadly used properties like these and for specic domains such as biology and e-commerce, is the third key to RDFs value, because it makes it easier to nd connections in data aggregated from disparate sources. This storage of data in individual, potentially connected units has much in common with the approach of several NoSQL databases used in big data implementationsfor example, column-oriented databases such as HBase and especially graph NoSQL databases such as Neo4J. (A set of RDF triples is, in fact, a graph database.) In particular, its an efcient way to store sparse (though not necessarily small) datasets that dont store all the same properties for every instance of a given class. The most important feature that separates RDF from these other NoSQL options is its accompanying query language standard, SPARQL, which is how you nd and use the connections between triples in aggregated data.

All of this data can be stored easily enough in a relational database. RDF, however, offers several important advantages:


Virtually any RDF software can parse the lines shown above as a self-contained, working data le. You can declare properties if you wantdoing so just takes one triple per propertyand the RDF Schema (RDFS) Querying RDF standard lets you declare classes and relationships between properties and classes, but these are comQuerying data with most big data tools means writing a pletely optional, and I didnt need to declare anything program in Java, in a scripting language such as Ruby, Pyin my example. I just listed thon, or JavaScript, or in a spefacts, one per triple. To add cialized query language designed a new triple with a brandaround a single tool such as THE HEART OF ANY SPARQL new property, I can just do Neo4Js Cypher or Hadoops Hive. it; theres no need to modify QUERY IS THE WHERE CLAUSE, SPARQL (SPARQL Protocol And a schema in advance so that WHICH SPECIFIES THE TRIPLES RDF Query Language) is widely my software wont choke on TO PULL OUT OF THE DATASET. implemented and, although SQLthe unrecognized property like in its syntax, even more uniname. The exibility that we form from one implementation to get from this lack of dependence on schemas is the another than SQL. The following SPARQL query asks for all rst key to RDFs value. property names and values associated with the fbd:s9483  I could split the seven triples above into two or three resource: or seven les and it wouldnt affect their collective PREFIX fbd: < http://foobarco.net/data/ > meaning, which makes sharding of data collections SELECT ?property ?value easy. Multiple datasets can be combined into a usable WHERE {fbd:s9483 ?property ?value.} whole with simple concatenation. This ability to aggregate triples from different sourceswhether you The heart of any SPARQL query is the WHERE clause, which had split up a large set for distributed storage or species the triples to pull out of the dataset. Various options gathered together different, independently developed for the rest of the query tell the SPARQL processor what to do sets from different sourcesis the second key to with those triples, such as sorting, creating, or deleting triples. RDFs value, because along with the lack of a depenThe above querys WHERE clause has a single triple pattern, dence on schemas, it makes integration of data from which resembles a triple but may have variables substituted different sources nearly trivial. for any or all of the triples three parts. The triple pattern  For this inventory datasets property name URIs, I above says that were interested in triples that have fbd:s9483 as the subject andbecause variables function as wildcards wouldnt use fbv:name for part and company names anything at all in the triples second and third parts. When a or fbv:homePage for the supplier companys web adSPARQL engine nds triples that match this pattern (in this dress. I know that popular shared vocabularies offer case, any triple with a subject of fbd:s9483), it will store the property names such as http://www.w3.org/2000/01/ other parts of those triples in the corresponding variables. rdf-schema#label from the RDFS vocabulary (typically This querys SELECT clause tells the SPARQL engine to list abbreviated rdfs:label) and http://xmlns.com/foaf/0.1/ the values that it stored in the ?property and ?value variables. homepage (usually foaf:homepage) from the Friend of
MARY ANN LIEBERT, INC.  VOL. 1 NO. 1  MARCH 2013 BIG DATA

BD39

RDF, SPARQL, AND BIG DATA


DuCharme The following shows the default output of running this query on the seven triples in the inventory data example with the open-source ARQ query engine; like most query engines, ARQ can also return the results in XML, CSV, JSON, and other formats: properties get used in this set of data?, the same brief SPARQL query works with any SPARQL processor and any RDF dataset:
SELECT DISTINCT ?propertyName WHERE {?resourceID ?propertyName ?propertyValue}

-------------------------------------------------------------------------------------------------------------------j property j value j = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = j < http://foobarco.net/vocab/contactEmail > j"gina.smith@joespartco.com" j j < http://foobarco.net/vocab/contactName > j"Gina Smith" j j < http://foobarco.net/vocab/homePage > j"http://www.joespartco.com" j j < http://foobarco.net/vocab/name > j"Joes Part Company" j --------------------------------------------------------------------------------------------------------------------

The next query asks for the contact e-mail address of the This is only one of many SPARQL queries that let you look supplier of the Blue reverse ange. It does this with three for implicit structure in a dataset, which can be very useful triple patterns in its WHERE clause: the rst looks for the with data aggregated from different silos. resource (which the SPARQL engine will store in the variable ?part) that has an fbv:name of Blue reverse ange, and the Once youve identied some implicit structure, RDFS offers second looks for the fbv:supplier value of that same resource, various benets to big data projects. Using this optional storing it in the ?supplier variable. schema language to dene strucThe nal triple pattern asks for the ture descriptively as you nd that fbv:contactEmail value associated ONCE YOUVE IDENTIFIED SOME structureinstead of describing it with the supplier that was identied proscriptively before you load IMPLICIT STRUCTURE, RDFS by the second triple pattern, storing your rst byte of data, as with OFFERS VARIOUS BENEFITS it in the ?angeContactEmail varimost schema languagesyou can able. The SELECT clause is only then use that structure to build TO BIG DATA PROJECTS. interested in that one variables applications and infer new knowlvalue (or valuesif that supplier edge to act on. This provides the has multiple fbv:contactEmail values, the query will nd all perfect compromise to the NoSQL debate over schema-based of them). vs. schemaless data management: an optional schema language that you can add iteratively as data accumulates. The PREFIX fbd: < http://foobarco.net/data/ > incremental use of RDFS and SPARQL together gives you an PREFIX fbv: < http://foobarco.net/vocab/ > agile alternative to the expensive, time-consuming steps asSELECT ?flangeContactEmail sociated with planning a typical data warehousing project.
WHERE { ?part fbv:name "Blue reverse flange". ?part fbv:supplier ?supplier. ?supplier fbv:contactEmail ?flangeContactEmail. }

Querying Non-RDF Data


Gaining all of these benets does not require migrating all your data to RDF. Open source and commercial tools are available to dynamically represent relational data, spreadsheets, and other formats in the RDF data model so that you can query them with SPARQL. One such tool, the opensource D2RQ platform, lets you execute a single SPARQL query across different databases stored in different relational database management systems (for example, both SQL Server and MySQL) with no special preparation necessary for those databases other than granting read access to D2RQ. D2RQ makes the data available as a SPARQL endpoint, or a service that accepts SPARQL queries and returns them using the SPARQL protocol (the P in SPARQL). Hundreds of public SPARQL endpoints exist, delivering data about life sciences, government, publishing, and more as part of the Linked Open Data Cloud. By using SPARQL endpoint tools such as D2RQ to share data across internal silos, many
BIG DATA MARCH 2013

The use of the same variable in the object position of one triple pattern and the subject position of another (in this case, ?supplier) is one way that SPARQL can nd connections between triples. Because RDF triples from different sources can be so easily aggregated, this ability to identify connections between different triples is one of the great benets that SPARQL brings to big data applications, which are often looking for patterns among aggregations of disparate datasets. One advantage of SPARQL over SQL is its ability to let you query the structure of an unfamiliar set of data. Relational database products give you commands to list databases, the tables in a database, and the names of a tables columns, but these commands are, if standardized at all, inconsistently supportedsomeone accustomed to MySQL must learn new ways to perform these tasks when working with relational databases from Oracle or IBM. To ask the question what

40BD

ORIGINAL ARTICLE
DuCharme

enterprises are creating their own private Linked Data clouds behind their rewalls, building them on the HTTP infrastructures already in place for their HTML-based intranets. Because most SPARQL endpoints can return data in simple XML, JSON, and CSV formats, their integration into a service-oriented architecture already using such an infrastructure is usually pretty simple.

RDF triples such as data retrieved from internal and public SPARQL endpoints can be stored in a specialized RDF database known as a triplestore. If you do choose to store retrieved RDF data natively in addition to working with dyBecause RDF technology is all namically generated RDF, you get BIG DATA IS OFTEN DEFINED AS built from public standards, ofall the benets that an extractDATA WHOSE VOLUME, VARIETY, ferings from more specialized transform-load system gives you, vendors such as triplestores from AND VELOCITY EXCEED THE and implementations of the new Allegrograph and Stardog and CAPABILITIES OF TRADITIONAL SPARQL 1.1 specication let you the TopBraid application platupdate data and support a range form from TopQuadrant can DATABASE TOOLS. of new functions that give you mix and match with Cray, IBM, greater power to transform the and Oracles offerings as well as data. (1.1s federated query capability also lets a single query with open-source tools to create applications that can start retrieve data from multiple endpoints, another boon for apsmall and provide a basis for incremental growth up to trilplications working with diverse data sources.) With SPARQL lions of triples. Thats some pretty big data. endpoint middleware such as D2RQ shielding your application from the actual syntax and storage format of the datasets youre using, all you need is SPARQL to extract the remote Author Disclosure Statement data, transform it into a model more suitable to your applications if you wish, and load it into a local triplestore. The author is a Solution Architect for TopQuadrant.

and supported SPARQL since it was rst released. To target more of the big data market, Oracle Spatial has supported RDF and SPARQL since Oracle 10g, and more recently, IBMs DB2 has added this support. Big Data is often dened as data whose volume, variety, and velocity exceed the capabilities of traditional database tools, and while these companies certainly have technology for dealing with volume and velocity, the variety of data (and especially, of data structures) that big data projects often tackle requires an approach different from starting with normalized relational tables and then squeezing all of the data into them. They found this in RDF.

RDF, SPARQL, and Big Data


The uRiKA appliance from Crays YarcData subsidiary (which, according to the products homepage, was created to transform big data into meaningful informationin real timeby discovering unknown relationships; http:// yarcdata.com/products.html) has used an RDF data model

Address correspondence to: Bob DuCharme TopQuadrant 2938 Old Via Road Charlottesville, VA 22901 E-mail: bob@snee.com

MARY ANN LIEBERT, INC.  VOL. 1 NO. 1  MARCH 2013 BIG DATA

BD41