Você está na página 1de 6

Introduction to AVRO

Apache Avro is a language-neutral data serialization system. Avro has a schema-


based system. A language-independent schema is associated with its read and
write operations. Avro serializes the data which has a built-in schema. Avro
serializes the data into a compact binary format, which can be deserialized by
any application.

Avro provides:

•Rich data structures.

•A compact, fast, binary data format.

•A container file, to store persistent data.

•Remote procedure call (RPC).

•Simple integration with dynamic languages. Code generation is not required to


read or write data files nor to use or implement RPC protocols. Code generation
as an optional optimization, only worth implementing for statically typed
languages.
Some Important features of
AVRO
Dynamic Access – No need of Code generation for accessing the
data.

UnTagged Data – Which allows better compression

Platform in-dependent – Has libraries in Java, Scala, Python,


Ruby, C and C#. Compressible and Splittable – Complements the
parllel processing systems such as MR and Spark.

Schema Evolution: “Data models evolve over time”, and


it’s important that your data formats support your need to
modify your data models. Schema evolution allows you to
add, modify, and in some cases delete attributes, while at the
same time providing backward and forward compatibility for
readers andwriters
AVRO Schema
Types
Primitive Types
Complex Types
null: no value
boolean: a binary
Records
value
Enums
int: 32-bit signed
Arrays
integer long: 64-bit
Maps
signed integer
Unions
float: single precision (32-bit)
Fixed
double: double precision (64-bit)
bytes: sequence of 8-bit unsignedbytes
string: unicode character sequence

Avro IDL
In addition to supporting JSON for type and protocol definitions, Avro includes
experimental support for an alternative interface description language (IDL) syntax
known as Avro IDL. Previously known as GenAvro, this format is designed to ease
adoption by users familiar with more traditional IDLs and programming languages,
with a syntax similar to C/C++, and others.
Creating Avro Schemas

The Avro schema is created in JavaScript Object Notation (JSON) document format, which is
a lightweight text-based data interchange format. It is created in one of the following ways −

A JSON string
A JSON object
A JSON array
Example − The following example shows a schema, which defines a document, under the
name space Tutorialspoint, with name Employee, having fields name and age.
In this example, you can observe that there are four fields for each record −

type − This field comes under the document as well as the under the field named
fields.
In case of document, it shows the type of the document, generally a record
because there are multiple fields.
When it is field, the type describes data type.

namespace − This field describes the name of the namespace in which the object
resides.

name − This field comes under the document as well as the under the field named
fields.
In case of document, it describes the schema name. This schema name
together with the namespace, uniquely identifies the schema within the store
(Namespace.schema name). In the above example, the full name of the
schema will be Tutorialspoint.Employee.
In case of fields, it describes name of the field.
General Working of Avro

To use Avro, you need to follow the given workflow −

Step 1 − Create schemas. Here you need to design Avro schema according to your data.

Step 2 − Read the schemas into your program. It is done in two ways −
By Generating a Class Corresponding to Schema − Compile the schema using Avro. This
generates a class file corresponding to the schema
By Using Parsers Library − You can directly read the schema using parsers library.

Step 3 − Serialize the data using the serialization API provided for Avro, which is found in
the package org.apache.avro.specific.

Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in
the package org.apache.avro.specific.

Você também pode gostar