Escolar Documentos
Profissional Documentos
Cultura Documentos
2/10/17, 20:30
This is the first in a series of posts on Data Engineering. If you like this
and want to know when the next post in the series is released, you can
subscribe at the bottom of the page.
From helping cars drive themselves to helping Facebook tag you in photos,
data science has attracted a lot of buzz recently. Data scientists have
become extremely sought after, and for good reason a skilled data
scientist can add incredible value to a business.
https://www.dataquest.io/blog/what-is-a-data-engineer/ Page 1 of 11
What is Data Engineering? 2/10/17, 20:30
But a data scientist is only as good as the data they have access to. Most
companies store their data in variety of formats across databases and text
files. This is where data engineers come in they build pipelines that
transform that data into formats that data scientists can use. Data
engineers are just as important as data scientists, but tend to be less visible
because they tend to be further from the end product of the analysis.
A good analogy is a race car builder vs a race car driver. The driver gets the
excitement of speeding along a track, and thrill of victory in front of a
crowd. But the builder gets the joy of tuning engines, experimenting with
different exhaust setups, and creating a powerful, robust, machine. If
youre the type of person that likes building and tweaking systems, data
engineering might be right for you. In this post, well explore the day to day
of a data engineer, and discuss the skills required for the role.
Data engineering is also a broad field, but any individual data engineer
doesnt need to know the whole spectrum of skills. In this section, well
https://www.dataquest.io/blog/what-is-a-data-engineer/ Page 2 of 11
What is Data Engineering? 2/10/17, 20:30
sketch the broad outlines of data engineering, then walk through more
specific descriptions that illustrate specific data engineering roles.
A data engineer transforms data into a useful format for analysis. Imagine
that youre a data engineer working on a simple competitor to Uber called
Rebu. Your users have an app on their device through which they access
your service. They request a ride to a destination through your app, which
gets routed to a driver, who then picks them up and drops them off. After
the ride, theyre charged, and have the option to rate their driver.
Server
https://www.dataquest.io/blog/what-is-a-data-engineer/ Page 3 of 11
What is Data Engineering? 2/10/17, 20:30
As you may expect, this kind of system will generate huge amounts of data.
Youll have a few different data stores:
The database that backs your main app. This contains user and driver
information.
Server analytics logs
Server access logs. These contain one line per request made to
the server from the app.
Server error logs. These contain all the server-side errors
generated by your app.
App analytics logs
App event logs. These contain information about what actions
users and drivers took in the app. For example, youd log when
they clicked a button or updated their payment information.
App error logs. These contain information about errors in the
app.
Ride database. This contains information about a single ride for
user/driver pair, and contains status information on the ride.
Customer service database. This contains information about customer
interactions by customer service agents. It can include voice
transcripts and email logs.
https://www.dataquest.io/blog/what-is-a-data-engineer/ Page 4 of 11
What is Data Engineering? 2/10/17, 20:30
Main
Database
Access
Ride Logs
Database
Access Access
Logs Logs
Lets say a data scientist wants to analyze a users action history with your
service, and see what actions correlate with users who spend more. In
order to enable them to create this, youll need to combine information
from the server access logs and the app event logs. Youll need to:
https://www.dataquest.io/blog/what-is-a-data-engineer/ Page 5 of 11
What is Data Engineering? 2/10/17, 20:30
In order to solve this, youll need to create a pipeline that can ingest mobile
app logs and server logs in real-time, parse them, and attach them to a
specific user. Youll then need to store the parsed logs in a database, so
they can easily be queried by the API. Youll need to spin up several servers
behind a load balancer to process the incoming logs.
Most of the issues that youll run into will be around reliability and
distributed systems. For example, if you have millions of devices to gather
logs from, and variable demand (in the morning, you get a ton of logs, but
not as many at midnight), youll need a system that can automatically scale
your server count up and down.
https://www.dataquest.io/blog/what-is-a-data-engineer/ Page 6 of 11
What is Data Engineering? 2/10/17, 20:30
Storage this involves storing the end results for fast retrieval.
Access youll need to enable a tool or user to access the end results
of the pipeline.
For a more complex example, imagine that a data scientist wants to build a
system that finds all rides that ended prematurely due to app or driver
issues. One way to do this is to look at the customer service database to see
which rides ended with issues, and analyze their language logn with some
data about the ride.
Before the data scientist can do this, they need a way to match up the logs
in the customer service database with specific rides. As a data engineer,
youll want to create an API endpoint that allows the data scientist to query
for all customer service messages related to a particular ride. In order to do
this, youll need to:
Create a system that pulls data from the ride database, and figures out
information about the ride, such as how long it was, and whether the
destination matched the users initial request.
Combine the computed statistics on each ride with user information,
such as name and user id.
Extract error information from the app and server analytics logs
pertaining to the user during the time period of the ride.
Find all customer service queries by a user.
https://www.dataquest.io/blog/what-is-a-data-engineer/ Page 7 of 11
What is Data Engineering? 2/10/17, 20:30
A skilled data engineer will be able to build a pipeline that performs each of
the above steps every time a new ride is added. This will ensure that the
data served by the API is always up to date, and that whatever analysis the
data scientist does is valid.
Note that we didnt mention any tools above. Although tools like Hadoop
and Spark and languages like Scala and Python are important to data
engineering, its more important to understand the concepts well and know
how to build real-world systems. Well continue this focus on concepts over
tools throughout this series on data engineering.
Although data engineers need to have the skills listed above, the day to day
of a data engineer will vary depending on the type of company they work
https://www.dataquest.io/blog/what-is-a-data-engineer/ Page 8 of 11
What is Data Engineering? 2/10/17, 20:30
for. Broadly, you can classify data engineers into a few categories:
Generalist
Pipeline-centric
Database-centric
Generalist
Pipeline-centric
https://www.dataquest.io/blog/what-is-a-data-engineer/ Page 9 of 11
What is Data Engineering? 2/10/17, 20:30
Database-centric
After Rebu takes over the world, a database centric data engineer might
design an analytics database, then create scripts to pull information from
the main app database into the analytics database.
In this post, we covered data engineering and the skills needed to practice
it at a high level. If youre interested in architecting large-scale systems, or
working with huge amounts of data, then data engineering is a good field
for you. It can be very exciting to see your autoscaling data pipeline
suddently handle a traffic spike, or get to work with machines that have
https://www.dataquest.io/blog/what-is-a-data-engineer/ Page 10 of 11
What is Data Engineering? 2/10/17, 20:30
Vik Paruchuri
https://www.dataquest.io/blog/what-is-a-data-engineer/ Page 11 of 11