Você está na página 1de 182

Introduction

This Wiley Business e-book sampler includes selected materials from seven recently published titles in Wileys Marketing, Tech, Design, Management, and Business lists. The material that is included for each selection is the books full Table of Contents as well as a full sample chapter. To learn more, please visit the individual links below. Predictive Analytics by Eric Siegel Data Points by Nathan Yau Infographics by Jason Lankow, Josh Ritchie, and Ross Crooks Too Big to Ignore by Phil Simon Optimize by Lee Odden Measure What Matters by Katie D. Paine Social Media Metrics by Jim Sterne Visit Wiley Business at http://www.wiley.com/go/wileybiz and follow us on Twitter.com/WileyBiz and Facebook.com/WileyBiz for the latest business books and ebooks.

Contents

Foreword Thomas H. Davenport Preface What is the occupational hazard of predictive analytics? Introduction The Prediction Effect How does predicting human behavior combat risk, fortify healthcare, toughen crime ghting, and boost sales? Why must a computer learn in order to predict? How can lousy predictions be extremely valuable? What makes data exceptionally exciting? How is data science like porn? Why shouldnt computers be called computers? Why do organizations predict when you will die? Chapter 1 Liftoff! Prediction Takes Action (deployment) How much guts does it take to deploy a predictive model into eld operation, and what do you stand to gain? What happens when a man invests his entire life savings into his own predictive stock market trading system? Chapter 2 With Power Comes Responsibility: Hewlett-Packard, Target, and the Police Deduce Your Secrets (ethics)

xiii xv

17

37

How do we safely harness a predictive machine that can foresee job resignation, pregnancy, and crime? Are civil liberties at risk? Why does one leading health insurance company predict policy holder death? An extended sidebar on fraud detection addresses the question: how does machine intelligence ip the meaning of fraud on its head? Chapter 3 The Data Effect: A Glut at the End of the Rainbow (data) We are up to our ears in data, but how much can this raw material really tell us? What actually makes it predictive? Does existing data go so far as to reveal the collective mood of the human populace? If yes, how does our emotional online chatter relate to the economys ups and downs? Chapter 4 The Machine That Learns: A Look Inside Chases Prediction of Mortgage Risk (modeling) What form of risk has the perfect disguise? How does prediction transform risk to opportunity? What should all businesses learn from insurance companies? Why does machine learning require art in addition to science? What kind of predictive model can be understood by everyone? How can we condently trust a machines predictions? Why couldnt prediction prevent the global nancial crisis? Chapter 5 The Ensemble Effect: Netflix, Crowdsourcing, and Supercharging Prediction (ensembles) To crowdsource predictive analyticsoutsource it to the public at largea company launches its strategy, data, and research discoveries into the public spotlight. How can this possibly help the company compete? What key innovation in predictive analytics has crowdsourcing helped develop? Must supercharging predictive precision involve overwhelming complexity, or is there an elegant solution? Is there wisdom in nonhuman crowds? 133 103 67

Chapter 6 Watson and the Jeopardy! Challenge (question answering) How does WatsonIBMs Jeopardy!-playing computerwork? Why does it need predictive modeling in order to answer questions, and what secret sauce empowers its high performance? How does the iPhones Siri compare? Why is human language such a challenge for computers? Is articial intelligence possible? Chapter 7 Persuasion by the Numbers: How Telenor, U.S. Bank, and the Obama Campaign Engineered Influence (uplift) What is the scientic key to persuasion? Why does some marketing ercely backre? Why is human behavior the wrong thing to predict? What should all businesses learn about persuasion from presidential campaigns? What voter predictions helped Obama win in 2012 more than the detection of swing voters? How could doctors kill fewer patients inadvertently? How is a person like a quantum particle? Riddle: What often happens to you that cannot be perceived, and that you cant even be sure has happened afterwardbut that can be predicted in advance? Afterword Ten Predictions for the First Hour of 2020 Appendices A. Five Effects of Prediction B. Twenty-One Applications of Predictive Analytics C. Prediction PeopleCast of Characters Notes Acknowledgments About the Author Index Buy This Book 187 151

218

221 222 225 228 290 292 293

CHAPTER 1

Liftoff! Prediction Takes Action

How much guts does it take to deploy a predictive model into field operation, and what do you stand to gain? What happens when a man invests his entire life savings into his own predictive stock market trading system? Launching predictive analytics means to act on its predictions, applying whats been learned, whats been discovered within data. Its a leap many takeyou cant win if you dont play.

In the mid-1990s, an ambitious postdoc researcher couldnt stand to wait any longer. After consulting with his wife, he loaded their entire life savings into a stock market prediction system of his own designa contraption he had developed moonlighting on the side. Like Dr. Henry Jekyll imbibing his own untested potion in the moonlight, the young Dr. John Elder unflinchingly pressed Go. There is a scary moment every time new technology is launched. A spaceship lifting off may be the quintessential portrait of technological greatness and national prestige, but the image leaves out a small group of spouses terrified to the very point of psychological trauma. Astronauts are in essence stunt pilots, voluntarily strapping themselves in to serve as guinea pigs for a giant experiment, willing to sacrifice themselves in order to be part of history. From grand challenges are born great achievements. Weve taken strolls on our moon, and in more recent years a $10 million Grand Challenge prize was awarded to the first nongovernmental organization to develop a reusable manned spacecraft. Driverless cars have been unleashedLook, Ma, no hands! Fueled as well by millions of dollars in prize money, they navigate autonomously around the campuses of Google and BMW. Replace the roar of rockets with the crunch of data, and the ambitions are no less far-reaching, boldly going not to space but to a new final frontier: predicting the future. This frontier is just as exciting to explore, yet less dangerous and uncomfortable (outer space is a vacuum, and vacuums totally suck). Millions in grand challenge prize money go toward averting the unnecessary hospitalization of each patient and predicting the idiosyncratic preferences of each individual consumer. 17

18

Going Live

The TV quiz show Jeopardy! awarded $1.5 million in prize money for a face-off between man and machine that demonstrated dramatic progress in predicting the answers to questions (IBM invested a lot more than that to achieve this win, as detailed in Chapter 6). Organizations are literally keeping kids in school, keeping the lights on, and keeping crime down with predictive analytics (PA). And success is its own reward when analytics wins a political election, a baseball championship, or . . . did I mention managing a financial portfolio? Black box tradingdriving financial trading decisions automatically with a machineis the holy grail of data-driven decision making. Its a black box into which current financial environmental conditions are fed, with buy/hold/sell decisions spit out the other end. Its black (i.e., opaque) because you dont care whats on the inside, as long as it makes good decisions. When working, it trumps any other conceivable business proposal in the world: Your computer is now a box that turns electricity into money. And so with the launch of his stock trading system, John Elder took on his own personal grand challenge. Even if stock market prediction would represent a giant leap for mankind, this was no small step for John himself. Its an occasion worthy of mixing metaphors. Going for broke by putting all his eggs into one analytical basket, John was taking a healthy dose of his own medicine. Before continuing with the story of Johns blast-off, lets establish how launching a predictive system works, not only for black box trading but across a multitude of applications.

Going Live
Learning from data is virtually universally useful. Master it and youll be welcomed nearly everywhere! John Elder

New groundbreaking stories of PA in action are pouring in. A few key ingredients have opened these floodgates:
  

Wildly increasing loads of data. Cultural shifts as organizations learn to appreciate, embrace, and integrate predictive technology. Improved software solutions to deliver PA to organizations.

But this flood built up its potential in the first place simply because predictive technology boasts an inherent generalitythere are just so many conceivable ways to make use of it. Want to come up with your own new innovative use for PA? You need only two ingredients.

Liftoff! Prediction Takes Action

19

EACH

APPLICATION OF

PA

IS DEFINED BY:

1. Whats predicted: The kind of behavior (i.e., action, event, or happening) to predict for each individual, stock, or other kind of element. 2. Whats done about it: The decisions driven by prediction; the action taken by the organization in response to or informed by each prediction. Given its open-ended nature, the list of application areas is so broad and the list of example stories is so long that it presents a minor data management challenge in and of itself! So I placed this big list (147 examples total) into nine tables in the center of this book. Take a flip through to get a feel for just how much is going on. Thats the sexy partits the centerfold of this book. The Central Tables divulge cases of predicting: stock prices, risk, delinquencies, accidents, sales, donations, clicks, cancellations, health problems, hospital admissions, fraud, tax evasion, crime, malfunctions, oil flow, electricity outages, approvals for government benefits, thoughts, intention, answers, opinions, lies, grades, dropouts, friendship, romance, pregnancy, divorce, jobs, quitting, wins, votes, and more. The application areas are growing at a breakneck pace. Within this long list, the quintessential application for business is the one covered in the Introduction for mass marketing:

PA APPLICATION: TARGETING DIRECT MARKETING


1. Whats predicted: Which customers will respond to marketing contact. 2. Whats done about it: Contact customers more likely to respond. As we saw, this use of PA illustrates The Prediction Effect. The Prediction Effect: A little prediction goes a long way.

Lets take a moment to see how straightforward it is to calculate the sheer value resulting from The Prediction Effect. Imagine you have a company with a mailing list of a million prospects. It costs $2 to mail to each one, and you have observed that one out of 100 of them will buy your product (i.e., 10,000 responses). You take your chances and mail to the entire list. If you profit $220 for each rare positive response, then you pocket: Overall profit Revenue Cost $220 3 10; 000 responses $2 3 1million Whip out your calculatorthats $200,000 profit. Are you happy yet? I didnt think so. If you are new to the arena of direct marketing (welcome!), youll notice were playing a kind of wild numbers game, amassing great waste, like one million monkeys chucking darts across a chasm in the general direction of a dartboard. As turn-of-the-century

20

A Faulty Oracle Everyone Loves

marketing pioneer John Wanamaker famously put it, Half the money I spend on advertising is wasted; the trouble is I dont know which half. The bad news is that its actually more than half; the good news is that PA can learn to do better.

A Faulty Oracle Everyone Loves


The rst step toward predicting the future is admitting you cant. Stephen Dubner, Freakonomics Radio, March 30, 2011 The prediction paradox: The more humility we have about our ability to make predictions, the more successful we can be in planning for the future. Nate Silver, The Signal and the Noise: Why So Many Predictions Failbut Some Dont Half of what we will teach you in medical school will, by the time you are done practicing, be proved wrong. Dr. Mehmet Oz

Your resident oracle, PA, tells you which customers are most likely to respond. It earmarks a quarter of the entire list and says, These folks are three times more likely to respond than average! So now you have a short list of 250,000 customers of which 3 percent will respond7,500 responses. Oracle shmoracle! These predictions are seriously inaccuratewe still dont have strong confidence when contacting any one customer, given this measly 3 percent response rate. However, the overall IQ of your dart-throwing monkeys has taken a real boost. If you send mail to only this short list then you profit: Overall profit Revenue Cost $220 3 7; 500 responses $2 3 250; 000 Thats $1,150,000 profit. You just improved your profit 5.75 times over by mailing to fewer people (and, in so doing, expending fewer trees). In particular, you predicted who wasnt worth contacting and simply left them alone. Thus you cut your costs by three-quarters, in exchange for losing only one-quarter of sales. Thats a deal Id take any day. Its not hard to put a value on prediction. As you can see, even if predictions themselves are generated from sophisticated mathematics, it takes only simple arithmetic to roll up the plethora of predictionssome accurate, and others not so much and reveal the aggregate bottom-line effect. This isnt just some abstract notion; The Prediction Effect means business.

Liftoff! Prediction Takes Action

21

Predictive Protection
Thus, value has emerged from just a little predictive insight, a small prognostic nudge in the right direction. Its easy to draw an analogy to science fiction, where just a bit of supernatural foresight can go a long way. Nicolas Cage kicks some serious bad-guy butt in the movie Next based on a story by Philip K. Dick. His weapon? Pure prognostication. He can see the future, but only two minutes ahead. Its enough prescience to do some damage. An unarmed civilian with a soft heart and the best of intentions, he winds up marching through something of a war zone, surrounded by a posse of heavily armed FBI agents who obey his every gesture. He sees the damage of every booby trap, sniper, and mean-faced grunt before it happens and so can command just the right moves for this Superhuman Risk-Aversion Team, avoiding one calamity after another. In a way, deploying PA makes a Superhuman Risk-Aversion Team of the organization just the same. Every decision an organization makes, each step it takes, incurs risk. Imagine the protective benefit of foreseeing each pitfall so that it may be avoidedeach criminal act, stock value decline, hospitalization, bad debt, traffic jam, high school dropout . . . and each ignored marketing brochure that was a waste to mail. Organizational risk management, traditionally the act of defending against singular, macro-level incidents like the crash of an aircraft or an economy, now broadens to fight a myriad of micro-level risks. Hey, its not all bad news. We win by foreseeing good behavior as well, since it often signals an opportunity to gain. The name of the game is Predict n Pounce when it pops up on the radar that a customer is likely to buy, a stock value is likely to increase, a voter is likely to swing, or the apple of ones online dating eye is likely to reciprocate. A little glimpse into the future gives you power because it gives you options. In some cases the obvious decision is to act to change what may not be inevitable, be it crime, loss, or sickness. On the positive side, in the case of foreseeing demand, you act to exploit it. Either way, prediction serves to drive decisions. Lets turn to a real case, a $1 million example.

A Silent Revolution Worth a Million


When an organization goes live with PA, it unleashes a massive army, but its an army of ants. These ants march out to the front lines of an organizations operations, the places where theres contact with the likes of customers, students, or patients the people served by the organization. Within these interactions, the ant army, guided by predictions, improves millions of small decisions. The process goes largely unnoticed, under the radar . . . until someone bothers to look at how its adding up. The

22

The Perils of Personalization

improved decisions may each be ant-sized, relatively speaking, but there are so many that it comes to a powerful net effect. In 2005, I was digging in the trenches, neck deep in data for a client who wanted more clicks on its website. To be precise, they wanted more clicks on their sponsors ads. This was about the moneymore clicks, more money. The site had gained tens of millions of users over the years, and within just several months worth of tracking data that they handed me, there were 50 million rows of learning datano small treasure trove from which to learn to predict . . . clicks. Advertising is an inevitable part of media, be it print, television, or your online experience. Benjamin Franklin forgot to include it when he proclaimed, Nothing can be said to be certain, except death and taxes. The flagship Internet behemoth Google credits ads as its greatest source of revenue. Its the same with Facebook. But on this website, ads told a slightly different story than usual, which further amplified the potential win of predicting user clicks. The client was a leading student grant and scholarship search service, with one in three college-bound high school seniors using it: an arcane niche, but just the one over which certain universities and military recruiters were drooling. One ad for a university included a strong pitch, naming itself Americas leader in creative education, and culminating with a button that begged to be clicked: Yes, please have someone from the Art Institutes Admissions Office contact me! And you wont be surprised to hear that creditors were also placing ads, at the ready to provide these students another source of funds: loans. The sponsors would pay up to $25 per leadfor each would-be recruit. Thats good compensation for one little click of the mouse. Whats more, since the ads were largely relevant to the users, closely related to their purpose on the website, the response rates climbed up to an unusually high 5 percent. So this little business, owned by a well-known online job-hunting firm, was earning well. Any small improvement meant real revenue. But improving ad selection is a serious challenge. At certain intervals, users were exposed to a full-page ad, selected from a pool of 291 options. The trick is selecting the best one for each user. The website currently selected which ad to show based simply on the revenue it generated on average, with no regard to the particular user. The universally strongest ad was always shown first. Although this tactic forsakes the possibility of matching ads to individual users, its a formidable champion to unseat. Some sponsor ads, such as certain universities, paid such a high bounty per click, and were clicked so often, that showing any user a less powerful ad seemed like a crazy thing to consider, since doing so would risk losing currently established value.

The Perils of Personalization


By trusting predictions in order to customize for the individual, you take on risk. A predictive system boldly proclaims, Even though ad A is so strong overall, for this

Liftoff! Prediction Takes Action

23

particular user it is worth the risk of going with ad B. For this reason, most online ads are not personalized for the individual usereven Googles Adwords, which allows you to place textual ads alongside search results and on other web pages at large, determines which ad to display by web page context, the ads click rate, and the advertisers bid (what it is willing to pay for a click). It is not determined by anything known or predicted about the particular viewer who is going to actually see the ad. But weathering this risk carries us to a new frontier of customization. For business, it promises to personalize!, increase relevance!, and engage one-to-one marketing! The benefits reach beyond personalizing marketing treatment to customizing the individual treatment of patients and suspected criminals as well. During a speech about satisfying our widely varying preferences in choice of spaghetti saucechunky? sweet? spicy?Malcolm Gladwell said, People . . . were looking for . . . universals, they were looking for one way to treat all of us[;] . . . all of science through the 19th century and much of the 20th was obsessed with universals. Psychologists, medical scientists, economists were all interested in finding out the rules that govern the way all of us behave. But that changed, right? What is the great revolution of science in the last 10, 15 years? It is the movement from the search for universals to the understanding of variability. Now in medical science we dont want to know . . . just how cancer works; we want to know how your cancer is different from my cancer. From medical issues to consumer preferences, individualization trumps universals. And so it goes with ads:

PA APPLICATION: PREDICTIVE ADVERTISEMENT TARGETING


1. Whats predicted: Which ad each customer is most likely to click. 2. Whats done about it: Display the best ad (based on the likelihood of a click as well as the bounty paid by its sponsor). I set up PA to perform ad targeting for my client, and the company launched it in a head-to-head, champion/challenger competition to the death. The loser would surely be relegated to the bin of second-class ideas that just dont make as much cash. To prepare for this battle, we armed PA with powerful weaponry. The predictions were generated from machine learning across 50 million learning cases, each depicting a micro-lesson from history of the form, User Mary was shown ad A and she did click it (a positive case) or User John was shown ad B and he did not click it (a negative case). The learning technology employed to pick the best ad for each user was a Nave Bayes model. Reverend Thomas Bayes was an eighteenth-century mathematician, and the Nave part means that we take a very smart mans ideas and compromise them in a way that simplifies yet makes their application feasible, resulting in a practical method thats often considered good enough at prediction, and scales to the task at hand. I went with this method for its relative simplicity, since in fact I needed to

24

In Flight

generate 291 such models, one for each ad. Together, these models predict which ad a user is most likely to click on.

Deployments Detours and Delays


As with a rocket ship, launching PA looks great on paper. You design and construct the technology, place it on the launch pad, and wait for the green light. But just when youre about to hit Go, the launch is scrubbed. Then delayed. Then scrubbed again. The Wright brothers and others, galvanized by the awesome promise of a newly discovered wing design that generates lift, endured an uncharted rocky road, faltering, floundering, and risking life and limb until all the kinks were out. For ad targeting and other real-time PA deployments, predictions have got to zoom in at warp speed in order to provide value. Our online world tolerates no delay when its time to choose which ad to display, determine whether to buy a stock, decide whether to authorize a credit card charge, recommend a movie, filter an e-mail for viruses, or answer a question on Jeopardy! A real-time PA solution must be directly integrated into operational systems, such as websites or credit card processing facilities. If you are newly integrating PA within an organization, this can be a significant project for the software engineers, who often have their hands full with maintenance tasks just to keep the business operating normally. Thus, the deployment phase of a PA project takes much more than simply receiving a nod from senior management to go live: it demands major construction. By the time the programmers deployed my predictive ad selection system, the data over which I had tuned it was already about 11 months old. Were the facets of what had been learned still relevant almost one year later, or would predictions power peter out?

In Flight
This is Major Tom to Ground Control Im stepping through the door And Im oating in a most peculiar way . . . Space Oddity by David Bowie

Once launched, PA enters an eerie, silent waiting period, like youre floating in orbit and nothing is moving. But the fact is, in a low orbit around Earth youre actually screaming along at over 14,000 miles per hour. Unlike the drama of launching a rocket or erecting a skyscraper, the launch of predictive analytics is a relatively stealthy maneuver. It goes live, but daily activities exhibit no immediately apparent change. After the ad-targeting projects launch, if you checked out the website, it would show you an ad as usual, and you could wonder whether the system made any difference in

Liftoff! Prediction Takes Action

25

this one choice. This is what computers do best. They hold the power to silently enact massive procedural changes that often go uncredited, since most arent directly witnessed by any one person. But, under the surface, a sea-change is in play, as if the entire ocean has been reconfigured. You actually notice the impact only when you examine an aggregated report. In my clients deployment, predictive ad selection triumphed. The client conducted a head-to-head comparison, selecting ads for half the users with the existing champion system and the other half with the new predictive system, and reported that the new system generated at least 3.6 percent more revenue, which amounts to $1 million every 19 months, given how much moolah was already coming in. This was for the websites full-page ads only; many more (smaller) ads are embedded within functional web pages, which could potentially also be boosted with a similar PA project. No new customers, no new sponsors, no changes to business contracts, no materials or computer hardware needed, no new full-time employees or ongoing effortsolely an improvement to decision making was needed to generate cold, hard cash. In a welloiled, established system like the one my client had, even a small improvement of 3.6 percent amounts to something substantial. The gains of an incremental tweak can be even more dramatic: In the insurance business, one company reports that PA saves almost $50 million annually by decreasing its loss ratio by half a percentage point. So how did these models predict each click?

Elementary, My Dear: The Power of Observation


Just like Sherlock Holmes drawing conclusions by sizing up a suspect, prediction comes of astute observation: Whats known about each individual provides a set of clues about what he or she may do next. The chance a user will click on a certain ad depends on all sorts of elements, including the individuals current school year, gender, and e-mail domain (hotmail, yahoo, gmail, etc.); the ratio of the individuals SAT written to math scores (is the user more a verbal person or more a math person?), and on and on. In fact, this website collected a wealth of information about its users. To find out which grants and scholarships theyre eligible for, users answer dozens of questions about their school performance, academic interests, extracurricular activities, prospective college majors, parents degrees, and more. So the table of learning data was long (at 50 million examples) and was also wide, with each row holding all the information known about the user at the moment the person viewed an ad. It can sound like a tall order: Harnessing millions of examples in order to learn how to incorporate the various factoids known about each individual so that prediction is possible. But we can break this down into a couple of parts, and suddenly it gets much simpler. Lets

26

Elementary, My Dear: The Power of Observation

start with the contraption that makes the predictions, the electronic Sherlock Holmes that knows how to consider all these factors and roll them up into a single prediction for the individual.
Predictive modelA mechanism that predicts a behavior of an individual, such as click, buy, lie, or die. It takes characteristics of the individual as input, and provides a predictive score as output. The higher the score, the more likely it is that the individual will exhibit the predicted behavior.

A predictive model (depicted throughout this book as a golden egg, albeit in black and white) scores an individual:

Characteristics of an Individual

Predictive Model

Predictive Score

A predictive model is the means by which the attributes of an individual are factored together for prediction. There are many ways to do this. One is to weigh each characteristic and then add them upperhaps females boost their score by 33.4, Hotmail users decrease their score by 15.7, and so on. Each element counts toward or against the final score for that individual. This is called a linear model, generally considered quite simple and limited, although usually much better than nothing. Other models are composed of rules, like this real example:

IF the individual is still in high school AND expects to graduate college within three years AND indicates certain military interest AND has not been shown this ad yet THEN the probability of clicking on the ad for the Art Institute is 13.5 percent.

Liftoff! Prediction Takes Action

27

This rule is a valuable find, since the overall probability of responding to the Art Institutes ad is only 2.7 percent, so weve identified a pocket of avid clickers, relatively speaking. It is interesting that those who have indicated a military interest are more likely to show interest in the Art Institute. We can speculate, but its important not to assume there is a causal relationship. For example, it may be that people who complete more of their profile are just more likely to click in general, across all kinds of ads. Various types of models compete to make the most accurate predictions. Models that combine a bunch of rules like the one just shown arerelatively speakingon the simpler side. Alternatively, we can go more super-math on the prediction problem, employing complex formulas that predict more effectively but are almost impossible to understand by human eyes. But all predictive models share the same objective: They consider the various factors of an individual in order to derive a single predictive score for that individual. This score is then used to drive an organizational decision, guiding which action to take. Before using a model, weve got to build it. Machine learning builds the predictive model:

Data

Machine Learning

Predictive Model

Machine learning crunches data to build the model, a brand-new prediction machine. The model is the product of this learning technologyit is itself the very thing that has been learned. For this reason, machine learning is also called predictive modeling, which is a more common term in the commercial world. If deferring to the older metaphorical term data mining, the predictive model is the unearthed gem. Predictive modeling generates the entire model from scratch. All the models math or weights or rules are created automatically by the computer. The machine learning process is designed to accomplish this task, to mechanically develop new capabilities from data. This automation is the means by which PA builds its predictive power. The hunter returns back to the tribe, proudly displaying his kill. So, too, a data scientist posts her model on the bulletin board near the company ping-pong table. The hunter hands over the kill to the cook, and the data scientist cooks up her model,

28

To Act Is to Decide

translates it to a standard computer language, and e-mails it to an engineer for integration. A well-fed tribe shows the love; a psyched executive issues a bonus. The tribe munches and the scientist crunches.

To Act Is to Decide
Knowing is not enough; we must act. Johann Wolfgang von Goethe Potatoes or rice? What to do with my life? I cant decide. From the song I Suck at Deciding by Mufn (1996)
1

Once you develop a model, dont pat yourself on the back just yet. Predictions dont help unless you do something about them. Theyre just thoughts, just ideas. They may be astute, brilliant gems that glimmer like the most polished of crystal balls, but hanging them on the wall gains you nothing and displays nerd narcissismthey just hang there and look smart. Unlike a report sitting dormant on the desk, PA leaps out of the lab and takes action. In this way, it stands above other forms of analysis, data science, and data mining. It desires deployment and loves to be launchedbecause, in what it foretells, it mandates movement. The predictive score for each individual directly informs the decision of what action to take with that individual. Doctors take a second look at patients predicted to be readmitted, and service agents contact customers predicted to cancel. Predictive scores issue imperatives to mail, call, offer a discount, recommend a product, show an ad, expend sales resources, audit, investigate, inspect for flaws, approve a loan, or buy a stock. By acting on the predictions produced by machine learning, the organization is now applying whats been learned, modifying its everyday operations for the better. To make this point, we have mangled the English language. Proponents like to say that predictive analytics is actionable. Its output directly informs actions, commanding the organization about what to do next. But with this use of vocabulary, industry insiders have stolen the word actionable, which originally has meant worthy of legal action (i.e., sue-able), and morphed it. This verbal assault comes about because people are so tired of seeing sharp-looking reports that provide only a vague, unsure sense of direction. With this words new meaning established, Your fly is unzipped is actionable (it is clear what to doyou can and should take action to remedy), but Youre going
1

A rock band that included the authors sister, Rachel.

Liftoff! Prediction Takes Action

29

bald is not (theres no cure; nothing to be done). Better yet, I predict you will buy these button-fly jeans and this snazzy hat is actionable, to a salesperson. Launching PA into action delivers a critical new edge in the competitive world of business. One sees massive commoditization taking place today, as the faces of corporations appear to blend together. They all seem to sell pretty much the same thing and act in pretty much the same ways. To stand above the crowd, where can a company turn? As Thomas Davenport and Jeanne Harris put it in Competing on Analytics: The New Science of Winning, At a time when companies in many industries offer similar products and use comparable technology, high-performance business processes are among the last remaining points of differentiation. Enter predictive analytics. Survey results have in fact shown that a tougher competitive environment is by far the strongest reason why organizations adopt this technology. But while the launch of PA brings real change, so too can it wreak havoc by introducing new risk. With this in mind, we now return to Johns story.

A Perilous Launch
Ladies and gentlemen . . . from what was once an inarticulate mass of lifeless tissues, may I present a cultured, sophisticated man about town. Dr. Freddy Frankenstein (Gene Wilder) in Mel Brookss Young Frankenstein

Dr. John Elder bet it all on a predictive model. He concocted it in the lab, packed it into a black box, and unleashed it on the stock market. Some people make their own bed in which they must then passively lie. But John had climbed way up high to take a leap of faith. Diving off a mountain top with newly constructed, experimental wings, he wondered how long it might take before he could be sure he was flying rather than crashing. The risks stared John in the face like a cracked mirror reflecting his own vulnerability. His and his wifes full retirement savings were in the hands of an experimental device, launched into oblivion and destined for one of the same two outcomes achieved by every rocket: glory or mission failure. Discovering profitable market patterns that sustain is the mission of thousands of traders operating in what John points out to be a brutally competitive environment; doing so automatically with machine learning is the most challenging of ambitions, considered impossible by many. It doesnt help that a stock market scientist is completely on his own, since work in this area is shrouded in secrecy, leaving virtually no potential to learn from the successes and failures of others. Academics publish, marketers discuss, but quants hide away in their Bat Caves. What can look great on paper might be stricken with a weakness that destroys or an error that bankrupts. John puts it plainly: Wall Street is the hardest data mining problem.

30

Houston, We Have a Problem

The evidence of danger was palpable, as John had recently uncovered a crippling flaw in an existing predictive trading system, personally escorting it to its grave. Opportunity had come knocking on the door of a small firm called Delta Financial in the form of a black box trading system purported to predict movements of the Standard & Poors (S&P) 500 with 70 percent accuracy. Built by a proud scientist, the system promised to make millions, so stakeholders were flying around all dressed up in suits, actively lining up investors prepared to place a huge bet. Among potential early investors, Delta was leading the way for others, taking a central, influential role. The firm was known for investigating and championing cutting-edge approaches, weathering the risk inherent to innovation. As a necessary precaution, Delta sought to empirically validate this system. The firm turned to John, who was consulting for them on the side while pursuing his doctorate at the University of Virginia in Charlottesville. Johns work for Delta often involved inspecting, and sometimes debunking, black box trading systems. How do you prove a machine is broken if youre not allowed to look inside it? Healthy skepticism bolstered Johns resolve, since the claimed 70 percent accuracy raised red flags as quite possibly too darn good to be true. But he was not granted access to the predictive model. With secrecy reigning supreme, the protocol for this type of audit dictated that John receive only the numerical results, along with a few adjectives that described its design: new, unique, powerful! With meager evidence, John sought to prove a crime he couldnt even be sure had been committed. Before each launch, organizations establish confidence in PA by predicting the past (aka backtesting). The predictive model must prove itself on historical data before its deployment. Conducting a kind of simulated prediction, the model evaluates across data from last week, last month, or last year. Feeding on input that could only have been known at a given time, the model spits out its prediction, which then matches against what we now already know took place thereafter. Would the S&P 500 go down or up on March 21, 1991? If the model gets this retrospective question right, based only on data available by March 20, 1991 (the day just before), we have evidence the model works. These retrospective predictionswithout the manner in which they had been derivedwere all John had to work with.

Houston, We Have a Problem


Even the most elite of engineers commits the most mundane and costly of errors. In late 1998, NASA launched the Mars Climate Orbiter on a daunting nine-month trip to Mars, a mission that fewer than half the worlds launched probes headed for that destination have completed successfully. This $327.6 million calamity crashed and burned indeed, due not to the flip of fates coin, but rather a simple snafu. The spacecraft came too close to Mars and disintegrated in its atmosphere. The source of the navigational bungle? One system expected to receive information in metric units

Liftoff! Prediction Takes Action

31

(newton-seconds), but a computer programmer for another system had it speak in English imperial units (pound-seconds). Oops. John stared at a screen of numbers, wondering if anything was wrong and, if so, whether he could find it. From the long list of impressiveyet retrospective predictions, he plainly saw the promise of huge profits that had everyone involved so excited. If he proved there was a flaw, vindication; if not, lingering uncertainty. The task at hand was to reverse engineer: Given the predictions the system generated, could he infer how it worked under the hood, essentially eking out the method in its madness? This was ironic, since all predictive modeling is a kind of reverse engineering to begin with. Machine learning starts with the data, an encoding of things that have happened, and attempts to uncover patterns that generated or explained the data in the first place. John was attempting to deduce what the other team had deduced. His guide? Informal hunches and ill-informed inferences, each of which could be pursued only by way of trial and error, testing each hypothetical mess-up he could dream up by programming it by hand and comparing it to the retrospective predictions he had been given. His perseverance finally paid off: John uncovered a true flaw, thereby flinging back the curtain to expose a flustered Wizard of Oz. It turned out that the prediction engine committed the most sacrilegious of cheats by looking at the one thing it must not be permitted to see. It had looked at the future. The battery of impressive retrospective predictions werent true predictions at all. Rather, they were based in part on a threeday average calculated across yesterday, today . . . and tomorrow. The scientists had probably intended to incorporate a three-day average leading up to today, but had inadvertently shifted the window by a day. Oops. This crippling bug delivered the dead-certain prognosis that this predictive model would not perform well if deployed into the field. Any prediction it would generate today could not incorporate the very thing it was designed to foreseetomorrows stock pricesince, well, it isnt known yet. So, if foolishly deployed, its accuracy could never match the exaggerated performance falsely demonstrated across the historical data. John revealed this bug by reverse engineering it. On a hunch, he hand-crafted a method with the same type of bug, and showed that its predictions closely matched those of the trading system. A predictive model will sink faster than the Titanic if you dont seal all its time leaks before launch. But this kind of leak from the future is common, if mundane. Although core to the very integrity of prediction, its an easy mistake to make, given that each model is backtested over historical data for which prediction is not, strictly speaking, possible. The relative future is always readily available in the testing data, easy to inadvertently incorporate into the very model trying to predict it. Such temporal leaks achieve status as a commonly known gotcha among PA practitioners. If this were an episode of Star Trek, our beloved, hypomanic engineer Scotty would be screaming, Captain, were losing our temporal integrity! It was with no pleasure that John delivered the disappointing news to his client, Delta Financial: He had debunked the system, essentially exposing it as inadvertent

32

The Little Model That Could

fraud. High hopes were dashed as another fairy tale bit the dust, but gratitude quickly ensued as would-be investors realized theyd just dodged a bullet. The wannabe inventor of the system suffered dismay, but was better off knowing now; it would have hit the fan much harder postlaunch, possibly including prosecution for fraud, even if inadvertently committed. The project was aborted.

The Little Model That Could


Every new beginning comes from some other beginnings end. From the song Closing Time by Semisonic

Even the young practitioner that he was, John was a go-to data man for entrepreneurs in black box trading. One such investor moved to Charlottesville, but only after John Elder, PhD, new doctorate degree in hand, had just relocated to Houston in order to continue his academic rite of passage with a postdoc research position at Rice University. Hed left quite an impression back in Charlottesville, though; people in both the academic and commercial sectors alike referred the investor to John. Despite Johns distance, the investor hired him to prepare, launch, and monitor a new black box mission remotely from Houston. It seemed as good a place as any for the projects Mission Control. And so it was time for John to move beyond the low-risk role of evaluating other peoples predictive systems and dare to build one of his own. Over several months, he and a small team of colleagues hed pulled together built upon core insights from the investor and produced a new, promising black box trading model. John was champing at the bit to launch it and put it to the test. All the stars were aligned for liftoff except one: the money people didnt trust it yet. There was good reason to believe in John. Having recently completed his doctorate degree, he was armed with a fresh, talented mind, yet had already gained an impressively wide range of data-crunching problem-solving experience. On the academic side, his PhD thesis had broken records among researchers as the most efficient way to optimize for a certain broad class of system engineering problems (machine learning is itself a kind of optimization problem). He had also taken on predicting the species of a bat from its echolocation signals (the chirps bats make for their radar). And in the commercial world, Johns pregrad positions had dropped him right into the thick of machine learning systems that steer for aerospace flight and that detect cooling pipe cracks in nuclear reactors, not to mention projects for Delta Financial looking over the shoulders of other black box quants. And now Johns latest creation absolutely itched to be deployed. Backtesting against historical data, all indications whispered confident promises for what this thing could do once set in motion. As John puts it, A slight pattern emerged from the overwhelming noise; we had stumbled across a persistent pricing inefficiency in a

Liftoff! Prediction Takes Action

33

corner of the market, a small edge over the average investor, which appeared repeatable. Inefficiencies are what traders live for. A perfectly efficient market cant be played, but if you can identify the right imperfection, its payday.

PA APPLICATION: BLACK BOX TRADING


1. Whats predicted: Whether a stock will go up or down. 2. Whats done about it: Buy stocks that will go up; sell those that will go down. John could not get the green light. As he strove to convince the investor, cold feet prevailed. It appeared they were stuck in a circular stalemate. After all, this guy might not get past his jitters until seeing the system succeed, yet it couldnt succeed while stuck on the launch pad. The time was now, as each day marked lost opportunity. After a disconcerting meeting that seemed to go nowhere, John went home and had a sit-down with his wife, Elizabeth. What supportive spouse could possibly resist the seduction of her beloveds ardent excitement and strong belief in his own abilities? She gave him the go-ahead to risk it all, a move that could threaten their very home. But he still needed buy-in from one more party. Delivering his appeal to the client investor raised questions, concerns, and eyebrows. John wanted to launch with his own personal funds, which meant no risk whatsoever to the client, and would resolve any doubts by field-testing Johns model. But this unorthodox step would be akin to the dubious choice to act as ones own defense attorney. When an individual is without great personal means, this kind of thing is often frowned upon. It conveys overconfident, foolish brashness. Even if the client wanted to truly believe, it would be another thing to expect the same from co-investors who hadnt gotten to know and trust John. But, with every launch, proponents gamble something fierce. John had set the rules for the game hed chosen to play. He received his answer from the investor: Go for it! This meant there was nothing to prevent moving forward. It could have also meant the investor was prepared to write off the project entirely, feeling there was nothing left to lose.

Houston, We Have Liftoff


Practitioners of PA often put their own professional lives a bit on the line to push forward, but this case was extreme. Like baseballs Billy Beane of the Oakland As, who literally risked his entire career to deploy and field-test an analytical approach to team management, John risked everything he had. It was early 1994, and Johns individual retirement account (IRA) amounted to little more than $40,000. He put it all in.

34

A Passionate Scientist

Going live with black box trading is really exciting and really scary, says John. Its a roller coaster that never stops. The coaster takes on all these thrilling ups and downs, but with a very real chance it could go off the rails. As with baseball, he points out, slumps arent slumps at alltheyre inevitable statistical certainties. Each one leaves you wondering, Is this falling feeling part of a safe ride, or is something broken? A key component to his system was a cleverly designed means to detect real quality, a measure of system integrity that revealed whether recent success had been truly deserved or had come about just due to dumb luck. From the get-go, the predictive engine rocked. It increased Johns assets at a rate of 40 percent per year, which meant that after two years his money had doubled. The client investor was quickly impressed and soon put in a couple of million dollars himself. A year later, the predictive model was managing a $20 million fund across a group of investors, and eventually the investment pool increased to a few hundred million dollars. With this much on tap, every win of the system was multiplicatively magnified. No question about it: All involved relished this fiesta, and the party raged on and on, continuing almost nine years, consistently outperforming the overall market all along. The system chugged, autonomously trading among a dozen market sectors such as technology, transportation, and healthcare. John says the system beat the market each year and exhibited only two-thirds its standard deviationa home run as measured by risk-adjusted return. But all good things must come to an end, and, just as John had talked his client up, he later had to talk him down. After nearly a decade, the key measure of system integrity began to decline. John was adamant that they were running on fumes, so with little ceremony the entire fund was wound down. The system was halted in time, before catastrophe could strike. In the end, all the investors came out ahead.

A Passionate Scientist
The early success of this streak had quickly altered Johns life. Once the project was cruising, he had begun supporting his rapidly growing family with ease. The project was taking only a couple of Johns hours each day to monitor, tweak, and refresh what was a fundamentally stable, unchanging method within the black box. Whats a man to do? Do you put your feet up and sip wine indefinitely, with the possible interruption of family trips to Disney World? After all, John had thus far always burned the candle at both ends out of financial necessary, with summer jobs during college, parttime work during graduate school, and this black box project, which itself had begun as a moonlighting gig during his postdoc. Or, do you follow the logical business imperative: pounce on your successes, using all your free bandwidth to find ways to do more of the same?

Liftoff! Prediction Takes Action

35

Johns passion for the craft transcended these self-serving responses to his good fortune. That is to say, he contains the spirit of the geek. He jokes about the endless insatiability of his own appetite for the stimulation of fresh scientific challenges. Hes addicted to tackling something new. There is but one antidote: a growing list of diverse projects. So, two years into the stock market project, he wrapped up his postdoc, packed up his family, and moved back to Charlottesville to start his own data mining company. And so John launched Elder Research, now the largest predictive analytics services firm (pure play) in North America. A narrow focus is key to the success of many businesses, but Elder Researchs advantage is quite the opposite: its diversity. The companys portfolio reaches far beyond finance to include all major commercial sectors and many branches of government. John has also earned a top-echelon position in the industry. He chairs the major conferences, coauthors massive textbooks, takes cameos as a university professor, and served five years as a presidential appointee on a national security technology panel.

Launching Prediction into Inner Space


With stories like Johns coming to light, organizations are jumping on the PA bandwagon. One such firm, a mammoth international organization, focuses the power of prediction introspectively upon itself, casting PAs keen gaze on its own employees. Read on to witness the windfall and the fallout when scientists dare to ask: Do people like being predicted?

Contents

Foreword v Preface vii Acknowledgments xi PHASE 1: Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 PLANNING

Setting the Stage for an Optimized State of Mind 3 Journey: Where Does Optimize and Socialize Fit in Your Company? 15 Smart Marketing Requires Intelligence: Research, Audit, and Listen 25 In It to Win It: Setting Objectives 39

Roadmap to Success: Content Marketing Strategy 51 PHASE 2: IMPLEMENTATION

Chapter 6 Chapter 7 Chapter 8 Chapter 9

Know Thy Customer: Personas

65 75 99

Words Are Key to Customers: Keyword Research

Attract, Engage, and Inspire: Building Your Content Plan

Content Isnt King, Its the Kingdom: Creation and Curation 115

Chapter 10 If It Can Be Searched, It Can Be Optimized: Content Optimization 127

Chapter 11 Community Rules: Social Network DevelopmentDont Be Late to the Social Networking Party 157 Chapter 12 Electrify Your Content: Promotion and Link Building 175 Chapter 13 Progress, Refi nement, and Success: Measurement 195 PHASE 3: SCALE Chapter 14 Optimize and Socialize: Processes and Training Chapter 15 Are You Optimized? About the Author Notes 233 Index 239 Buy This Book 231 225 211

CHAPTER 1
Setting the Stage for an Optimized State of Mind

Several years ago my family established a tradition of celebrating the tenth birthday of each of our children by taking them on a trip to a city of their choosing somewhere in North America. My son Dominic picked New York City. While I travel to New York several times a year for business, I really had no idea what kid-friendly activities we could nd for a ve-day vacation in one of the worlds greatest cities. Where did I go for advice and information? Some people reading this book will think of a search engine like Google or Bing. For others, Facebook or Twitter will come to mind. Some might even know specic people they could e-mail for travel tips or specialty travel websites that focus on New York. What did I do? I used all of these ideas. I posted on Twitter that I would be bringing my son to New York for his tenth birthday and that we were looking for kid-friendly activities and places to see. Numerous suggestions were offered, and from them I made a list. Dominic and I used Google to research each destination and to nd out details such as available activities, location, fees, and schedules. Based on what we found, we further rened our search phrases, which inuenced follow-up questions posted on social networks. Some of the websites we found posted ratings from customers; others had links to blogs, photos on Flickr, and Facebook fan pages.
3

Optimize

From our research conducted through a combination of search engines, social networking websites, and e-mail, we settled on our itinerary and had a fantastic time. We didnt stop there, though. As we explored the city from Manhattan to the Bronx Zoo to Broadway to see The Lion King, I tweeted comments about our experiences and uploaded photos to Flickr, Twitter, and Facebook so the people who had made suggestions for our trip could see the impact they had on this once-in-a-lifetime experience. My social network experienced our adventures right along with us interacting, sharing, and engaging from all over the world. The content and media I posted online became ndable on Google and has undoubtedly provided helpful ideas to others who are looking for information on kid-friendly activities in New York for years to come. Our experience in planning that trip to New York with content discovered through search and social media represents a fundamental change thats emerged in consumer behaviors for information discovery, consumption, and engagement. While search engines continue to represent the most popular method of nding specic information, the inuence of social networking, shared social media, and the proliferation of platforms for individuals to publish content all intersect to create tremendous opportunities to better attract and engage customers.1 Recognizing the importance, relevance, and need to master each of these changing consumer preferences is essential for businesses to succeed online.
CONTENT MARKETING TRILOGY: DISCOVERY, CONSUMPTION, AND ENGAGEMENT

The web is ush with change and innovation. Gone are the days of linear information ow and incremental growth. Content ows in every direction through a variety of platforms, formats, and devices. The mass adoption of the social and mobile web has facilitated a revolution of information access, sharing, and publishing at a scale never before experienced. (See Figure 1.1.) Access to information for discovery is most often associated with search engines. For people who have some idea of what they want or need, its second nature to search and then sort through the results for

Setting the Stage for an Optimized State of Mind

79%

of desktop Internet access is to socialize Consumption

48%
of Consumers are influenced to purchase by search and social media Discovery

#1
influencer on consumer brand dicisions: Search. Online word of mouth is 2nd. Engagement

FIGURE 1.1

Discovery, Consumption, and Engagement

the best answer. When my eight-year-old points to the Google Chrome browser icon on a computer desktop, she doesnt call it Google or Chrome. She calls it the Internet, because it represents the interface she uses to search and connect with information online. For her, Internet access isnt thought of as anything special and certainly is not limited to a computer. Her perception of information transcends devices, whether a smart phone, an iPad, a PlayStation 3, an Apple TV, or a laptop. Just as she is growing up in a digital age where information access is ubiquitous, companies and their customers are growing up digitally and nding a wealth of opportunities to connect and engage. While search plays an important part in how we connect with what we need online, the revolution occurring on the social web has had a global impact, from neighborhoods to entire nations. Recognizing the synergies of search and social media plus the role they play with content marketing will help businesses realize the impact on their ability to connect, engage, and grow revenue.

Optimize

THE INTERSECTION OF SEARCH OPTIMIZATION AND SOCIAL MEDIA

Google handles a staggering 11 billion queries a month.2 But did you know Twitter delivers more than 350 billion tweets each day?3 Facebook has nearly 1 billion users, and Google Plus is likely to reach more than 100 million users in 2012.4,5 With two in three adults using social networks, social media is hot, but is by no means mutually exclusive of search.6 The notion of search has expanded beyond Google and Bing, and marketers from companies of all sizes and industries must now consider other search channels, ranging from internal Facebook search to innovations such as Siri on the iPhone 4S, as opportunities for content creation, optimization, and social promotion. The blur of all this change is an opportunity for brands and marketers to engage in an active marketing strategy that converges the disciplines of search, social media, content, and online public relations. To meet brand needs to engage customers in an always-on digital world, whether its B2B or B2C, the convergence of marketing and public relations, search, and social media are inevitable. Because there are so many information sources online, sales cycles are getting longer. Customers expect more than to be presented with features and benets followed by a call to action. For marketers, more isnt always better. Relevance, timeliness, and ease of sharing are better. That means better content and visibility in all the places customers might be looking or might be inuenced by. It also means a better experience with brand and consumer interactions. For example, searchers expect not only to nd what theyre looking for on a search engine, but also to interact with what they nd through commenting, rating, joining, as well as buying. Purchase is just the start of social engagement with the customer, which extends across a life cycle that takes the customer from prospect to evangelist. Adaptive Internet marketing pays attention to those customer needs and creates a dynamic cycle of social and search interaction. Creating experiences that are easily discovered through search or social media and continuously evaluating what works and what doesnt helps to fuel the most critical aspects of an effective editorial, optimization, and social media marketing effort.

Setting the Stage for an Optimized State of Mind

WHATEVER CAN BE SEARCHED CAN BE OPTIMIZED

Theres nothing static about Internet marketing, but the one constant we can all count on is the persistent effort by search engines to improve search quality and user experience. Such continuous improvements, including the Google Search, plus Your World implementation in late 2011, have signicantly affected how search engines interact with content ranging from discovery, indexation, sorting in search results, to what external signals are considered to determine authority. Its essential for results-oriented marketers to monitor both the frontand back-end landscapes of search to be proactive about what it will take to achieve and maintain a competitive advantage. Continuous efforts toward progressive search strategy for marketers are important, because we cannot rely on Google to send us Weather Reports every time an update is made. In 2007, Google and other search engines like Ask.com made some of the most signicant changes ever, affecting search results by including more sources such as Images, Maps, Books, Video, and News for certain queries.7 In an effort to capitalize on the opportunity for improved search visibility for the array of media types included in search results, concepts like Digital Asset Optimization came about.8 Fast-forward to 2011 and youll nd that search results have evolved from 10 blue links to situationally dependent mixed-media results that vary according to your geographic location, web history, social inuences, and social ratings like Google. At any given time there are from 50 to 200 different versions of Googles core algorithm in the wild, so the notion of optimizing for a consistently predictable direct cause and effect is long gone.9 The potential inuence of social media sites such as Twitter and Facebook with Google, Bing, and Yahoo! as link sources has changed what it means to build links for SEO and how we view whether PageRank is still important. (See Figure 1.2.) Social signals are rich sources of information for search engines, and old ways of link acquisition simply dont have the same effect in the same ways. As the worlds most popular search engine, Google says its mission is to Organize the worlds information and make it universally accessible and useful. Marketers need to understand the opportunities to make informationincluding various types of digital assetseasy for

Optimize

11 Billion
Google Queries Per Month

800 Million
Facebook Users

350 Billion
Tweets Per Day

50 Million
Google Users

FIGURE 1.2 Channel of Discovery: Search and Social Networks search engines to nd, index, and sort in search results. Structured data in the form of markup, microformats, and rich snippets, as well as feeds and sitemaps, all play an increasingly important role in helping Google achieve this goal. At the same time, so does understanding myriad data sources and le types that can be included in search results. By understanding these opportunities, search marketers can inventory their digital assets and deploy a better, more holistic SEO strategy that realizes the benet of inclusion and visibility where customers are looking. Increasingly, marketers are approaching search optimization holistically under the premise, What can be searched can be optimized. That means more attention is being paid to the variety of reasons people search as well as the variety of reasons companies publish digital content. Content and SEO are perfect partners for making it easy to connect constituents and customers with brand content. In the past, SEO consultants have typically been left to deal with whatever content they could optimize and promote for link building. Now the practice of SEO involves content creation and curation as much as it does with reworking existing content. When SEO practitioners examine the search results page of targeted keyword phrases on a regular basis,

Setting the Stage for an Optimized State of Mind

review web analytics, and conduct social media monitoring, they can gain a deeper sense of what new sources and content types can be leveraged for better search visibility. Monitoring search results might show that the keyword terms being targeted may trigger different types of content. Certain search queries might be prone to triggering images and video, not just web pages. An understanding of the search results landscape for a target keyword phrase should be considered when allocating content creation and keyword optimization resources. For many companies, it can be very difcult and complex to implement a holistic content marketing and search optimization program. Substantial changes may be necessary with content creation, approval, and publishing processes. But the upside is that a substantial increase in the diversity of content and media types indexed and linking to a company website will provide the kind of advantage standard SEO no longer offers. As long as there are search engines, and search functionality on websites, there will be some kind of optimization for improving marketing performance of content in search. Companies need to consider all the digital assets, content, and data they have to work with to give both search engines and customers the information theyre looking for in the formats theyll respond to.
OPTIMIZE FOR CUSTOMERS

No doubt, youve searched Google or Bing and found web pages that were clearly optimized in the name of SEO. That kind of copy might help a page appear higher in search results but doesnt do much for readers once they click through. When I see those pages, it reminds me of the increasing importance of optimizing for customers and user experience versus the common overemphasis on search engines. Keep in mind, technical SEO and understanding how bots interact with servers and web pages are timeless best practices, but it just makes sense to write web copy thats more useful and a better reection of what customers are looking for versus chasing the most popular keywords alone. (See Figure 1.3.) I recall reading an SEO blog a long time ago that advised creating websites, copy, and links as though search engines didnt exist. That seems a bit naiveespecially if youre in a competitive category. Creating,

10

Optimize

Awareness > Interest > Consideration > Purchase

Research Customer Segments

Keywords Topics Message

Content & Promotion Plan

Optimize Socialize Promote

Preferences Pain Points Behaviors

Search & Social Data Sources

Topics, SEO Calendar, Repurpose

Social & SEO Networking, Link Building

FIGURE 1.3 Optimize For Customers optimizing, and promoting content based on customers interests that leads them to a purchase makes the most out of both useful content and SEO best practices. Great SEO copywriting doesnt read as a list of keywords, but instead balances keyword usage with creative writing that appeals to the reader, thus educating, inuencing, and inspiring action. Consider the difference between these general SEO copywriting recommendations: Use the most popular keywords at the beginning of title tags, in on-page titles, body copy, anchor text, and image alt text in combination with attracting relevant keyword links from other websites so the pages rank high on Google. Higherranking web pages can result in more visitors and sales. In comparison, try this advice, which is absent any explicit SEO lingo: Use the words that matter most to your customers in titles, links, and body copy to inform and inspire them to take action. Text used in titles should make it easy for readers to understand the topic of the page quickly, in the rst few words. Text used to link from one page to another should give the readers an idea of what theyll nd on the

Setting the Stage for an Optimized State of Mind

11

destination page. A consistent approach to titling, labeling, and copy in web page text, image annotations, video descriptions, and links will create condence for the reader in the subject matter and inspire sales. Both recommendations should result in more focused and relevant content for search engines. But the focus in the rst instance is only on keywords and search engines. The advice in the second instance is less SEOspecic, but emphasizes relevance from the customer point of view and at the same time is search enginefriendly. Marketers need to take a step back and review which audience and outcomes theyre optimizing for: search engines and rankings or customers and sales. What about all of the above?

OPTIMIZE FOR EXPERIENCES

My friend Bob Knorpp had a good piece about the fallacy of content in Advertising Age, Why Marketers Should Break Free of the Digital Content Trap. He made some good points about companies going through the motions of creating and promoting content on social channels with motivations of retweets, likes, shares, and links instead of real engagement. I have to agree when he says, Content alone is a dead end for ongoing engagement. Savvy online marketers dont see content as a shortsighted substitute for social strategy or simply as an SEO tactic but as a proxy to creating better customer experiences. Content is the mechanism for storytelling, and if social and search optimization are also involved in a qualitative way to aid in discovery and sharing of those stories, then all the better. In that Ad Age article, Bob also makes great points about the need to think of new ways to approach digital storytelling beyond single dimensions like videos that go viral and infographics that spread like wildre on Twitter and Facebook. Engagement is indeed more than a click, a share, or a link. In the same way that many business bloggers and marketers approach online marketing with an egocentric perspective, promoting messages in which they try to persuade audiences instead of empathizing with customer needs and interests, many agencies that create content are more interested in creative self-expression than in experiences that are truly meaningful to customers. In our hub/spoke and constellation publishing models for content marketing (covered more in-depth in Chapter 8), we emphasize

12

Optimize

an understanding of customer needs and behaviors through persona development and attention to variances during the buying cycle. Those insights, combined with ongoing monitoring and engagement, drive content marketing strategy and the creative mix of content objects designed to help prospects have meaningful experiences with the brand. The content itself is made easier to discover in more relevant ways through search engine optimization and social media optimization. A socialize and optimize approach to content marketing increases the connections between consumers who are looking (i.e., searching) and discussing (social networking) topics of relevance to the brand solution. Ive said it before: Great content isnt great until its discovered, consumed, and shared. Littering the social web with scheduled tweets, status updates, and blog posts alone is not engagement, and it certainly does not create the kind of experience that builds brand or motivates customers to buy, be loyal, or advocate.
ARE YOU READY TO BE OPTIMIZED?

From an overall marketing and customer engagement perspective, all content is not created equal. Any kind of content isnt appropriate in any kind of situation, despite what recent content marketing SEO advocates would have you believe. Since much of the focus of online marketing is on customer acquisition, many SEO efforts emphasize transaction or lead generation outcomes. Thats what theyre held accountable for. Unfortunately, search to purchase or social to purchase are not the only ways people interact with information online. Research before purchase and education and support afterward are also important. Being in the brand as publisher business is better than not creating any content at all, but its a much more effective thing to be purposeful in content creation and marketing according to the full customer experience. Seeing content engagement opportunities holistically can provide a company more ways to initiate, maintain, and enhance customer relationships. For example, in the context of online marketing, there are many different touch points during the customer relationship. Using the buying cycle model of Awareness, Consideration, Purchase, Service, and Loyalty, marketers can best plan what kind of content may be most appropriate to engage customers according to their needs.

Setting the Stage for an Optimized State of Mind

13

For a holistic editorial plan, here are a few types of content and methods of communication to consider: Awareness Public relations Advertising Word of mouth Social media

Consideration Search marketing Advertising Social media Webinars Product and service reviews Blogs Direct response

Purchase Website Social commerce Service Social media Social CRM Online messaging E-mail Search

Loyalty E-mail newsletter Webinars Blog Social network, forumcommunity

14

Optimize

In the development of a content marketing strategy, there are numerous opportunities to be more relevant and effective. Planning content thats meaningful to the customers youre trying to engage according to where they are in the buying cycle and overall customer relationship provides greater efciency in content creation and in the repurposing of content. Holistic content marketing and editorial planning also help make better use of tactics that transcend the relationship timeline, like SEO and social media. Its especially the case with holistic SEO that content producers can extend their reach and visibility to customers who are looking not just to buy, but to engage with brands in other ways. By considering the content needs across the customer life cycle, not just acquisition or conversion, companies can become signicantly more effective and efcient in their ability to connect relevant messages and stories with customers who are interested. The result: shorter sales cycles, better customer relationships, and more word of mouth. Now weve set the stage for an optimized approach to search, social, and content marketing. Chapter 2 digs into where this mind-set ts for B2B, B2C, small and large companies, as well as within specic business functional areas ranging from marketing to customer service. Lets get optimized!

ACTION ITEMS

1. Think about your current content, optimization, and social media marketing efforts. How could you start integrating those programs? 2. What areas of your content marketing could you start optimizing related to content discovery, consumption, and sharing? 3. Is your content optimization more focused on keywords or customers? Consider how you could begin to evaluate customer needs as inspiration for keyword research and content. 4. Look at your current social media content. Where might you begin to optimize for better social engagement and customer experience? 5. Identify the spectrum of content types used from the top of the buying cycle to customer support. Consider how you might optimize and socialize content more holistically.

foreword by

D av i d M e e r m a n S c o t t
SERIES

THE NEW RU RULES ULES O OF F S SOCIAL OCIAL MEDIA MEDIA

SOCIAL MEDIA METRICS


H O W T O MEASURE A N D OPTIMIZE YO U R

MARKETING INVESTMENT
JIM STERNE

Contents

Foreword Acknowledgments Introduction: Getting StartedUnderstanding the Ground Rules Chapter 1 Getting FocusedIdentifying Goals Chapter 2 Getting AttentionReaching Your Audience Chapter 3 Getting RespectIdentifying Influence Chapter 4 Getting EmotionalRecognizing Sentiment Chapter 5 Getting ResponseTriggering Action Chapter 6 Getting the MessageHearing the Conversation Chapter 7 Getting ResultsDriving Business Outcomes Chapter 8 Getting Buy-InConvincing Your Colleagues Chapter 9 Getting AheadSeeing the Future Appendix: Resources Index Buy This Book

x xiii xv 1 15 51 77 105 123 163 199 213 229 235

CHAPTER 1

Getting FocusedIdentifying Goals

I know the price of success: dedication, hard work, and an unremitting devotion to the things you want to see happen. Frank Lloyd Wright Give me a stock clerk with a goal and Ill give you a man who will make history. Give me a man with no goals and Ill give you a stock clerk. J.C. Penney Measuring for measurements sake is a fools errand. But were all fools now and again. The human mind loves order. People think their iPod really knows how they feel by how it sequences songs just so when set to shufe. People tell fortunes by reading the leaves in the bottom of a tea cup or from the succession of upturned tarot cards. When faced with absolute randomness, the human mind kicks into overdrive to nd patterns. We enjoy spending quiet time on the grass nding animals in the clouds, and conspiracy theorists can nd plots and schemes in random events.

Social Media Metrics

In the same way, web marketers have attempted to divine signicance from the rows of IP addresses, le names, byte counts, and time stamps in the log les of web servers from the very beginning. Over the years, with the advent of additional data collection technologies, we have proven that our conjectures and prognostications are valuable to business. Hypotheses can be scientically tested to show that we understand and can inuence onsite behavior by making specic changes to a web site and measuring the results. We can alter our prospective customers behavior by altering our promotional efforts and persuasion techniques.

Measurement Is No Longer Optional


Katie Delahaye Paine is a PR maven who understands social media better than most. Shes an insightful consultant and an engaging speaker, and one of her more popular PowerPoint presentations is available online at www.themeasurementstandard.com/issues/5-109/paine7stepssocial5-1-09.asp. Its called 7 Steps to Measurable Social Media Success. In step two, Katie advocates setting clear, measurable objectives. She says you need to know what problem you need to solve, you need to not do anything in social media if it doesnt add value, and she reminds us that you cant manage what you cant measureso set measurable goals. Whether money is tight or times are good, everybody is bent on improving their business performance based on metrics. You cannot continue to y by the seat of your pants. Automated systems and navigational instrumentation are required on passenger planes, and your business deserves no less.

Getting FocusedIdentifying Goals

As the tools escalate in sophistication, there remains one truism that cannot be ignored. Regardless of the amount of data and the cleverness of analytics tools one has, we still need analysis. The sharpest analyst or most talented statistician in the world is stymied without data, to be sure. But without those brilliant minds cogitating about a given purpose, those tools and data can create pretty charts and graphs and not much else. The most frequent missing piece is a specic problem to solve. Every analyst has been asked to describe the past, explain the present, and tell the future given a data warehouse full of bits and bytes and the assumed ability to interpret human intent. When faced with the question Heres a bunch of datawhat does it mean? there are only two responses. The rst is a tedious explanation of how the word data is the plural of datum and therefore the inquisitors grammar is lacking. This approach is tiresome for the addressee and only fun for the analyst the rst couple of times. The second response is What problem are we solving for? While this is an equally egregious mangling of the Kings English, it is an integral part of the analytical vernacular. The question, while sounding just as haughty as the former grammar lesson, is critical. When getting into a taxi, one is expected to know and communicate ones destination. Of course a statistician can groom a large data dump and nd correlations between temperature, elevation, and the rate of change in barometric pressure. But he wont volunteer the critical answer of whether you should bring an umbrella unless you specically ask, Do you think it might rain? The same is true of marketingespecially online marketingwhere we are data rich and insight poor.

Social Media Metrics

Measurement, Metrics, and Key Performance Indicators


There were 4,231 views and mentions of your viral marketing campaign on the rst day. On hearing this, you might jump out of your chair, run down the hall, high-ve the older members of your team, stbump the younger ones, and open a bottle of champagne. Alternatively, you might slump in your chair, hide from the rest of your team, and open a bottle of antidepressants. Four thousand two hundred and thirty-one is a measurement. Without context, it is merely a number. When compared with your personal best, company expectations, or your competitors efforts, that number becomes a metric. It is now indicative of value, importance, or a change in results. If that metric is central to the well-being of the organization, it might be considered a Key Performance Indicator (KPI). It might be worthy of daily e-mail updates, dashboard placement, and iPhone App notications. To be a KPI, it must indicate how well your organizations goals are being served. Therein lies the rubthe downfall of web measurement people everywhere: ill-dened objectives. Without context, your measurements are meaningless. Without specic business goals, your metrics are meaningless.

Proceed Ye No Further if Ye Have No Goals


It is crucial to map out your specic business goals before embarking on a social media program. As Yogi Berra put it, If you dont know where you are going, you will wind up somewhere else.

Getting FocusedIdentifying Goals

Companies that tout their success because they track the number of friends and followers will never compete effectively with those who track sales and prots gained from reaching out to their followers. You want a goal? Incomes a great goalbut its not alone.

THE BIG THREE BUSINESS GOALS


Its time to get very high-level. There are only three true business goals (Figure 1.1). They are all that matters in the long run. If the work you do does not result in an improvement to one or more of these Big Three Goals, then you are wasting your time, wasting money, spinning your wheels, alienating customers, and not helping the organization. You may be covering your backside and building your empire, but in the long run you will not ensure your status as an employee.

Your focus should always be on either increasing revenue, lowering costs, or improving customer satisfaction. Doing all three would be just ne.

Figure 1.1

Social Media Metrics

There are many measurable elements that indicate whether you are improving on one or more of these Big Three Goals. You need to keep an eye on these critical factors because you are running your marketing programs in real time and cant wait for month-end or quarterly results to make adjustments along the way. Are we there yet? is the question asked from the backseat. Are we still going in the right direction? and Is there anything in the way? are asked from behind the wheel and lead to business and career success. You can always think of something to earn more, spend less, and make customers happier. If you can do all three at the same time, do please give me a call. You are headed for greatness, and I love a good case study.

Increased Revenue
Considered the easiest to measure, revenue is always tabulated in terms of cash. You raked it in or you didnt. You met the expected return on investment or you missed the mark. You brought in more this time than last time or you fell under the bus. A Mark, a Yen, a Buck or a Pound, they are very easy to tot up. If the things you are measuring cannot be connected back to income, then you need to be very clear why you are taking the time to measure them. You can completely bafe your colleagues with analytics colloquialisms like sentiment volatility rate, pass-along engagement velocity, and uptaketo-captivation ratios. But as soon as you connect the dots to arrive at income, everybody knows what you are talking about and has a standard, consensual means of evaluating the righteousness of your social marketing programs. While income is always the pot of gold at the end of the rainbow, there is another consideration that cannot be ignored: the other side of the prot equation called Cost.

Getting FocusedIdentifying Goals

Lowered Costs
Its easy to bring in a million dollars; just sell two million one-dollar bills for 50 cents each. Clearly, your attention should be focused on prots. So, while coming up with new and innovative ways to make sales, dont forget to come up with new and innovative ways to lower expenses. If you can lower the cost of nding that pot of gold, then theres more net gold to go around. Customer service and market research are the obvious areas where social media can boost prots by lowering costs, but its a ne balance. You must spend money to make money, but if you can show that social media is a less expensive way to measure public opinion, make friends, and inuence people, then you can have a larger share of the budget next time around. Oh, and you get to keep your job, too.

Improved Customer Satisfaction


The great thing about improving customer satisfaction is that it raises revenue and lowers cost. Happier customers are more likely to buy again. It is cheaper to sell something to somebody already in your database than it is to have to beat the bushes to nd new ones. So if customer satisfaction is a factor in income and expense, and if income and expense are simply part of the prot equation, then the only goal worth worrying about is prots, right? Not so fast, Mr. Lay. Remember Kenneth Lay? The famous CEO, last seen being walked out of Enron headquarters wearing government-issued bracelets? It seems that when one focuses on prots alone, one steps in over ones head with unpleasant results. Happy customers are necessary to the life of a business if you can just force yourself to look beyond the quarterly report. A 6- or 9- or 12-month window will verify

Social Media Metrics

that a company with unhappy customers is not on a path to survival. Curiously, there is proof. A companys American Customer Satisfaction Index score has been shown to be predictive of both consumer spending and stock market growth, among other important indicators of economic growth (www.theacsi.org). Yes, thats right, happier customers make for increased stock prices.

THE FREE CALCULATOR DILEMMA


If you need a down and dirty answer to how much social media can raise revenue and lower costs, there are plenty of online tools like DragonSearchs Social Networking Media ROI Calculator (Figure 1.2). These calculations are ideal for those who have to t into a specic budget and wish to manipulate the numbers until they do what theyre told.

Figure 1.2

DragonSearchs Social Networking Media ROI Calculator lets you fudge the numbers till theyll play nice. (www .dragonsearchmarketing.com/social-media-roi-calculator.htm)

Getting FocusedIdentifying Goals

All spreadsheets can help in the same way. You can play what-if until the cows come home or at least until the boss allocates another sliver of the budget for a Facebook app that might go viral. The solution is to get a clear handle on your goals and sweep up a handful of online metrics tools and services to prove your point. Show the actual results of your endeavors rather than hockey-stick conjectures. Just dont go overboard. Thatll lead you to where the tool-price quandary pendulum swings too far the other way.

THE EXPENSIVE CALCULATOR DILEMMA


With a decent calculator, you can add, subtract, multiply, divide, and get on about your business. With a spreadsheet, you can play what-if scenarios until your keyboard wears out. With a customer relationship management system, a dynamic content management server, an integrated ad server, a recommendation engine, and a sentiment analysis system, you can deliver the right message to the right person at the right time. And, given a supercomputer the size of Deep Thought, you can calculate the answer to life, the universe, and everythingin about 7 million years. But should you? If that spreadsheet has enough horsepower for the project, you do not want to spend more than you can get in return. Heres where calculating the ROI on ROI rears its ugly head. Suppose you used a million dollars worth of the worlds most sophisticated tools to increase sales by .002 percent. Not something youd want in your LinkedIn prole. Unless, of course, you thereby delivered an additional $7.5 million in sales. Update LinkedIn, Facebook, and MySpace and reach for the champagne bottle. Kudos to you and well worth

10

Social Media Metrics

the price for all those tools. If you dont happen to work for Walmart and your organization doesnt happen to sell $375 billion in goods every year, then your mileage, online self-aggrandizement, and bottle choice may vary. The most important part of all analysis is whether and how the resulting metrics will be used. If you want to know how many of this compares with how many of that, you should know exactly why you want to know. How will you use that information? What business decisions will be made based on a movement of 5 to 10 percent in any one direction? As an analyst you are inundated with requests for reports. Youll want to pay heed to Judah Phillips advice in his post on the Web Analytics Demystied web site entitled Thoughts on Prioritizing Web Analytics Work (http://judah.webanalyticsdemystied.com/2008/ 10/thoughts-on-prioritizing-web-analytics-work.html), summarized here. How do you prioritize requests for analysis? Answer these questions: Is revenue at risk? (Always seems to be #1.) Whos asking? (As Bob Page from Yahoo! likes to say: All metrics are political.) How difcult is the request? (Do some, but not all, of the easy ones right away.) Can it be self-serviced? (Let them count cake!) When is the analysis needed? (Between immediately and 7.5 million years from now.) Why is the analysis needed? (This is the gotcha! question.)

Getting FocusedIdentifying Goals

11

Understanding Analysis
Are fat people lazy? Pat LaPointe from MarketingNPV asked that question in an article for MediaPost (www.mediapost.com/ publications/?fa=Articles.showArticle&art aid=110610). He went on to explain how hard it is to provide a specic answer to a non-specic question. In order to answer this loaded question, one would have to: 1. Dene fat. Weight/height ratios, body mass index, and body fat content are all legitimate options, but a common denition would have to be agreed to before calculations can begin. 2. Dene lazy. Same problem. Levels of exercise? Work habits? Overreliance on modern conveniences? 3. Dene the standard of proof. Just how fat is fat and just how lazy is lazy? 4. Design a means of observing if the question is true. Conduct the research and collect the data. LaPointe then turned to Is our marketing costeffective? and illustrated that these denitions are even more wobbly. He warned that the wobbility of these terms is exactly where politics enters the effort. Yes, Bob, all metrics are indeed political. Therefore, you must rst be certain that you know what you are trying to nd out (what problem you are solving for) and then be certain that you and those around you agree on

12

Social Media Metrics

your denitions of the terms you use to describe and solve that problem. Its also useful to understand how the mind works. In the introduction of his book Psychology of Intelligence Analysis (Center for the Study of Intelligence, Central Intelligence Agency, 1999), Richards J. Heuer, Jr. wrote: People construct their own version of reality on the basis of information provided by the senses, but this sensory input is mediated by complex mental processes that determine which information is attended to, how it is organized, and the meaning attributed to it. What people perceive, how readily they perceive it, and how they process this information after receiving it are all strongly inuenced by past experience, education, cultural values, role requirements, and organizational norms, as well as by the specics of the information received. For some very useful and practical advice on approaching an analysis project, I can highly recommend The Thinkers Guide to Analytical Thinking by Dr. Linda Elder and Dr. Richard Paul (The Foundation for Critical Thinking, 2006) (www.criticalthinking.org).

Marketing Analysis and Optimization from 30,000 Feet


Divide each difculty into as many parts as is feasible and necessary to resolve it. Rene Descartes

Getting FocusedIdentifying Goals

13

Knowing your goals, the company goals, and the limits of your budget for gathering and quantifying data are the entrance fee. If you dont feel as if you have a handle on your goals and resources, read on, but rst place a rather large yellow sticky where you cant miss it, reminding you to nd out fast. If we consider the ow of the relationship between a company and a customer, we have a framework for addressing the metrics of each step in sequence. Optimizing your marketing is daunting, so you take it a step at a time. Step one: Get their attention. You cant sell to somebody who has never heard of you. More on this in Chapter 2: Getting AttentionReaching Your Audience. Step two: Get them to like you. Thats the subject of Chapter 4: Getting EmotionalRecognizing Sentiment. Step three: Get them to interact. Chapter 5: Getting ResponseTriggering Action Step four: Convince them to buy. Chapter 5: Getting ResponseTriggering Action But none of that matters if you dont have goals. Start with an unremitting devotion to the things you want to see happen, divide each difculty into as many parts as is feasible and necessary to resolve it, and you too can move up from stock clerk to history maker.

Contents

Foreword Larissa A. Grunig and James E. Grunig Preface xvii

xiv

Part 1 Not Your Fathers Ruler Chapter 1 You Can Now Measure Everything, but You Wont Survive Without the Metrics that Matter to Your Business 3
Social Media Isnt about Media, Its about the Community in which You Do Business Measurement Is So Much More than Counting What Really Matters to Your Business? Why Measure at All? 6 7 8 9 8 6 5 4

Data-Driven Decision Making Saves Time and Money It Helps Allocate Budget and Staff Strategic Planning 9 9 12 Gain a Better Understanding of the Competition

Measurement Gets Everyone to Agree on a Desired Outcome Measurement Reveals Strengths and Weaknesses Measurement Gives You Reasons to Say No Dispelling the Myths of Measurement Myth #1: Measurement = Punishment Myth #3: Measurement Is Expensive 12 13

Myth #2: Measurement Will Only Create More Work for Me 14 14 15 Myth #4: You Cant Measure the ROI, so Why Bother? Myth #5: Measurement Is Strictly Quantitative

13

Myth #6: Measurement Is Something You Do When a Program Is Over Myth #7: I Know Whats Happening: I Dont Need Research 15 Measurement, the Great Opportunity: Where Are Most Companies in Terms of Measurement and Where Could They Be? 16

15

Chapter 2 How to Get Started

19
20 20 20 21 21

10 Questions Every Communications Professional Must Be Able to Answer Question #1: What Are Your Objectives? Question #2: Who Are Your Programs Target Audience(s)? Question #3: What Is Important to Your Audiences? Question #4: What Motivates Them to Buy Your Products? Question #5: What Are Your Key Messages? Question #6: Who Inuences Your Audience(s)? 21 22 22

Question #7: How Do You Distribute Your Product or Service? Question #8: What Are You Going to Do with the Information You Get from Your Research? 22 Question #9: What Other Departments or Areas Will Be Affected? How to Decide What to Measure: SuccessAre We There Yet? Making the Budget Argument How to Ensure Accurate Data 25 26 23

22 23

Question #10: What Other Measurement Programs Are Currently Underway?

Bad Data Reason #1: Incomplete Assessment of Variables Bad Data Reason #2: Relevancy of Content 27 Bad Data Reason #3: Commercial Services Omit Results Bad Data Reason #4: The (In)accuracy of Content Analysis A Simple Checklist to Ensure Accurate Results 29

26 28 28

Chapter 3 Seven Steps to the Perfect Measurement Program: How to Prove Your Results and Use Your Results to Improve
Step 1: Dene Your Goals and Objectives: Why Are You Launching This Plan or Pursuing This Strategy? What Is the R in the ROI That You Are Seeking to Measure? 34

33

Step 2: Dene Your Environment, Your Audiences, and Your Role in Inuencing Them

36

Step 3: Dene Your Investment: What Will It Cost? What Is the I in ROI? Step 4: Determine Your Benchmarks 37 Step 5: Dene Your Key Performance Indicators: What Are the Metrics You Will Report With? 37 Step 6: Select the Right Measurement Tool and Vendors and Collect Data Step 7: Turn Data into Action: Analyze Data, Draw Actionable Conclusions, and Make Recommendations 40 Five Ways to Measure ROI 41 How to Leverage Your Measurement Results to Get What You Want

36

39

43

Chapter 4 Yes, You Can Afford to Measure: Choosing the Right Measurement Tool for the Job

45
46 47

How to Decide What Tool Is Right for You: The Right Tool Depends on the Job Tools to Determine What Your Marketplace Is Saying: Media Content Analysis Type of Media Tone 51 52 52 52 50 50 Visibility: Prominence + Dominance Messages Communicated Sources Mentioned Conversation Type

Tools to Determine What Your Marketplace Is Thinking: Opinion Research and Surveys 53 Measuring Awareness Measuring Preference Measuring Relationships Measuring Engagement 55 56 56 60

Tools to Determine What Your Marketplace Is Doing: Web Analytics and Behavioral Metrics 61 Whats It Really Going to Cost? 62 62 63 64 64 Controlling the Cost of Surveys Random Sample Your Content If You Have No Budget at All

Controlling the Cost of Media Content Analysis

Qualitative versus Quantitative Research Focus Groups Provide Insight Survey Provide Facts 65 65

65

Part 2 How to Measure What People Are Saying about You Online and Off Chapter 5 How to Measure Marketing, Public Relations, and Advertising in a Social Media World 69
The Three-Part Social Media Revolution Thought Shift #1: Redene Now 70 70

Thought Shift #2: Redene PR, Advertising, Marketing, and Corporate Communications 71 Thought Shift #3: Change How We Quantify Success The New Rules for PR and Social Media 74 74 75 74 76 73

New Rule #1: Youre Not in Controland Never Have Been New Rule #2: There Is No Market for Your Message New Rule #4: Its Worse to Not Be Talked about at All Building the Perfect Online Measurement Program The Two Worlds of Social Media Level 1 Engagement: Lurking Level 2 Engagement: Casual Level 3 Engagement: Active Level 4 Engagement: Committed Level 5 Engagement: Loyalist Measuring What You Cant Control Step 1: Dene the Goal 84 83 84 78 77

New Rule #3: Its about Reaching the Right Eyeballs, Not All the Eyeballs

Measuring What You Can Control: Web Metrics and Engagement 80 81 82 83

78

Step 2: Identify Your Publics and Determine How Your Social Media Efforts Affect Them 86 Step 3: Dene Your Benchmarks 86 87 Step 4: Determine the Specic KPIs by Which You Will Dene Success

Step 5: Select a Tool

88

Step 6: Collect Data, Analyze Results, Make Recommendations, and Measure Again 95 A Final Word on ROI and Comparing Social Media to Other Tactics Whats Wrong with Advertising Value Equivalency? 97 96

Chapter 6 How to Use Numbers to Get Closer to Your Customers


Listening, Learning, and Responding to the Marketplace Set Up and Rene Your Search Strings Review and Track the Results Verify Which Outlets Matter 101 101 100

99
100

Determine What the Market Thinks of You and Your Competition: What Are Your Market Hot Buttons? 102 Determine How You Are Positioned in the Marketplace versus the Competition, and Use That Knowledge to Gain Advantage Listening, Learning, and Responding to Your Customers Turning Feelings into Numbers and Metrics 103 103 102

Chapter 7 Measuring the Impact of Events, Sponsorships, and Speaking Engagements


Why Events and Sponsorships? 105 Use Data to Support Your Event Decisions

105
106 107

Social Media Has Redened the Concept of Events

Events and the Relationships behind Brand Engagement: How Are People Involved with Your Brand? 107 Seven Steps to Measure Sponsorships and Events Step 1: Dene Your Objectives Sell Products 109 109 109 110 110 Launch New Products 109 108

Drive Afnity between Customers and the Brand Reach New Markets and Customers Step 2: Determine Your Measurable Criteria of Success

Step 3: Decide Upon Your Benchmarks Step 4: Select a Measurement Tool Step 5: Dene Your Specic Metrics Step 6: Choose a Measurement Tool

111 111 114 115

Step 7: Analyze Your Results and Use Them to Make Your Events More Effective 117 How to Calculate ROI for a Booth at an Event: Was It Worth the Time and Resources? 118

Chapter 8 How to Measure Influencers and Thought Leadership


New Inuencers, New Thought Leaders, New Relationships

123
123 125

How to Build a Custom List of the Top 100 Inuencers in Your Marketplace Step 1: Search for Blogs That Mention You or Your Marketplace Most Frequently 126 Step 2: Verify That the Blogs and Bloggers Are Actually Important How to Measure Your Relationships with Your Inuencers Step 1: Dene Your Goals Step 2: Dene Your Audience Step 3: Dene Your Benchmark 128 129 129 130 130 127 126

Step 4: Dene Your Key Performance Indicators Step 5: Select Your Measurement Tool

Chapter 9 Measuring Relationships with Your Local Community


Who Are Your Neighbors and Why Are They Important? Who and What Is Most Important to Measure? 138

137
137 138 139 140

How Do Good or Bad Relationships Inuence Your Organization?

Seven Steps to Measuring Relationships with Your Communities and Neighbors Step 1: Agree upon Solid Measurable Goals That Are Tied to the Bottom Line Step 2: Dene Your Publics 140 141 Step 3: Who or What Are Your Benchmarks?

Step 4: Set Your Audience Priorities: Who and What Is Most Important to Measure? 143 Step 5: Choose Your Measurement Tools 143

Relationship Surveys Step 6: Analyze the Data

143 145 147 146 147 148

Local Media Analysis Is Critical

When It Comes Up for a Vote, Its Too Late to Change Anything Fishing in the Talent Pool? Nonprot Measures 147 148

Government Can Planand PollAhead Campus Opportunities

Chapter 10 Measuring What Your Employees Think

149

If Employees Are So Connected, Why Is It So Hard to Communicate with Them? 149 Seven Steps to Measuring What Employees Think, Say, and Do as a Result of Your Internal Communications 151 Step 1: Understand the Environment and Where They Really Get Information How Are Messages Getting through to Employees, and What Are They? What Channels or Vehicles Do Employees Trust? Whats Important to Them? 154 155 155 156 157 157 What Do They Think about the Organization Today? Step 2: Agree on Clear, Measurable Goals Step 3: Select a Benchmark to Compare To Step 4 : Dene the Criteria of Success Message Analysis Tools Outcome Measurement Tools Step 6 : Analyze and Take Action 158 158 159 161 160 153 152 152

Step 5: Select Your Measurement Tools and Collect Data

Use Surveys to Determine What Employees Think Make Changes to Improve Employee Relationships

Chapter 11 Threats to Your Reputation: How to Measure Crises


Measuring What Is Being Said about You Measuring What People Believe about You 165 169

163

Trust Is the Key to Building and Defending Your Reputation What Is Trust? 172 174 BS Is More Damaging than Lies Seven Steps to Measure Crises and Trust

170

Measuring What People Do: Long-Term Effects and Follow-Up Research 176 177 Step 1: Dene a Specic Desired Outcome from the Crisis

175

Step 2: Dene Your Audiences and What You Want Your Relationships to Be with Each One 177 Step 3: Dene Your Benchmark Step 5: Select a Measurement Tool 178 179 181 180 182 Step 4: Dene Your Measurement Criteria

Step 6: Analyze Results, Glean Insight, and Make Actionable Recommendations Step 7: Make Changes and Measure Again

Chapter 12 Measuring Relationships with Salespeople, Channel Partners, and Franchisees 183
Millions Spent on Sales Communications, but Does Any of It Work? The Problem: Mixed Messages, Mixed Objectives The Solution: Consistent Messages Other Measures of Success 187 188 Measuring What Matters to Sales 186 185 184

Chapter 13 Measurement for Nonprofits


Not Measuring Is Not an Option

191
192 193 193 195

Measuring Relationships with Your Membership Step 2: Identify and Prioritize Your Audiences Step 3: Establish a Benchmark Step 4: Pick Your Metrics 197 198 Step 5: Pick a Measurement Tool 196

Step 1: Use Your Mission to Dene Your Objectives

Use Content Analysis to Measure Activity, Sentiment, and Messaging Use Surveys to Measure What People Think about You 200

198

Measuring Behavioral Change Measuring Results During a Crisis

201 203 203

Step 6: Analyze Results and Make Changes

Chapter 14 Measure What Matters in Higher Education: 205 How to Get an A in Measurement
University Flunks Measurement: Millions in Funding Lost and President Resigns Key Considerations: Multiple Audiences = Multiple Goals = Multiple Metrics Five Steps for Getting an A in Measurement 210 212 Step 1: Identify and Prioritize Your Audiences 210 Step 2: Dene Your Objectives and Get Everyone on the Same Page Step 3: Establish a Benchmark 213 213 214 215 Step 4: Pick a Measurement Tool and Collect Data Measure What the Media Is Saying about You Measure What People Think Measure Behavior 217 217 216 207 206

Measure Social Media in the Academic Environment

Step 5: Analyze the Data, Glean Insight, Make Changes, and Measure Again

Epilogue: Whither Measurement?

219 223

Appendix 1: The Grunig Relationship Survey Appendix 2: Measurement Resources Glossary References Index 240 232 239 230

Buy This Book

CHAPTER 1

You Can Now Measure Everything, but You Wont Survive Without the Metrics that Matter to Your Business

What is wanted is not the will to believe, but the will to nd out, which is the exact opposite. Bertrand Russell

Until recently, the attitude toward


measurement in business has been: Its too expensive and too complicated and really only applicable for major corporations. In the past decade, however, a conuence of circumstances has pushed measurement and metrics onto the priority lists of businesspeople everywhere. First there was the Internet explosion. The Internet, and specifically social media, has been adopted by businesses worldwide in record-breaking time. It took 89 years for the telephone to reach the level of household penetration that Facebook reached in just ve. As consumers increasingly research and purchase goods online, their behaviors, thoughts, and opinions have become easier to track and measure. At the same time, the proliferation of listening, analysis, and reporting tools has made such metrics affordable and accessible to every organization, from nonprots to go-fast Internet start-ups. 3

Measure What Matters

Then there is the current global recession. In these hard times most every business is taking a hard look at what strategies, programs, and communications are working and not working. Today, if youre in business and want to survive, you will need to continuously measure and improve your processes and programs. Whether or not you are measuring, your competition very likely isand as a result probably knows more about your business than you do. This book will do much more than just teach you how to measure. It will teach you how to measure what you need to make the decisions that are crucial to your business. It used to be that he or she with the most data wins. But today nothing is cheaper and easier to come by than dataespecially useless data. Its having the right data that counts. While every program is different, all organizations have a core set of key publics with whom they need to build relationships, collectively known as the stakeholders. These include, among others: the media, employees, customers, distributors or sales force, the local community, industry inuencers, nancial analysts, and elected ofcials. Each stakeholder group requires slightly different measurement tools and slightly different metrics. Thats why this book is organized around the stakeholderseach with its own chapter and its own procedures and advice. This book shows you how to measure business relationships with just about any key public that your job involves.

Social Media Isnt about Media, Its about the Community in which You Do Business
Most of what I advocate in this book wouldnt be possible or necessary without social media. We talk about social media as a shiny new object, as if its some sort of new toy for business. In fact, social media has changed everything important to your business. From marketing and sales to employee and nancial management, the social media revolution has forced all of us to rethink how we approach business, our marketplaces, and our customers.

You Can Now Measure Everything

Today, customers talk to and trust each other more than they do companies. They choose how they spend their time and money based on recommendations from people with similar tastes and proles. They trust, and therefore prefer to do business with, companies that are open, honest, and authentic. Companies with which they have good relationships are more likely to be forgiven when they make a mistake. Thus, companies that listen carefully to their customers and respond to their needs will survive and prosper. Those who dont will be gone. In order to succeed in this new era of easy and frequent conversations, it is critical that you continuously listen to and evaluate what your market is saying about you. Companies that do can promote themselves more efciently, innovate more effectively, and operate more protably.

Measurement Is So Much More than Counting


Before we get into the how-tos of measurement we need to be clear on our denitions. Everyone in business already has some form of accounting in place. All business owners know how to count inventory, the number of ads they place, or the number of stories in which they are mentioned. They count their customers, their sales, and generally count their prots. But counting is very different from measurement. Counting just adds things up and gets a total. Measurement takes those totals, analyzes what they mean, and uses that meaning to improve business practices. Measurement of your processes and resultswhere you spend your time and money and what you get out of itprovides the data necessary to make sound decisions. It helps you set priorities, allocate resources, and make choices. Without it, hunches and gut feelings prevail. Without it, mistakes get made and no one learns from them.

Measure What Matters

What Really Matters to Your Business?


Only a handful of businessesthose that prosper, grow, and continuously improvemeasure what matters. Most organizations, when asked what really matters to their business, would probably say, my customers, or my employees. And theyd be partially correct. But its not the number of customers and employees that matters, its the relationships that your organization has with them that matters. Good relationships lead to prots. With good relationships, prospects become customers and customers become loyal advocates for your company. Thanks to good relationships, employees stay, learn, grow, and contribute to their organizations. Poor relationships result in more expensive operations, fewer sales, less customer loyalty, more churn, higher legal fees, higher turnover rates, more expensive recruiting costs, and, ultimately, disadvantage in the marketplace. In public relations, if you establish good relationships with reporters, bloggers, editors, and other key inuencers, theyll trust your word, cut you slack in a crisis, and turn to you for your thoughts and opinions. A lack of good relationships with the media leads to crises escalating, omission from key stories, and less inclusion of your point of view in stories. So what really matters is your relationships and the aggregated outcome of those relationships: your reputation. Today, if youre not measuring the health of your relationships, you wont be in business for very long. This book tells you how to measure those relationships and what to do with the data once you have it.

Why Measure at All?


When budgets are ush, theres a popular misconception that it doesnt much matter how you measure results, as long as there is a perfunctory number that shows up for your department every

You Can Now Measure Everything

so often. But times arent always ush. And the bean counters and stakeholders are getting more demanding. Even when prots are rising, measurement saves time and money. The spectacular proliferation of social mediafrom Twitter to Facebook to YouTube and beyondmeans the average businessperson is faced with a bewildering array of opportunities and obstacles. Its a new and rapidly changing world out there, and the most productive way to run your business is not obvious. The prudent and productive approach is to measure the results of all your efforts in a consistent manner and compare the results against a clearly articulated and predened set of goals. When I entered the eld of corporate communications it was by way of journalism, and I had little practical knowledge of communications tactics and strategies. So I asked a lot of questions, such as, Where do we get the most bang for the buck? and Which strategy results in the cheapest cost per message communicated? At the time, no one had the answers at their ngertips, so I developed systems to get the data. And for more than two decades I have been rening those systems and developing new ones. You will read about them in detail in the following chapters. Along the way I learned that measuring your success is not just another buzzword that follows Six Sigma, TQM, and paradigm shifts. It is a key strategic tool that helps you better manage your resources, your department, and your career. No matter what type or size of organization you are in, there are half a dozen advantages to setting up a measurement program. Here they are: Data-Driven Decision Making Saves Time and Money Making decisions based on data saves time and boosts your credibility. When faced with tough decisions, youll seldom nd boards of directors or CEOs relying on hunches or gut instinct. Chances are any decisions made at the highest levels will be made following extensive research.

Measure What Matters

So why should other business decisions be any different? How credible would your CFO be if he got up in front of the board and said, I know were making money because I see the checks coming in? Just as the CFO relies on accounting data to give advice and make recommendations on nancial issues, you need other data to decide where, when, and how to allocate resources in other departments, including HR, marketing, public affairs, communications, and sales support. It Helps Allocate Budget and Staff I once used a competitive media analysis to indicate the need for PR staff for a major semiconductor company. We analyzed this clients presence in key media and compared it to that of three competitors to determine who was earning the greatest share of ink. As it happened, over a two-year period there was very little difference between the competitors, with the four organizations equally matched in coverage each month. But at a certain point the clients results took a dive; all of a sudden its share of ink in the key trade media dropped to about two percent. I presented the results and asked the audience, which included several managers, what had happened. The answer was: That was when we reorganized and eliminated our PR effort. I replied by demonstrating that, in the months following the reorganization, the market had had about nine times more opportunities to see news about the competitions products than their own. That seemed to do the trickthe last time I was in touch, the PR staff had grown to about 10 people and their budget was increasing every year. Gain a Better Understanding of the Competition Your business or organization is always competing for something: sales, donations, search ranking results, share of conversations, share of wallet, or share of voice. So you need to know how you stack

You Can Now Measure Everything

up against your peers and rivals. Measurement gives you insight into competitive strengths and weaknesses. Strategic Planning Deciding how to best allocate resources is arguably the most important responsibility of any manager. But without data you are forced to rely on gut instinct. And as accurate as your gut may be, it doesnt translate very well into numbers. What you need is datadata you can rely on to guide your decisions and to improve your programs. Measurement Gets Everyone to Agree on a Desired Outcome You cant decide what form your measurement program is going to take without an agreed upon set of goals. This alone may be the best reason to start measuring. Putting everyone in a room and getting agreement on what a program is designed to achieve eliminates countless hours of blaming and bickering later if the project doesnt work. This is especially true with social media. Too often people will complain that marketing dollars spent on social media are unmeasurable, when, in fact, the real reason metrics dont exist is that no one ever articulated just what the social media program was designed to do. And, getting our feet wet in social media is not a measurable goalunless youre a duck. Measurement Reveals Strengths and Weaknesses Measurement isnt something you should do because youre forced to. It should be approached as an essential strategic tool to more effectively run your business. Deciding how to allocate the necessary resources and staff is easier if you know exactly what works and what doesnt, especially when it comes to social media.

10

Measure What Matters

One of my rst experiences with measurement was at Lotus Development (now IBM Software). I was nearing the end of my rst year there and wanted to determine what had worked and what hadnt. My primary role was to insure the successful communication of key messages to the target audience. So we gathered the 2,400 or so articles that mentioned Lotus during the previous year and analyzed each one to determine whether it left a reader more or less likely to purchase Lotus software, and whether it contained one or more of the key messages our company was trying to communicate about itself. To do the analysis I hired 20-something college students who were in the market for software. I gave them careful instructions on how to read and analyze each article to determine if it left them more or less likely to buy Lotus software. The results were very revealing. The $350,000 launch (with a major cocktail party) of a word processing product generated plenty of coverage, but very few of those articles contained our key messages. In fact, a $15,000 press tour was much more effective at getting key messages to our target audiences. The metric we used to measure success was cost per message communicated (CPMC), and the press tour delivered a CPMC of $.02 compared to the partys CPMC of $1.50. (See Figure 1.1.) Based on this data, we immediately cut the planned $150,000 party out of the next product launch plan. Even more revealing was our success in penetrating new markets. We were targeting software buyers with a product that required us to reach an entirely new audience that relied upon a distinct group of industry trade magazines. When we analyzed the results, we realized that this new group of journalists had not responded well to our pitch, and, in fact, their stories were only half as likely to contain key messages as was typical. I called a few of these journalists and tracked down the source of their problem, which turned out to be a member of my staff who wasnt responding in a timely manner. Through proper coaching I was ultimately able to repair the relationship.

You Can Now Measure Everything

11

Figure 1.1 Media analysis is one way to demonstrate what works and what doesnt.

Another example involves a client who had us compare the results of a press tour with those of a press conference to determine which was more effective. The results varied little in terms of quantity of coverage. In terms of quality, however, the press tour received nearly twice as much positive press and communicated almost twice as many messages, all for a fraction of the cost of the press conference. Heres an example from the nonprot world. In 2009, my company compared two programs undertaken by the USO. In one, newly elected President Obama stuffed Care packages on the White House

12

Measure What Matters

lawn, and in another The Colbert Report visited Baghdad. Our metrics showed that the Colbert event generated about 10 times more publicity, but both events generated almost identical amounts of online donations, and at vastly different costs. The Obama event wasnt nearly as expensive or as interesting to the media, but it sent out a very effective message to the USOs mailing list: If you cant be there to support the USO in person, donate online.

Measurement Gives You Reasons to Say No All too often, making decisions based on gut feeling rather than data leads to overworked staff with unclear priorities. If and when you are presented with demands that seem ill-timed, rushed, or just plain unwise, there is simply no good argument with which to say No. However, if you have data on the results of previous programs, you frequently gain the leverage you need to turn down requests that will be a waste of time or resources. One of my clients took saying No to an entirely new level. At the time she was a lowly researcher, but each month shed look at her report and highlight all the worst performing programs. Shed take the data to her boss and point out the failures. Shed then move budget and resources out of the failing programs and redirect them to those that were working. In short order the department was operating so effectively that she was promoted to vice president.

Dispelling the Myths of Measurement


So if measurement is all that valuable and important, why isnt everyone already doing it? There are a number of bona de reasonslack of knowledge, lack of time, lack of a clear strategybut most of the so-called reasons people give stem from a few commonly held myths about measurement.

You Can Now Measure Everything

13

Myth #1: Measurement = Punishment Measurement is too often seen as a way to check up on people or a department and thus is too often used to justify the existence of a program or budget. People shy away from accountability, fearing that it will reveal aws and weaknesses in the organization. However, in two decades in this business I have never seen anyone punished for being more accountable. No one Ive worked with has ever been punished for showing how to make a program more efcient or for having clear and quantiable ways to gure out what works or what doesnt. In fact, most people who institute measurement programs nd that they get more promotions, bigger raises, and increased budgets because of their ability to identify strengths and weaknesses and allocate resources more intelligently. A corollary to the Measurement = Punishment myth is, Im afraid Ill get bad news. Too many people are afraid of projects being cancelled because they arent working or that they will hear unpleasant things from the stakeholders. The truth is: If something isnt working, its wasting money and resources. So why would you want to continue it? And if people are saying bad things about your organization, theyll keep on saying them anyway, even if youre not listening and participating in the conversation. Myth #2: Measurement Will Only Create More Work for Me In the overall scheme of things, measurement seems to many of us just one more thing in a long list of high-priority items. Too often it gets dropped to the bottom of the list because it seems like too much work. The reality is that once a measurement system is in place, it actually makes everything else much easier. Data at your ngertips helps you to better direct the resources you have, ensuring that they are having maximum impact. Data at your disposal means less time debating the merits of one tactic over another. Gut feelings can always be second-guessed, but data is much harder to argue with.

14

Measure What Matters

Myth #3: Measurement Is Expensive The number one reason that people give for not measuring is that they cant afford it. The truth is, you cannot afford not to measure. Without measurement, you have no way of knowing if you are spending your budget effectively. A measurement system frequently pays for itself because it inevitably leads to increased efciency. One international client of ours called its PR agencies together and showed them the results of our $10,000 benchmark study of their PR. Based on those results, the agencies were given concise new objectives for directing specic messages to specic audiences. Six months later, communication of the companys key messages had risen by 245 percent and sales had increased in each country that implemented the program. Think of that: A tool that could more than double the exposure of key messages to your target audience and increase salesall for less than $10,000.

Myth #4: You Cant Measure the ROI, so Why Bother? ROI is an accounting term that means return on investment. To calculate the ROI of any project, you take the total amount of money saved or brought in and subtract from it the total budget amount invested, then divide it by the cost of investment. Thats the net return, and it is typically a measure of money saved, costs avoided, or revenue brought in. There are some programs for which establishing a reliable ROI gure is not an easy task, but it is generally doable. For example, suppose you institute a new communications effort designed to increase trust in the organization, and you spend $50,000 on social media, public relations efforts, and community outreach. Now suppose that, as a result, the next time you go before the city council for a zoning board variance your request breezes through in one meeting. Youve probably saved that $50,000 in legal fees alone. Or, suppose that you have to recall a product and your sales rebound in 3 months instead

You Can Now Measure Everything

15

of 12. Again, the net gains far outweigh the cost of your outreach program. Just because something isnt easy to measure, its no reason not to measure it. Too often the people who are screaming loudest about not being able to demonstrate the ROI of their programs are those who simply dont want to try anything new. They are just using ROI as an excuse to say no. Myth #5: Measurement Is Strictly Quantitative Another myth claims that measurement primarily concerns quantiable entities such as sales, leads, conversions, mentions, friends, or followers. The reality, however, is that the only type of measurement system that works combines both qualitative and quantitative data. If all you look at are sales and not the relationships your organization has with its publics, youll never be able to accurately understand why those sales go up or down. To really understand your successes (and failures) you need to measure what I call revenuetionshipsboth the revenue you bring as well as the relationships and reputation that you build with your publics. Myth #6: Measurement Is Something You Do When a Program Is Over Measurement is seen too frequently as an afterthought, a tool to gauge the efciency of a program you have already completed. On the contrary, in order to be maximally effective measurement should be in place at the start of a program. Myth #7: I Know Whats Happening: I Dont Need Research I hear it all the time. And so often from managers who are generally quite effective, but who could be so much more effective if they only

16

Measure What Matters

understood the true value of measurement. Measurement provides the context and the rationale behind changes in your reputation, your relationships, and ultimately your P&L. Everyone has a formal accounting system to track and measure prot and loss. So why the reluctance to establish similar formal systems to track and evaluate other business processes like marketing and communications efforts, relationships, and reputation? Without such systems in place, you will never know why your sales rise and fall, or what you need to do to make them rise faster.

Measurement, the Great Opportunity: Where Are Most Companies in Terms of Measurement and Where Could They Be?
Despite billions spent on marketing and communications, the percentage of companies who are actually measuring their marketing efforts is shockingly low. Study after study shows that most CEOs dont feel that they have adequate measures in place. And, despite or because of the rapid growth of data mining, the general consensus is that theres lots of data but very little insight. Part of the reason is that most organizations dont allocate sufcient resources to measurement. The annual Annenberg GAP study of common practices in public relations for 2010 reports that the average corporation spends just 4.5 percent of its marketing budget on evaluation (http://annenberg.usc.edu/News%20and%20Events/ News/043010SCPRC.aspx). Another recent study found that 79 percent of organizations arent measuring the ROI of their social media efforts at all (www.mzinga.com/company/newsdetail. asp?lang=en&newsID=252&strSection=company&strPage=news). In general, those organizations that measure do better than those that dont. Years of studies have shown that one of the key ingredients of excellence is the ability to measure what matters (www.cmocouncil.org/news/pr/2008/011408.asp. Also: http://

You Can Now Measure Everything

17

customerexperiencematrix.blogspot.com/2010/05/cmo-surveymeasurement-isnt-our-top.html. And: www.lenskold.com/content/ articles/lenskold apr07.html). All the best performers on most of the Top 100 lists are what we call measurement mavensthey invest in metrics and research on a regular basis and use the results to continually improve their operations. Most of the companies on Fortunes Most Admired list have had some form of formal marketing measurement in place for years. What this translates into for most businesses is a great opportunity. As long as the competition isnt measuring, you have a huge competitive advantage. You can listen to your customersand theirsand act on the issues and opportunities of your marketplace. All while your competition continues to operate in the dark. Youll respond faster, your relationships with your employees and customers will be better, and your reputation will be stronger. The results will show in your bottom line.

Contents

INTRODUCTION

010 014 018 018 019 024

A BRIEF HISTORY OF INFOGRAPHICS THE PURPOSE OF THIS BOOK WHAT THIS BOOK IS NOT A NOTE ON TERMINOLOGY HOW TO USE THIS BOOK

01

IMPORTANCE AND EFFICACY: WHY OUR BRAINS LOVE INFOGRAPHICS

028 031 038 040 044 050

VARIED PERSPECTIVES ON INFORMATION DESIGN: A BRIEF HISTORY OBJECTIVES OF VISUALIZATION APPEAL COMPREHENSION RETENTION

02

INFOGRAPHIC FORMATS: CHOOSING THE RIGHT VEHICLE FOR YOUR MESSAGE

056 060 74 82

STATIC INFOGRAPHICS MOTION GRAPHICS INTERACTIVE INFOGRAPHICS

03

THE VISUAL STORYTELLING SPECTRUM: AN OBJECTIVE APPROACH

088 090 112 114 122 128 146 149 152 159

UNDERSTANDING THE VISUAL STORYTELLING SPECTRUM

04

EDITORIAL INFOGRAPHICS

WHAT ARE EDITORIAL INFOGRAPHICS? ORIGINS OF EDITORIAL INFOGRAPHICS EDITORIAL INFOGRAPHIC PRODUCTION

05

CONTENT DISTRIBUTION: SHARING YOUR STORY

POSTING ON YOUR SITE DISTRIBUTING YOUR CONTENT PATIENCE PAYS DIVIDENDS

06

BRAND-CENTRIC INFOGRAPHICS

160 162 166 167 175 178 182 185 186 189 196 199 200 201 202 204 206 210

ABOUT US PAGES PRODUCT INSTRUCTIONS VISUAL PRESS RELEASES PRESENTATION DESIGN ANNUAL REPORTS

07

DATA VISUALIZATION INTERFACES

A CASE FOR VISUALIZATION IN USER INTERFACES DASHBOARDS VISUAL DATA HUBS

08

WHAT MAKES A GOOD INFOGRAPHIC?

UTILITY SOUNDNESS BEAUTY

09

INFORMATION DESIGN BEST PRACTICES

ILLUSTRATION DATA VISUALIZATION

10

THE FUTURE OF INFOGRAPHICS

220 236 237 247 248 250 252 254

DEMOCRATIZED ACCESS TO CREATION TOOLS SOCIALLY GENERATIVE VISUALIZATIONS PROBLEM SOLVING BECOMING A VISUAL COMPANY FURTHER INFOGRAPHIC GOODNESS THANK YOU INDEX

Buy This Book

28

028

CHAPTER

ESSENTIAL READING CHAPTERS

fig. 1.1-1.17

01

02

03

08

09

10

01 01

IMPORTANCE AND EFFICACY: WHY OUR BRAINS LOVE INFOGRAPHICS


VARIED PERSPECTIVES ON INFORMATION DESIGN: A BRIEF HISTORY OBJECTIVES OF VISUALIZATION APPEAL COMPREHENSION RETENTION 031 038 040 044 050

055

29

30

INFOGRAPHICS

In De Architectura, Roman architect and engineer Vitruvius states that there exist three standards to which all structures should adhere: soundness, utility, and beauty. In their paper, On the Role of Design in Information Visualization, authors Andrew Vande Moere and Helen Purchase point out that these standards can and should also be applied to information design and the various applications that serve this purpose. They state that a good visualization should be sound; that is, the designs form should be suitable for the information it depicts. It should be useful, enabling the viewer to derive meaning from it. And of course, as with all design, it should have aesthetic appeal that attracts the viewers attention and provides a pleasing visual experience. This framework provides a solid basis that anyone can use to judge the value of visualization. However, we will use a slightly different categorization for the purpose of discussing the positive effects of infographics. We will refer to beauty as appeal, and divide utility into the areas of comprehension and retention as these are the three basic provisions of all effective verbal or visual communication methods: 1. Appeal Communication should engage a voluntary audience. 2. Comprehension Communication should effectively provide knowledge that enables a clear understanding of the information. 3. Retention Communication should impart memorable knowledge. We will address the need to have a sound design on a more practical level in Chapter 9 (Information Design Best Practices) when we discuss principles for the practice of information design.

Images and graphics should always look appealing and encourage viewers to engage in the content. It is important that we examine why this is the case and identify the primary elements that lead to this appeal. This is certainly the first and potentially most challenging step in conveying a message: getting the recipient to commit to hearing what you have to say. People have long accepted the notion that a picture can replace a thousand words, and similarly, that a simple graph can replace a table full of numbers. Basic visualization allows us to immediately comprehend a message by detecting notable patterns, trends, and outliers in the data. This chapter will look at how visualization achieves this feat so easily while other forms of communication fall short. Further, well determine how we can make those visualizations more memorable. The democratization of media, especially online, has given us a great variety of options that we can use to consume our news, videos, and funny picturesand generally educate ourselves on myriad topics. However, the downside to this exponential increase in stimuli is that we tend to lose much of this knowledge shortly after we gain it. While no one should lament forgetting a mediocre LOLcat, it pays to be memorable especially in the business world. Fortunately, connections have been made recently between the illustrative elements of graphics and the retention rates of the information displayedand these connections can help us all figure out how to have people remember our material. This chapter will also discuss the fact that information design lends itself to achieving these objectives, and will seek to understand exactly how and why it does this, based on the way our brains process information. We will not be getting into too much heavy science; rather, our main goal is to understand which elements of design help us reach our specific communication goals, and to leave behind those that do not. For this we will lean heavily on several key works that have covered the science

I M P O R TA N C E A N D E F F I C A C Y

31

of visualization exhaustively, most notably Colin Wares thoughtfully written Information Visualization: Perception for Design. Finally, we will identify several of the divergent schools of thought. Weve concluded that the differences in these approaches are largely rooted in the failure to recognize that varied objectives necessitate varied approaches to the practice. That is, a design whose primary objective is to give the viewer information for analysis cannot be considered, designed, or judged in the same way as one whose primary goal is to be appealing and entertaining while informing. We will discuss these varied approaches to each unique objective and elaborate on the practice in the applications chapters (3, 4, 6, and 7). We will then discuss how we can use these different methods to serve our three basic communication provisions: appeal, comprehension, and retention.

VARIED PERSPECTIVES ON INFORMATION DESIGN: A BRIEF HISTORY


Many a heated debate over the proper approach to information design is raging online nowadays, which seems to raise the question: Why all the conflict in the friendly field of pretty picture creation? The debate surrounds just that: The role of aesthetics and decoration in the design of infographics. To understand the underlying tension, a bit of background is necessary. Science and publishing have used information design and visualization as a communication tool for centuries. However, study and development in the field has mostly been dominated by academics and scientists, who are concerned primarily with understanding the most effective way to process and present information to aid viewers analyses. These efforts are driven by loads of research, with highly theoretical consideration; when practical, the focus is on using software to process and visualize data sets. For years, only a select fewan educated, knowledge-

able, and skilled group of individualshave discussed and practiced visualization in this sense. Then the Internet caught on. Around 2007, interest in infographics (mostly editorial in nature) began to grow on the web, as people shared old infographics like Napoleons march on Moscow (Figure 1.1) and newer creations such as those published by GOOD Magazine (Figure 1.2). Suddenly, a whole new group of experts was praising, sharing, and critiquing (mostly critiquing) any infographic they could find. Since then, an impressive number of new infographics have been created as various industries and areas identified different applications for their use. One of the most common was to use editorial infographics for commercial marketing purposes. This new breed of visual took a bit of a different path, both in format and content. The long, skinny graphic, designed to fit within a blogs width, became ubiquitous and almost instantly synonymous with the term infographic. These pieces used illustration and decoration much more than their traditional counterparts. And as with most marketing efforts, their goal was to use their content and design to attract attention, interest, and adoration for the company that produced themmaking each brand a thought leader in its industry. This was quite a divergence from the traditionally stated purposes of the field, which was purely to use visual representation to aid in the processing and comprehension of data. As you can imagine, the new field of infographic designers often lacked knowledge of best practices for information design. In other words, people were winging it. As with any field experiencing this kind of growth, overall quality of designs vary drasticallywhich has attracted criticism (read: utter disdain) from the academic and scientific visualization community. The Internet had usurped infographics.

32

INFOGRAPHICS

Figure 1.1 Flow map of Napoleons Russian Campaign of 1812. Charles Minard.

I M P O R TA N C E A N D E F F I C A C Y

33

Figure 1.2 You Should Vote Because. Open, NY for GOOD.

34

INFOGRAPHICS

EXPLORATIVE

NARRATIVE

CHARACTERISTICS

MINIMALIST ONLY INCLUDES ELEMENTS THAT REPRESENT DATA SEEKS TO COMMUNICATE INFORMATION IN THE MOST CLEAR, CONCISE MANNER

ILLUSTRATIVE DESIGN-FOCUSED SEEKS TO APPEAL TO VIEWER WITH ENGAGING VISUALS INFORMS AND ENTERTAINS

APPLICATIONS

ACADEMIC RESEARCH SCIENCE BUSINESS INTELLIGENCE DATA ANALYSIS


Figure 1.3 Approaches to infographic design.

PUBLICATIONS BLOGS CONTENT MARKETING SALES AND MARKETING MATERIALS

I M P O R TA N C E A N D E F F I C A C Y

35

The debate over what should be considered an infographic continues to this day, as people seek to find concrete definitions in an area thats constantly becoming more nebulous. Among the most known and quoted voices in this area is Yale University statistics professor Edward Tufte, who has written some of the most acclaimed and comprehensive works on the topic of information design. Tufte has contributed much to its popular terminology by coining terms such as chartjunk (unnecessary graphic elements that do not communicate information) and developing the data-ink ratioa measurement of the amount of information communicated in a graphic as it relates to the total number of visual elements in it. Tuftes thoughts on the topic represent a conservative lean on the spectrum of approaches to infographic design (Figure 1.3), which is typical of those who have an academic or scientific background. He argues that any graphic elements of a design that do not communicate specific information are superfluous and should be omitted. He believes that chartjunk such as unnecessary lines, labels, or decorative elements only distract the viewer and distort the data, thus detracting from the graphics integrity and decreasing its value (Figure 1.4). Although Tufte does concede that decorative elements can help editorialize a topic in some instances, his teachings typically discourage their use.

Figure 1.4 Example of explorative graphic approach using minimalist design.

36

INFOGRAPHICS

The work and writing of British graphic designer Nigel Holmes characterizes the opposite end of the spectrum, which supports the heavy use of illustration and decoration to embellish information design (Figure 1.5). Holmes is best known for his illustration of editorial explanation graphics in Time from 1978 to 1994. The perspective that Holmes work supports the notion that using illustration and visual metaphor to support and reinforce the topic makes the graphic appealing to viewers. Recent studies show that these decorative elements can also aid in the retention of the information presented, which we will examine later in the chapter.

So which is the correct approach? Both are. What people often overlook in these debates is the most central issue to any design: the objective. While Tufte and Holmes might want to represent the exact same data set, they likely would be doing it for very different reasons. Tufte would aim to show the information in the most neutral way possible, to encourage his audience to analyze it without bias. Conversely, Holmess job is to editorialize the message in order to appeal to the viewer while communicating the value judgment he wants readers to take away. Tuftes communication is explorative; that is, it encourages the viewer to explore and extract his or her own insights. Holmess, on the other hand, is narrative, and prescribes the intended conclusion to the viewer. The difference is inherent in their areas of work, as the objectives of science and research are much different than those of the publishing world. Theres no need to establish a universal approach to govern all objectives; rather, different individuals and industries should develop best practices unique to each applications specific goal.

I M P O R TA N C E A N D E F F I C A C Y

37

Figure 1.5 The Tipsy Turvy Republic of Alcohol. Nigel Holmes.

38

INFOGRAPHICS

A P P EA L
1

OBJECTIVES OF VISUALIZATION
Of course, we must first look at what each infographic is trying to achieve before we can establish the best practices for its application. By definition, all information graphics are aimed at communicating information. What varies is the purpose for doing soand understanding this purpose is what determines a graphics priorities. These priorities account for a necessary difference in approach to each design. For example, if an infographic is intended to communicate information in the most clear and unbiased manner possible, then the first priority for the designer is comprehension, then retention, followed by appeal (Figure 1.6). This is common in academic, scientific, and business intelligence applications, as these areas typically lack any agenda aside from conveying and having viewers comprehend knowledge. Appeal is less necessary in this setting, as the viewer most typically needs the information and seeks it out as a result. Appeal is only useful when it keeps the viewers attention to enable further comprehension. Such a graphic typically would be used as a resource for informationwhich is why retention is also a secondary priority. If the viewer needs the information and it is a readily accessible resource, then he or she can revisit it as needed to retrieve it again. Theres no need for it to take up any more valuable brain space than necessary.

CO MP R EHE N S IO N
Key

R ET ENT I O N

Academic/Scientific Marketing Editorial

Figure 1.6 Infographic priorities by application.

I M P O R TA N C E A N D E F F I C A C Y

39

However, a graphic created with a commercial interest in mind will have much different priorities. Brands primarily seek to get viewers attention and eventually (hopefully) convert those users into paying customers. As evidenced by Super Bowl commercials, companies will go to almost any length to get this attention. The order of priorities of a commercial marketing graphic would be appeal, retention, and then comprehension. Brands are looking to catch viewers attention and make a lasting impressionwhich usually means that viewers comprehension of content is frequently the brands last priority. The exception to this would be infographics that are more focused on the description of a product or service, such as a visual press release, since designers in these cases would want the viewer to clearly understand the material as it relates to the companys value proposition. However, being appealing enough to prospective customers to get them to listen is always goal number one.

Publishers that create editorial infographics have a slightly different mix: appeal, comprehension, and retention. Since the appeal of a magazines content is what will make it fly off the newsstand, it shares this top priorityimproving saleswith companies in other industries. A publishers survival is based solely upon its ability to spark readers interest. The quality of content or graphics produced on a consistent basis helps drive this interest by making a strong impression on readersand this is where comprehension comes into play. A publications quality is based on the content it produces, which is intended to help readers understand a given topic. However, whether or not the reader can recall that topic with the same level of understanding one week later is of little importance to a publishers bottom line. The common denominator between commercial and editorial interests is that they both desire to compel the consumer to take a specific action.

40

INFOGRAPHICS

APPEAL
In 2010, Google CEO Eric Schmidt famously stated that we now create more information in two days than we created from the dawn of man up until 2003. This staggering statistic obviously necessitates clarification of what constitutes information and its creation. Regardless, the message is clear and uncontested: humanity is creating and consuming far more information than it ever has before. As a result, it is increasingly difficult to get peoples attention, since theyre constantly bombarded with various stimuli throughout the daymaterial that ranges from breaking news to funny photos to Facebook updates. Marketers, salespeople, brand evangelists, and publishers must all figure out how to grab a slice of this attentiona task that is becoming more challenging by the day. How do you get peoples attention, and keep it long enough to share your message with them? Due to the sheer volume of stuff out there, its a formidable task to make yours stand out. How do you appeal to an audience in a world of information overload in which people constantly face new inputs, options, and decisions? Ask the worlds biggest company, Apple. With a cash reserve larger than the total valuation of all but fifty companies worldwide (as of early 2012), this organization surely must have some insight into what people like. In the battle for MP3 player dominance, the iPod came in early and overshadowed the competition. What was, and still is, the key differentiator between this and other products? The simple answer is design. While features such as OS compatibility, memory, and screen size certainly factor into the decision, the most outstanding difference between the iPod and its competitors is its impressive design. As Steve Jobs preached, good design not only garners additional appeal for an item, it can also actually incite an emotional reaction. Few can deny the good feeling of pulling a new Apple product out of the box.

So how does this translate to best practices for information design? Our consumer culture is becoming increasingly design focused in areas that extend beyond graphics and consumer electronics, and that play a role in many other industries. Home products company IKEA, for example, has made clever furniture design mainstream. British mega-brand Virgin brought sexiness to the airline industry, with an interior design that looks more like a chic lounge than a mode of mass transit. Regardless of whether they can articulate itor if they even know itconsumers connect with these brands because of designs that continues to attract new fans and followers. The ever-growing media landscape makes it increasingly important to use great design to differentiate your brand from the crowd. Even if your goal is to present information for a purely analytical objectivethat is, without any desired action from the readerit is still beneficial to have aesthetic appeal.

DESIGN IS TO DATA AS CHEESE SAUCE IS TO BROCCOLI.

(That analogy is on the SAT, if you dont remember.) In other words, people need an added incentive to eat their vegetables especially when those vegetables are as cold and dry as research studies and analytics reports. Presenting information by way of engaging visuals immediately attracts readers and entices them to dig deeper into the content. Possessing this appeal to your audience is not a nice to have for businesses; it is a must have. You cant sell magazines if no one picks them up, and you cant sell products if you cant get potential customers attention.

I M P O R TA N C E A N D E F F I C A C Y

41

The modern marketer can learn a lot from Horaces quote in the introduction, and the notion that delighting people with your content is a must. It has become a necessity in order to build trust with your audience and capture their attention often. We will discuss how to do this further in Chapter 3 (The Visual Storytelling Spectrum) and Chapter 4 (Editorial Infographics), in the sections pertaining to Editorial Infographics. For now, it is important for us to focus on the first step: How to get their attention in the first place.

Just what appeals to us when we become interested in consuming information? We are drawn to formats that we see as efficient, engaging, and entertaining (Figure 1.7). Its highly unlikely that someone would prefer to read a lengthy article than view a multimedia display presenting the same information. A diversity of media keeps our brains engaged in the material, and the visualization can enable us to digest it more efficiently and facilitate understanding.

Figure 1.7 Source: Reprinted with permission of THE ONION. Copyright 2012, by ONION, INC. www.theonion.com.

42

INFOGRAPHICS

Further, a recent study from the University of Saskatchewan suggests that viewers prefer a greater use of illustration in visual representations. When presented with both a simple chart and one that contained an illustration by the aforementioned Nigel Holmes (Figure 1.8), participants consistently opted for the Holmes version in a number of different areas (Figure 1.9). While this conclusionthat a more dynamic and stimulating visual is preferable to a plain oneseems somewhat obvious, its important to consider in design approach. Its not enough to make your content visual; you must also make it visually interesting. You can do this by using representative iconography, illustrative metaphor, or relevant decorative framing mechanismsall powerful tools for communicating your message. However, you always want to remember your objective. The appropriateness of decorative and illustrative elements will vary based on an infographics application and use. For example, an editorial graphic in the Sunday newspaper on the topic of corporate profits could find great use in the illustration of a rotund executive sitting atop a throne of gold bullion. Shareholders, on the other hand, might not share the same appreciation for such a work of art if it adorned the pages of an annual report containing similar data. If used incorrectly, decorative elements have the potential to distract the viewer from the actual information, which detracts from the graphics total value. Mastering this execution and finding the balance between appeal and clarity can be a nuanced process. We will discuss the proper use of illustration and decorative elements further in Chapter 9 (Information Design Best Practices), where well cover the principles and best practices of information design.

MONSTROUS COSTS Total House and Senate campaign expenditures, in millions

$300

250

200

150

100

50

0 1972 74 76 78 80 82 est

Figure 1.8 Illustrative Nigel Holmes graphic with simplified equivalent.

FA T TE DE IL DE BE M RE T SI EA ES TO EM R TA S SC RI BE TO RE
Figure 1.9 University of Saskatchewan study results.

ST TO M EM BE DE R SC RI BE

ES

TO

RE

EM

BE

43

FA AC TE TO CU RA

ST

ES

M AC CU RA

OS

OS

EA

SI

ES

TO

RE

EM

BE

EA

SI

ES

TO

DE

SC

RI

BE

M
PLAIN

OS

AT

TR

AC

TI

VE

I M P O R TA N C E A N D E F F I C A C Y

OS

EN

JO

YE

COUNT OF RESPONSES

HOLMES

OS

PR

EF

ER

RE

KEY

20

18

16

14

12

10

44

INFOGRAPHICS

COMPREHENSION
You often hear someone claim to be a visual learner, which simply means that they need to see something in order to understand it. Researchers have studied and modeled learning styles in a number of different ways over the past several decades, and the origins of this specific visual style of thinking can be traced to Neil Flemings VAK model. One of the most commonly known and quoted models of thinking, it states that when comprehending information, people learn best with one of three types of stimuli:

VISUAL

AUDITORY

Visual learners best comprehend information that is presented in pictures, diagrams, charts, and the like; auditory learners do best when hearing this information spoken; and tactile learners need to touch and learn by doing. While this theory is commonly accepted, it has been highly scrutinized in the scientific community, which posits that there is little to no evidence that any one preferred method of learning is actually more beneficial for comprehending and retaining information. Regardless of this ongoing debate, it is important to consider the media structure and channels through which people obtain information. It is less important to identify how people prefer to learn, and instead figure out how they are actually learningand these experiences are occurring increasingly online today, a channel based primarily on visual display. The use of audio-only content on the web is relatively minimal outside of music sitesand until virtual reality is able to provide interactive, tactile experiences, the majority of information on the Internet will be communicated visually. Given that people are more likely to consume information visually, the value of using visuals in our communicationinstead of just wordsis truly significant. As Colin Ware states in Information Visualization: Perception for Design, The human visual system is a pattern seeker of enormous power and subtlety. The eye and the visual cortex of the brain form a massive parallel processor that provides the highestbandwidth channel into human cognitive centers. At higher levels of processing, perception and cognition are closely interrelated, which is why the words understanding and seeing are synonymous (p. xxi).

K I N E S T H E T I C O R TA C T I L E

I M P O R TA N C E A N D E F F I C A C Y

45

Ware goes on to state that we are able to acquire more information through our visual system than we do through all our other senses combined (p. 2). This is largely because visualizations contain certain characteristics called preattentive attributes, which our eyes perceive very quickly (within 250 milliseconds) and our brains process with impressive accuracy without any active attention on our part. Force-feeding for the mindhow convenient! To use a common illustration of this concept, refer to Figure 1.10. Try to count the number of 7s in the number set. How long did that take? Now, try the same exercise with Figure 1.11. A color change makes recognition almost instant, since color is one of several preattentive attributes, displayed in Figure 1.12. All visualizations contain such attributes, and using them properly to convey information is the key to visual communication. Our brains are able to recognize and process many of these visual cues simultaneously through a course of action called preattentive processing. All this action precedes any cognitive attempts to focus on any specific area; rather, it is purely involuntary and will simply proceed wherever our eyes are pointed. These natural functions that result from the connection between the eyes and brain can be quite handy when we want to communicate to people who dont have a lot of timeor a long attention span. We know that we can use these visuals to attract people by appealing to them aesthetically, but we can also decrease the amount of time it takes them to comprehend the message by using these same tools.

That said, you cant tell a story through color alone, or craft compelling messaging using only shapes and symbols. So how do words factor into information design? Within the context of a society that speaks the same language, wordsas compared to symbolshave a distinct advantage in terms of familiarity. No set of symbols has universal ubiquity; rather, most are isolated to specific social or cultural settings. This necessitates a costversus-benefit analysis of using visualization instead of verbal communication. Symbols can take longer to interpret than language when conveying a concept to someone who is unfamiliar with the symbols. In this case, communication should favor text descriptions. To someone who knows the symbols, however, this comprehension process is far easier; in this case, communication should rely more upon visualization methods. Ware provides a sound breakdown of the general value of each medium by explaining that images are better for spatial structures, location, and detail, whereas words are better for representing procedural information, logical conditions, and abstract verbal concepts (p. 304). The practical reality is that we dont need to choose between the two. The strongest visualizations are those that are supported by descriptions as well as narratives, especially in editorial applications. Using words in this way helps to bring both personality and clarity to an infographic.

46

INFOGRAPHICS

Figure 1.10 Preattentive Processing Test 1.

I M P O R TA N C E A N D E F F I C A C Y

47

Figure 1.11 Preattentive Processing Test 2.

48

INFOGRAPHICS

FORM

Figure 1.12 Preattentive attributes.

O R I E N TAT I O N

FORM

FORM

SIZE

SHAPE

FORM

COLOR

ENCLOSURE

INTENSITY

I M P O R TA N C E A N D E F F I C A C Y

49

FORM

FORM

LINE LENGTH

LINE WIDTH

FORM

FORM

C U R V AT U R E

ADDED MARKS

COLOR

S PAT I A L P O S I T I O N

HUE

2-D POSITION

50

INFOGRAPHICS

RETENTION
The third main benefit of using infographics in communication is their ability to help people retain information, as the graphics are able to extend the reach of our memory systems. Visualizations do this by instantly and constantly drawing upon nonvisual information thats stored in our long-term memory (Ware, p. 352). The human brain can recall familiar symbols, scenes, and patterns, allowing us to make rapid connections to already stored information and to quickly comprehend what were seeing. This prompts the question: Which visualization methods best serve recall for various different types of memory? There are three main types of memory that relate to viewing images. The iconic memory is the snapshot of a scene that you retain for a brief instant after looking at something. It is stored for less than a second, unless it is analyzed and connected to something that is already stored in your brain (Sperling via Ware, p. 352). Long-term memory stores information from our experiences that we will retain for long periods of time, and from which we draw upon in order to process new information. Long-term memory is further divided into three areas: episodic memory, semantic memory, and procedural memory. Episodic memory is the primary device for recalling images and scenes that weve experienced, and the feelings associated with those experiences. Semantic memory enables us to recall knowledge that has no specific context or experience associated with it, and could generally be considered the storage of common knowledge. Procedural memories are those that recall processes of doingsuch as typing or tying a tiethat we access involuntarily without conscious thought. These memories often build on themselves, which is why you are able to recall that the M arm position comes after the Y when the Village People are played at a wedding reception.

Visual working memory is what lies in between iconic and long-term memory, and is most essential to processing visual information. When we see an object that requires further attention, we move it from iconic to visual working memory. Visual working memory then calls upon semantic memory (long term, nonvisual) to understand its meaning. All this happens in about 100 milliseconds (Ware, p. 353). With our vision transmitting massive amounts of information into the brain, and the brain accessing its stored knowledge to provide context, we are able to understand much more quickly than with any other combination of sensory perception and processing. So what visual elements should be used to best ensure that individuals store this understanding for long-term recall? While academics have typically argued against using decorative elements in information designclaiming that they only serve to distract the viewerthis isnt always the case. A very interesting finding from a University of Saskatchewan study conducted by Scott Bateman and his colleagues from the Department of Computer Science uncovered that a more illustrative approach to design actually benefits information recall significantly. All participants were shown a set of alternating graphics, some plain and some in Holmess illustrative style, such as that depicted in Figure 1.8. The researchers split the participants into two groups: half were part of an immediate recall group, and the other half were in the long-term recall group. After seeing all the graphics, the immediate recall group played a five-minute game to clear their visual and linguistic memory. They were then questioned regarding the information in each graphic. The longterm recall group was scheduled to come back for their recall session two to three weeks following the initial observation. Each participant had to answer questions about the graphics subject, the categories displayed within it, and the general trend of the chart. They also had to describe whether there was a value judgment presented in the chart; that is, a perceived

I M P O R TA N C E A N D E F F I C A C Y

51

opinion that the graphics creator had presented. The immediate recall group showed no significant differences between Holmess graphics and their plain counterparts in terms of how well theyd retained information about the subject, categories, or trends (Figure 1.13). Yet there was a significant difference in their identification of whether a value judgment had been presented. However, the long-term recall group experienced notable differences in their ability to recall information in all areas (Figure 1.14). The subjects, categories, trends, and value messages within Holmess graphics stuck with users more prominently after two to three weeks.

Bateman et al. offer up three possible explanations for the findings in this experiment: 1. Additional imagery enabled people to encode informationmore deeply, as there were more visual items to recall and use memory to draw upon. 2. The variety of Holmess style gave it a unique advantage in being memorable over the style of the plain graphics, which all had a similar look. 3. The user preference (as described earlier in the Appeal section) provided a hidden factor: The participants emotional responses to the graphic, combined with the imagery used, helped to solidify the image in their memories.

KEY

HOLMES

PLAIN

SUM OF SCORES
16 14 12 10 8 6 4 2

SUM OF SCORES
16 14 12 10 8 6 4 2 0

SUBJECT

CATEGORIES

TREND

VALUE MESSAGE

SUBJECT

CATEGORIES

TREND

VALUE MESSAGE

I M M E D I AT E R E C A L L
Figure 1.13: Results of immediate recall group.

LONG-TERM RECALL
Figure 1.14: Results of long-term recall group.

52

INFOGRAPHICS

So what does all of this tell us about using infographics, particularly for commercial objectives? Graphics that contain visual embellishment beyond the information being displayed may be superior not only in terms of appeal, but also in their ability to ensure that viewers understand and retain your messagewhich is likely value-based. Appealing to someone not only aesthetically but also emotionally prompts a deeper connection with the information, which makes them more likely to remember it. While design style is something that varies greatly and often cannot be categorized neatly, there are certain devices that we can use to facilitate understanding and retention. We refer to these collectively as illustrative design: 1. Visual Metaphor We use this often at Column Five and it works incredibly well when implemented effectively. You can do this by containing information within a framing mechanism that is indicative of your subject matter (Figure 1.15).

2. Symbols and Iconography The success that these achieve depends largely on cultural context. Your audience must universally understand your icons and symbols for them to be effective. When this is the case, they can provide a great communication shortcut by using visual elements in the place of verbal explanation (Figure 1.16). 3. Decorative Framing Using design elements that appeal to your target audience lets them connect with infographics on an emotional level, thereby deepening their interest in and retention of the information (Figure 1.17). Illustrative design can also have its negative effects, so it is important to determine when it might potentially detract from rather than support your message. The main pitfall here is the designers accidental or intentional distortion of the display of data. Illustrations should complement visualization elements, but never at the expense of misleading the viewer. Whether intentional or not, you always want to avoid altering accurate information representation.

I M P O R TA N C E A N D E F F I C A C Y

53

Figure 1.15 Example of visual metaphor. Column Five for GOOD.

54

INFOGRAPHICS

Figure 1.16 Example of use of symbols and iconography. Column Five for Microsoft.

I M P O R TA N C E A N D E F F I C A C Y

55

Figure 1.17 Example of decorative framing. Column Five for GOOD.

Contents

Introduction ................................................................................................................. xi

Understanding Data

What Data Represents ..................................................................................................2 Variability ...................................................................................................................20 Uncertainty .................................................................................................................30 Context .......................................................................................................................35 Wrapping Up ...............................................................................................................41

Visualization: The Medium

43

Analysis and Exploration.............................................................................................45 Information Graphics and Presentation ......................................................................58 Entertainment.............................................................................................................69 Data Art ......................................................................................................................74 The Everyday ...............................................................................................................81 Wrapping Up ...............................................................................................................89

Representing Data

91

Visualization Components ..........................................................................................93 Putting It Together ....................................................................................................115 Wrapping Up .............................................................................................................132

Exploring Data Visually

135

Process .....................................................................................................................136 Visualizing Categorical Data ....................................................................................143 Visualizing Time Series Data ....................................................................................154 Visualizing Spatial Data ...........................................................................................165

Multiple Variables.....................................................................................................176 Distributions .............................................................................................................193 Wrapping Up .............................................................................................................199

Visualizing with Clarity

201

Visual Hierarchy........................................................................................................202 Readability ...............................................................................................................205 Highlighting..............................................................................................................221 Annotation ................................................................................................................228 Do the Math ..............................................................................................................236 Wrapping Up .............................................................................................................239

Designing for an Audience

241

Common Misconceptions ..........................................................................................242 Present Data to People .............................................................................................254 Things to Consider ....................................................................................................258 Putting It Together ....................................................................................................268 Wrapping Up .............................................................................................................273

Where to Go from Here

277

Visualization Tools ....................................................................................................278 Programming ............................................................................................................283 Illustration ................................................................................................................288 Statistics ..................................................................................................................289 Wrapping Up ............................................................................................................ 289 Index .........................................................................................................................291

Buy This Book

Understanding Data

2 | CHAPTER 1: Understanding Data

When you ask people what data is, most reply with a vague description of something that resembles a spreadsheet or a bucket of numbers. The more technically savvy might mention databases or warehouses. However, this is just the format that the data comes in and how it is stored, and it doesnt say anything about what data is or what any particular dataset represents. Its an easy trap to fall in because when you ask for data, you usually get a computer le, and its hard to think of computer output as anything but just that. Look beyond the le though, and you get something more meaningful.

WHAT DATA REPRESENTS


Data is more than numbers, and to visualize it, you must know what it represents. Data represents real life. Its a snapshot of the world in the same way that a photograph captures a small moment in time. Look at Figure1-1. If you were to come across this photo, isolated from everything else, and I told you nothing about it, you wouldnt get much out of it. Its just another wedding photo. For me though, its a happy moment during one of the best days of my life. Thats my wife on the left, all dolled up, and me on the right, wearing something other than jeans and a T-shirt for a change. The

FIGURE1-1 A single photo, a single data point

What Data Represents | 3

pastor who is marrying us is my wifes uncle, who added a personal touch to the ceremony, and the guy in the back is a family friend who took it upon himself to record as much as possible, even though we hired a photographer. The owers and archway came from a local orist about an hour away from the venue, and the wedding took place during early summer in Los Angeles, California. Thats a lot of information from just one picture, and it works the same with data. (For some, me included, pictures are data, too.) A single data point can have a who, what, when, where, and why attached to it, so its easy for a digit to become more than a toss in a bucket. Extracting information from a data point isnt as easy as looking at a photo, though. You can guess whats going on in the photo, but when you make assumptions about data, such as how accurate it is or how it relates to its surroundings, you can end up with a skewed view of what your data actually represents. You need to look at everything around, nd context, and see what your dataset looks like as a whole. When you see the full picture, its much easier to make better judgments about individual points. Imagine that I didnt tell you those things about my wedding photo. How could you nd out more? What if you could see pictures that were taken before and after?

FIGURE1-2 Grid of photos

4 | CHAPTER 1: Understanding Data

Now you have more than just a moment in time. You have several moments, and together they represent the part of the wedding when my wife rst walked out, the vows, and the tea drinking ceremony with the parents and my grandma, which is customary for Chinese weddings. Like the rst photo, each of these has its own story, such as my father-in-law welling up as he gave away his daughter or how happy I felt when I walked down the aisle with my bride. Many of the photos captured moments that I didnt see from my point of view during the wedding, so I almost feel like an outsider looking in, which is probably how you feel. But the more I tell you about that day, the less obscure each point becomes. Still though, these are snapshots, and you dont know what happened in between each photo. (Although you could guess.) For the complete story, youd either need to be there or watch a video. Even with that, youd still see only the ceremony from a certain number of angles because its often not feasible to record every single thing. For example, there was about ve minutes of confusion during the ceremony when we tried to light a candle but the wind kept blowing it out. We eventually ran out of matches, and the wedding planner went on a scramble to nd something, but luckily one of our guests was a smoker, so he busted out his lighter. This set of photos doesnt capture that, though, because again, its an abstraction of the real thing. This is where sampling comes in. Its often not possible to count or record everything because of cost or lack of manpower (or both), so you take bits and pieces, and then you look for patterns and connections to make an educated guess about what your data represents. The data is a simplicationan abstractionof the real world. So when you visualize data, you visualize an abstraction of the world, or at least some tiny facet of it. Visualization is an abstraction of data, so in the end, you end up with an abstraction of an abstraction, which creates an interesting challenge. However, this is not to say that visualization obscures your viewfar from it. Visualization can help detach your focus from the individual data points and explore them from a dierent angleto see the forest for the trees, so to speak. To keep running with this wedding photo example, Figure1-3 uses the full wedding dataset, of which Figure1-1 and Figure1-2 were subsets of. Each rectangle represents a photo from our wedding album, and they are colored by the most common shade in each photo and organized by time.

What Data Represents | 5

FIGURE1-3 Colors in the wedding

6 | CHAPTER 1: Understanding Data

With a time series layout, you can see the high points of the wedding, when our photographers snapped more shots, and the lulls, when only a few photos were taken. The peaks in the chart, of course, occur when there is something to take pictures of, such as when I rst saw my wife in her dress or when the ceremony began. After the ceremony, we took the usual group photos with friends and family, so there was another spike at that point. Then there was food, and activity died down, especially when the photographers took a break a little before 4 oclock. Things picked up again with typical wedding fanfare, and the day came to an end around 7 in the evening. My wife and I rode o into the sunset. In the grid layout, you might not see this pattern because of the linear presentation. Everything seems to happen with equal spacing, when actually most pictures were taken during the exciting parts. You also get a sense of the colors in the wedding at a glance: black for the suits, white for the wedding dress, coral for the owers and bridesmaids, and green for the trees surrounding the outdoor wedding and reception. Do you get the detail that you would from the actual photos? No. But sometimes that level isnt necessary at rst. Sometimes you need to see the overall patterns before you zoom in on the details. Sometimes, you dont know that a single data point is worth a look until you see everything else and how it relates to the population. You dont need to stop here, though. Zoom out another level to focus only on the picture-taking volumes, and disregard the colors and individual photos, as shown in Figure1-4. Youve probably seen this layout before. Its a bar chart that shows the same highs and lows as in Figure1-3, but it has a dierent feel and provides a dierent message. The simple bar chart emphasizes picture-taking volumes over time via 15-minute windows, whereas Figure1-3 still carries some of the photo albums sentiment. The main thing to note is that all four of these views show the same data, or rather, they all represent my wedding day. Each graphic just represents the day dierently, focusing on various facets of the wedding. Interpretation of the data changes based on the visual form it takes on. With traditional data, you typically examine and explore from the bar chart side of the spectrum, but that doesnt mean you have to lose the sentiment of the individual data pointthat single photo. Sometimes that means adding meaningful annotation that enables readers to interpret the data better, and other times the message in the numbers is clear, gleaned from the visualization itself.

What Data Represents | 7

FIGURE1-4 Photos over time

8 | CHAPTER 1: Understanding Data

The connection between data and what it represents is key to visualization that means something. It is key to thoughtful data analysis. It is key to a deeper understanding of your data. Computers do a bulk of the work to turn numbers into shapes and colors, but you must make the connection between data and real life, so that you or the people you make graphics for extract something of value. This connection is sometimes hard to see when you look at data on a large scale for thousands of strangers, but its more obvious when you look at data for an individual. You can almost relate to that person, even if youve never met him or her. For example, Portland-based developer Aaron Parecki used his phone to collect 2.5 million GPS points over 3 years between 2008 and 2012, about one point every 2 to 6 seconds. Figure1-5 is a map of these points, colored by year. As youd expect, the map shows a grid of roads and areas where Parecki frequented that are colored more brightly than others. His housing changed a few times, and you can see his travel patterns change over the years. Between 2008 and 2010, shown in blue, travel appears more dispersed, and by 2012, in yellow, Parecki seems to stay in a couple of tighter pockets. Without more context it is hard to say anything more because all you see is location, but to Parecki the data is more personal (like the single wedding photo is to me). Its the footprint of more than 3 years in a city, and because he has access to the raw logs, which have time attached to them, he could also make better decisions based on data, like when he should leave for work. What if there were more information attached to personal time and location data, though? What if along with where you were, you also took notes during or after about what was going on at some given time? This is what artist Tim Clark did between 2010 and 2011 for his project Atlas of the Habitual. Like Parecki, Clark recorded his location for 200 days with a GPS-enabled device, which spanned approximately 2,000 miles in Bennington, Vermont. Clark then looked back on his location data and labeled specic trips, people he spent time with, and broke it down by time of year. As shown in Figure1-6, the atlas, with clickable categorizations and time frames, shows a 200-day footprint that reads like a personal journal. Select Running errands and the note reads, Doing the everyday things from running to the grocery store all the way to driving 30 miles to the only bike shop in southern Vermont opened on Sundays. The traces stay around town, with the exception of two long ones that venture out. There is one entry titled Reliving the breakup, and Clark writes, A long-term girlfriend and I broke up immediately before I moved. These are the times that I had a real dicult time coming to terms that I had to move on. Two small paths, one within the city limits and one outside, appear, and the data suddenly feels incredibly personal.

What Data Represents | 9

FIGURE1-5 GPS traces collected by Aaron Parecki, http://aaronparecki.com

This is perhaps the appeal behind the Quantied Self movement, which aims to incorporate technology to collect data about ones own activity and habits. Some people track their weight, what they eat and drink, and when they go to bed; their goal is usually to live healthier and longer. Others track a wider variety of metrics purely as a way to look in on themselves beyond what they see in the mirror; personal data collection becomes something like a journal for self-reection at the end of the day.

10 | CHAPTER 1: Understanding Data

FIGURE1-6 Selected maps from Atlas of the Habitual by Tim Clark, http://www.tlclark.com/atlasofthehabitual/

What Data Represents | 11

Nicholas Felton is one of the more well-known people in this area for his annual reports on himself, which highlight both his design skills and disciplined personal data collection. He keeps track of not just his location, but also who he spends time with, restaurants he eats at, movies he watches, books he reads, and an array of other things that he reveals each year. Figure1-7 is a page out of Feltons 2010/2011 report. Felton designed his rst annual report in 2005 and has done one every year since. Individually, they are beautiful to look at and hold and satisfy an odd craving for looking in on a strangers life. What I nd most interesting, though, is the evolution of his reports into something personal and the expanding richness of data. Looking at his rst report, as shown in Figure1-8, you notice that it feels a lot like a design exercise in which there are touches of Feltons personality embedded, but it is for the most part strictly about the numbers. Each year though, the data feels less like a report and more like a diary. This is most obvious in the 2010 Annual Report. Feltons father passed away at the age of 81. Instead of summarizing his own year, Felton designed an annual report, as shown in Figure1-9, that cataloged his fathers life, based on calendars, slides, postcards, and other personal items. Again, although the person of focus might be a stranger, its easy to nd sentiment in the numbers. When you see work like this, its easy to understand the value of personal data to an individual, and maybe, just maybe, its not so crazy to collect tidbits about yourself. The data might not be useful to you right away, but it could be a decade from now, in the same way its useful to stumble upon an old diary from when you were just a young one. Theres value in remembering. In many ways you log bits of your life already if you use social sites like Twitter, Facebook, and foursquare. A status update or a tweet is like a mini-snapshot of what youre doing at any given moment; a shared photo with a timestamp can mean a lot decades from now; and a check-in rmly places your digital bits in the physical world. Youve seen how that data can be valuable to an individual. What if you look at the data from many individuals in aggregate? The United States Census Bureau collects the ocial counts of people living in the country every 10 years. The data is a valuable resource to help ocials allocate funds, and from census to census, the uctuations in population help you see how people move in the country, changing the neighborhood

FIGURE1-7 (following page) Apage from 2010/2011 Annual Report by Nicholas Felton, http://feltron.com

14 | CHAPTER 1: Understanding Data

composition, and how areas grow and shrink. In short, the data paints a picture of who lives in America. However, the data, collected and maintained by the government, can show only so much about the individuals, and its hard to grasp who the people actually are. What are their likes and dislikes? What kind of personality do they have? Are there major dierences between neighboring cities and towns? Media artist Roger Luke DuBois took a dierent kind of census, via 19 million online dating proles in A More Perfect Union. When you join an online dating

FIGURE1-8 Selected pages from 2005 Annual Report by Nicholas Felton, http://feltron.com

What Data Represents | 15

site, you rst describe yourself: who you are, where youre from, and what youre interested in. After you uncomfortably ll out that information, and perhaps choose not to share a thing or two, you describe what your ideal mate is like. In the words of DuBois, in the latter, you tell the complete truth, and in the former, you lie. So when you aggregate peoples online dating proles, you get some combination of how people see themselves and how they want to be seen. In A More Perfect Union, DuBois categorized online dating proles, digital encapsulations of hopes and dreams, by postal code, and then looked for the word that was most unique to each area. Using a tracing of a Rand McNally map, DuBois replaced each city name with the citys unique word and painted a dierent picture of the United States: a more recognizable and personal one. In Figure1-10, around southern California, where they make the talkies, words such as acting, writer, and entertainment appear; on the other hand, in Washington, DC, shown in Figure1-11, words like bureaucrat, partisan, and democratic appear. These mostly pertain to professions, but in some areas the words describe personal attributes, favorite things, and major events. In Louisiana, shown in Figure1-12, Cajun and curvy pop out at you, as does crawsh, bourbon, and gumbo, but in New Orleans, the most unique word is flood, a reection of the eects of Hurricane Katrina in 2005. People are dened by common demographic data such as race, age, and gender, but they also identify themselves with what they like to do in their spare time, what has happened to them, and who they hang around with. The great thing about A More Perfect Union is that you can see that in the data on a countrywide scale. The same sentimentwhere data points are recollections and reports are portraits and diariesis seen in Feltons reports, Clarks atlas, and Pareckis GPS traces. Statisticians and developers call this analysis. Artists and designers call this storytelling. For extracting information from data, thoughto understand whats in the numbersanalysis and storytelling are one and the same. Just like what it represents, data can be complex with variability and uncertainty, but consider it all in the right context, and it starts to make sense.

FIGURE1-9 (following page) Selected pages from 2010 Annual Report by Nicholas Felton, http://feltron.com

18 | CHAPTER 1: Understanding Data

FIGURE1-10 California map from A More Perfect Union (2011) by R. Luke DuBois, courtesy of the artist and bitforms gallery, New York City, http://perfect.lukedubois.com

FIGURE1-11 Washington, DC map from A More Perfect Union (2011)

20 | CHAPTER 1: Understanding Data

FIGURE1-12: Louisiana map from A More Perfect Union (2011)

VARIABILITY
In a small town in Germany, amateur photographer and full-time physicist Kristian Cvecek heads out into the forest at night with his camera. Using long-exposure photography, Cvecek captures the movements of reies as they prance between the trees. The insect, as shown in Figure1-13, is tiny and barely noticeable during the day, but in the dark, its hard to look elsewhere.

Variability | 21

FIGURE1-13 A rey in the night by Kristian Cvecek, http://quit007 .deviantart.com/

Although each moment in ight seems like a random point in space to an observer, a pattern emerges in Cveceks photos, as shown in Figure1-14. Its as if the reies move along the walking path and circle around the trees with a predetermined destination. There is randomness, though. You can guess where a rey goes next based on its ight path, but how sure are you? A rey can bolt left, right, up, and down at any moment, and that variability, which makes each ight unique, is what makes reies so fun to watch and the picture so beautiful. The path is what you care about. The end point, start point, and average position dont mean nearly as much. With data, you can nd patterns, trends, and cycles, but its not always (rarely, actually) a smooth path from point A to point B. Total counts, means, and other aggregate measurements can be interesting, but theyre only part of the story, whereas the uctuations in the data might be the most interesting and important part. Between 2001 and 2010, according to the National Highway Traffic Safety Administration, there were 363,839 fatal automobile crashes in the United States. No doubt this total count, over one-third of a million, carries weight because it represents the lost lives of even more than that. Place all focus on the one number, as in Figure1-15, and it makes you think or maybe even reect on your own life. However, is there anything you can learn from the data, other than that you should drive safely? The NHTSA provides data down to individual accidents, which includes when and where each occurred, so you can look closer. In Figure1-16, every fatal crash in the contiguous United States between 2001 and 2010 is mapped. Each dot represents a crash. As you might expect, there is a higher concentration of accidents in large cities and major highways; there are fewer accidents where there are fewer people and roads.

22 | CHAPTER 1: Understanding Data

FIGURE1-14 Path of a rey by Kristian Cvecek, http://quit007.deviantart.com/ FIGURE1-15 (facing page) One aggregate

24 | CHAPTER 1: Understanding Data

FIGURE1-16 Everything mapped at once

Again, although not to be taken lightly, the map tells you more about the countrys road network than it does the accidents. A look at crashes over time shifts focus to the events themselves. For example, Figure1-17 shows the number of accidents per year, which tells a dierent story than the total in Figure1-15. Accidents still occurred in the tens of thousands annually, but there was a signicant decline from 2006 through 2010, and fatalities per 100 million vehicle miles traveled (not shown) also decreased. Seasonal cycles become obvious at month-by-month granularity, as shown in Figure1-18. Incidents peak during the summer months when people go on vacation and spend more time outside, whereas during the winter, fewer people drive, so there are fewer crashes. This happens every year. At the same time, you can still see the annual decline overall between 2006 and 2010.
FIGURE1-17 Annual fatal accidents

Variability | 25

However, theres variability when you compare specic months over the years. For example, in 2001, the most crashes occurred in August, and there was a small, relative drop the following month. The same thing happened in 2002 through 2004. However, in 2005 through 2007, July had the most accidents. Then it was back to August in 2008 through 2010. On the other hand, February, the month with the fewest days had the least accidents every year, with the exception of 2008. So there are seasonal variations and variation within the seasons. Go down another level to daily crashes, as shown in Figure1-19, and you see even higher variability, but its not all noise. There still appears to be a pattern of peaks and valleys. Although its harder to make out the seasonal patterns, you can see a weekly cycle with more accidents during the weekends than during the middle of the week. The peak day each week uctuates between Friday, Saturday, and Sunday.

FIGURE1-18 Monthly fatal accidents

26 | CHAPTER 1: Understanding Data

FIGURE1-19 Daily fatal accidents

Variability | 27

But guess what: You can increase granularity to crashes by the hour. Figure1-20 breaks it down. Each row represents a year, so each cell in the grid shows an hourly time series for the corresponding month. With the exception of a new years spike during the midnight hour, its hard to make out patterns at this level because of the variability. Actually, the monthly chart is hard to interpret, too, if you dont know what youre looking for. There are clear patterns, though, if you aggregate, as shown in Figure1-21. Instead of showing values at every hour, day, or month, you can aggregate on specic time segments to explore the distributions. What was hard to discern, or looked like noise before, is easy to see here. Theres a small bump in the morning when people commute to work, but most fatal crashes occur in the evening after work. As you saw in Figure1-19, there are more crashes during the weekend, but summed up, its more obvious. Finally, you can see the seasonal patterns, but more clearly, with a greater number of accidents during the summer than in the winter. The main point is that theres value in looking at the data beyond the mean, median, or total because those measurements tell you only a small part of the story. A lot of the time, aggregates or values that just tell you where the middle of a distribution is hide the interesting details that you should actually focus on, for both decision making and storytelling. An outlier that stands out from the crowd could be something that you need to x or pay special attention to. Maybe the changes over time are a signal that something good (or bad) is happening in your system. Cycles or regular occurrences could help you prepare for the future. However, sometimes it isnt helpful to see so much variability; in which case you can dial back the granularity for generalizations and distributions. You lose this informationthe juicy bitswhen you step too far away from the data. Think of it this way: When you look back on your life, would you rather just remember what your days were like on average, or is it the highs and the lows that are most important? I bet its some combination of the two.

FIGURE1-20 Hourly fatal accidents

30 | CHAPTER 1: Understanding Data

FIGURE1-21 Accident distributions over time

UNCERTAINTY
A lot of data is estimates rather than absolute counts. An analyst considers the evidence (such as a sample), and makes an education guess about a full population. That educated guess has uncertainty attached to it. You do this all the time in your day-to-day. You make a guess based on what you know, read, or what someone told you, and you can say with some (possibly rough) certainty that youre right. Are you absolutely positive or are you basically clueless? It works the same with data.

Note: Its tempting to look at data as absolute


truth, because we associate numbers with fact, but more often than not, data is an educated guess. Your goal is to use data that doesnt have large levels of uncertainty attached.

When I was a young lad, a recent engineering graduate with a statistics minor, I had a 9-month gap in between college and graduate school. I took a few temporary jobs that paid a little more than minimum wage, and they were mind-numbingly boring, so naturally my mind wandered to more engaging things.

Uncertainty | 31

One day I thought to myself, Hey, I have some statistics and probability knowhow and a deck of cards. Im going to become an expert blackjack player like those kids from MIT. Forget this stupid job. Im gonna be rich! And my 1-month obsession with blackjack began. (To save you the suspense, I didnt get rich, and its not nearly as exciting as they make it look in the movies.) In case youre unfamiliar with the game, heres a quick rundown. Theres a dealer and a player. The dealer deals two cards to each (one of his is face down), and the goal is to get a card total as close to 21 as possible, without going over. You can choose to take additional cards (called a hit) or not (stay). In some cases, you can also split your hand of two cards, as if youre playing two separate hands; you can also double down, which means to double your bet. The more you bet, the more you can win. If you go over 21, to bust, you automatically lose, and if not, the dealer hits or stays, and whoever is closer to 21 wins. By design, the dealer has the advantage, but if you hit and stay when youre supposed to, you can decrease that advantage. These rules are based on averages, but as anyone who has played blackjack can tell you, there is uncertainty in each hand of cards. You can still lose even when you make the right move. For example, imagine you are dealt a 5 and a 6 for a total of 11, and the dealer shows a 6. The right move is to double your bet because its impossible for you to bust with an additional card, and theres a decent chance of getting 21. Theres also a good chance the dealer will bust with a 6 showing. So you double down, and you get a 3, for a total of 14. Ouch. Thats not good. Your only hope is for a dealer bust. So he ips his hidden card, and its a 10 for a total of 16. By rule, he has to hit, and its a 5. Dealer total: 21. You lose. Had you not double downed, you would have lost only half the money that you did playing the right way. But if it were that easy to win, the casino wouldnt bother putting the game on the oor. Theres uncertainty in each hand because you are playing against distributions, or rather, you know only the approximate probabilities of drawing cards. You might have an idea of what cards are in the deck, but you can make only an educated guess about what card comes next. Of course, uncertainty applies to things outside of cards, and it comes in a variety of forms. Take the weather for example. How many times have you looked up the forecast for the

Note: If you count cards, or keep track of whats left in the deck, the probabilities change as you modify your bet based on your advantage, but uncertainty remains.

32 | CHAPTER 1: Understanding Data

next day or for the next week as you pack for a trip, only to nd, when the time comes, that the weather isnt how you expected it to be? What about the meter in cars that tells you how much farther you can drive with the current amount of fuel in your tank? I was running errands with my wife, and the meter said I could drive an estimated 16 more miles, but home was about 18 miles away. Dilemma. Instead of stopping at the nearest gas station, I drove toward the one nearest home, and the meter said I had zero miles left for about 2 miles, but we made it. (Good thing because someone kept insisting that I would be the one to push the car.) Weigh yourself more than once, and you might get dierent readings; typically though, breathing for a few seconds does not lead to weight loss or gain. The estimated battery life on your laptop can jump around by hour increments when only minutes have passed. The subway announcement says a train will arrive in 10 minutes, but it comes in 11, or a delivery is estimated to arrive on Monday, but it comes on Wednesday instead. When you have data that is a series of means and medians or a collection of estimates based on a sample population, you should always wonder about the uncertainty.

Note: Numbers seem concrete and absolute, but estimates carry uncertainity with them. Data is an abstraction of what it represents, andthe level of exactness varies.

This is especially important when people base major decisions, which aect millions, on estimates, such as with national and global demographics. Program creation and funding is often based on these numbers, so even a small margin of error can make a big dierence.

The United States Census Bureau releases data about the country on topics such as migration, poverty, and housing, which are estimates based on samples from the population. (This is dierent from the decennial census, which aims to count every person in the United States.) A margin of error is provided with each estimate, which means that the actual count or percentage is likely within a given range. For example, Figure1-22 shows estimates about housing. The margin of error for total households is almost one-quarter of a million. To put it dierently, imagine you have a jar of gumballs that you cant see into, and you want to guess how many of each color there are. (Why do you care about gumball distribution? I dont know. Use your imagination. Youre a gumball connoisseur who works for a gumball factory, and you bet your snotty statistician friend that every jar on your watch is uniformly distributed, so its a matter of pride and cash.)

Uncertainty | 33

If you were to pour all the gumballs on to the table and count every one, you wouldnt have to guess because you would get the full tally. But say you can grab only a handful, and you have to guess the contents of the entire jar, based on what you have in your hand. A larger handful would make it easier to guess because its more likely a better representation of the entire jar. On the other side of the spectrum, you could take just one gumball out, and itd be much harder to guess what else is in the jar. With one gumball, your margin of error would be high; with a large handful of gumballs, your margin of error would be lower; and if you counted all the gumballs, you would have zero margin of error. Apply that to millions of gumballs in thousands of dierently sized jars, with dierent distributions and big and small handfuls, and estimation

FIGURE1-22 Household estimates in 2010

FIGURE1-23 Gumballs and margin of error

34 | CHAPTER 1: Understanding Data

grows more complex. Then substitute the gumballs for people, the jars for towns, cities, and counties, and the handfuls for randomly distributed surveys, and a mean with a margin of error carries more weight. According to Gallup, 48 percent of Americans disapproved of the job Barack Obama was doing from June 11 through 13 in 2012. However, there was a 3 percent margin of error, which means the dierence between more than half and less than half of the country disapproving. Similarly, during election season, polls estimate which candidates lead, and if the margin of error is wide, the results can put more than one person in front, which kind of defeats the purpose of the poll. Estimates get tricky when you rank people, places, and things, especially when you combine measurements (and create statistical models with multiple variables). Take education evaluation, for example, which is under constant scrutiny. Cities, schools, and teachers are often compared against one another, but what denes a good education or makes an entire city smart? Is it the percentage of high school students who graduate? The percentage of students who go to college? Is it the number of universities, libraries, and museums per capita? If its all of this, is one count more important than the other, or do you give all of them equal weight? Answers change depending on who you ask, as do ratings.

Note: My hometown was ranked the dumbest city in America by a publication that shall go unnamed. The rankings were estimates, which were based on estimates with questionable uncertainty.

In 2011, the New York City Department of Education released Teacher Data Reports that tried to measure teaching quality. The reports were originally given only to schools and teachers but were later made publicly available in early 2012. The estimates took several factors into account, but one of the main ones was the change in test percentiles from the seventh to eighth grade. This is how seventh- and eighth-grade math teacher Carolyn Abbott became known as the worst math teacher in the city, placed in the 0th percentile. However, her seventh-grade students scored in the 98th percentile. What? Those students were predicted to score in the 97th percentile in the eighth grade, but they

FIGURE1-24 Carolyn Abbotts rating compared to her students

Context | 35

instead scored in the 89th percentile, which according to the statistical model, was not progress. Most would agree that students wouldnt earn the scores they did with a poor teacher. The challenge is that theres uncertainty and variability within teacher ratings. A rating represents a distribution of teachers, who are ranked based on estimates with uncertainty attached, but the ratings are treated as absolute. A general audience wont understand that concept, so its your responsibility to and communicate it clearly. When you dont consider what your data truly represents, its easy to accidently misinterpret. Always take uncertainty and variability into account. This is also when context comes into play.

CONTEXT
Look up at the night sky, and the stars look like dots on a at surface. The lack of visual depth makes the translation from sky to paper fairly straightforward, which makes it easier to imagine constellations. Just connect the dots. However, although you perceive stars to be the same distance away from you, they are actually varying light years away. If you could y out beyond the stars, what would the constellations look like? This is what Santiago Ortiz wondered as he visualized stars from a dierent perspective, as shown in Figure1-25. The initial view places the stars in a global layout, the way you see them. You look at Earth beyond the stars, but as if they were an equal distance away from the planet. Zoom in, and you can see constellations how you would from the ground, bundled in a sleeping bag in the mountains, staring up at a clear sky. The perceived view is fun to see, but ip the switch to show actual distance, and it gets interesting. Stars transition, and the easy-to-distinguish constellations are practically unrecognizable. The data looks dierent from this new angle. This is what context can do. It can completely change your perspective on a dataset, and it can help you decide what the numbers represent and how to interpret them. After you do know what the data is about, your understanding helps you nd the fascinating bits, which leads to worthwhile visualization.

36 | CHAPTER 1: Understanding Data

FIGURE1-25 View of the Sky by Santiago Ortiz, http://moebio.com/exomap/viewsofthesky/2/

Without context, data is useless, and any visualization you create with it will also be useless. Using data without knowing anything about it, other than the values themselves, is like hearing an abridged quote secondhand and then citing it as a main discussion point in an essay. It might be okay, but you risk nding out later that the speaker meant the opposite of what you thought.

Context | 37

You have to know the who, what, when, where, why, and howthe metadata, or the data about the databefore you can know what the numbers are actually about. Who: A quote in a major newspaper carries more weight than one from a celebrity gossip site that has a reputation for stretching the truth. Similarly, data from a reputable source typically implies better accuracy than a random online poll. For example, Gallup, which has measured public opinion since the 1930s, is more reliable than say, someone (for example, me) experimenting with a small, one-o Twitter sample late at night during a short period of time. Whereas the former works to create samples representative of a region, there are unknowns with the latter. Speaking of which, in addition to who collected the data, who the data is about is also important. Going back to the gumballs, its often not nancially feasible to collect data about everyone or everything in a population. Most people dont have time to count and categorize a thousand gumballs, much less a million, so they sample. The key is to sample evenly across the population so that it is representative of the whole. Did the data collectors do that? How: People often skip methodology because it tends to be complex and for a technical audience, but its worth getting to know the gist of how the data of interest was collected. If youre the one who collected the data, then youre good to go, but when you grab a dataset online, provided by someone youve never met, how will you know if its any good? Do you trust it right away, or do you investigate? You dont have to know the exact statistical model behind every dataset, but look out for small samples, high margins of error, and unt assumptions about the subjects, such as indices or rankings that incorporate spotty or unrelated information. Sometimes people generate indices to measure the quality of life in countries, and a metric like literacy is used as a factor. However, a country might not have up-to-date information on literacy, so the data gatherer simply uses an estimate from a decade earlier. Thats going to cause problems because then the index works only under the assumption that the literacy rate one decade earlier is comparable to the present, which might not be (and probably isnt) the case. What: Ultimately, you want to know what your data is about, but before you can do that, you should know what surrounds the numbers. Talk to subject experts, read papers, and study accompanying documentation.

38 | CHAPTER 1: Understanding Data

In introduction statistics courses, you typically learn about analysis methods, such as hypothesis testing, regression, and modeling, in a vacuum, because the goal is to learn the math and concepts. But when you get to real-world data, the goal shifts to information gathering. You shift from, What is in the numbers? to What does the data represent in the world; does it make sense; and how does this relate to other data? A major mistake is to treat every dataset the same and use the same canned methods and tools. Dont do that. When: Most data is linked to time in some way in that it might be a time series, or its a snapshot from a specic period. In both cases, you have to know when the data was collected. An estimate made decades ago does not equate to one in the present. This seems obvious, but its a common mistake to take old data and pass it o as new because its whats available. Things change, people change, and places change, and so naturally, data changes. Where: Things can change across cities, states, and countries just as they do over time. For example, its best to avoid global generalizations when the data comes from only a few countries. The same logic applies to digital locations. Data from websites, such as Twitter or Facebook, encapsulates the behavior of its users and doesnt necessarily translate to the physical world. Although the gap between digital and physical continues to shrink, the space between is still evident. For example, an animated map that represented the history of the world based on geotagged Wikipedia, showed popping dots for each entry, in a geographic space. The end of the video is shown in Figure1-26. The result is impressive, and there is a correlation to the real-life timeline for sure, but its clear that because Wikipedia content is more prominent in Englishspeaking countries the map shows more in those areas than anywhere else. Why: Finally, you must know the reason data was collected, mostly as a sanity check for bias. Sometimes data is collected, or even fabricated, to serve an agenda, and you should be wary of these cases. Government and elections might be the rst thing that come to mind, but so-called information graphics around the web, lled with keywords and published by sites trying to grab Google juice, have also grown up to be a common culprit. (I fell for these a couple of times in my early days of blogging for FlowingData, but I learned my lesson.) Learn all you can about your data before anything else, and your analysis and visualization will be better for it. You can then pass what you know on to readers.

Context | 39

FIGURE1-26 A History of the World in 100 Seconds by Gareth Lloyd, http://data.ws/24a

However, just because you have data doesnt mean you should make a graphic and share it with the world. Context can help you add a dimensiona layer of informationto your data graphics, but sometimes it means its better to hold back because its the right thing to do. In 2010, Gawker Media, which runs large blogs like Lifehacker and Gizmodo, was hacked, and 1.3 million usernames and passwords were leaked. They were downloadable via BitTorrent. The passwords were encrypted, but the hackers cracked about 188,000 of them, which exposed more than 91,000 unique passwords. What would you do with that kind of data? The mean thing to do would be to highlight usernames with common (read that poor) passwords, or you could go so far as to create an application that guessed passwords, given a username. A different route might be to highlight just the common passwords, as shown in Figure1-27. This offers some insight into the data without making it too easy to log in with someone elses account. It might also serve as a warning to others to change their passwords to something less obvious. You know, something with at least two symbols, a digit, and a mix of lowercase and uppercase letters. Password rules are ridiculous these days. But I digress.

FIGURE1-27 Commonly used passwords in Gawker hack

Wrapping Up | 41

With data like the Gawker set, a deep analysis might be interesting, but it could also do more harm than good. In this case, data privacy is more important, so its better to limit what you show and look at. Whether you should use data is not always clear-cut though. Sometimes, the split between whats right and wrong can be gray, so its up to you to make the call. For example, on October 22, 2010, Wikileaks, an online organization that releases private documents and media from anonymous sources, released 391,832 United States Army eld reports, now known as the Iraq War Logs. The reports recorded 66,081 civilian deaths out of 109,000 recorded deaths, between 2004 and 2009. The leak exposed incidents of abuse and erroneous reporting, such as civilian deaths classied as enemy killed in action. On the other hand, it can seem unjustied to publish ndings about classied data obtained through less than savory means. Maybe there should be a golden rule for data: Treat others data the way you would want your data treated. In the end, it comes back to what data represents. Data is an abstraction of real life, and real life can be complicated, but if you gather enough context, you can at least put forth a solid eort to make sense of it.

WRAPPING UP
Visualization is often thought of as an exercise in graphic design or a bruteforce computer science problem, but the best work is always rooted in data. To visualize data, you must understand what it is, what it represents in the real world, and in what context you should interpret it in. Data comes in dierent shapes and sizes, at various granularities, and with uncertainty attached, which means totals, averages, and medians are only a small part of what a data point is about. It twists. It turns. It uctuates. It can be personal, and even poetic. As a result, you can nd visualization in many forms.

Contents

List of Tables and Figures Preface xvii Acknowledgments xxiii

xv

Introduction This Aint Your Fathers Data


Better Car Insurance through Data Potholes and General Road Hazards Recruiting and Retention 8 10 12 How Big is Big? The Size of Big Data Central Thesis of Book Plan of Attack Summary Notes 26 25 24 25 22 2 5

Why Now? Explaining the Big Data Revolution

Who Should Read This Book?

Chapter 1 Data 101 and the Data Deluge


The Beginnings: Structured Data 30 Structure This! Web 2.0 and the Arrival of Big Data The Composition of Data: Then and Now The Current State of the Data Union Summary Notes 47 46 41 39

29
33

The Enterprise and the Brave New Big Data World

43

Chapter 2 Demystifying Big Data


Characteristics of Big Data Summary Notes 72 72 50 The Anti-Denition: What Big Data Is Not

49
71

Chapter 3 The Elements of Persuasion: Big Data Techniques


The Big Overview Data Visualization Automation Semantics 88 93 98 100 105 79 80 84 Statistical Techniques and Methods

77

Big Data and the Gang of Four Predictive Analytics Summary Notes 106 Limitations of Big Data 107

Chapter 4 Big Data Solutions


Projects, Applications, and Platforms Other Data Storage Solutions Hardware Considerations Summary Notes 137 121

111
114 128 136

Websites, Start-ups, and Web Services 133

The Art and Science of Predictive Analytics 137

Chapter 5 Case Studies: The Big Rewards of Big Data 141


Quantcast: A Small Big Data Company Explorys: The Human Case for Big Data Summary Notes 158 141 147 152

NASA: How Contests, Gamication, and Open Innovation Enable Big Data 158

Chapter 6 Taking the Big Plunge


Before Starting 161 165 174 Starting the Journey Summary 181 Notes 181

161

Avoiding the Big Pitfalls

Chapter 7 Big Data: Big Issues and Big Problems 183


Privacy: Big Data = Big Brother? Big Security Concerns Big, Pragmatic Issues Summary Notes 195 196 188 189 184

Chapter 8 Looking Forward: The Future of Big Data


Predicting Pregnancy Big Data Is Here to Stay Big Data Will Evolve Projects and Movements 198 200 203 205 201

197

Big Data Will Only Get Bigger . . . and Smarter Big Data: No Longer a Big Luxury Stasis Is Not an Option Summary Notes 214 213 212 211

The Internet of Things: The Move from Active to Passive Data Generation 206

Final Thoughts 217


Spreading the Big Data Gospel Notes 220 221 219

Selected Bibliography About the Author Index 225 223

Buy This Book

C HA P T E R

Data 101 and the Data Deluge


Any enterprise CEO really ought to be able to ask a question that involves connecting data across the organization, be able to run a company effectively, and especially to be able to respond to unexpected events. Most organizations are missing this ability to connect all the data together.
Tim Berners Lee

oday, data surrounds us at all times. We are living in what some have called the Data Deluge.1 Everything is data. Theres even data about data, hence the term metadata. And data is anything but static; its becoming bigger and more dynamic all the time. The notion of data is somewhat different and much more nuanced today than it was a decade ago, and its certainly much larger.

29

30

T O O B I G t O I G N O R E

Powerful statements like these might give many readers pause, scare some others, and conjure up images of The Matrix. Thats understandable, but the sooner that executives and industry leaders realize this, the quicker theyll be able to harness the power of Big Data and see its benefits. As a starting point, we must explore the very concept of data in greater depthand a little history is in order. If we want to understand where we are now and where we are going, we have to know how we got here. This chapter discusses the evolution of data in the enterprise. It provides an overview of the types of data that organizations have at their disposal today. It answers questions like these: How did we arrive at the Big Data world? What does this new world look like? We have to answer questions like these before we can move up the food chain. Ultimately, well get to the big question: how can Big Data enable superior decisionmaking?

THE BEGINNINGS: StRUCtUREd DAtA


Make no mistake: corporate data existed well before anyone ever turned on a proper computer. The notion of data didnt even arrive years later, when primitive accounting systems became commercially viable. So why werent as many people talking about data thirty years ago? Simple: because very little of it was easily (read: electronically) available. Before computers became standard fixtures in offices, many companies paid employees via manual checks; bookkeepers manually kept accounting ledgers. The need for public companies to report their earnings on quarterly and annual bases did not start with the modern computer. Of course, thirty years ago, organizations struggled with this type of reporting because they lacked the automated systems that we take for granted today. While calculators helped, the actual precursor to proper enterprise systems was VisiCalc. Dan Bricklin invented the first spreadsheet program in the mid1970s, and Bob Frankston subsequently refined it. In the mid1980s, userfacing or frontend applications like manufacturing resource planning (MRP) and enterprise resource planning (ERP) systems began to make inroads. At a high level, these systems had one goal: to automate standard business processes. To achieve this

D A t A 1 0 1 A N d t H E D A t A D E L U G E

31

Table 1.1 Simple Example of Structured Customer Master Data

CustomerID 1001 1002 1003 1004 1005

CustomerName Ballys Bellagio Wynn Casino Borgata Caesars Palace

ZipCode 89109 89109 89109 08401 89109

ContactName Jon Anderson Geddy Lee Mike Mangini Steve Hogarth Brian Morgan

goal, enormous mainframe databases supported these systems. For the most part, these systems could only process structured data (i.e., orderly information relating to customers, employees, products, vendors, and the like). A simple example of this type of data is presented in Table 1.1. Now a master customer table can only get so big. After all, even Amazon.com only serves 300 or 400 million customersalthough its current internal systems can support many more times that number. Tables get much longer (not wider) when they contain transactional data like employee paychecks, journal entries, or sales. For instance, consider Table 1.2. In Table 1.2, we see that many customers make multiple purchases from a company. For instance, I am an Amazon customer, and I buy at least one book, DVD, or CD per week. I have no doubt that each sale represents an individual record in a very long Amazon database table somewhere. (Amazon uses this data for two reasons: [1] process my payments; and [2] learn more about my purchasing habits and recommend products that, more often than not, I consider buying.)
Table 1.2 Simple Example of Transactional Sales Data

OrderNbr 119988 119989 119990 119991 119992

CustomerID 1001 1002 1001 1004 1004

ProductID 2112 1234 2112 778 999

OrderDate 1/3/13 1/6/13 1/6/13 1/6/13 1/7/13

ShipDate 1/6/13 1/11/13 1/9/13 1/12/13 1/15/13

32

T O O B I G t O I G N O R E

Customer CustomerID CustomerName ContactName ZipCode

Order OrderNbr CustomerID ProductID OrderDate ShipDate

Product ProductID Description CreationDate CurrInventory StockStatus

Figure 1.1 Entity Relationship Diagram (ERD)

Things are orderly under a relational data model. All data is stored in proper tables, and each table is typically joined with at least one other. Each table is its own entity. An Entity Relationship Diagram (ERD) visually represents the relationships between and among tables. A simple example of an ERD is shown in Figure 1.1. Note that the ERD in Figure 1.1 is nothing like what youd find behind the scenes in most large organizations. Its common for enterprise systems to contain thousands of individual tables (including some customized ones), although not every table in a commercial off the shelf (COTS) system contains data. Also, querying data from multiple tables requires JOIN statements. While you can theoretically query as many data sources and tables as you like (as long as they are properly joined), queries with a high number of huge tables tend to take a great deal of time to complete.* Queries improperly or inefficiently written can wreak havoc across an entire enterprise. Throughout the 1990s and early 2000s, more and more organizations deployed systems built upon this relational data model. They uprooted their legacy mainframe systems and supplanted them with contemporary enterprise applications. Importantly, these applications were powered by orderly and expensive relational databases like Oracle, SQL Server, and others. Whats more, organizations typically converted their legacy data to these new systems by following a process called ETL (extract, transform, and load).**

**

Trust me. Ive written tens of thousands of queries in my day. Well see in Chapter 4 that ETL isnt really beneficial in a world of Hadoop and NoSQL because much data is far less structured these days.

D A t A 1 0 1 A N d t H E D A t A D E L U G E

33

Like their predecessors, ERP and CRM systems excelled at handling structured data, performing essential business functions like paying vendors and employees, and providing standard reports. With these systems, employees could enter, edit, and retrieve essential enterprise information. Corporate intranets, wikis, and knowledge bases represented early attempts to capture unstructured data, but most of this data was internal (read: generated by employees, not external entities). For the most part, intranets have not displaced email as the de facto killer app inside many large corporations. When asked about data, most people still only think of the structured kind mentioned in this section. The relational model has dominated the data management industry since the 1980s, writes blogger Jim Harris on the Data Roundtable. That model foster(s) the longheld belief that data has to be structured before it can be used, and that data should be managed following ACID (atomicity, consistency, isolation, durability) principles, structured primarily as tables and accessed using structured query language (SQL).2 Harris is spoton. The relational data model is still very important, but it is no longer the only game in town. It all depends on the type and source of data in question. Even in a Big Data world, transactional and structured data and the relational databases behind them are far from irrelevant. But organizations need to start leveraging new data sources and solutions.

StRUCtURE THIS! WEb 2.0 ANd tHE ARRIVAL Of BIG DAtA


While business information is as old as capitalism itself, the widespread use of corporate data is a relatively recent development. The last section demonstrated how, in the 1980s and 1990s, relational databases, ERP and CRM applications, business automation, and computers all helped popularize the contemporary notion of data. Over the past few decades, organizations have begun gradually spending more time, money, and effort managing their data, but these efforts have tended to be mostly internal in nature. That is, organizations have focused on what the data generated by their own hands.

34

T O O B I G t O I G N O R E

In or around 2005, that started to change as we entered Web 2.0aka the social web. As a direct result, the volume, variety, and velocity of data rose exponentially, especially consumerdriven data that is, for the most part, external to the enterprise. The usual suspects include the rise of nascent social networks like Classmates.com, MySpace, Friendster, and then a little Harvardspecific site named The Facebook. Photo sharing began to go mainstream through sites like Flickr, eventually gobbled up by Yahoo! Blogging started to take offas did microblogging sites like Twitter a few years later. More and more people began walking around with increasingly powerful smartphones that could record videos. Enter the citizen journalist. Sites like YouTube made video sharing easy and extremely popular, prompting Google to pay $1.65 billion for the company in 2007. Collectively, these sites, services, and advancements led to proliferation of unstructured data, semistructured data, and metadatathe majority of which was external to the enterprise. To be sure, many organizations have seen their structured and transactional data grow in velocity and volume. As recently as 2005, Information Management magazine estimated the largest data warehouse in the world at 100 terabytes (TB) in size. As of September 2011, Walmart, the worlds largest retailer, logged one million customer transactions per hour and fed information into databases estimated at 2.5 petabytes in size.3 (Ill save you from having to do the math. This is 25 times as big.) For their part, today companies like Amazon, Apple, and Google are generating, storing, and accessing much more data now than they did in 2005. This makes sense. As Facebook adds more users and features, Apple customers download more songs and apps, Google indexes more web pages, and Amazon sells more stuff, each generates more data. However, its essential to note that Web 2.0 did not increase internal IM demands at every organization. Consider a mediumsized regional hospital for a moment. (Ive consulted at many in my career.) Lets say that, on a typical day, it receives 200 new patients. For all of its transformative power, the Internet did not cause that hospitals daily patients to quadruple. Hospitals only contain so many beds. Think of a hospital as an antiFacebook, because it faces fairly strict limits with respect to scale. Hospitals dont benefit from network effects. (Of course, they can always expand their physical space, put in more beds, hire more

D A t A 1 0 1 A N d t H E D A t A D E L U G E

35

employees, and the like. These activities will increase the amount of data the hospitals generate, but lets keep it simple in this example.)

Unstructured Data
We know from Tables 1.1 and 1.2 that structured data is relational, orderly, consistent, and easily stored in spreadsheets and database tables. Unstructured data is its inverse. Its big, nonrelational, messy, text laden, and not easily represented in traditional tables. And unstructured data represents most of what we call Big Data. According to ClaraBridge, a leader in sentiment and text analytics software, Unstructured information accounts for more than 80 percent of all data in organizations.4 By some estimates, unstructured data is growing ten to fifty times faster than its structured counterpart. While everyone agrees on the growth of data, theres plenty of disagreement on the precise terminology we should be using. Some believe that unstructured data is in fact a contradiction in terms.5 And then theres Curt Monash, Ph.D., a leading analyst of and strategic advisor to the software industry. He defines polystructured data as data with a structure that can be exploited to provide most of the benefits of a highly structured database (e.g., a tabular/relational one) but cannot be described in the concise, consistent form such highly structured systems require.6 Debating the technical merits of different definitions isnt terribly important for our purposes. This book uses the term unstructured data in lieu of polystructured data. Its just simpler, and it suits our purposes just fine.

Semi-Structured Data
The rise of the Internet and the web has led not only to a proliferation of structured and unstructured data. Weve also seen dramatic increases in two other types of data: semistructured data and metadata. Lets start with the former. As its name implies, semistructured data contains characteristics of both its structured and unstructured counterparts. Examples include
JJ

Extensible Markup Language (XML) and other markup languages

36

T O O B I G t O I G N O R E

JJ JJ

Email Electronic Data Interchange (EDI), a particular set of standards for computertocomputer exchange of information

Many people are using semistructured data whether they realize it or not. And the same holds true for metadata, discussed next.

Metadata
The information about the package is just as important as the package itself. Fred Smith, Founder and CEO of FedEx, 1978 Now lets move on to metadata, a term that is increasingly entering the business vernacular. As the quote indicates, people like Fred Smith grasped its importance thirtyfive years ago. The term metadata means, quite simply, data about data. We are often creating and using metadata whether we realize it or not. For instance, my favorite band is Rush, the Canadian power trio still churning out amazing music after nearly forty years. While I usually just enjoy the music at concerts, sing along, and air drum,* I occasionally take pictures. Lets say that when I get home, I upload them to Flickr, a popular photosharing site. (Flickr is one of myriad companies that extensively use tags and metadata. Many stock photo sites like iStockphoto and Shutterstock rely heavily upon metadata to make their content easily searchable. In fact, I cant think of a single major photo site that doesnt use tags.) So I can view these photos online, but what if I want other Rush fans to find them? What to do? For starters, I can create an album titled Rush 2012 Las Vegas Photos. Thats not a bad starting point, but what if someone wanted to see only recent pictures of the bands insanely talented drummer Neil Peart? After all, hes not in all of my pictures, and it seems silly to make people hunt and peck. The web has evolved, and so has search. No bother; Flickr has me covered. The site encourages me to tag pictures with as many descriptive labels as I like, even offering suggestions based upon similar photos
* To

see what I mean, Google Rush Car Commercial Fly By Night 2012.

D A t A 1 0 1 A N d t H E D A t A D E L U G E

37

and albums. In the end, my photos are more findable throughout the site for everyone, not just me. For instance, a user searching for Las Vegas concerts or rock drummers may well come across a photo of Peart in action in Las Vegas, whether she was initially looking for Rush or not. In this example, the tags are examples of metadata: they serve to describe the actual data (in this case, the photo and its contents). But these photos contain metadata whether I choose to tag them or not. When I upload each photo to Flickr, the site captures each photos time and date and my username. Flickr also knows the size of the photos file (its number of KBs or MBs). This is more metadata. And it gets better. Perhaps the photo contains a date stamp and GPS information from my camera or smartphone. Lets say that Im lazy. I upload my Rush photos in mid2013 and tag them incorrectly as Cleveland, Ohio. Flickr knows that these photos were actually taken in November 2012 in Las Vegas and kindly makes some recommendations to me to improve the accuracy of my tags and description.* In the future, maybe Flickr will add facial recognition software so I wont have to tag anyone anymore. The site will learn that my future Rush photos will differ from those of a countrymusicloving, conservative Republican who adores radio show host Rush Limbaugh. (For more on how tagging works and why its so important, check out Everything Is Miscellaneous: The Power of the New Digital Disorder by David Weinberger.) Because of its extensive metadata, Flickr can quickly make natural associations among existing photos for its users. For instance, Flickr knows that one mans car is anothers automobile. It also knows that maroon and mauve are just different shades of red and purple, respectively. This knowledge allows Flickr to provide more accurate and granular search results (see Figure 1.2). If I want to find photos taken of Neil Peart from 6/01/2012 to 10/01/2012 in HD only with the tag of Clockwork Angels (the bands most recent studio album), I can easily perform that search. (Whether Ill see any results is another matter; the more specific the criteria, the
* The

site uses Exif data (short for Exchangeable Image File), a standard format for storing interchange information in digital photography image files using JPEG compression.

38

T O O B I G t O I G N O R E

Figure 1.2 Flickr Search Options Source: Flickr.com

less likely that Ill see any results.) The larger point is that, without metadata, searches like these just arent possible. Smartphones with GPS functionality make tagging location easier than everand just wait until augmented reality comes to your smartphone. If not the final frontier, the next logical step in tagging is facial recognition. To this end, in June 2012, Facebook acquired facial recognition startup Face.com for an undisclosed sum (rumored to be north of $100 million). The Tel Aviv, Israelbased startup offers application programming interfaces (APIs) for thirdparty developers to incorporate Face.coms facialrecognition software into their applications. The company has released two Facebook applications: Photo Finder, which lets people find untagged pictures of themselves and their Facebook friends, and Photo Tagger, which lets people automatically bulktag photos on Facebook.7 So from a data perspective, what does all of this mean? Several things. First, even an individual photo has plenty of metadata

D A t A 1 0 1 A N d t H E D A t A D E L U G E

39

associated with itand that data is stored somewhere. Think about the billions of photos online, and you start to appreciate the amount of data involved in their storage and retrieval. Second, photos today are more complex because they capture more data than everand this trend is only intensifying. Lets get back to my Rush example. If I look at my concert pictures from last night, Im sure that Ill find one with poor focus. Right now, theres not much that I can do about it; Photoshop can only do so much. But soon there might be hope for fuzzy pictures, thanks to the folks at the startup Lytro. Through light field technology, Lytros cameras ultimately allow users to change the focus of a picture after the picture is taken.8 Plenoptic cameras such as Lytros represent a new type of camera that dramatically changes photography for the first time since the 1800s. [Its] not too far away from those 3D moving photographs in the Harry Potter movies.9

THE COmPOSItION Of DAtA: THEN ANd NOw


Over the past decade, many organizations have continued to generate roughly the same amount of internal, structured, and transactional data as they did before the arrival of Web 2.0. For instance, quite a few havent seen appreciable changes in the same number of employee and vendor checks cut. They book roughly the same number of sales and generate a more or less stable number of financial transactions. The data world inside of the organization in many cases has not changed significantly. However, this is in stark contrast to the data world outsideand aroundthe enterprise. It could not be more different. Consider Figure 1.3. As Figure 1.3 shows, there is now much more external and unstructured data than its structured counterpartand has been for a long time. At the same time, the amount of structured, transactional data has grown exponentially as well. The ostensible paradox can be explained quite simply: while the amount of structured data has grown fast, the amount of unstructured data has grown much faster. Analyticsasaservice pioneer 1010data now hosts more than 5 trillion yes, trillion with a trecords for its customers.10 Despite statistics like this, most data today is of the unstructured variety.

40

T O O B I G t O I G N O R E

2000

2013

Figure 1.3 The Ratio of Structured to Unstructured Data

As mentioned before, unstructured data has always existed, even though few people thought of it as data. So other than the terminology, whats different now? First, as mentioned earlier, theres just more unstructured data today than at any point in the past. Second, much of this unstructured data is digitized and available nearly instantlyespecially if its owners want it to be.* Delays have evaporated in many cases, although many organizations cannot access realtime data. Be that as it may, by and large, today unstructured data doesnt need to be transcribed, scanned into computers, or read by document storage systems. Data is born digitalor at least it can be. (There are still plenty of hospitals and doctors offices that refuse to embrace the digital world and electronic medical records.) Consider books, newspapers, and magazines for a momentall good examples of unstructured data (both 20 years ago and now). For centuries, they were released only in physical formats. With the rise of the Internet, this is no longer the case. Most magazines and newspapers (sites) are available electronically, and print media has been dying for some time now. Most proper books (including this one) are available both in traditional and electronic formats. There is no time lag. In fact, some ebooks and Kindle singles are only available electronically.

* Even

this isnt entirely true, as Julian Assange has proven.

D A t A 1 0 1 A N d t H E D A t A D E L U G E

41

THE CURRENt StAtE Of tHE DAtA UNION


Unstructured data is more prevalent and bigger than ever. This does not change the fact that relatively few organizations have done very much with it. For the most part, organizations lamentably have turned a deaf ear to this kind of informationand continue to do so to this day. They have essentially ignored the gigantic amounts of unstructured or semistructured data now generated by alwaysconnected consumers and citizens. They treat data as a fourletter word. Its not hard to understand this reluctance. To this day, many organizations struggle managing just their own transactional and structured data. Specific problems include the lack of master data, poor data quality and integrity, and no semblance of data governance. Far too many employees operate in a vacuum; they dont consider the implications of their actions on others, especially with regard to information management (IM). Creating business rules and running audit reports can only do so much. Based upon my nearly fifteen years of working in different IM capacities across the globe, Id categorize most organizations related efforts as shown in Figure 1.4. For every organization currently managing its data very well, many more are doing a poor job. Call it data dysfunction, and Im far from the only one who has noticed this disturbing fact. As for why this is the case, the reasons vary, but I asked my friend Tony Fisher for his take on the matter. Fisher is the founder of DataFlux and the author

Excellen Good Average Poor

Figure 1.4 The Organizational Data Management Pyramid

42

T O O B I G t O I G N O R E

of The Data Asset: How Smart Companies Govern Their Data for Business Success. Fisher told me The problem with data management in most organizations today is that they manage their data to support the needs of a specific application. While this may be beneficial in the context of any one application, it falls woefully short in supporting the needs of the entire enterprise. With more sources of dataand more variety of data and larger volumes of data, organizations will continue to struggle if they dont adopt a more contemporary and holistic mindset. They need to reorient themselves, aligning their data management practices with an organizational strategy instead of an application strategy.11 In other words, each department in an organization tends to emphasize its own data management and application needs. While this may seem to make sense for each department, on a broader level, this approach ultimately results in a great deal of organizational dysfunction. Many employees, departments, teams, groups, and organizations operate at a suboptimal level. Sadly, they often make routine and even strategic decisions without complete or accurate information. For instance, how can a VP of Sales make accurate sales forecasts when his organization lacks accurate master customer data? How can an HR Director make optimal recruiting decisions without knowing where her companys best and brightest come from? They cantat least easily. Far too many organizations struggle just trying to manage their structured data. (These are the ones too busy to even dabble with the other kinds of data discussed earlier in his chapter.) The results can be seen at individual employee levels. Because of poor data management, many employees continue to spend far too much time attempting to answer what should be relatively simple and straightforward business questions. Examples include
JJ

HR: How many employees work here? Which skills do our employees have? Which ones are they lacking? Payroll: Are employees being paid accurately? Finance and Accounting: Which departments are exceeding their budgets?

JJ JJ

D A t A 1 0 1 A N d t H E D A t A D E L U G E

43

JJ

Sales: How many products do we sell? Which ones are selling better than others? Which customers are buying from us? How many customers in New York have bought from us in the past year? Supply Chain: What are our current inventory levels of key products and parts? When can we expect them to be replenished? Will we have enough inventory to meet current and future demand? Marketing: Whats our companys market share? How has that changed from last year or last quarter?

JJ

JJ

For organizations with cultures that have embraced information based decisionmaking, answers to questions like these can be derived quickly.

THE ENtERPRISE ANd tHE BRAVE NEw BIG DAtA WORLd


The questions at the end of the previous section represent vast simplifications of many employees jobs. I am certainly not implying that the entire responsibilities of employees and departments can be reduced a few bullet points. The previous synopses only serve to illustrate the fact that bad data is downright confounding. Poor data management precludes many organizations and their employees from effectively blocking and tackling. Many cannot easily produce accurate financial reports, lists of customers, and the like. As a result, at these datachallenged companies, accurately answering broader and missioncritical questions like, How do we sell more products? is extremely difficult, if not impossible. On occasion, executives will try to address bigpicture items like these by contracting external agencies or perhaps bringing in external consultants. As one of these folks, let me be blunt: we arent magicians. While we hopefully bring unique perspectives, useful methodologies, and valuable skills to the table, we ultimately face the same data limitations as everyday employees. The pernicious effects of bad data have plagued organizations for decades, not to mention consultants like me. Inconsistent, duplicate, and incomplete data mystifies everyone. At a minimum, data issues

44

T O O B I G t O I G N O R E

complicate the answering of both simple and more involved business questions. At worst, they make it impossible to address key issues. Now lets move to the other end of the spectrum. Consider the relatively few organizations that manage their data exceptionally well (Figure 1.4). I have found that, all else being equal, employees in companies at the top of the pyramid are more productive than those at the bottom. The former group can focus more of its energies on answering bigger, broader questions. They dont have to routinely massage or secondguess organizational data. For instance, its easier for the head of HR to develop an effective succession plan when the organization knows exactly which employees have which skills. With impeccable customer and sales data, its more feasible for the chief marketing officer (CMO) to optimize her companys marketing spend. Today, simple and traditional business questions still matter, as do their answers. The advent of Twitter and YouTube certainly did not obviate the need for organizations to effectively manage their structured and transactional data. In fact, doing so remains critical to running a successful enterprise. At the same time, though, Web 2.0 is a gamechanger. Were not in Kansas anymore. It is no longer sufficient for organizations to focus exclusively on collecting, analyzing, and managing tablefriendly data. Twitter, YouTube, Facebook, and their ilk only mean that, for any organization, structured data now only tells part of the story. Texts, tweets, social review sites like Yelp and Angies List, Facebook likes, Google +1s, photos, blog posts, and viral videos collectively represent a new and important breed of cat. This data cannot be easily (if at all) stored, retrieved, and analyzed via standalone employee databases, large database tables, or often even traditional data warehouses.

The Data Disconnect


Today many organizations suffer a disconnect between new forms of data and old tools that handled, well, old types of data. Many employees cannot even begin to answer critical business questions made more complicated by the Big Data explosion. Examples include
JJ

How do our customers feel about our products or customer service?

D A t A 1 0 1 A N d t H E D A t A D E L U G E

45

JJ JJ JJ

What products would our customers consider buying? When is the best time of the year to launch a new product? What are people publicly saying about our latest commercial or brand?

Why the disconnect? Several reasons come to mind. First, lets discuss the elephant in the room. Many businesspeople dont think of tweets, blog posts, and Pinterest pins as data in the conventional sense, much less potentially valuable data. Why should they waste their time with such nonsense, especially when they continue to struggle with real data? Fewer than half of organizations currently collect and analyze data from social media, according to a recent IBM survey.12 To quote from Cool Hand Luke, Some men you just cant reach. At least theres some good news: not everyone is in denial over Big Data. Some organizations and employees do get itand this book will introduce you to many of them. Count among them the U.S. federal government. In March 2012, it formally recognized the power ofand need forBig Data tools and programs.13 But theres no magic Big Data switch. Simply recognizing that Big Data matters does not mean that organizations can immediately take advantage of it, at least in any meaningful way. Many early Big Data zealots suffer from a different problem: they lack the requisite tools to effectively handle Big Data. When it comes to unstructured data, standard reports, ad hoc queries, and even many powerful data warehouses and business intelligence (BI) applications just dont cut it. They were simply not built to house, retrieve, and interpret Big Data. While these old stalwarts are far from dated, they cannot accommodate the vast technological changes of the past seven yearsand the data generated by these changes. To paraphrase The Who, the old boss isnt the same as the new boss.

Big Tools and Big Opportunities


As the Chinese say, there is opportunity in chaos. While relatively recent, the rise of Big Data has hardly gone unnoticed. New applications and technologies allow organizations to take advantage of Big Data.

46

T O O B I G t O I G N O R E

Equipped with these tools, organizations are deepening their understanding of essential business questions. But, as well see in this book, Big Data can do much, much more than answer even complex, predefined questions. Predictive analytics and sentiment analysis are not only providing insights into existing problems, but addressing unforeseen ones. In effect, they are suggesting new and important questions, as well as their answers. Through Big Data, organizations are identifying issues, trends, problems, and opportunities that human beings simply cannot. Unlocking the full power of Big Data is neither a weekend project nor a hackathon. To be successful here, organizations need to do far more than purchase an expensive new application and hire a team of consultants to deploy it. (In fact, Hadoop, one of todays most popular Big Data tools, is available for free download to anyone who wants it.) Rather, to succeed at Big Data, CXOs need to do several things:
JJ JJ JJ JJ

Recognize that the world has changedand isnt changing back. Disavow themselves of antiquated mindsets. Realize that Big Data represents a big opportunity. Understand that existing tools like relational databases are insufficient to handle the explosion of unstructured data. Embrace new and Big Dataspecific toolsand encourage employees to utilize and experiment with them.

JJ

The following chapter will make the compelling business case for organizations to embrace Big Data.

SUmmARY
This chapter has described the evolution of enterprise data and the arrival of the Data Deluge. It has distinguished among the different types of data: structured, semistructured, and unstructured. With respect to managing their structured data, most organizations in 2013 are doing only passable jobs at best. This squeaking by has rarely come quickly and easily Today, theres a great deal more unstructured data than its structured equivalent (although theres still plenty of the latter). Its high time for organizations to do more with data beyond just keeping the lights on. Theres a big opportunity with Big Data, but

D A t A 1 0 1 A N d t H E D A t A D E L U G E

47

what exactly is it? Answering that question is the purpose of the next chapter. It characterizes Big Data.

NOtES
1. The Data Deluge, February 25, 2012, www.economist.com/node/15579717, retrieved December 11, 2012. 2. Harris, Jim, Data Management: The Next Generation, October 24, 2012, www .dataroundtable.com/?p=11582, retrieved December 11, 2012. 3. Rogers, Shawn, Big Data Is Scaling BI and Analytics, September 1, 2011, www.informationmanagement.com/issues/21_5/bigdataisscalingbiand analytics100210931.html, retrieved December 11, 2012. 4. Grimes, Seth, Unstructured Data and the 80 Percent Rule, copyright 2011, http:// clarabridge.com/default.aspx?tabid=137&ModuleID=635&ArticleID=551, retrieved December 11, 2012. 5. Pascal, Fabian, Unstructured Data: Why This Popular Term Is Really a Contradiction, September 19, 2012, www.allanalytics.com/author.asp?section_ id=2386&doc_id=250980, retrieved December 11, 2012. 6. What to Do About Unstructured Data, May 15, 2011, www.dbms2.com/2011/05/15/ whattodoaboutunstructureddata/, retrieved December 11, 2012. 7. Reisinger, Don, Facebook Acquires Face.com for Undisclosed Sum, June 18, 2012, http://news.cnet.com/83011023_35745528793/facebookacquiresface.comfor undisclosedsum/, retrieved December 11, 2012. 8. Couts, Andrew, Lytro: The Camera That Could Change Photography Forever, June 22, 2011, www.digitaltrends.com/photography/lytrothecamerathatcould changephotographyforever/, retrieved December 11, 2012. 9. Lacy, Sarah, Lytro Launches to Transform Photography with $50M in Venture Funds (TCTV), June 21, 2011, http://techcrunch.com/2011/06/21/lytro launchestotransformphotographywith50minventurefundstctv/, retrieved December 11, 2012. 10. Harris, Derrick, Like Your Data Big? How About 5 Trillion Records?, January 4, 2012, http://gigaom.com/cloud/likeyourdatabighowabout5trillionrecords/, retrieved December 11, 2012. 11. Personal conversation with Fisher, October 25, 2012. 12. Cohan, Peter, Big Blues Bet on Big Data, November 1, 2012, www.forbes.com/sites/ petercohan/2012/11/01/bigbluesbetonbigdata, retrieved December 11, 2012. 13. Kalil, Tom, Big Data Is a Big Deal, March 29, 2012, www.whitehouse.gov/ blog/2012/03/29/bigdatabigdeal, retrieved December 11, 2012.

Você também pode gostar