Escolar Documentos
Profissional Documentos
Cultura Documentos
Sponsored By:
E-Guide
Sponsored By:
Page 2 of 11
Sponsored By:
Page 3 of 11
Big-data management has become one of the most talked-about trends in the IT industry, as organizations look to cope with the challenges of storing large data sets and mining them for nuggets of information that potentially can provide significant competitive advantages. Complicating matters is the fact that a big-data installation might include both structured transactional data from in-house systems and unstructured information from a variety of sources, including system logs, call detail records and social media sites such as Facebook and Twitter. Distributed computings role in big-data management For example, clickstream data lets companies track what people are doing on the Web, from PCs as well as mobile devices. That produces huge amounts of data, said Tony Iams, a vice president and senior analyst at Ideas International, a Rye Brook, N.Y.-based IT research firm. The benefit, Iams said, is that organizations can use that data to create potentially a much more accurate picture of user behavior than ever before. But the data needs to be properly structured and managed to make that possible. Jill Dyche, a partner at Baseline Consulting Group in Sherman Oaks, Calif., said classifying data is a key first step in the big-data management process. When we talk about it with clients, we very quickly move on to data classification, Dyche said at the Pacific Northwest BI Summit 2011 in Grants Pass, Ore. So theyre not just forklifting data onto a data warehouse platform or data marts but really looking at what the data is and how its used. Often, one of the defining characteristics of big data is that its too large for a standalone database server to process efficiently. In addition, nontransactional data types such as Web logs and social media interactions the other big data, in the words of Gartner Inc. analyst Merv Adrian arent always a good fit for traditional relational databases. As a result, many user organizations engaged in big-data management employ a distributed computing, or scale-out, model, often built around open source technologies such as Hadoop, MapReduce and NoSQL data stores. The distributed approach has worked well for Catalina Marketing, according to Williams. This whole idea of grid computing or connecting standardized PC-type appliances and making them work in concert made all the sense in the world, he said. Thats really what
Sponsored By:
Page 4 of 11
has allowed us to scale to the size that we are and to do that very cost-effectively and efficiently. Another strategy that Williams put in place is holding a monthly user group meeting designed partly to help Catalina keep its data warehouse appliances performing at an optimum level. Williams said the meetings are critical because they allow the IT staff to see how the needs of business users and the queries theyre looking to run change over time. We work with them to understand how they operate, what theyre running and what their analytics are showing, he said, adding that the process enabled his team to recognize that the existing data structure and query parameters werent optimized to accommodate what [the users] needed. The data structure has now been modified to accommodate new types of queries, Williams said. Big-data challenges call for management oversight For some organizations, one of the biggest challenges associated with managing and analyzing super-large data sets is finding valuable information that can yield business benefits and deciding what data can be jettisoned. For example, UPMC, a Pittsburgh-based health care network with 20-plus hospitals and more than 50,000 employees, has seen its data stores grow by leaps and bounds in recent years, largely because workers are afraid to delete any information, according to William Costantini, associate director of the companys integrated operations center. The biggest issue right now is [figuring out] what do you purge and what cant you purge, because everybody is afraid of liability and being sued, Costantini said. Everybody is afraid to throw anything out or get rid of it. At the same time, everybody wants to be budgetconscious and keep the size down. Adding to the big-data challenges facing organizations is the increasing popularity of data sandboxes that enable data analysts to explore and experiment on subsets of information,
Sponsored By:
Page 5 of 11
typically outside of a data warehouse. Companies need to keep a close watch on sandboxes to make sure that they dont end up with inconsistent stovepipes of data, analysts said. In addition, the databases and Hadoop installations used to store nontransactional forms of big data are often set up by application developers working independently of the IT department. This is being done by people outside the usual IT focus, with different tools, Adrian said at the Pacific Northwest BI Summit. Managed is probably too generous a term. Gartners take, he added, is that organizations able to integrate those data types into a coherent information management infrastructure will outperform businesses that cant.
Sponsored By:
Page 6 of 11
Sponsored By:
Page 7 of 11
data streams coming from a variety of sources, including the constant stream of chatter on social media venues like Facebook and Twitter, daily Web log activity, Global Positioning System location data and machine-generated data produced by barcode readers, radio frequency identification scans and sensors. Stamford, Conn.-based Gartner Inc. estimates that worldwide information volume is growing at an annual clip of 59%, and while there are hurdles around dealing with the volume and variety of data, there are equally big challenges related to velocity, or how fast the data can be processed to deliver any benefit to the business. This whole notion of extreme data management has put a strain on traditional data warehouse and BI systems, which are not well-suited to handle the massive volume and velocity requirements of so-called big data applications, both economically and in terms of performance. "There is a paradigm shift in terms of analytics as to how you use this finer-grained data to come up with more accurate or new types of analyses," said David Menninger, vice president and research director for Ventana Research Inc., a San Ramon, Calif.-based consultancy specializing in data management strategies. Big data analytics, bigger data processing Key to the paradigm shift is a host of new technologies designed to address the volume, variety and velocity challenges around big data analytics. At the heart of the new movement is massively parallel processing (MMP) database technology, which automatically splits database workloads across multiple commodity servers and runs them in parallel to garner significant performance boosts when working across extremely large data sets. Building on this core architecture are columnar databases, which store data in columns as opposed to rows, serving to shrink the amount of storage space while greatly accelerating how quickly a user can ask questions of data and get results. In-database analytics, built specifically for large-scale analytics and BI workloads, is yet another technology that companies are evaluating to efficiently and economically provide fast query response and deliver access to larger amounts of detailed data. The same goes for data warehouse
Sponsored By:
Page 8 of 11
appliances, an integrated and packaged set of server, storage and database technology tuned to the needs of large-scale data management applications. While these capabilities tend to play off core RDBMS technology, there's a relative newcomer garnering a significant amount of attention when it comes to big data analytics. Hadoop, with roots in the open source community, is one of the most widely heralded new platforms for managing big data, particularly the flood of unstructured data like text, social media feeds and video. Along with its core distributed file system, Hadoop ushers in new technologies, including the MapReduce framework for processing large data sets on computer clusters, the Cassandra scalable multi-master database and the Hive data warehouse, among other emerging projects. Hadoop attracts attention for big data uses According to a new Ventana Research benchmark report, titled "Hadoop and Information Management," more than half of the 163 survey respondents (54%) are currently using or evaluating Hadoop as part of big data initiatives, driven by their desire to analyze data at a greater level of detail and perform analytics that couldn't be done previously on large volumes of data. However, despite this early wave of enthusiasm, the reality is that Hadoop is still relatively immature, and there are only a handful of packaged analytics tools available today designed to shield users from the inherent complexities of the open source technology. While new technologies present a challenge to data warehouse professionals, the real issue for firms looking to make a go at big data analytics is to get past the emphasis on big volume or risk missing out on the broader opportunity for business. "The volume aspect has been around for a long time, but it's really more about extreme data," said Yvonne Genovese, a vice president and distinguished analyst at Gartner. The ability to turn data into information and information into action is where IT and data warehouse executives can really make their mark, she said.
Sponsored By:
Page 9 of 11
About the Author: Beth Stackpole is a freelance writer who has been covering the intersection of technology and business for 25-plus years for a variety of trade and business publications and websites.
Sponsored By:
Page 10 of 11
About ParAccel
ParAccel's analytic platform is built from the ground up to provide the highest performance for the widest variety of analytic workloads. It includes 500+ advanced in-database functions along with dynamic analytic and data integration. ParAccel's software only approach enables physical and virtual deployment, either on-premise or in public/private clouds.
Sponsored By:
Page 11 of 11