Você está na página 1de 24

Data Warehousing & Data Mining

- Sree Satya.K Rno:8 MBA

Data Warehousing
Definitions
A Data warehouse is a subject-oriented, integrated, timevariant and non-volatile collection of data in support of management's decision making process.

A Data warehouse is a copy of transaction data specifically structured for query and analysis.
Data warehousing arises in an organization's need for reliable, consolidated, unique and integrated analysis and reporting of its data at different levels of aggregation

A Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated. This makes it much easier and more efficient to run queries over data that originally came from different sources. Making better business decisions quickly is the key to succeeding in today's competitive marketplace. Understandably, organizations seeking to improve their decision-making can be overwhelmed by the sheer volume and complexity of data available from their varied operational and production systems. Making this data available to a wide audience of business users is one of the most significant challenges for today's businesses. In response, Persys, Inc. has chosen Microsoft SQL Server Data Warehousing Framework to build data warehouses and data marts.

Microsoft implement the following two basic types of data


warehouses: Enterprise data warehouses and Data marts.
Each have their proponents, as well as implement the following two basic types of data warehouses: enterprise data warehouses and data marts. Each have their respective strengths and weaknesses. Enterprise data warehouses: The enterprise data warehouse contains corporate-wide information integrated from multiple operational data sources for consolidated data analysis. Typically it is composed of several subject areas such as customers, products, and sales and is used for both tactical and strategic decision making. An enterprise warehouse contains both detailed point-in-time data and summarized information and can range from 50 gigabytes to more than one terabyte in total data size. Enterprise data warehouses can be very expensive and time consuming to build and manage. They are usually created by centralized IS organizations from the top down. Data marts contain a subset of corporate-wide data that is built for use by an individual department or division of an organization. Unlike the enterprise warehouse, data marts are often built from the bottom up by departmental resources for a specific decision support application or group of users. Data marts contain summarized and often detailed data about the subject area. This information in a data mart can be a subset of an enterprise warehouse (dependent data mart) or more likely come directly from the operational data sources (independent data mart).

Steps involved in Data Warehousing Project Cycle


Requirement Gathering Physical Environment Setup Data Modeling ETL OLAP Cube Design Front End Development Report Development Performance Tuning Query Optimization Quality Assurance Rolling out to Production Production Maintenance Incremental Enhancements

Sections of Data Warehousing


Task Description: This section describes what typically needs to be accomplished during this particular data warehouse design phase. Time Requirement: A rough estimate of the amount of time this particular data warehouse task takes. Deliverables: Typically at the end of each data warehouse task, one or more documents are produced that fully describe the steps and results of that particular task. This is especially important for consultants to communicate their results to the clients. Possible Pitfalls: Things to watch out for. Some of them obvious, some of them not so obvious. All of them are real.

Data Warehousing (Update)


The parallel data warehouse (PDW) edition of Microsoft's SQL Server 2008 R2 solution will soon see the light of day. It will appear on HP's hardware next month. The new HP Enterprise Data Warehouse Appliance will be available sometime in December, Microsoft announced on Tuesday at the PASS (Professional Association for SQL Server) Summit event, which is being held this week in Seattle. The product will use Microsoft's SQL Server 2008 R2 Parallel Data Warehouse edition solution, formerly known by its "Madison" code name

Data Mining
Definitions
Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions

Data mining consists of five major elements:


Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table.

Challenges of Data Mining


Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data

Data Mining Structure


The Mining Structure tab of Data Mining Designer displays a mining structure that you select in Solution Explorer. You can use this tab to modify mining structures that you create with the Data Mining Wizard. For more information about mining structures and other data mining objects, see Mining Structures (Analysis Services - Data Mining). Mining Structure Panes The Mining Structure tab is broken into the following panes: Tree View Pane Data Source View Pane

Behind the solution: St.George Bank


Teradata Warehouse powered by: Teradata Database V2R6.1; 4-node 5400 Server and 1-node 5400 Server disaster recovery system

Users:
Storage:

280 ad hoc users

4.7TB of data in the Group Data Warehouse (GDW) 1.19TB of space for disaster recovery

Operating System: Teradata Utilities:

UNIX MP-RAS Teradata Load Utilities

Tools/Applications:

Products from Business Objects and SAS Institute

St.George Bank is Australia's fifth largest retail bank and one of the top 20
publicly listed companies in Australia. Over the past five years, St.George Bank has experienced exceptional growth in revenue and profitability while increasing customer satisfaction and its share price. The bank's Group Information Systems (GIS) team manages St.George's data warehouse and business intelligence (BI) platform. The bank created a GDW architecture to support a new 360GB EDW, one that included data from nearly all areas of St.George. The bank became more data-driven, using the data warehouse to support new initiatives, provide business enablers, and support its strategy and governance requirements. "Once we had that building block in place, we could do our regulatory and credit reporting because we know the numbers are complete and accurate," says Carter. "These APRA figures specify our assets, liabilities and risks to the regulators. That's the confidence we have in our [data] warehouse. Further, St.George analysts are using the GDW to dig down into the data and deliver more granular levels of analysis. For example, one team is trying to determine the most effective number of times to contact a customer with a credit card offer. By tracking how many times a customer is contacted, after which contact the customer decides to accept the card, and how the customer rates in the use of the card, the team is working to understand which customers are most profitable and least risky to acquire.

Technical benefits of the Teradata Warehouse


Efficient use of resources: The multi-value compression feature provided with the Teradata Database V2R5 release helped St.George Bank save 10% of the space on its data warehouse when space was at a premium. "We saved over 120GB when our warehouse was at 1TB, so that was quite a savings," says Damian Plueckhahn, senior software advisor, project leader of the infrastructure team and database administrator (DBA). "We were able to keep running an extra six months on the same platform, which was nearing the end of its life, as we were preparing for our next upgrade." Heavy-duty data processing, with no performance loss: Partitioned primary indexes helped the infrastructure team extend the transaction table from one year to three years. Being able to store this additional information was well-received by users. Safe environment for innovation: The Teradata Warehouse provides St.George Bank with a safe environment to try innovative ideas and applications. Other operational systems require a more stringent testing environment before new applications can go live. With Teradata, any mistake can be corrected without damage to the bank's data. Manageability: Despite the thousands of jobs and tables in the St.George Group Data Warehouse (GDW) and its hundreds of users, it is managed by only one DBA. The bank was able to handle Australian Prudential Regulation Authority (APRA) compliance with two businesspeople and one IT person; other banks often have as many as 15 people on the project.

Você também pode gostar