Escolar Documentos
Profissional Documentos
Cultura Documentos
Topic: Study of tools and techniques for advanced data analysis, business analytics, business intelligence and its applications in the financial domain
A. Introduction to Analytics
Data analytics is a fast growing field and there are many tools available in the market to serve the needs of organizations. The range of analytical software goes from relatively simple statistical tools in spreadsheets (ex-MS Excel, SPSS) to statistical software packages (ex-KXEN, Statistica) to sophisticated business intelligence suites (ex-SAS, Oracle, SAP, IBM among the big players). Open source tools like R and Weka are also gaining popularity. Besides these, companies develop in-house tools designed for specific purposes.
MS Excel: Almost every business user has access to MS Office suite and Excel. Excel is an excellent reporting and dash boarding tool. For most business projects, even if you run the heavy statistical analysis on different software but you will still end up using Excel for the reporting and presentation of results. While most people are aware of its excellent reporting and graphing abilities, excel can be a powerful analytic tool in the hands of an experienced user. Latest versions of Excel can handle tables with up to 1 million rows making it a powerful yet versatile tool.
IBM SPSS: With some major acquisitions in the last few years, IBM has suddenly become a major player in the analytics software market. While COGNOS is largely considered as a business intelligence tool, SPSS and IBM modeler (previously SPSS Clementine) are big players in this market. SPSS is more popular in the market research field than the business analytics field. IBM modeler is comparable to SAS e-miner both in terms of features as well as pricing and hence has the same pros and cons.
SAS: SAS is the 5000 pound gorilla of the analytics world and claims to be the largest independent vendor in the business intelligence market. It is the most commonly used software in the Indian analytics market despite its monopolistic pricing. SAS software has wide ranging capabilities from data management to advanced analytics.
SPSS Modeler (Clementine): SPSS Modeler is a data mining software tool by SPSS Inc., an IBM company. It was originally named SPSS Clementine. This tool has an intuitive GUI and its point-andclick modelling capabilities are very comprehensive.
Statistica: is a statistics and analytics software package developed by StatSoft. It provides data analysis, data management, data mining, and data visualization procedures. Statistica supports a wide variety of analytic techniques and is capable of meeting most needs of the business users. The GUI is not the most user-friendly and it may take a little more time to learn than some tools but it is a competitively priced product that is value for money.
Salford systems: provides a host of predictive analytics and data mining tools for businesses. The company specialises in classification and regression tree algorithms. Its MARS algorithm was
originally developed by world-renowned Stanford statistician and physicist, Jerome Friedman. The software is easy to use and learn.
KXEN: is one of the few companies that is driving automated analytics. Their products, largely based on algorithms developed by the Russian mathematician Vladimir Vapnik, are easy to use, fast and can work with large amounts of data. Some users may not like the fact that KXEN works like a black box and in most cases, it is difficult to understand and explain the results.
Angoss: Like Salford systems, Angoss has developed its products around classification and regression decision tree algorithms. The advantage of this is that the tools are easy to learn and use, and the results easy to understand and explain. The GUI is very user friendly and a lot of features have been added over the years to make this a powerful tool.
MATLAB: is a statistical computing software developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms and creation of user interfaces. There are many add-on toolboxes that extend MATLAB to specific areas of functionality, such as statistics, finance, image processing, bioinformatics, etc. Matlab is not a free software. However, there are clones like Octave and Scilab which are free and have similar functionality. Open Source Software
R: R is a programming language and software environment for statistical computing and graphics. The R language is an open source tool and is widely used by the academia. For business users, the programming language does represent a hurdle. However, there are many GUIs available that can sit on R and enhance its user-friendliness.
Weka: Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software, developed at the University of Waikato, New Zealand. Weka, along with R, is amongst the most popular open source software used by the business community. The software is written in the Java language and contains a GUI for interacting with data files and producing visual results and graphs.
Both Excel and SPSS have a similar feel, with pull-down menus, a host of built-in statistical functions and a spreadsheet format for easy data entry. However, SPSS is specifically built for statistics and surpasses Excel in many ways, including:
Faster and easier basic function access like descriptive statistics (i.e. mean, standard deviation or median). While Excel does have built-in functions, SPSS has these basic statistics elements in pull down menus. Wider variety of graphs and charts in SPSS. Excel does have a wide range of basic charts, but if you want to create complex graphs like contingency tables, this is much easier in SPSS with the pulldown menus. Easier to find statistical tests. While Excel does have a wide range of statistical tests built-in, the pull-down menus in SPSS make for faster access.
Due to all the above advantages, SPSS would be studied in detail and its applications in the Financial Domain.
D. Introduction to SPSS
1. Opening a Data File
To open a data file: From the menus choose: File > Open > Data...
The data file is displayed in the Data Editor. In the Data Editor, if you put the mouse cursor on a variable name (the column headings), a more descriptive variable label is displayed (if a label has been defined for that variable).
By default, the actual data values are displayed. To display labels: From the menus choose: View > Value Labels
2. Running an Analysis
If you have any add-on options, the Analyze menu contains a list of reporting and statistical analysis categories. We will start by creating a simple frequency table (table of counts). This example requires the Statistics Base option.
From the menus choose: Analyze > Descriptive Statistics > Frequencies...
The Frequencies dialog box is displayed. An icon next to each variable provides information about data type and level of measurement.
3. Viewing Results
Results are displayed in the Viewer window.
You can quickly go to any item in the Viewer by selecting it in the outline pane. For example, you could click Income category in thousands [inccat].
The frequency table for income categories is displayed. This frequency table shows the number and percentage of people in each income category.
4. Creating Charts
Although some statistical procedures can create charts, you can also use the Graphs menu to create charts. For example, you can create a chart that shows the relationship between wireless telephone service and PDA (personal digital assistant) ownership.
From the menus choose: Graphs > Chart Builder... Click the Gallery tab (if it is not selected). Click Bar (if it is not selected). Drag the Clustered Bar icon onto the canvas, which is the large area above the Gallery. Scroll down the Variables list, right-click Wireless service [wireless], and then choose Nominal as its measurement level. Drag the Wireless service [wireless] variable to the x axis. Right-click Owns PDA [ownpda] and choose Nominal as its measurement level. Drag the Owns PDA [ownpda] variable to the cluster drop zone in the upper right corner of the canvas.
The bar chart is displayed in the Viewer. You can edit charts and tables by double-clicking them in the contents pane of the Viewer window, and you can copy and paste your results into other applications. Those topics will be covered later.
E. Reading Data
1. Basic Structure of IBM SPSS Statistics Data Files
IBM SPSS Statistics data files are organized by cases (rows) and variables (columns). Cases represent individual respondents to a survey. Variables represent responses to each question asked in the survey.
The data now appear in the Data Editor, with the column headings used as variable names. Since variable names can't contain spaces, the spaces from the original column headings have been removed. For example, Marital status in the Excel file becomes the variable Maritalstatus. The original column heading is retained as a variable label.
Any database that uses ODBC (Open Database Connectivity) drivers can be read directly after the drivers are installed. ODBC drivers for many database formats are supplied on the installation CD. Additional drivers can be obtained from third-party vendors. One of the most common database applications, Microsoft Access, is discussed in this example.
If you do not want to import all cases, you can import a subset of cases (for example, males older than 30), or you can import a random sample of cases from the data source. For large data sources, you may want to limit the number of cases to a small, representative sample to reduce the processing time. Field names are used to create variable names. If necessary, the names are converted to valid variable names. The original field names are preserved as variable labels. You can also change the variable names before importing the database.
The Text Import Wizard guides you through the process of defining how the specified text file should be interpreted.
3. Defining Data
In addition to defining data types, you can also define descriptive variable labels and value labels for variable names and data values. These descriptive labels are used in statistical reports and charts.
Type 0 in the Value field. Type Single in the Label field. Click Add to add this label to the list. Type 1 in the Value field, and type Married in the Label field. Click Add, and then click OK to save your changes and return to the Data Editor.
Type F in the Value field, and then type Female in the Label field. Type M in the Value field, and type Male in the Label field. Because string values are case sensitive, you should be consistent. A lowercase m is not the same as an uppercase M.