Você está na página 1de 68

To develop a new approach to the visualization of multivariate datasets based on the scatterplot concept.

Faisal Salem Alsrheed Student Number:200363485 MSc Computing and Management 2007/2008

Summary
These days, the need is increasing for a technique which can make the large multivariate datasets easier for humans to discover the relationships and hidden facts. This project is to develop a new approach to the visualization of multivariate datasets based on the scatterplot concept. Prototyping Model is followed as methodology. The 3D scatterplot software is designed and implemented. Then, five examples of datasets are used for the demonstration. After that, two methods of evaluation are used: User- evaluation and Case-study to compare the prototype with an existing approach which is as scatterplot matrices in xmdvtool. In the end, evaluation's conclusion is drawn.

Acknowledgements
I would like to thank the following people for their advice and support during my work on the project: Professor Ken Brodlie , my project supervisor, for his help and guidance. Mr. Tony Jenkins, for his help with python. Dr Vania Gatzeva Dimitrova for her advice and feedback. Most of all, I would like to thank my family for their support and love.

Table of Contents
Summary Acknowledgements 1. Understanding the problem 1.1 Introduction 1.2 The problem statement 1.3 Motivation 1.4 Related work 1.5 Background Reading 2. Preparation of Solution 2.1 Project methodology. 2.2 Project schedule. Delivery of solution 3.1 Analysis. 3.2 Identify Use-Case. 3.3 Designing 3.4 Implementation 3.5 Demonstration 3.6 Achieving the minimum requirements
I II 1 1 2 2 3 4 9 9 10 16 16 17 18 20 21 29 30 30 33 36 40 41 42 46 48 49 50 51 53 54 55

3.

Evaluation 4.1 Evaluation Criteria 4.2 User evaluation. 4.3 Case Study: 4.4 Evaluation Conclusion Further Work and Conclusion

References Appendix A: Personal Reflection Appendix B: Source Code Guide Appendix C: Gantt chart for initial project plan. Appendix D: Gantt chart for revised project plan. Appendix E : Questionnaire for user- Evaluation. Appendix F : The Screening Questionnaire. Appendix G : The questionnaire scenario. Appendix H : The Interim Project Report .

1. Understanding the problem


1.1 Introduction

1.1.1

Overall Project Aim

To develop a new approach to the visualization of multivariate datasets based on the scatterplot concept.

1.1.2

Project Objectives
To review the main techniques for multivariate visualization, to gain a general overview, and to review scatter plots in more detail. To design a novel scatter plot approach that can be used for multivariate data. To implement and evaluate a prototype of this approach.

1.1.3

Minimum Requirement
To design and implement a scatter plot tool that can visualize multivariate data for a large number (>10) variates. To demonstrate its application to 2 example datasets. To compare this with an existing approach, such as scatter plot matrices in xmdvtool.

1.2

The problem statement

These days, so many multivariate data is daily made from application areas ranging from financial markets to the medical and science research. This large amount of multivariate data is difficult to be analyzed and explored directly by humans [1].

For this reason, the need is increasing for a technique, which can make the large multivariate datasets easier for humans to discover the relationships and hidden facts [2].

1.3

Motivation

Visualization and exploration of a large multivariate dataset is a difficult problem to solve. As Xie Z [3] believes it is a big challenge that the visualization community is facing.

There are many reasons that make visualization and exploration of large multivariate data a difficult and challenging problem to solve:

Large multivariate data sets become increasingly common[3]. The majority of existing techniques lose their effectiveness, when large multivariate data sets data are being visualized [1]. The limitation of screen pixels [4].

The challenges of finding a suitable technique for visualization and exploration of large multivariate data have been the general motivation behind this Project.

1.4

Related work

In the past three decades, many visualization software and techniques have been developed to visually explore multivariate data sets [1][5].

Wong and Bergeron [7] created a "multiresolution" display using "wavelet approximations". Their approach reduced the size of data by merging the nearest points. The hierarchical structures are created by using the wavelet transform, to view the different levels of detail interactively [1]. However, their approach is functional only for natural order data sets, for instance time-series data [5].

Another approach is suggested by Wegman and Luo [6]. Their approach allows the dataset characteristics to reveal themselves. They suggest over-plotting translucent data points or lines. Therefore sparse areas disappear whilst dense areas appear highlighted [5]. The disadvantage of this technique is that it depends on overlapping points or lines to spot clusters. Clusters, not including overlapping points will not be visually highlighted [6] [1] [5].

Keim et al. [8] developed a new technique called "recursive pattern" which has been developed for visualizing large amounts of multivariate data on a typical computer screen based on recursive layout patterns. However, one of the limitations of their technique is that the number of records that can be visualized is dependent on the size of the screen area [8] [5]. However, because their technique only used one pixel per data value, it is not easy to convey the interaction between variables [8][5].

1.5

Background Reading

1.5.1 Information visualization


Information visualization as Spence [9] defined " is the process of turning abstract data into a visual shape easily understood by the user, making it possible for him/her to generate new knowledge about the relationship between the data".

One of the reasons that makes information visualization a key technique for analysis and exploration of large multivariate data sets , is as Xie Z et al [2] mentioned the fact that information visualization takes benefit of the huge power and recognition ability of the human beings visual system.

1.5.2

Multivariate visualization
"Multivariate means that every data object has multiple attributes or parameters" as Prohaska et al [10] defined it. Multivariate data also called ndimensional or multidimensional. [11].

In the past three decades, many multivariate visualization techniques have been developed for example glyphs, parallel coordinates and scatterplot matrices.

1.5.3

Glyph
A glyph is [12] "a graphical object consisting of one or more components, the attributes of which (position, orientation, size, shape, colour, transparency, etc.) are determined by one data point" as Kraus et al defined it.

There are some limitations of the glyph technique: Mapping can make understanding relationships between dimensions difficult [11]. Relationships between non-adjacent components are hard to observe [11]. The limitation of screen space and resolution. Visualizing too many glyphs at once will lead to either overlaps or very small glyphs [11]. A lot of data dimensions can make it hard to distinguish individual dimensions [11].

1.5.4

Parallel coordinates

Parallel coordinates is a visualization technique invented in the 1980s. [13]. In Parallel coordinates the technique as Fua et al [13] explained " each data dimension is represented as a (horizontal or) vertical axis, and the N axes are organized as uniformly spaced lines. A data element in an N-dimensional space is mapped to a polyline that traverses across all of the axes crossing each axis at a position proportional to its value for that dimension".

There are some limitations of the Parallel Coordinates technique: Because of every point creating a line, lots of points can lead to a mass of overlapping lines, especially with large datasets [14] [15].

The relationships between non-adjacent dimensions are not easy to observe [14] [15].

The numbers of dimensions are limited by the horizontal resolution of the computer monitor [14] [15].

It is not easy to find a structure or clusters because the axes will get closer when the large data is visualized [14] [15].

1.5.5

Scatterplots

At a high level, a taxonomy for data visualization was made by Buja et al in 1996. The taxonomy divides the data visualization into two categories: rendering and manipulation[16].

The rendering category separated into three subtypes: Scatterplots, Traces and Glyphs[16].

Most of the popular graphical devices used these days to present the quantitative data, such as pie charts, bar charts and line graphs which were invented by William Playfair (17591823) [17]. All of these were basically onedimensional (1D) [17].

After, the most important invention, and the first accurate two-dimensional (2D) is the scatterplot. Friendly and Denis [17] considered scatterplot as the most flexible, polymorphic, and useful invention in the entire history of statistical graphics. Because scatterplot gives a lot of unique advantages more than any earlier graphic forms, for instance, the ability to distinguish clusters, trends, patterns and relationships in a blur of points. [17][18].

Scatterplot is one of the oldest and most frequently used techniques for multivariate data [19]. Scatterplots have been used since the early 1800s [20].Friendly and Denis [17] defined scatterplot as " a plot of two variables, x and y, measured independently to produce bivariate pairs (xi, yi), and are

10

displayed as individual points on a coordinate grid typically defined by horizontal and vertical axes, where there is no necessary functional relations between x and y".

Scatterplots in 2D and 3D are useful visualization tools, used in many applications [21]. A scatterplot can show abstract data dimensions very efficiently, and provide a basic image of an object if fed with the correct data [ 22].

2D scatterplots are a well-known visualization tool for unstructured data [22]. It shows each data point in a dataset as one point on a plane that organises and corresponds to two of the data dimensions [22]. If the data set has more than two dimensions, only two of them are used. Extra dimensions can be shown by using additional visual attributes, for instance, colour, size or glyphs [22]. In a large dataset, many points may be located into the same pixel, sometimes it is hard to tell how many. This makes it hard to judge the correct distribution of data from a scatterplot [22].

3D scatterplots can solve only a part of this difficulty. There is an extra dimension in which structures can be separated, and the overplotting problem will be reduced [22][23].

11

1.5.6 Python and VPython


Fangohr [24] defined Python as " an interpreted, interactive, object-oriented programming language. It is often compared to Tcl, Perl, Scheme or Java and combines remarkable power with very clear syntax".

Stephen et al [25]

defined Visual Python (VPython) as " a 3-D graphics

system that is an extension of the Python language. Its main usage has been in the area of demonstration of physical systems in Physics, Chemistry, and Engineering."

In 2000, under the supervision of Bruce Sherwood and Ruth Chaby , VPython [25] was written by Dave Scherer under the GNU public license was released and is now available from the VPython's website at shttp://www.vpython.org [25].

The VPython modules are written in C++ and built to fit into the Python Environment. It is imported as a Python module and gives the user the ability to model 3 dimensional scenes [25]. The basic depiction in VPython uses OpenGL, but, it has only a limited dependence upon OpenGL [25].

12

2. Preparation of Solution
2.1 Project methodology.
The first stage in this research is to conduct a literature review, in the beginning focusing on topics relating to the visualization of multivariate datasets. Then, the second stage is to focus more on the visualization of multivariate datasets based on the scatterplot concept. The first and second will provide background knowledge about the visualization software and techniques that have been developed to visually explore multivariate datasets.

The third stage is to follow one of the systems development methods (SDM) namely the Prototyping Model.

Prototyping Model is suitable for this kind of project as Roger [28] suggested for the following reasons: 1. The system requirements are not clearly defined in advance. 2. Fundamentally new system will be created. 3. The developer is not confident in system architecture.

One of major advantages of using the prototyping model is that it allows the developer to start with requirements that are not clearly defined [28].

In the end of this stage the prototype was ready for the evaluation.

The fourth stage is to evaluate the prototype by using two methods: 1. User- evaluation. 2. Case-study.

In the end of this stage the evaluation's conclusion is drawn.

13

2.2 Project schedule.


2.2.1 Initial project plan:

After the project's objectives have been confirmed, a project plan has been completed to make sure the tasks will be done on time and in a correct order. According to the project's objectives, the tasks need to be done are the following:

1. Planning. 2. Analysis. 3. Designing. 4. The first stage of implementation. 5. The second stage of implementation. 6. Testing. 7. Evaluation.

Please see Appendix C, Gantt chart for Initial project plan. The objectives and the time of each task were estimated as the following:

Planning. Time: From 01-02- 2008 to 08-03 -2008. Objectives: Understanding the Problem by Identifying relevant literature together with a review of background reading. Finding out more about related work and the available visualization software and techniques that have been developed to visually explore multivariate datasets. Identifying the project methodology. Making the project plan and tasks schedule.

14

Analysis. Time: From 08-03- 2008 to 05-04 -2008. Objectives: Doing the Technical feasibility study, to find out the suitable technology and programming language to solve the problem. Identify the possible Use-Case.

Designing. Time: From 05 - 04- 2008 to 05-05 -2008. Objective: Designing the user interface.

The first stage of implementation. Time: From 19- 05- 2008 to 29-06 -2008. Objective: Writing a python code to visualize the dataset by using a 3D graphics module called Visual.

The second stage of implementation. Time: From 29- 06- 2008 to 06-07 -2008. Objective: Writing a python code to build a user graphical interface (GUI).

Testing. Time: From 06- 07- 2008 to 03-08 -2008. Objective: The codes will be tested to uncover and correct any errors in the software.

15

Evaluation. Time: From 03- 08- 2008 to 17-08 -2008. Objectives: Specifying a set of criteria for evaluation. Using a Users-evaluation for evaluation.

Writing the report. Time: From 08- 06- 2008 to 26-08 -2008. Objective: To complete writing the report on time.

16

2.2.2

Revised project plan:

Some changes have been made on the initial project plan. Some tasks need to be done faster, though some need more time. Moreover, in order to mange the big tasks more easily, some of the big tasks divided into smaller tasks.

Please see Appendix D, Gantt chart for revised project plan.

The objectives and the time of each task were estimated as the following:

Planning. Time: From 06-02- 2008 to 18-03 -2008. Objectives: Understanding the Problem by Identifying relevant literature together with a review of background reading. Finding out more about related work and the available visualization software and techniques that have been developed to visually explore multivariate datasets. Identifying the project methodology. Making the project plan and tasks schedule.

Analysis. Time: From 18-03- 2008 to 02-04 -2008. Objectives: Doing the Technical feasibility study, to find out the suitable technology and programming language to solve the problem. Identify the possible Use-Case.

17

Designing. Time: From 02 -04- 2008 to 05-05 -2008. Objective: Designing the user interface.

The first stage of implementation. Time: From 5- 05- 2008 to 20-05 -2008. Objectives: Downloading and installing python. Downloading and installing Vpython. Converting the '.okc' datasets format to '.csv' datasets format by using Microsoft Excel. Writing a python code to read the '.csv' dataset file. Writing a python code to visualize the dataset by using a 3D graphics module called Visual.

The second stage of implementation. Time: From 20- 05- 2008 to 10-06 -2008. Objectives: Writing a python code to build a user graphical interface (GUI). Writing a python code to allow the users to save the users comments in the log file. Writing a python code to allow the users to customize the scatterplot. Configuration the interaction features (Zooming and Rotation) to give the user better views.

18

Testing. Time: From 10- 06- 2008 to 01-08 -2008. Objectives: The codes will be tested to uncover and correct any errors in the software. Make sure that python, Vpython , dataset , log file and user graphical interface (GUI) work together smoothly without any problems .

Evaluation. Time: From 01- 08- 2008 to 20-08 -2008. Objectives: Specifying a set of criteria for evaluation. Planning for a Users-evaluation. Creating the Users-evaluation questionnaire. Selection of users for the evaluation. Planning the evaluation day. Analysis of evaluation Results. Drawing the evaluation conclusion.

Writing the report. Time: From 08- 06- 2008 to 27-08 -2008. Objectives: Writing the report. Find someone for the Proofreading the report. Checking the report format. Checking the references. Printing the report. Photocopying the report for the second copy. Burning the software to a CD for submitting.

19

3. Delivery of solution
3.1 Analysis.
3.1.1 Technical Feasibility study

The prototype is built using python programming language. The python is chosen for the following reasons:

Object-oriented : Python is an Object-Oriented programming language. So, in the future, the python code in this project can be easily moved to any further project for development.

Cross- platform: Python is a Cross-Platform programming language. So, this software can run on virtually every major platform in use these days.

3D graphics: Python support a 3D graphics module called Visual.

Internet Scripting : Python comes with standard internet modules that allow python code to run on both server and client. So, in the future, this software can be easily run as the internet application.

20

3.2

Identify Use-Case.

Figure 1 : Use-Case This is the use case model for the prototype. The explanations for the use cases are as follows: -

User:
1. The prototype will give the user the ability to visualize scatterplot. This will involve the user being able to select the X, Y and Z axes. Also, this will involve the user being able to customize the scatterplot.

2. The prototype will give the user the ability to send comment about the scatterplot to a log file.

3. The prototype will give the user the ability to view comments from a log file. 4. The prototype will give the user the ability to clear the comments from the log file.

21

3.3 Designing
3.3.1 Designing the user interface.
The next frame is the software's user interface. The user-interface is designed in way to help the user to complete a task in a logical series. The next frame will allow the user to do the following: To select the X, Y and Z axes To visualize scatterplot. To customize the scatterplot. To send comment To view comments To clear the comments

3.3.2

Frame design

The following figure is the frame design.

Choosr the X Axis Choosr the Y Axis Choosr the Z Axis

View: Size spheres:

Show BOX

Show Data

Show X.Y.Z Axes

Large

Medium
Visualze

Small

Comment

Send Your Comment View Comment History Clear Comment History

Figure 2 :Frame design

22

The following figure is the screen shot of frame after is implemented.

Figure 3 : User-interface Screen shot

3.3.3

State table

The following table is the state table of the frame. Component


pull-down list pull-down list pull-down list Check button Check button Check button Radio button Radio button Radio button Button Text Entry Text box Button Button Button

Title for source code Menu1 Menu2 Menu3 showbox Showdata showxyz Large Medium Small Visualize Comment comcom_txt display displayv displayc

Visibility
Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes yes Yes Yes Yes

Active
Yes Yes Yes No No No No No No Yes yes yes Yes Yes Yes

Text
Choose the x Axis Choose the y Axis Choose the Z Axis Show BOX Show Data Show X,Y,Z Axes Large Medium Small visualize Send Your Comment View Comment History Clear Comment History

23

3.4 Implementation
The software is implemented in two stages:

3.4.1 The first stage of implementation.


The first stage is done by completing the following steps: 1. Downloading and installing python. 2. Downloading and installing Vpython. 3. Converting the '.okc' datasets format to '.csv' datasets format by using Microsoft Excel. 4. Writing a python code to read the '.csv' dataset file. 5. Writing a python code to visualize the dataset by using a 3D graphics module called Visual.

3.4.2 The second stage of implementation.


The second stage is done by completing the following steps:

Writing a python code to build a user graphical interface (GUI). Writing a python code to allow the users to save the users comments in the log file. Writing a python code to allow the users to customize the scatterplot. Configuration the interaction features (Zooming and Rotation) to give the user better views.

24

3.5 Demonstration
The following four dataset are used to demonstrate the software's application: 1. Aaup Dataset 2. Iris Dataset. 3. Netperf Dataset 4. webstats Dataset. 5. Out5d Dataset.

3.5.1

Aaup dataset

The following table gives more information about the dataset. Name Dimensions Records Description Dimension Description AAUP 14 1161 Faculty salary data are for the 1993-1994 school year. Type: (I, IIA, or IIB) fp_sal: Average salary - full professors ac_sal: Average salary - associate professors at_sal: Average salary - assistant professors to_sal: Average salary - all ranks fp_com: Average compensation - full professors ac_com: Average compensation - associate professors at_com: Average compensation - assistant professors to_com: Average compensation - all ranks fp_#: Number of full professors ac_#: Number of associate professors at_#: Number of assistant professors in_#: Number of instructors to_#: Number of faculty - all ranks http://lib.stat.cmu.edu/datasets/colleges/ http://davis.wpi.edu/~xmdv/datasets/aaup.html

Source

25

The following figure demonstrates the software's application by using the "aaup" dataset. The X axis is ac_sal: Average salary - associate. The Y axis is fp_com: Average compensation - full professors. The Z axis is ac_com: Average compensation - associate professors. .

Figure 4 : Visualizing Aaup dataset

26

3.5.2

iris dataset.

The following table gives more information about the dataset.

Name Dimensions Records Description Dimension Description

Iris 4 150
The Iris dataset consists of 150 samples from three species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor).

Petal_Width: Petal Width Petal_Length: Petal Length Sepal_Width: Sepal Width Sepal_Length: Sepal Length http://lib.stat.cmu.edu/DASL/Datafiles/Fisher'sIris.html http://lib.stat.cmu.edu/DASL/Stories/Fisher'sIrises.html http://davis.wpi.edu/~xmdv/datasets/iris.html

Source

The following figure demonstrates the software's application by using the " Iris " dataset. The X axis is Sepal_Length: Sepal Length. The Y axis is Sepal_Width: Sepal Width. The Z axis is Petal_Length: Petal Length.

27

Figure 5: Visualizing Iris dataset

3.5.3 Netperf Dataset


The following table gives more information about the dataset.

Name Dimensions Records Description

Dimension Description

Source

Netperf 6 179 It is taken from Wireless Multimedia Streaming Lab. It explains the relationship between the application, the network and link layer measurements of streaming video over a network. Signal_Strength (Physical) RTT (Netwrok) (Round Trip Time) Lost_Rate (Network) Bandwidth (Network) Throughput1 (TCP) Framerate (Application) http://davis.wpi.edu/~xmdv/datasets/netperf.html

28

The following figure demonstrates the software's application by using the " Netperf " dataset. The X axis is Signal_Strength. The Y axis is Round Trip Time. The Z axis is Framerate.

Figure 6:Visualizing Netperf dataset

29

3.5.4

Webstats Dataset.

The following table gives more information about the dataset.

Name Dimensions Records Description Dimension list

Webstats 22 300 The dataset is web statistics data.


Conn_s1.0 Conn_b1.0 Conn_s1.1 Conn_b1.1 Time_s1.0 Time_b1.0 Time_s1.1 Time_b1.1 retrieved Size mean_conntime stdev_conntime mean_reqtime stdev_reqtime servers objects size_ob1 Time_addit ser_1.1 Burst_1.1 getall http_vers

Source

http://davis.wpi.edu/~xmdv/datasets/webstats.html

The following figure demonstrates the software's application by using the " Webstats " dataset. The X axis is Conn_s1.1. The Y axis isTime_b1.0. The Z axis is Size.

30

Figure 7:Visualizing Webstats dataset

3.5.5 Out5d Dataset.


The following table gives more information about the dataset.

Name Dimensions Records Description

Dimension Description Source

Out5d 5 16384 It a five dimensional remote sensed data. These data are gathered from a grid of 128*128. The data covers an area in Australia. Spot: satellite Mag: magnetics Potas: potassium Thor: thorium Uran: uranium http://davis.wpi.edu/~xmdv/datasets/out5d.html

31

The following figure demonstrates the software's application by using the " Out5d " dataset. The X axis is Spot. The Y axis is Mag. The Z axis is Potas.

Figure 8: Visualizing Out5d dataset

32

3.6 Achieving the minimum requirements


All the three minimum requirements have been achieved; moreover, the minimum requirements have been exceeded.

3.6.1 The First requirement:


The first requirement has been achieved. Moreover .it has been exceeded by produce a better quality product. A scatterplot tool is designed and implemented. It can visualize

multivariate data for a large number (>10) variates. Also, it can visualize multivariate data more that 10 variates. It can visualize every large dataset. Moreover, an easy to use user graphical interface (GUI) is built. In addition, Interaction features (Zooming and Rotation) are configured. Also, user can save his comments in the log file. Moreover, user can customize the scatterplot.

3.6.2 The second requirement:


The second requirement has been achieved. Moreover, it has been exceeded by demonstrating the software application to more that 2 example datasets. 5 examples of datasets are used for the demonstration.

3.6.3 The third requirement:


The third requirement has been achieved. Moreover, it has been exceeded by using two methods of evaluation to compare the prototype with an existing approach which is as scatterplot matrices in xmdvtool.

33

Evaluation.

In this chapter, the prototype will be evaluated. First, a set of criteria will be specified. After that, two methods of evaluation will be used to evaluate the prototype: user evaluation and case study.

4.1 Evaluation Criteria


Specifying a set of criteria for evaluating the approach as, it is necessary to evaluate the success of the approach. These criteria cover a variety of aspects of the prototype. The following set of criteria is used to evaluate the prototype:

4.1.1

Functionality.

Functionality covers and evaluates the following aspects: Showing the relationship between 2 variates. Showing the relationship between 3 variates. Showing the relationship between more than 3 variates.

The functionality criterion is chosen because, it will help to find out how good the prototype is in showing relationship between the two variates and more. Actuality, this is the most important criterion in this evaluation, because it will evaluate the functionality of the new approach that this project develops.

34

Usability.
Usability covers and evaluates the following aspects: How easy it is to use the prototype. Are the steps to complete a task following a logical series.

The Usability criterion is chosen because; it will help to find out that whether it is easy or difficult to use the prototype. In this prototype, the user interface is designed in the way to let the users complete the task in a logical series. For this reason, there is a need to find out whether the users interface design is helpful or not in order for the users to complete the task in a logical series.

User interaction.
User interaction covers and evaluates the following aspects. Using the zooming feature. Using the rotation feature. Customizing the Scatterplot.

The user interaction criterion is chosen because; it will help to find out that whether it is helpful or unhelpful to use the rotation and zooming feature. Also, it will help to find out that whether the users find customizing the scatterplot helpful or not.

35

Capabilities.
Capability covers and evaluates the following aspects. Response time for operations. How reliable is the prototype.

This prototype will create real-time 3D scene graph every time the user visualizes a scatterplot. For this reason, there is a need to evaluate the response time for operations to find out if there any is delay in the response time. In addition, the capability criterion is chosen because; it will help to find out how reliable the prototype is.

Learning.
Learning covers and evaluates the following aspects. How easy it is to learn how to use the prototype. The time to learn to use the prototype.

The learning criterion is chosen because; it will help to find out that whether it is easy or difficult to learn how to use the prototype

36

4.2

User evaluation.
The User-evaluation method is used to evaluate this prototype. Also, User-evaluation helps to compare between the prototype and the Xmdvtool.

There are many methods to gather data for the users. In this project the questionnaire method is used. The Questionnaire method is chosen for the following reasons: Saving time: There is no need to interview each and every user separately. Standardization: The questionnaire offers a standardized data-gathering procedure. Human errors will be reduced by using a well-constructed questionnaire. Privacy The feeling of anonymity will encourage the users to answer a questionnaire more honestly [26] [27].

4.2.1 Creating the questionnaire.


The first page of the questionnaire is the cover letter. It explains to the users the purpose of this questionnaire. Also, it will remind them that the results will be confidential. In addition, the cover letter contains a clear set of Instructions explaining how to complete the questionnaire. The second page contains the questions. See Appendix E for the questionnaire.

37

4.2.2 Selection of users.


In order to qualify and select the users to do the evaluation, a screening questionnaire is provided. The screening questionnaire contains 2 simple questions the help to express the background of the users. Some of the screening questionnaire is presented over the phone and via email. See Appendix F for the screening questionnaire.

The users who are going to do the questionnaire are a mixture of users, who have experience of working with Xmdvtool and other users who have not got any experience of working with Xmdvtool at all. The experienced users will help more to evaluate the software in terms of functionality, interaction and capabilities. Whist the, inexperienced users will help more to evaluate the software in terms of Learning and Usability.

4.2.3 Evaluation day plan.

Every user is given two questionnaires with a scenario. The scenario explains the problem to solve and the tasks to complete. See Appendix G for the scenario.

The users helped to compare the prototype and the Xmdvtool by completing the same questionnaire twice for the Xmdvtool and the prototype.

38

4.2.4 Evaluation results.


6 users had undertaken the Evaluation questionnaires. The following table shows the results of the users' evaluation.

Result (Average) Criteria Xmdvtool scatterplot matrix (1-9) 8 Prototype 3D scatterplot (1-9) 5

Showing relationship between two 2 variates. Showing relationship 1.Functionality between two 3 variates. Showing relationship between more than 3 variates. How easy to use the software. 2.Usability Is the user interface helping you to complete a task in a logical series? 3.User interaction Using the zooming and rotation features. Customizing the scatterplot response time for operations 4.Capabilities How reliable is the software? Learning the software. 5.Learning The time to learn to use the software The Total

6 5 8 8 7 7 72

8 6 6 8 8 8 74

39

4.3

Case Study:

Voyager Plasma dataset is used in this case study. The following table gives more information about the dataset.

Name Dimensions Records Description

Dimension list

Source

Voyager 12 744 It is one of NASA mission data that taken from the Voyager 2 mission. This dataset related to time. It obtained over 1 month that is January 1995. Date Hour S/C_Distance S/C_Latitude S/C_Longitude BR_in_RTN BT_in_RTN BN_in_RTN B_Magnitude Plasma_Velocity Plasma_Density Plasma_Temperature http://starbrite.jpl.nasa.gov/pdsexplorer/index.jsp?selection=mission&msnname= VOYAGER http://davis.wpi.edu/~xmdv/datasets/voyager.html

There is a clustering in the plasma velocity dataset. The task is to find out which one of scatterplot matrix and 3D scatterplot will help more to explore the clustering. In order to successfully compare between the scatterplot matrix and 3d scatterplot the, the evaluation criteria are used.

40

4.3.1 Scatterplot matrix in Xmdvtool


The scatterplot matrix in Xmdvtool is used to find the clustering in the plasma velocity dataset. The scatterplot matrix successfully found the clustering in the plasma velocity dataset. The clustering can be seen clearly when the scatterplot matrix used to visualize the relationship between the "Date" Dimension and "Plasma_Velocity" Dimension. See the next figure.

Figure 9 :Clustering can be seen when the relationship between the "Date" Dimension and "Plasma_Velocity" Dimension is visualized by using the Xmdvtool

After using Scatterplot matrix in Xmdvtool for completing this task, the following have been found out: It was easy and fast to spot the clustering because all the dimensions can be displayed together. The zooming feature, it is not easy to use.

4.3.2

3D scatterplot in the prototype:

The 3D scatterplot in the prototype is used to find clustering in the plasma velocity dataset. The 3D scatterplot successfully found the clustering in the plasma velocity dataset. The clustering can be seen clearly when the scatterplot matrix used to visualize the

41

relationship between the "Date" Dimension and "Plasma_Velocity" Dimension. See the next figure.

Figure 10: Clustering can be seen when the relationship between the "Date" Dimension and "Plasma_Velocity" Dimension is visualized by using the prototype After using the prototype for completing this task, the following have been found out: It was not easy and fast to find the clustering because only three dimensions can be displayed every time. The zooming and rotation features, they are useful and easy to use.

4.3.3 Case Study Evaluation results:


The following table shows the results of the users' evaluation.

Result Criteria Xmdvtool scatterplot matrix (1-9) Prototype 3D scatterplot (1-9)

42

Showing relationship between two 2 variates. Showing relationship 1.Functionality between two 3 variates. Showing relationship between more than 3 variates. How easy to use the software. 2.Usability Is the user interface helping you to complete a task in a logical series? 3.User interaction Using the zooming and rotation features. Customizing the scatterplot response time for operations 4.Capabilities How reliable is the software? Learning the software. 5.Learning The time to learn to use the software The Total

4 5 8 8 7 7 69

8 5 5 8 8 8 70

43

4.4

Evaluation Conclusion

In terms of showing the relationship between the two variates, the results indicate that the scatterplot matrix is slightly better than 3D scatterplot. On the other hand, in terms of showing the relationship between 3 variates the results show that the 3D scatterplot is much better than the scatterplot matrix.

In addition, the results show that both the scatterplot matrix and the 3D scatterplot are not quite good enough for showing relationships between more than 3 variates.

In terms of usability, the results show that both the scatterplot matrix and 3D scatterplot are easy to use, also, both of the softwares have a user-friendly user interface. However, the results show that some users prefer the 3D scatterplot interface, because the user interface is designed in a way to help the users complete a task in a logical manner.

In terms of the user interaction, the results show that the 3D scatterplot is much better than the scatterplot matrix because the 3D zooming and rotation features give the user the ability to view the scatterplot from different angles.

In terms of learning and capabilities, the results show that both scatterplot matrix and 3D scatterplot are reliable and easy to learn quickly. However, because of the 3D scatterplot the software creates a real-time 3D scene graph. Every time the users visualize the scatterplot; it relater to some users noticing that there are some delays in the response time for operation.

44

5 Further Work and Conclusion


5.1 Further Work

Possible additions to this project are listed below:

Developing the software to be internet application. Developing the software so; the user can visualize more than one 3D scatterplot in the same time. Developing a Dynamic user interface. Developing the software to open and read most of the dataset file format.

5.2

Conclusion

This project was to develop a new approach to the visualization of multivariate datasets based on the scatterplot concept. Prototyping Model was followed as methodology. The 3D scatterplot software was designed and implemented. Then, five examples of datasets were used for the demonstration. After that, two methods of evaluation were used: Userevaluation and Case-study to compare the prototype with an existing approach which was as scatterplot matrices in xmdvtool.

45

References
[1] Jing Yang, Matthew O. Ward, and Elke A. Rundensteiner,(2002), Interactive Hierarchical Displays: A General Framework for Visualization and Exploration of Large Multivariate Data Sets, Computers and Graphics Journal, Vol 27, pp 265-283. Xie Z. , Huang S. , Ward M. , Rundensteiner E, (2006) , Exploratory Visualization of Multivariate Data with Variable Quality, IEEE Symposium on VAST, pp.183-190. Zaixian Xie, (2007), Towards Exploratory Visualization of MultivariateStreaming Data, IEEE Vis/InfoVis/VAST Doctoral Colloquium. Q.V. Nguyen & M.L. Huang, (2005), EncCon: an approach to constructing interactive visualization of large hierarchical data, Information Visualization, Vol. 4, No. 1, pp. 1-21. Y. Fua, M. Ward, and E. Rundensteiner,(1999), Hierarchical parallel coordinates for exploration of large datasets. Proc. of Visualization 99, p. 43-50. E.Wegman and Q. Luo, (1997), High dimensional clustering using parallel coordinates and the grand tour. Computing Science and Statistics, Vol. 28, p. 361-8., Wong PC, Bergeron RD,(1996), Multire solution multidimensional wavelet brushing. Proceedings of Visualization 96 . p. 1418.

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Keim DA, Kriegel HP, Ankerst M. Recursive pattern, (1995), a technique for visualizing verylarge amounts of data. Proceedings of Visualization 95. p. 27986.

[9]

R. Spence ,(2007), Information Visualization Design for interaction, 2nd edition,ACM Press Books.

46

[10]

Prohaska, G.; Aigner, W. & Miksch, S, (2007), Glyphs and Visualization of Multivariate Data, No. Asgaard-TR-2007-2, Vienna University of Technology, M.O. Ward,( 2002) ,A taxonomy of glyph placement strategies for multidimensional data visualization, Information Visualization, 1, pp. 194210. Kraus, M., and Ertl, T, (2001), Interactive Data Exploration with Customized Glyphs, In Proc. Of WSCG, Visualization and Interactive Systems Group, University of Stuttgart, pp. 20-23.

[11]

[12]

[13]

Ying-Huey Fua, Matthew O. Ward, Elke A. Rundensteiner,(1999), Hierarchical Parallel Coordinates for Exploration of Large Datasets. IEEE Visualization : pp43-50.

[14]

Inselberg, A., Dimsdale, B, (1990), Parallel coordinates: a tool for visualizing multidimensional geometry. Proceedings of Visualization '90, pp. 361 378. M. Ward, (1994), Xmdvtool: Integrating multiple methods for visualizing multivariate data. Proc. of Visualization '94, p. 326-33. A. Buja, D. Cook, and D. Swayne.(1996). Interactive high-dimensional data visualization. Journal of Computational and Graphical Statistics, 5(1):7899. Friendly,M. and Denis, D,(2005), The early origins and development of the scatterplot. Journal of the History of the Behavioral Sciences, 41(2), 103130. Cleveland, W. S., & McGill, R. (1984). The many faces of a scatterplot. Journal of the American Statistical Association, Vol.79, 807822. R. Becker andW. Cleveland ( 1987). Brushing scatterplots. Technometrics, 29(2):127142.

[15]

[16]

[17]

[18]

[19]

[20]

Thorsten B uring, Jens Gerken, Harald Reiterer (2006) ,User Interaction with Scatterplots on Small Screens A Comparative Evaluation of Geometric-Semantic Zoom and Fisheye Distortion, IEEE

47

Transactions On Visualization And Computer Graphics, 12( 5) : PP. 829-835 Harald Piringer, Robert Kosara, Helwig Hauser (2004) , Interactive Focus+Context Visualization with Linked 2D/3D Scatterplots, the Proceedings of the 2nd International Conference on Coordinated & Multiple Views in Exploratory Visualization (CMV). pp.49-60. R. Kosara, G. Sahling, and H. Hauser. (2004) ,Linking scientific and information visualization with interactive 3D scatterplots. In Proceedings of the 12th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG), pp.133140.

[21]

[22]

[23]

G. Reina and T. Ertl. (2004), Volume visualization and visual queriesfor large high-dimensional datasets. In Joint Eurographics IEEE TCVG Symposium on Visualization. pp.255-260 Fangohr, H. (2006) Exploiting real-time 3d visualisation to enthuse students: A case study of using visual python in engineering. In, ICCS 2006: 6th International Conference , pp139-146. Stephen R, Henry G, Shaun P, Linda S (2004) Teaching Computational Science Using VPython and Virtual Reality. International Conference on Computational Science: pp1218-1225. Rubin, Jeffrey,(1994). Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests. New York: John Wiley and Sons. Ian Brace ,(2004) Questionnaire Design: How to Plan, Structure and Write Survey Material for Effective Market Research. Kogan Page Publishers. Pressman, Roger S (2000) Software Engineering: A Practitioner's Approach. McGraw-Hill.

[24]

[25]

[26]

[27]

[28]

48

49

Appendix A: Personal Reflection


This appendix lists the lessons that I have undertaken and learnt from this project and a reflection on my personal experiences and skills gained. Also, there are some advices for students who might undertake a similar project.

Throughout this project there has been a lot of learning opportunities, both in terms of technical skills, which were developed throughout the design and implementation and, other skills such as project management and requirements analysis techniques.

There has been a lot of technical skills gained and some of them are listed below: Downloading and installing python and Vpython. Writing a code where python, Vpython, dataset, log file and user graphical interface (GUI) work together. Writing a python code to build a user graphical interface (GUI). Converting the '.okc' datasets format to '.csv' datasets format by using Microsoft Excel. Writing a python code to read the '.csv' dataset file. Writing a python code to visualize the dataset by using a 3D graphics module called Visual.

There are also a lot of other skills that I have gained are listed below: Using the requirements analysis techniques. The Self-learning skills. Time management skills. Making flexible and a realistic plan.

50

The lessons that I have learnt and my advice for the students who might undertake a similar project are mentioned below: How it is important to learn how to search and select the relevant literature. Taking a workshop in the university's library or skills centre will be every useful. How it is important to read the abstract before downloading the article this will save a lot of time. It will be useful to have a look at some of the previous year's dissertations, but in the same way, you have to be careful about the quality of the information. How it is really important to backup your work regularly by using in different means. such as, CD, your personal space in the university server or your Email. How it is important to backup your free-error code before you edit it. MSc project is a big project. You are going to read and download a lot of files and documents. Therefore organizing your files and documents will help you to find what you are looking for much quicker. Getting the maximum benefit from the project supervisor and the assessor feedback. How it is really important to make flexible and a realistic plan. How it is important to start writing up the dissertations as early as you can. Creating the user graphical interface (GUI), is not as easy as what people think, therefore, you should allow yourself enough time to design and build it. Using (IDLE) Integrated Development Environment, which can be downloaded with python, will make the python programmer job much easier. VPython depends on Python; therefore, Python should be installed before installing VPython.

Overall, it has been an enjoyable experience doing this project.

51

Appendix B: Source Code Guide


This appendix is a guide to the software found on the attached CD-ROM.

The CD contains the following List of files: 3dscatterplot.py is the software .It is python file. To run the software just double-clicks it. Text editor can be used to view the source code, for example, Notepad. iris.csv is a comma-separated values file. It contains the dataset. It is submitted with the software as example of dataset. Microsoft Excel or Notepad can be used to view the file. Comments.log is text file where the comments are saved. Text editor can be used to view the file.

The project was developed using python 2.5 and Vpython 2.5.1 . So, to run the software Python should be installed before installing VPython.

52

Appendix C: Gantt chart for initial project plan.

53

Appendix D: Gantt chart for revised project plan.

54

Appendix E : Questionnaire for user- Evaluation.


The following is the Questionnaire which was completed by the users. Questionnaire for user- Evaluation Identification number : Name of the software:

Xmdvtool

prototype.

How long have you worked on this software? __less than 1 hour __one hour to 1 day. __1 day to 1 week.

__ 1 week to 1 month. __ 1 month to 1 year. __ 1 year or more.

Please circle the number that reflects your impression about using this software. Not Applicable = NA.

Part 1: Functionality 1.1 Showing relationship between two 2 variates. Difficult 1 1.2 Showing relationship between two 3 variates. Easy NA

2 3 4 5 6 7 8 9 Easy

Difficult 1

2 3 4 5 6 7 8 9 Easy

NA

1.3 Showing relationship between more than 3 variates.

Difficult 1

2 3 4 5 6 7 8 9

NA

Part 2: Usability 2.1 Using the software. Difficult Easy 1 2 3 4 5 6 7 8 9 Agree 1 Disagree NA

NA

2.2 Is the user interface help you to complete a task in a logical series?

2 3 4 5 6 7 8 9

Part 3: User interaction 3.1 Using the zooming and rotation Unhelpful Helpful features. 1 2 3 4 5 6 7 8 9 3.2 Customizing the Scatterplot Unhelpful Helpful 1 2 3 4 5 6 7 8 9

NA

NA

55

Part 4: Capabilities 4.1 response time for operations Slow 1 2 3 Fast 8 9 reliable 8 9

NA

4.2 How reliable is the software?

Unreliable 1 2 3 4 Part 5: Learning

NA

5.1 learning the software.

Difficult 1 2 3 Slow 1 2

Easy 8 9 Fast 8 9

NA

5.2 The time to learn to use the software

NA

Comments
Please write any comments you may have in the space below. If this your second questionnaire, please can you compare between the Xmdvtool and prototype?.

Evaluation prepared by:

Date:

56

Appendix F : The Screening Questionnaire.


The Screening questionnaire.for (user- Evaluation) 1.Have you ever work on 2D or 3D Scatterplot? Yes No 2. How long have you worked on Xmdvtool software? __ Never. __ 1 week to 1 month. __less than 1 hour __ 1 month to 1 year. __one hour to 1 day. __ 1 year or more. __1 day to 1 week. Completed By: Date:

57

Appendix G : The questionnaire scenario.


The following is the questionnaire scenario that is given to users to complete the evaluation.

The questionnaire scenario In this evaluation, you are going to use the "Iris" dataset. Iris dataset contains 4 variables with 150 observations. The variables are the following: 1. 2. 3. 4. Petal_Width. Petal_Length. Sepal_Width. Sepal_Length.

Your assignment is to use scatterplot matrix in Xmdvtool then 3D scatterplot in the prototype, to complete the following tasks: 1. Visualize and try to explore a relationship between Petal_Length and Sepal_Length. 2. Visualize and try to explore a relationship between Petal_Length , Sepal_Length and Petal_Length. 3. Visualize and try to explore a relationship between all the 4 variables. 4. Visualize and try to use the interaction features to explore a relationship between Petal_Length and Sepal_Length. 5. Visualize Petal_Width , Petal_Length and Sepal_Width and try to Customizing your scatterplot. Note: to use the interaction features in the prototype. For the zooming feature: Hold both the "right" and "left" buttons in the mouse in the same time move the mouse forward for "zoom in" and backward for "zoom out". For the rotation feature: Hold the "right" button in the mouse, in the same time move the mouse right or left. After finishing the tasks please complete questionnaires for Xmdvtool and prototype

58

Appendix H : The Interim Project Report .

59

60

61

62

63

64

65

66

67

68

Você também pode gostar