Motivation

The process of knowledge discovery lies on a continuum ranging between the human driven (manual exploration) approaches to fully automatic data mining methods. As a hybrid approach, the emerging field of visual analytics aims to facilitate human-machine collaborative decision making by providing automated analysis of data via interactive visualizations. One area of interest in visual analytics is to develop data transformation methods that support visualization and analysis.

John W. Tukey pioneered the use of statistical graphics for data modeling in 1977 as an alternative to the confirmatory methods which heavily relied on statistical hypothesis testing. He chose to name this new way of analyzing data as Exploratory Data Analysis (EDA) to emphasize that these were methods aimed to help people build hypotheses based on the collected data as opposed to the confirmatory methods which require pre-existing hypotheses to work with. In his seminal work[1] , Tukey introduced many of today's well-known statistical graphs such as the scatterplot matrices, box-and-whisker plots and the bubble charts. Tukey's vision of data analysis was impressive given the fact that the computer technology to create the graphics was very primitive compared to what we have available today.

Since the days of being able to manually explore the raw data through visualizations are long gone, the emphasis is on developing automated methods that would explore the space of possible hypotheses and provide the users with the most useful ones with respect to some criteria. Here, we focus on the analysis of higher dimensional and labeled datasets through automatically generated visualizations. Two or three dimensional scatterplots have been the most common visualization tools for multi-variate data. A scatterplot visualizes the data by projecting it to any two (or three) variables at a time. By inspecting the visualization, the users can draw conclusions about how the variables are related (such as correlated or not). If the data contains multiple groups (class labels), it is also possible to tell if the projection reveals these distinct groups. However, generally it is highly unlikely that projecting on just two (or three) variables would yield any interesting patterns in the data. Many automatic feature extraction techniques have been developed over the years in order to transform the original dataset into a new dataset which reveals an interesting pattern. For data visualization, feature extraction techniques replace the original variables with two (or three) variables that are generated with respect to a pre-defined criterion. These methods are more commonly known as the dimensionality reduction techniques.

It is also desired to build predictive models that would learn how to classify the observed data so that when it is given unlabeled data from the same problem domain, it can successfully determine the class labels. Data classification is a large research area in machine learning and data mining fields and many algorithms have been proposed over the years. Given a data mining problem, two questions arise: 1) Should we try a number of classification algorithms and pick the one with best performance on the dataset? or 2) Should we pick one classification algorithm and transform the data (extract features) in a way that increases the performance of that classifier? Generally, either one of these options are selected. Here, we pursue a joint approach that searches for the best feature representation/classifier pairs for a given data mining problem

A Multi-Objective Architecture for Visual Classification

The joint process of feature extraction and classifier selection can be performed independently from the number of features to be extracted. Generally, there is no constraint on the number of features to be extracted, provided that classifier accuracy is optimized. The classification problem itself can be seen as feature extraction, where the higher dimensional dataset is transformed into one dimension which identifies the class labels for each data item. Visual classification can be seen as extension of this idea, where the data is transformed into two dimensions which can be visualized as scatterplots and classified using any classification algorithm.

The visualization generator component performs dimensionality reduction by transforming the data into two dimensions. Visual classification has the advantage that users can visually inspect the data and understand the class structures within the data. A predictive model can then be developed on the transformed dataset. It is also highly desirable that the users are also presented with the mathematical expressions that transform the data. Interpretability of these expressions are especially important in understanding how the visualization axes are related to the original features and which variables are important in identifying the distinct class structures on the generated visualization.

However, most dimensionality reduction schemes either do not generate explicit mapping functions or the complexity of the feature transformation functions can not be easily controlled. Moreover, a number of dimensionality reduction techniques do not consider the label information therefore not explicitly aiming to reveal clearly separated class structures.

Considering all these shortcomings associated with the standard practice of data classification, we developed a flexible, multi-objective scheme for visual classification. The algorithm consists of two main components: an evolutionary computing based engine that creates data transformation functions that map the original higher dimensional dataset into 2D visualizations, and a multi-objective assessment scheme that evaluates the quality of these transformations:

  • Classifiability: Performance of one or more classifiers on the two dimensional representation of the dataset generated using the feature transformation functions. Classifier selection is done by determining the best performing classifier on the transformed data.
  • Visual interpretability: The quality of the visualizations from the perspective of how well the class structures are presented in terms of clear separation and compactness of the individual groups.
  • Semantic interpretability of the extracted features: The complexity of the feature transformation expressions that map the original dataset into a 2D scatterplot. Complexity of these expressions are important in terms of providing the user with interpretable visualization axes.

G3P-Genetic Programming Projection Pursuit

We implement the visual classification scheme given above using genetic programming. We call this algorithm Genetic Programming Projection Pursuit since it uses genetic programming to evolve interesting views of the high dimensional dataset. The degree of interestingness is measured in terms of the multiple objectives given above.

Human Perception and Automated Measures of Interpretability of Data Transformations

Visual Analytics is a human-machine collaboration in decision making based on observed data. In most cases, it is impractical for humans to manually explore the large amounts of data to discover hidden patterns that would be useful in decision making. Therefore, the duty of the machine is to provide automated data analysis in order to help the human-the decision maker. On the other hand, humans need to be able to interpret and comprehend the machine generated models so that they can make informed decisions.

The best way to make sure that the machine will return interpretable models is to provide instructions to search for such models. Since we are focused on the classification problems, we study interpretability within the framework of exploratory analysis of labeled data through visualizations. Two dimensional scatterplots are amongst the most common types visualizations since the early days of statistical graphs. Here, we consider scatterplots of higher dimensional datasets with axes generated by feature transformation functions mapping the original features (variables, attributes) into the two projected features.

We consider two important aspects of human interpretability of data projections for labeled data. Visual interpretability is concerned with how humans judge the quality of the views presented as 2D scatterplots. In the case of labeled data, visual interpretability closely relates to how easy it is to tell the members of each class apart by inspecting the scatterplot. Semantic interpretability is concerned with how the views are generated, namely the complexity of the projection functions that relate the projection axes to the original variables. A number of automated measures from machine learning and visual analytics can be used to assess visual interpretability. The goal of our study was to investigate the degree of which the automated measures of visual and semantic interpretability match human perception.

We designed a two-part experiment to study visual and semantic interpretability. In the first part, we showed the participants a number of scatterplots and asked them to rate these views. The participants were not given any background information about the data since we aimed to investigate how they would rate the views independent from a specific domain. The second part of our experiment investigated how easily humans would understand mathematical expressions with varying levels of complexity. We used a generic set of expressions to study interpretability independently from a specific domain. We chose to separate the evaluation of these two interrelated aspects of how users might judge the quality of data projections rather than showing the visualizations along with the mathematical expressions transforming the original dataset into the displayed view. Such an experiment setting would impose another level of complexity to our study. The user judgement on visualizations would be affected by how they feel about the corresponding transformation functions and vice versa. Therefore, it would be difficult to be isolate how they would rate the visualization or the transformation function and compare with the corresponding automatic measures.

User Study on Interpretability of Visualizations

We developed computer software that automatically administered the data visualization experiment without investigator intervention. We recruited 20 participants (13 males and 7 females) who had completed or were pursuing graduate degrees in scientific fields such as computer science, physics, biology, engineering, accounting and psychology. At the beginning of the study, the participants were asked to fill out a brief questionnaire asking them about their related course-work or experience. 14 of the participants specified that they had taken a Statistics, Data Mining or a Machine Learning course.

We chose four commonly used datasets from the data mining and visual analytics literature (table 3.1). These datasets were selected because they contain different number of classes ranging from 2 to 9, which would let us investigate how the number and shape of the classes affect the relationship between human perception and the automated measures.

The Wisconsin Diagnostic Breast Cancer (WDBC) dataset contains 30 measurements characterizing malignant or benign tumors. The Wine dataset contains 13 attributes related to the chemical properties of wines from three different regions of Italy. The Segments dataset contains 19 features derived from images of seven kinds of scenes (brickface, sky, foliage, cement, window, path, grass). All three datasets were downloaded from the UCI Machine Learning Repository[3]. The Italian olive oils dataset [162] contains the amounts of eight fatty acids in olive oils that are from nine different regions of the country (downloaded from [2]).

For a dataset with N attributes, there are N(N - 1)/2 unique attribute pairs, where each pair can be visualized as a scatterplot. In order to choose the visualizations for our experiment, we first generated all possible 2D scatterplots for each dataset. We computed the values of the automated measures given in table 3.2 for the datasets. Our aim was to ensure that we chose a diverse set of visualizations with respect to the automated measures. We created five equi-width bins of values between [0-1]. Then, for each bin, we selected two scatterplots that appeared most frequently within that value range across all the automated measures. Upon completion of this process, a total of 40 scatterplots (10 for each dataset) were selected to be included in our experiments.

Before starting the study, the participants were told that they would be shown a series of scatterplots of some datasets containing multiple groups and their task was to rate how good the view was by inspecting the visualizations. We did not provide any directions on how to define the “goodness” of a view. Each scatterplot showed only the data points in different colors with respect to their class labels and no further information about the data (such as the name of the dataset or the names of the attributes) was provided. After reading the instructions, the participants were shown one scatterplot at a time and were asked to rate them on a continuous scale between 0 (very good) to 1 (very bad) with labels shown below:

Each participant rated a total of 45 scatterplots. Undisclosed to the participants, the first five scatterplots were artificial views showing different levels of compactness and separation between the classes from very good to very bad with respect to the automated measures. These visualizations were used as calibration views in order to help the participant get used to the interface and build their mental models for how they would rate the quality of a view. The participant ratings for these calibration views were not included in the analysis of the responses. The remaining pre-selected scatterplots were displayed in a randomized order to each participant. In order to reduce the effect of outliers, we computed the median of participant responses for each of the scatterplots and used this value in our comparisons to the automated measures.

Assessing Quality of Visualizations Automatically (Scatterplots)

Lee at al. present a measure for exploratory projection pursuit of labeled data that is based on Fisher's Linear Discriminant Analysis method in [4]. The VizRank algorithm proposed by Leban et al. searches for informational 2D projections of datasets that are evaluated by a k-Nearest Neighbor classifier [5]. The authors claim almost perfect agreement between the human judgement and the VizRank algorithm through a user study conducted using six datasets. Sips et al. propose two measures based on the notion of class consistency in [6]. One measure is based on preservation of closeness to class centroids after projection, and another is based on the entropies of the spatial distributions of the classes. The authors report a user study on a number of datasets with varying number of classes. They claim that their proposed measures are in alignment with human judgement in terms of finding all views that were labeled as good views by the participants. Tatu et al. propose two measures to evaluate the degree of separation on scatterplots of labeled data in [7]. Tatu et al. report a user study in [8] that compares four visual quality measures that have been proposed in [6] and [7]. The authors suggest that a combination of measures might be worth investigating.

Wrapper Methods (Using classifiers)

Since the goal of a classifier is to tell the classes apart, high accuracy on the generated 2D data would also mean a good view of the data. Therefore, classifier accuracy can be a quality measure to assess scatterplots. However, classification algorithms have different underlying principles. Therefore, each algorithm displays a different decision boundary characteristic that is related to how the algorithm works. The following figure shows the decision boundaries of various classifiers on an example 2D point cloud.

Characteristics of each algorithm can be contrasted to how humans would draw the boundaries between the classes on the views. Some algorithms have linear boundary characteristics. The boundary generated by a support vector machine depends on the kernel it uses, it can be linear or nonlinear. The boundaries by the k-nearest neighbors are generally irregular especially for small k.

Cluster Validity Indices

Data clustering is a well-known machine learning problem of categorizing multi-dimensional data into natural groupings such that items that are in the same group are more similar to each other than items from other groups. A number of methods have been proposed in order to quantify the quality of the outcome of clustering algorithms. In the case of 2D data, cluster validity indices can be used as measures of interpretability of the visualizations. In this section, we discuss three measures that were used in our experiments. The unifying theme of these cluster validity indices is that they all aim to measure compactness and well-separation of the class structures using a distance measure and they are susceptible to outliers.

Demo of the User Study (VAST'11)



G3P Example-Wine Dataset

QUESTION: Which visualization would you choose?

Bibliography

[1] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley, (1977)
[2] Italian olive oils dataset. http://www.ggobi.org/book/data/olive.csv.
[3] Uci machine learning data repository. www.ics.uci.edu/~mlearn.
[4] E.-K. Lee, D. Cook, S. Klinke, and T. Lumley. Projection pursuit for exploratory supervised classification. Journal of Computational and Graphical Statistics, 14(4):831<96>846, December 2005.
[5] G. Leban, B. Zupan, G. Vidmar, and I. Bratko. Vizrank: Data visualization guided by machine learning. Data Mining and Knowledge Discovery, 13:119–136, 2006. 10.1007/s10618-005-0031- 5.