Interactive Data Visualization Using<b>Mondrian</b>
Interactive Data Visualization UsingMondrian
Theus, Martin;
2002-01-01 00:00:00
Martin Theus University of Augsburg Department of Computeroriented Statistics and Data Analysis Universitatsstr ¨ . 14, 86135 Augsburg, Germany [email protected] Abstract This paper lists Mondrian’s special features and novel implementations. Concepts of how to utilize the interactive This paper presents the Mondrian data visualization visualization tools for an advanced data analysis are pre- software. In addition to standard plots like histograms, sented as well. barcharts, scatterplots or maps, Mondrian offers advanced Mondrian is freely available and as a JAVA application plots for high dimensional categorical (mosaic plots) and runs on almost any platform. continuous data (parallel coordinates). All plots are linked and offer various interaction techniques. A special focus is 2. Smart Selections on the seamless integration of categorical data. Unique is Mondrian’s special selection technique, which allows ad- The main task in interactive data visualization is the vanced selections in complex data sets. identification of patterns and subgroups. Thus selecting and Besides loading data from local (ASCII) files it can con- identifying data is of major importance. This section intro- nect to databases, avoiding a local copy of the data on the duces the special selection technique implemented in Mon- client machine. drian. Mondrian is written in 100% pure JAVA. 2.1. The Progress in Selection Techniques 1. Introduction The way how data are selected in interactive visualiza- tion software shows the steady advance of research results. Data visualization has been acknowledged as an impor- tant tool in decision support. But usually visualizations are 1. The standard way of selecting data is to select data static and just used for presentation rather than exploration. and by doing so replace any other selection that might Interactive statistical data visualization is a powerful tool have been present. There is no way of refining a selec- which reaches beyond the limits of static graphs. tion or selecting over different plots and/or variables. Although there was a big research effort in the mid 80s This standard selection technique is implemented e.g. in interactive graphical statistics, this data analysis tool has in GGobi [7]. not become widely used. One reason might be that the sys- tems designed by researchers 15 years ago (cf. [1]) needed 2. A more advanced way to handle selections is to al- extremely expensive hardware and a big effort in software low to combine the current selection with a new se- development. Certainly times have changed since then, and lection with boolean functions like and, or, Xor, not. any desktop computer is capable of graphics today. Fur- This allows the analyst to refine a selection step by thermore compatibility issues have become less important. step to drill down to a very specific subset of the data. Open source projects as well as platform independent pro- DataDesk [11] implements this selection technique. gramming languages like JAVA have made software more widely accessible. 3. When dealing with a whole sequence of selections, it is Only few software packages are not only tailored to- often desirable to change a selection at an earlier stage, wards one specific visualization task like e.g. network vi- without having to redefine all preceding and succes- sualization or 3d-imaging, but offer a variety of plots for a sive selection steps. By storing the sequence of selec- general analysis of data. tions it is possible to make changes to any step in the sequence. Selection Sequences have been first imple- mented in MANET [9]. 4. Although a selection is always performed on the com- puter screen in the first place, i.e. in terms of screen coordinates, the data selection must be stored in terms of data coordinates. The approach used by Mondrian keeps a list of any selection associated with a data set. For each entry in the list the selection area in screen coordinates and data co- ordinates, selection step, corresponding plot window and selection mode (e.g. and, or, not) is stored. The currently selected subset of the data can Figure 2. Zooming in a map: the Selection then be determined by processing all elements of the Rectangle changes accordingly. list, no matter which kind of modification to the list was the reason for an update of the selection subset. 2.2. Selection Rectangles Since selections are stored in terms of the data coordi- nates they are invariant to any alterations of a plot. Typical Allowing multiple selections in a single window as well scenarios are things like interactive reordering of the axes as across different windows makes a visual guide to the se- in a parallel coordinate plot, flipping the axes in a scatter- lections performed indispensable. plot or zooming. These operations automatically update the selection rectangles. The new screen coordinates of the se- lection rectangles are calculated from the data coordinates. Figure 2 shows how a selection rectangle reacts on a zoom inside a map. The ability to handle more than one selection in one window is indispensable when dealing with parallel coor- dinates. The way Mondrian handles selections is particularly use- ful when working with databases, since the selection trans- late easily into SQL code. At this point it is important to be sure about the precedence of boolean operators. Mondrian always performs selections sequentially, which is in most cases the way the user thinks. Thus an example selection Figure 1. Selection Rectangles in Mondrian. S1 OR S2 AND S3 reads as (S1 OR S2) AND S3, ignoring the usual precedence of boolean operators, where Mondrian introduces Selection Rectangles. Figure 1 AND has a higher precedence as OR. The WHERE-clause in gives an example of a scatterplot containing two selection an SQL-query thus is explicitly bracketed to ensure the se- rectangles. Selection rectangles indicate the area which was quential order of the operators. selected. An existing selection rectangle can be used as a brush by simply dragging the selection rectangle. The eight The use of JAVA 2D would make the implementation handles on the rectangle permit a flexible resizing of the of arbitrary shapes of a selection area relatively simple. rectangles. This enables various slicing techniques. Whereas this would allow very flexible selections, it is not The selection mode can be changed via a pop-up menu. obvious how a resizing of such a selection would look like. The deletion of a selection can be performed via this pop- A more structured approach to more general selection ar- up, too. An active (i.e. selected) selection can be deleted by eas could be to allow a rhombus shape of the selection area, simply pressing the backspace key. Only the active selec- which can be resized at the four corner handles. The re- tion is plotted in black. All other selections are plotted in a maining four handles would then be used to enlarge or re- lighter gray to make them less dominant in the plot. duce the size of the rhombus as needed. 3. Conventions a map, you can reorder the categories in a barchart or the axes in a parallel coordinate plot. Given these conventions it is very easy to get used to all the different functions inside One of the keys to the success of a graphical user in- Mondrian. terface are conventions. Conventions enable the user to perform tasks without learning new, specific interactions. E.g. most graphical user interfaces allow for a change of 4. Special Plots for High Dimensional Data the window size by dragging the lower right corner of the window. Once the user knows of this behavior he/she is Although linking and highlighting across different plots able to resize a window no matter what application or op- can already increase the number of dimensions to look at erating system he/she uses. A brilliant collection of good simultaneously, it is very desirable to find visualizations and many bad examples of user interface design is given which include many variables at a time. Mosaic plots for at http://www.iarchitects.com. A broader dis- categorical data and parallel coordinate plots for continu- cussion of user interface design for interactive visualization ous data are ideal for gaining insight into high dimensional software can be found in [8]. data. High interaction graphics with direct manipulation inter- faces offer a lot of interactions. To ease the use of high 4.1. Parallel Coordinates/Boxplots interaction graphics it is necessary to gather the various in- teractions into different groups like queries, zooming, selec- Parallel Coordinates are a powerful tool to analyze a high tion, reordering. Once these groups are identified, we can dimensional data sets graphically. Since static representa- assign the various user interface interactions to them; e.g. tions of parallel coordinates are usually not very revealing shift-mouse-click, pop-up-trigger etc. several interactive implementations arose very early. These Inside Mondrian the following groups of interactions implementations are restricted to very special computing have been identified to be crucial to perform steps in an in- environments and thus not easily accessible for most peo- teractive graphical data analysis. ple. The probably most advanced implementations are the ones of Inselberg [6] and Wegman [13]. Selections Figure 3 shows a parallel coordinate plot of the Midwest data including not less than 14 variables, of which 13 are – Creating a selection rectangle continuous and one categorical. In addition to the standard Click and drag. selection, highlighting and interrogation methods parallel – Brushing coordinates in Mondrian support the following features: Click inside a selection rectangle and drag. Coordinates can be rearranged manually to look at the – Resize a selection (Slice) most interesting adjacencies. Usually only a few adjacen- Click-drag a handle of a selection rectangle cies are of interest. Zooming is implemented for each axis individually. – Change the selection mode Since parallel coordinates are cluttered very much with an Shift-click inside the selection rectangle increasing number of observations displayed, zooming can Queries focus on a more detailed view of the variable. E.g. for the variable ’% American-Indian-Eskimo-Aleut’ it would be de- – Popup trigger on an object sirable to simply zoom in, in order to get rid of the outliers (i.e. right mouse button on most systems). and see the shape of the distribution, i.e. the box of the box- plot. Alterations Mondrian offers a special feature to plot categorical vari- – Zoom-out (-in) ables in parallel box/coordinate plots. Whereas most im- Meta-click (and drag). plementations only use the number coding of a categorical variable, Mondrian plots a stacked barchart, with left-to- – Change the plot settings right highlighting for each categorical variable. This display Popup trigger on the plot background. is consistent with all other plots representing counts. Addi- – Reorder objects in the plot tionally lines can be displayed for the highlighted points in Alternate-click on the object and drag to new po- boxplot mode. sition. In Figure 4 the same data as shown in figure 3 is dis- played. Whereas in Figure 3 one cannot really find out Obviously some of these interactions are identical in all 1 k+1 plots (e.g. interactions with selections), and some depend Only permutations of the k variable-axis are needed to display on the plot-type. Whereas there is nothing to reorder in all possible adjacencies, cf. [13] Figure 3. Parallel Coordinates for the Midwest data. Counties with high proportion of Asian-Pacifics are selected. Figure 4. Parallel boxplots for the same data as in figure 3. The categorical variable in the plot is shown as spineplot. about how many counties are selected in each state, interro- [4] within the MANET software. gating Figure 4 shows, that the selected counties are mainly Within Mondrian to flexibly reorder the variables in the in Wisconsin, Illinois and Michigan. plot and to include and exclude variables the four arrow Wills [14] gives an alternate method of incorporating cat- keys can be used. Empty cells which occur very often if egorical variables into parallel coordinates based on circles- the number of crossed categories is very high, are not sub- sizes, which is not compatible to the way counts are dis- divided on lower levels. In situations with many crossed played in barcharts. variables this usually reduces the number of cells to draw drastically. To make empty cells visually more prominent, 4.2. Mosaic Plots they are plotted in red. Since Mondrian supports queries there are no labels Mosaic plots are a relatively new development. Recent printed around a mosaic plot. With only a few variables in a implementations include a static version for S-Plus and R Mosaic plot labels would fit around the plot. But more com- by Emmerson [3] and an interactive version by Hofmann plex plots with, e.g. 8 binary variables would need twice as The Housing Factors Example The Housing Factors example will underline why inter- activity is a key-feature for a graphical exploration of cate- gorical data. The data are taken from Cox & Snell [2] resp. Venables [12] investigations (cf. pp 155 resp. pp 226). Data on the housing situation of 1681 tenants in Copen- hagen has been classified according to: Housing Type Apartments, Atrium House, Terraced House, Tower Block Influence on the housing situation low, medium high Contact to other tenants low, high Satisfaction with the housing situation low, medium, high The data are distributed over all 72 cells, i.e. there are no Figure 5. The Titanic Data in a Mosaic Plot. empty cells. Table 1 lists the complete data set. Figure 6 shows the default barcharts and mosaic plot for the four variables. The cases with high satisfaction are se- much space for the labels as for the plot itself. Figure 5 lected, to mark the most interesting response. Obviously gives an example of a mosaic plot with a query. Besides the the ordering of at least two of the variables makes no sense, query the name of the data set and the names of the variables and the mosaic plot does not reveal any systematic pattern, in the plot are shown in the title bar of the plot window. worth fitting a model for. The necessary steps to make the A special feature of the mosaic plots inside Mondrian is the interactive graphical modeling of loglinear models, Housing Factors Housing Type based on mosaic plots (cf. [10]). Sat. Infl. Cont. App. Atr. Terr. Tower low 61 13 18 21 low Weighted Plots high 78 20 57 14 low 43 8 15 34 Many data sets and most database queries present data in an low med high 48 10 31 17 already summarized form, i.e. a table. In Mondrian Mosaic low 26 6 7 10 plots as well as barcharts can handle data which is sum- high high 15 7 5 3 marized, specifying attribute variables and a count variable. low 23 9 6 21 Obviously any non-negative numeric variable can be used low high 46 23 23 19 as a weight variable, which allows for very flexible plots, low 35 8 13 22 which might be hard to interpret. A simple look up of values med med high 45 22 21 23 can be performed with barcharts by weighting case names low 18 7 5 11 by their values, as shown in figure 7. high high 25 10 6 5 low 17 10 7 28 low 5. Working with Categorical Data high 43 20 13 37 low 40 12 13 36 high med high 86 24 13 40 Mondrian can handle categorical variables in both ways, low 54 9 11 36 high as non-informative number coding, or full text labels. It im- high 62 21 13 23 plements interactive barcharts and mosaic plots for analyz- ing categorical data. Neither plot is very revealing in a static Table 1. Cross-classification of 1681 tenants setting, but are very insightful in an interactive environment providing linked highlighting and interactive reordering of variables and categories. plots more insightful comprise: Reorder the variables in the mosaic plot such that the plot is conditioned upon the Housing Type and put In- fluence - as a variable with many categories - at the deepest stage. The order is then: Housing Type, Con- tact, Influence. The reordering is done with the four arrow keys. Certainly it is still hard to read the plots without the inter- active queries. But in contrast to the default views, the re- ordered plots now reveal a clear pattern along with some deviations, which can now be investigated more closely us- ing statistical models as well as other relevant information. Figure 6. The Housing Factors Data in default 6. Special Features in Standard Plots view. 6.1. Barcharts Sort the categories of Housing Type according to the In Mondrian the layout of the bars in barcharts is chosen relative amount of high satisfaction cases (via the plot- to be horizontal rather than vertical. This allows full-length option pop-up). The plot has been switched to the printing of category names. The usual barchart view can Spineplot view, to make the sorting more obvious. be switched to a spine-plot view (cf. Hummel [5]), so that the height, not the width, is proportional to the number of cases in a category. If the highlighting is still done from left to right, the highlighted proportions can then be compared directly. When working with large data sets (10,000 to 50,000 cases are usually already sufficient) the number of cat- egories will grow as well. No matter how big the screen/window is, we will encounter situations where we can not see all bars/categories at the same time. Making bar- Sort Influence and Satisfaction to: low, medium, high charts scrollable allows the investigation of variables with (via alt-click and drag): dozens of categories. Obviously the ordering of the categories then becomes very important. Mondrian offers four ways to order cate- gories in barcharts. 1. Lexicographic Order This is the default order, which is presented after the plot is constructed and displayed. This ordering is best for looking up categories. This could be a mosaic plot, a parallel barchart or a choro- pleth map, which is shaded according to the levels of the categorical variable. Figure 7 shows data of the fortune 400 private persons in the US taken from Forbes Magazine in 1996. The left barchart shows each individual weighted by its worth. The right barchart, showing the 50 US states, has been sorted ac- cording to the number of individuals in this state. California has been selected. 6.2. Histograms The most crucial point in plotting a histograms is to choose the ”right” origin of the first bin and the ”right” num- ber of bins. Since there exists a vast amount of rules and suggestions what ”right” means under different assump- tions, the most important interactive manipulation inside histograms is changing the origin and the width of the bins. Figure 7. An example of two linked barcharts. 2. Manual Order Figure 8. Histogram of the age distribution. Any current order can be changed by manually drag- Cases with more than 60K income are high- ging a bar to its new position. This is useful if all au- lighted tomated sortings fail. 3. Absolute Size of Highlighting These parameters can be altered by using the four arrow This option sorts the categories according to the abso- keys (left, right moves the origin; up, down changes the bin lute number of selected cases in a category. Selecting width). Additionally a popup-menu offers two sliders to set all data allows for a sorting according to the absolute bin width and origin to whole numbers. This is especially size of the categories. useful when the user wants to set “pretty” ticks, i.e. multi- ples of 1, 2 or 5 to a power of 10. 4. Relative Size of Highlighting In order to keep the visual distortion as small as possible, This sorting option sorts corresponding to the relative the scale of the histogram axis is not updated during the amount of highlighting in the categories. In the spine- interactive reparametrization. plot view this option nicely shows the ordering of the As barcharts can be switched to spineplots, histograms in selected proportions. Mondrian can be switched to the so called spinogram view. A change of the order of the categories of variables is au- In a spinogram all bars of the histogram are scaled to be of tomatically propagated to any other plot which holds infor- same height and are plotted next to each other. Figures 8 and mation based on this variable and updated instantaneously. 9 show a corresponding pair of histogram and spinogram. A pop-up is presented with the data of the x- and y- variables according to the closest point. By selecting vari- ables in the main variables window, it is possible to specify the variables for which the pop-up will show the values. If more than one point is found at the same distance, a list of the cases is presented in the pop-up. 7. Conclusions This paper shall encourage the reader to make use of interactive graphical software. Furthermore writing such software in JAVA is easier than ever. JAVA is capable of all graphical displays we can think of. Carefully designed JAVA applications run fast enough on today’s hardware and can compete with classical implementations. The platform Figure 9. Spinogram of the age distribution. independence allows for a much wider distribution than we are used to from former development environments. Although Mondrian was never designed to be a general Scatterplots purpose graphical data analysis package, it already offers most standard plots. Furthermore various features and ideas In contrast to most other plots in Mondrian, scatterplots never implemented before are available. offer axes, showing the maximum and minimum as basic Current development versions on Mondrian implement orientation. Interrogation methods inside scatterplots oper- direct connections to databases. A general interface to ate on two levels. The first level is a simple overview of databases via JDBC allows to work on huge data sets, reach- the position of the cursor, which is displayed by projections ing far behind current limits. Certainly display techniques onto the x- and y-axes. This interrogation is invoked by must be adapted. Using channel transparency allows for simply pressing the control key. A <ctrl-click> invokes the plotting vast amounts of data without cluttering the screen. second level of interrogation, cf. Figure 10. In order to allow simple extensions to plots like exter- nally defined scatterplot smoothers, an interface to R is under development as well. Download Current versions of Mondrian can be downloaded at http://stats.math.uni-augsburg.de/Mondrian or http://www.theusRus.de/Mondrian. Versions for Windows and Mac OS X — which can be started with no further installations — are provided. For all other platforms a JAR file is distributed. The latest version covers blending techniques, imple- mented in scatterplots and parallel coordinates, which are not mentioned in this paper, to cope with very large datasets. The current development version implements the seamless integration of database connections. Acknowledgments The development of Mondrian started in 1997 at AT&T Shannon Labs. Further development will be carried out in Augsburg, as well as at other sites. Those who want to join Figure 10. Both levels of interrogation in a contributing to the development may contact the author. scatterplot. given an installation of SUN’s JDK 1.3 or higher References [1] W. S. Cleveland and M. E. McGill. Dynamic Graphics for Statistics. Wadsworth & Brooks/Cole, Pacific Grove CA, [2] D. R. Cox and E. J. Snell. Applied Statistics — Principles and Examples. Chapman & Hall, London, 1991. [3] J. Emerson. Mosaic displays in s-plus: A general implemen- tation and case study. Statistical Computing & Statistical Graphics Newsletter, 9(1):17–23, 1998. [4] H. Hofmann. Simpson on board the titanic? interac- tive methods for dealing with multivariate categorical data. Statistical Computing & Statistical Graphics Newsletter, 9(2):16–19, 1998. [5] J. Hummel. Linked bar charts: Analysing categorical data graphically. Computational Statistics, 11(1):23–33, 1996. [6] A. Inselberg. Visual data mining with parallel coordinates. Computational Statistics, 13(1):47–63, 1998. [7] D. Swayne, D. Temple, A. Buja, and D. Cook. Ggobi: Xgobi redesigned and extended. In Proceedings of the 33th Sym- posium on the Interface: Computing Science and Statistics, [8] M. Theus. User interfaces of interactive statistical graph- ics software. In Proceedings of the 31th Symposium on the Interface: Computing Science and Statistics, 1999. [9] M. Theus, H. Hofmann, and W. A. Selection sequences — interactive analysis of massive data sets. In Proceedings of the 29th Symposium on the Interface: Computing Science and Statistics, 1998. [10] M. Theus and S. Lauer. Visualizing loglinear models. Jour- nal of Computational and Graphical Statistics, 8(3):396– 412, 1999. [11] P. F. Velleman. DataDesk Version 6.0 — Statistics Guide. Data Description Inc., Ithaka, NY, 1997. [12] W. Venables and B. Ripley. Modern Applied Statistics with S-PLUS, 3rd Ed. Springer, New York, NY, 1999. [13] E. J. Wegman. Hyperdimensional data analysis using paral- lel coordinates. Journal of the American Statistical Associ- ation, 85:664–675, 1990. [14] G. Wills. A good, simple axis. Statistical Computing & Statistical Graphics Newsletter, 11(1):20–25, 2000.
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.pngJournal of Statistical SoftwareUnpaywallhttp://www.deepdyve.com/lp/unpaywall/interactive-data-visualization-using-b-mondrian-b-Ndbg9GdSXM
Interactive Data Visualization Using<b>Mondrian</b>