95JCGS04\P0229------------------------------------------------------------
Guiding Data Analysts With Visual Statistical Strategies
Forrest W. Young & David J. Lubinsky
The concept of statistical strategy is introduced and used to
develop a structured graphical user interface for guiding data
analysis. The interface visually represents statistical
strategies that are designed by expert data analysts to guide
novices. The representation is an abstraction of the expert's
concepts of the essence of a data analysis. We argue that an
environment that visually guides and structures data analysis
will improve data analysis productivity, accuracy, accessibility,
and satisfaction in comparison to an environment without such
aids, especially for novice data analysts. Our concepts are
based on notions from cognitive science, and can be empirically
evaluated. The interface consists of two interacting windows --- the
guidemap and the workmap. Each window contains a graph that has
nodes and edges. The guidemap graph represents the statistical
strategy for a specific statistical task (such as describing data).
Nodes represent potential data analysis actions that can be taken
by the system. Edges represent potential actions that can be taken
by the analyst. The guidemap graph exists prior to the data
analysis session, having been created by an expert. The workmap
graph represents the complete history of all steps taken by the data
analyst. It is constructed during the data analysis session as a
result of the analyst's actions. Workmap nodes represent data sets,
data models, or data analysis procedures that have been created or
used by the analyst. Workmap edges represent the chronological
sequence of the analyst's actions. One workmap node is highlighted
to indicate which statistical object is the focus of the strategy.
We illustrate our concepts with ViSta, the Visual Statistics system
that we have developed.
Key Words: Artificial intelligence; Cognitive science; Hypertext;
Lisp-Stat; Program visualization; Statistical system design; ViSta;
Visual programming
95JCGS04\P0251-----------------------------------------------------
Discussion of Guiding Data Analysts with Visual Statistical
Stratgies
William DuMouchel & Thomas Lane
95JCGS04\P0257-----------------------------------------------------
Discussion of Guiding Data Analysts with Visual Statistical
Strategies
Daryl Pregibon
95JCGS04\P0259------------------------------------------------------
Rejoinder to Guiding Data Analysts with Visual Statistical
Strategies}
Forrest W. Young
95JCGS04\P0261----------------------------------------------------
Dynamic Three-Dimensional Display of U.S. Air Traffic
William F. Eddy & Shingo Oue
We describe methods developed to interpolate and project flight
paths of aircraft in controlled airspace over the continental
United States from aperiodic position reports. There are a number
of unusual features of the dynamic displays we have developed. Our
visualizations can be viewed from either a fixed or moving
viewpoint. The direction and distance of the focal point from the
viewing point is under program control (allowing viewing in a
direction other than the direction of motion of the viewpoint).
The maximum and minimum depth of field is under program control
(allowing viewing of selected local subsets of the data).
Key Words: Animation localization; Interactive graphics.
95JCGS04\P0281-----------------------------------------------------
Huge Data Sets and the Frontiers of Computational Feasibility
Edward J. Wegman
Recently, Huber offered a taxonomy of data set sizes ranging from
tiny (10$^2$ bytes) to huge (10$^{10}$ bytes). This taxonomy is
particularly appealing because it quantifies the meaning of tiny,
small, medium, large, and huge. Indeed, some investigators consider
300 small and 10,000 large while others consider 10,000 small. In
Huber's taxonomy, most statistical and visualization techniques
are computationally feasible with tiny data sets. With larger data
sets, however, computers run out of computational horsepower and
graphics displays run out of resolution fairly quickly. In this
article, I discuss aspects of data set size and computational
feasibility for general classes of algorithms in the context of CPU
performance, memory size, hard disk capacity, screen resolution and
massively parallel architectures. I discuss some strategies such
as recursive formulations that mitigate the impact of size. I also
discuss the potential for scalable parallelization that will
mitigate the effects of computational complexity.
Key Words: Computational complexity; Grand challenge problems; Teraflop
computer; Visual acuity; Visualization.
95JCGS04\P0296-----------------------------------------------------------
Theoretical and Empirical Properties of the
Genetic Algorithm as a Numerical Optimizer
Christopher Jennison & Nuala Sheehan
We investigate the basic form of the genetic algorithm as an
optimization technique. Its failure to maximize a simple function of
a string of 50 binary variables prompts a closer study of Holland's
(1975) ``Schema Theorem'' and we find the implications of this result
to be much weaker than are often claimed. Further theoretical results
and exact calculations for simple problems provide an understanding
of how the genetic algorithm works and why it failed in our original
application. We show that the algorithm can be fine-tuned to succeed
in that problem but only by introducing features which could cause
serious difficulties in harder problems.
Key Words: Ergodic distribution, Global optimization, Markov chain,
Schema theorem, Simulated annealing.
95JCGS04\P0319--------------------------------------------------------
A Subsampling Method for the Computation of Multivariate
Estimators With High Breakdown Point
Jesus Juan & Francisco J. Prieto
All known robust location and scale estimators with high breakdown point
for multivariate samples are very expensive to compute. In practice,
this computation has to be carried out using an approximate subsampling
procedure. In this article we describe an alternative subsampling
scheme, applicable to both the Stahel-Donoho estimator and the
minimum volume ellipsoid estimator, with the property that the number
of subsamples required can be substantially reduced with respect to
the standard subsampling procedures used in both cases. We also discuss
some bias and variability properties of the estimator obtained from the
proposed subsampling process.
Key Words: Minimum volume ellipsoid estimator; Outlier detection; Robust
estimation; Stahel-Donoho estimator.
95JCGS04\P0335----------------------------------------------------------
Exploring Multidimensional Data With the Flipped Empirical
Distribution Function
Moon Yul Huh
This article introduces a new form of empirical distribution function
(EDF) called the flipped empirical distribution function (FEDF), to
represent univariate data graphically. Because the plot shows the
location of individual points, it may be useful when we need to
manipulate specific data points as with dynamic graphics. The article
introduces several methods to explore multidimensional data using the
FEDF. They are called a parallel FEDF, an FEDF scatterplot matrix, and
an FEDF starplot. Usefulness of these plots in exploring
multidimensional data becomes more prominent when they are
implemented with the method of dynamic graphics such as selecting,
deleting, linking, locating, and identifying a group of data points.
Key Words: Dynamic graphics; FEDF; FEDF scatterplot matrix; FEDF starplot;
Parallel FEDF.