The following two S(plus) functions together with the appropriate
Splus help files are intended to allow the user to explore the use of the
cone plot, a graphical object I have invented to allow ways of probing
the geometric structure of high dimensional data.
There are in principle a large number of variations and possibilities
but in the interests of simplicity I have kept it stripped down as far
as possible consistent with actually being useful.
I am happy to respond to requests for further information. See email
address below. I hope to be able to publish more details at a future date.
The two functions are
callcone: intended to make multiple calls to the cone function
cone : produces a single cone plot for given data.
The documentation, together with comments in the functions themselves
are intended to serve as a very basic introduction to the definition and
use of the cone plot, which promises to be a pretty potent way of picking up
such things as outliers, clustering and so forth.
After installing the code the user should (after setting mfrow to some
suitable number: see examples in documentation) invoke callcone on their
favourite (small) data set which has outliers etc.
The code can be radically improved speed-wise by writing parts of it as
external calls to C routines, but that complicates things, and I have written
the routines entirely in S. In general it is best to start with rather
smallish data sets, say 50-100 observations on a dozen or so variables,
just to get the idea.
The remainder of the file contains in order the following "blocks"
S(plus) code for callcone
S(plus) code for cone
Splus helpfile for callcone
Splus helpfile for cone
This file should be edited appropriately and the first two "blocks" should
be able to be sourced directly into Splus or S.
The help files should be installed in the appropriate place or possibly
nroff'ed directly.
Brian Dawkins
brian@isor.vuw.ac.nz
************************************************************
*****code and help files only from here to end**************
************************************************************
callcone<-function(x, apocentre, vertex, maxdist)
{
# The function callcone calls the function cone (qv) in order to get a
# series of cone plots for data x which is a matrix with no missing values.
# In general it seems best to invoke with the x standardised.
# apocentre is a matrix of apocentres with same number of cols as x.
# vertex is a matrix of vertices with same number of cols as x.
# Cone plots are to be obtained for each combination of rows from
# apocentre and vertex
# The default will be to take the apocentre as the mean of the data and
# vertex to be the data matrix itself. Also by default the plots will be
# produced without scales and in standard coneplot format.
# This does not capitalise on all the options of the cone function however.
# It is intended as a simple way to invoke a preliminary series of
# cone plots. The user is welcome to fiddle both this and the
# following files to their hearts content.
#
# See reference in help file for more details.
#
#Supply defaults
#
if(missing(maxdist)) maxdist <- 1
if(missing(apocentre))
apocentre <- apply(x, 2, mean)
if(missing(vertex)) {
vertex <- x
if(missing(maxdist))
maxdist <- max(dist(x))
}
else {
if(missing(maxdist))
maxdist <- max(dist(rbind(x, vertex)))
}
#
# now get the actual sequence of cone plots required. The user should
# have preset the mfig and other relevant values by a call to par
# before using callcone.
#
if(is.matrix(apocentre)) {
m <- nrow(apocentre)
if(is.matrix(vertex)) {
n <- nrow(vertex)
for(i in 1:m) {
for(j in 1:n) {
cone(x, apocentre[i, ], vertex[j,
], maxdist = maxdist)
}
}
}
else {
for(i in 1:m)
cone(x, apocentre[i, ], vertex, maxdist =
maxdist)
}
}
else {
if(is.matrix(vertex)) {
n <- nrow(vertex)
for(j in 1:n)
cone(x, apocentre, vertex[j, ], maxdist =
maxdist)
}
else cone(x, apocentre, vertex, maxdist = maxdist)
}
invisible()
}
cone<-function(x, apocentre, vertex, maxdist, pchars, Axes = F, cex = 0.8, TS = F,
opt = F)
{
# S(plus) function cone: produces a single cone plot.
# For data x and apocentre,vertex as given, compute the cone angles for rays
# centred on the vertex, with the the ray from the vertex to the apocentre
# maxdist should be the maximum possible radial component of the plot
# pchars is vector of plotting characters.
# Axes should be set to T to get labelled axes
# cex determines the size of plotted text
# TS is set to T if the data is time series or sequential data
# opt=T gives a maximal plot on the plotting surface
# See the help file for more details.
#
# Initialisation
if(missing(apocentre)) apocentre <- apply(x, 2, mean)
if(missing(vertex))
vertex <- x[1, ]
if(missing(maxdist)) {
maxdist <- 1 # arbitrary value
missingdist <- T # need to compute maxdist later
}
else missingdist <- F
p <- ncol(x)
n <- nrow(x)
if(missing(pchars)) pchars <- paste(1:n) # default plotting characters
#
# The user may want to fiddle the plotting parameters. I tend to use
# par(mar = c(1, 1, 0, 0), mgp = c(1, 0, 0))
# in order to squash a lot of plots on the page.
#
# determine whether either apocentre or vertex is in the data matrix
#
testa <- apply(abs(t(x) - apocentre), 2, sum)
testv <- apply(abs(t(x) - vertex), 2, sum)
testa <- (testa == 0)
testv <- (testv == 0)
vindx <- (1:n)[testv]
aindx <- (1:n)[testa]
ainx <- any(testa)
vinx <- any(testv)
# now modify the data matrix if necessary, to exclude vertex
if(vinx) {
newx <- x[!(testv), ]
pchars <- pchars[!testv]
}
else {
newx <- x
}
if(missingdist) {
if(vinx)
dmat <- dist(x)
else dmat <- dist(rbind(vertex, x))
maxdist <- max(dmat) #
}
#
# Now compute the radial components of the cone plot
#
Rvect <- t(newx) - vertex#matrix of vectors to data points from the vertex
R <- sqrt(apply(Rvect^2, 2, sum))# the lengths from vertex to data points
#
# and then the angular components
#
baseline <- apocentre - vertex # the vector to the apocentre
baselength <- sqrt(sum((baseline)^2))#distance from vertex to apocentre
cosTheta <- apply(Rvect * baseline, 2, sum)/(R * baselength)#vector of cosines
#
# Now set up the plot
#
X <- R * cosTheta
Y <- R * sin(acos(cosTheta))
#set up plot limits. If largest possible wanted opt should be set to T
if(opt) {
Xlim <- range(X)
Ylim <- range(Y)
}
else {
Xlim <- c( - maxdist, maxdist)
Ylim <- c(0, maxdist)
}
# set Axes to T if scales are wanted.
if(Axes) {
plot(Xlim, Ylim, type = "n")
}
else {
plot(Xlim, Ylim, type = "n", xlab = "", ylab = "", xaxt = "n",
yaxt = "n")
}
text(X, Y, pchars, cex = cex) #
#
#if the vertex is in the data set, plot its index at the origin
#
if(vinx) text(0, 0, vindx)
#
# Now plot the bounding semicircle if opt is not T, add axes and identifiers
#
if(!opt) {
ang <- seq(0, pi, length = 50)
lines(maxdist * cos(ang), maxdist * sin(ang), lty = 2) #
#
# Add axes
#
abline(h = 0)
abline(v = 0)
#
# Add identifiers in upper corners if vertex or apocentre in data
#
if(vinx) {
text(maxdist * 0.9, maxdist * 0.9, vindx)
}
if(ainx) {
text( - maxdist * 0.9, maxdist * 0.9, aindx)
}
}
#
#connect the points if a sequential plot is implied
#
if(TS) lines(X, Y, lty = 1) #
#
invisible()
}
*******************************************************************************
*The first Splus help file which should be installed in the appropriate place*
*******************************************************************************
.BG
.FN callcone
.TL
Function to call cone (qv).
.DN
A function intended to make a sequence of cone plots.
A suitable graphics device should be invoked before calling it.
.CS
callcone(Data, apocentre, vertex, maxdist)
.RA
.AG Data
Data is the data matrix. Missing values are not allowed.
.OA
See the documentation for the function cone for more details
.AG apocentre
apocentre is a matrix each row of which has the coordinates of
a possible apocentre. Defaults to the row average.
.AG vertex
vertex is a matrix, each row of which has the coordinates of a possible
vertex. Defaults to the data matrix Data.
.AG maxdist
maxdist is a value which will be used as radius of the bounding semicircle
of an individual cone plot. By default it will be calculated as the
maximum value in the distance matrix consisting of all the distinct points
included in the data Data and the set of vertices.
.RT
A sequence of coneplots is produced, one for each combination of apocentre
and vertex. The user should issue the appropriate par command in order to
fit a suitable number of plots on the page and to obtain reasonably
accurate semicircles. See Example below.
It is generally the case that the data Data should be standardised before
invoking the function, although this may not be appropriate at all times.
This is not intended to be the definitive way of invoking the cone function,
if only because it doesn't involve all the possible parameters of that
function. The user is welcome to change this as they wish.
.SE
.DT
See documentation on the function cone or the
following reference for more details.
.SH REFERENCES
Dawkins, B. P., (1992, to appear) Investigating the Geometry of a p-Dimensional
Data Set, Technical Report Number 22, Wellington, New Zealand:The Institute
of Statistics and Operations Research, Victoria University of Wellington
.SA
cone
.EX
par(mfrow=c(8,7, pin=c(1.0,.5))
callcone(scale(chernoff2)) # produces the default set of cone plots for
# # the scaled chernoff2 data set.
.KW coneplot
.WR
****************************************************************************
*The 2nd Splus help file which should be installed in the appropriate place*
****************************************************************************
.BG
.FN cone
.TL
Produces a single cone plot.
.DN
This implements cone plots as defined in the reference below. The cone plot
is essentially an exploratory tool designed to offer insight into the
geometry of a multidimensional data set.
.CS
cone(Data, apocentre, vertex, maxdist,pchars, Axes=F,
cex=0.8, TS=F, opt=F)
.RA
.AG Data
Data is a rectangular array of data, each row of which is considered to be
the set of coordinates of a point in the appropriate dimensional space.
Missing values are not permitted.
.OA
.AG apocentre
The apocentre defines a reference direction from the vertex. That is, it
is a vector of the same dimension as the column space of Data. It will
default to the centre of gravity. i.e. apocentre<-apply(Data,2,mean)
.AG vertex
The vertex is a vector of the same dimension as the column space of Data
and together with the apocentre defines a direction from which
all angles are measured. It defaults to Data[1,].
.AG maxdist
maxdist is a value which will define the radius of the bounding semicircle
of the cone plot. This will default to the maximum distance in the
set of points obtained by combining vertex and Data.
.AG pchars
The vector of characters to be plotted. It will default to the row numbers
of Data.
.AG Axes
Axes should be set to T if scales are wanted on the axes.
.AG cex
cex is used to determine the size of the plotting symbols which are
taken to be the row indices of Data by default.
.AG TS
If TS is set to T, the points in the cone plot are connected in order by
a line.
.AG opt
Setting opt to T suppresses the bounding semicircle and plots the points
on as large a scale as possible.
.RT
The cone plot for the given vertex/apocentre combination is plotted. The
default is to produce a semicircular plot area, with the radius determined
by the maximum distance between vertex and rows of Data.
If the vertex is actually a row of Data, its index is plotted in the upper
right hand corner. If the apocentre is a row of Data, its index is plotted
in the upper left hand corner.
.SE
.DT
The cone plot is
discussed in more detail in Dawkins (1992, to appear).
The basic idea is to reduce the dimensionality of a given multidimensional
data set by specifying an axis in space, using the vertex and the
apocentre, and then rotating all half--planes through that axis onto an
arbitary such half--plane. A half--plane is the semi--infinite 2--flat
which is bounded by the axis. See Somerville (1929) or any suitable text
on n-Dimensional Geometry for the precise definition of a 2-flat.
(This glosses over some of the niceties of rotation in n-dimensions!)
The plot itself is an ordinary polar coordinate plot, where for a given
data point (row of the matrix Data), the polar coordinates are the
distance from the vertex to the point, and the angle between the
line to the point and the vertex-apocentre line.
It is easy to pick up such things as outliers and clustering from such
plots. In the example below point 51 is a possible outlier. The
series of default plots obtained by using callcone (qv) confirms this.
Note that in general it will be best to call the function with data
Data standardised to have column means 0 and column standard deviations 1.
.SH REFERENCES
Dawkins, B. P., (1992), Investigating
the Geometry of a p-Dimensional Data Set. Wellington,
New Zealand:The Institute of Statistics and Operations Research.
Somerville, D. M. Y., (1929) An Introduction to the Geometry of n Dimensions.
London:Methuen,
.SA
callcone
.EX
printer(height=20,width=40))#invoke the worst graphic device possible!
cone(scale(chernoff2)) # produces a cone plot for the chernoff2 data
show()
. .................
. ... . ... 1
. ... . ...
. ... . ..
. .. . .
. . . 51 ..
. .. . ..
. . . .
. . . 22831.
. . . 504 27332 .
. . . 21 49 .
. . . 19353245 48 .
. . 16 635187 232 3444 .
. . 14. 211 52746 .
. . 13 12 .
. . .8104 .
............................1.............................
..........................................................
More refined plots show more detail of course.
.KW coneplot
.WR