#!/bin/sh
# This is a shell archive (produced by shar 3.49)
# To extract the files from this archive, save it to a file, remove
# everything above the "!/bin/sh" line above, and type "sh file_name".
#
# made 07/29/1994 19:31 UTC by feh@biostat
# Source directory /usr/local/slibrary/local
#
# existing files will NOT be overwritten unless -c is specified
#
# This shar contains:
# length mode name
# ------ ---------- ------------------------------------------
# 10853 -rw-r--r-- transcan.README
# 19742 -rw-r--r-- transcan.s
# 3075 -rw-r--r-- rcspline.eval.s
# 3258 -rw-r--r-- rcspline.restate.s
# 341 -rw-r--r-- in.operator.s
# 987 -rw-r--r-- scat1d.s
# 12311 -rw-r--r-- model.frame.default.s
# 18439 -rw-r--r-- .Data/.Help/transcan
# 2249 -rw-r--r-- .Data/.Help/rcspline.eval
# 3522 -rw-r--r-- .Data/.Help/rcspline.restate
# 991 -rw-r--r-- .Data/.Help/%in%
# 2494 -rw-r--r-- .Data/.Help/scat1d
#
# ============= transcan.README ==============
if test -f 'transcan.README' -a X"$1" != X"-c"; then
echo 'x - skipping transcan.README (File already exists)'
else
echo 'x - extracting transcan.README (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'transcan.README' &&
transcan Simultaneous transformation and imputation using
X canonical variates
X
When a predictor variable that has missing values (NAs) is unrelated to all
other predictors, and the NAs are missing at random, one can efficiently
analyze the data by imputing a constant value such as the median for the
predictor (see the impute function in statlib). Of course, standard errors
of regression coefficients will need to be corrected upward for imputation
using, for example, bootstrapping. When the predictor is correlated with
other predictors, the constant fill-in method is biased and inefficient. It
is then better to impute NAs by predicting the X from the other Xs.
X
In many cases, the variable being imputed is expected to be monotonically
related to the other variables forming the basis for imputation. Functional
status measures and socioeconomic measures fall into this category.
Ordinary linear regression, ordinal logistic regression, and tree-based
methods can be used to derive imputation models in this case.
X
Sometimes, however, the variable being imputed may be non-monotonically
related to the other variables (e.g., U-shaped). This happens commonly
when relating a physiologic measure to other variables, when the physiologic
measure has a ``normal range'' in the interior of its total range. For
example, if blood pressure is missing, one could estimate it (weakly) from
the patient's heart rate. Normal blood pressure is (on the average)
associated with normal heart rate. A very high or very low heart rate could
result in an abnormal blood pressure. In general, the best way to relate
the two variables is to make a U-shaped transformation of both. General
transformation methods can help in this regard, and they are valuable tools
even when there are no missing data. Besides assisting the analyst in
determining good predictor transformations, psychometric scaling methods can
greatly reduce the number of parameters to estimate, since the
pre-transformations are determined without reference to the response
variable.
X
The goal of a transformation/scaling method is to transform each variable in
turn so that it is most highly (linearly) correlated with the other
variables. An iterative procedure is used so that the same transformation
is used for each variable, whether the variable is currently being predicted
or is being used to predict other variables. The first general technique in
this area that can deal with mixtures of categorical and continuous
predictors was the "maximum total variance" (MTV) method of Young, Takane,
and de Leeuw (1978). This method is based on maximizing variation explained
by the first principal component of the series of predictor variables. A
related method is "maximum generalized variance" (MGV) of Sarle (SAS/STAT
User's Guide, Fourth Edition, 1990, SAS Institute, Inc.). MGV does not
require the first principal component to be a good summary of all the
variables. In other words, MGV can handle predictors that are correlated in
groups. When predicting variable i, that variable is represented as a set
of linear and non-linear terms (e.g., spline components). Analysis of
canonical variates can be used to find the linear combination of terms for
XXi (i.e., find a new transformation for Xi) and the linear combination of
the current transformations of all other variables (representing each
variable as a single, transformed, variable) such that these two linear
combinations have maximum correlation. [For example, if there are only two
variables X1 and X2 represented as quadratic polynomials, solve for a,b,c,d
such that aX1+bX1^2 has maximum correlation with cX2+dX2^2.] The process is
repeated until the transformations converge (one iteration is sufficient if
there are only 2 variables and no missing data). For categorical X, the
canonical variate approach is a generalization of Fisher's optimal scoring
method (see also the S functions ace, avas, principal curve, and fda).
X
Both MTV and MGV are implemented in the qualitative principal components SAS
procedure PRINQUAL of Kuhfeld. For both methods, continuous, categorical,
and monotonically-restricted variables are allowed. These methods will
simultaneously estimate missing data based on multiple regression on the
transformed variables. Unfortunately, PROC PRINQUAL frequently converges to
a solution which assigns imputed values outside the allowable range for a
predictor. This problem is more likely when multiple variables are missing
on the same subjects, since the transformation algorithm may simply separate
missings and non-missings into clusters. We have developed a simple
modification of the MGV algorithm of PRINQUAL which simultaneously imputes
missing values without these problems. The new algorithm is implemented in
S-Plus in the function transcan. Imputed values are initialized to medians
of continuous variables and the most frequent category of categorical
variables. Then when using canonical variates to transform each variable in
turn, observations that are missing on the current ``dependent'' variable
are excluded from consideration, although missing values for the current set
of ``predictors'' are imputed. Although categorical variables are scored
using the first canonical variate, transcan optionally uses the S tree
function to obtain imputed values on the original scale using CART for these
variables. transcan uses restricted cubic splines to model continuous
variables. It does not implement monotonicity constraints as does PRINQUAL,
but the non-monotonically-restricted cubic splines used by transcan (with
linear tail constraints) are better behaved than the non-monotonically-
restricted cubic splines used by PRINQUAL.
X
transcan has many options, including an MTV lookalike and the ability to
specify constant fill-in values that do not influence the transformation
estimation process. It also has a shrinkage option so that overfitting does
not cause imputed values to be "out on a limb" when estimating NAs --
predicted values are shrunk toward the mean of the transformed Xi using the
method of Van Houwelingen and Le Cessie (1990).
X
Simulations have shown that transcan is robust even to high proportions of
NAs for some of the variables, and high amounts of overlap of NAs between
two predictors. Several data analyses involving medical outcomes have shown
that the automatic transformations resulting from transcan (blinded to
outcomes) are more accurate predictors of outcomes than the original
variable codings, whether or not NAs are present. This implies that
transcan may be a good data reduction tool in many cases, since complex
transformations (and multiple dummy variables) are related to the outcome
using only a single transformed variable. For highly skewed variables,
transcan's cubic spline transformations tend to make the predictors'
distributions fairly symmetric. Even if the transformations are not used in
later modeling of Y, the backsolved imputed values may be used, and the
multi-non-collinearity statistics may suggest other data reduction. Also,
transcan can plot all transformations with tick marks indicating imputed
values, both on the original and transformed scale. It is informative to
see how imputed values are distributed.
X
To use transcan to its fullest, also install the impute package.
X
Here is an example of the usage of transcan and impute:
X
x <- cbind(age, disease, blood.pressure, pH)
#cbind will convert factor object `disease' to integer
par(mfrow=c(2,2))
x.trans <- transcan(x, categorical="disease", asis="pH", imputed=T)
#Could also have done
#x.trans <- transcan(~age+disease+blood.pressure+pH, asis="pH", imputed=T)
summary(x.trans) #Summary distribution of imputed values, and R-squares
f <- lm(y ~ x.trans) #use transformed values in a regression
#Now replace NAs in original variables with imputed values, if not
#using transformations
age <- impute(x.trans, age)
disease <- impute(x.trans, disease)
blood.pressure <- impute(x.trans, blood.pressure)
pH <- impute(x.trans, pH)
summary(pH) #uses summary.impute to tell about imputations
X #and summary.default to tell about pH overall
newx.trans <- predict(x.trans, xnew)
w <- predict(x.trans, xnew, type="original")
#Predict.transcan can get out-of-sample transformed variables and imputations
#With type="original", w has variables on the original scale with NAs imputed
age <- w[,"age"] #inserts imputed values
blood.pressure <- w[,"blood.pressure"]
X
X
Here's example output with random, independent x1, x2 (all R-sq. really=0):
X
> .Random.seed <- c(5, 21, 3, 44, 50, 1, 32, 26, 27, 43, 57, 1)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x <- cbind(x1, x2)
> w <- transcan(x, pl=F) # suppress plots of transformations
Convergence criterion:0.127 0.024 0.014 0.01 0.007
Convergence in 6 iterations
R-squared achieved in predicting each variable:
X
X x1 x2
X 0.122 0.122
X
Adjusted R-squared:
X
X x1 x2
X 0.055 0.055
> w <- transcan(x, asis=c("x1","x2"), pl=F) # force linearity
Convergence criterion:0 0
Convergence in 3 iterations
R-squared achieved in predicting each variable:
X
X x1 x2
X 0.014 0.014
X
Adjusted R-squared:
X
X x1 x2
X 0.004 0.004
X
X
X
X Created by Frank Harrell (feh@biostat.mc.duke.edu)
X
To install the program and help file, place this file in your local library
directory $SHOME/library/local. (Type !echo $SHOME while under S if you
need to find out what $SHOME is). Then type sh fff, where fff is the name
you stored this shar file under. (If you want to overwrite an existing
version, use "sh fff -c"). This will create the files:
transcan.README, transcan.s, rcspline.eval.s, rcspline.restate.s,
in.operator.s, scat1d.s, and .Data/.Help files transcan, rcspline.eval,
rcspline.restate, %in%, and scat1d. While under S or S-Plus in the /local
area, source("transcan.s") and then source() for the other *.s files. This
will install the functions under .Data.
X
If you are running a version of S or S-Plus that does not allow
`na.action' to be specified in `options(na.action=...)' (see if `options'
is mentioned in `model.frame.default'), also install `model.frame.default'
in the same area as `transcan' by `source'-ing `model.frame.default.s'.
Most installations will need to do this step.
X
COPYRIGHT NOTICE
X
The functions, documention, and descriptions in this README file are
copyrighted by Frank Harrell, 1994. You may distribute these functions
freely as long as you do so without financial gain, that you acknowledge the
source, and that you communicate improvements to the author. The author is
willing to help with problems. Send E-mail to feh@biostat.mc.duke.edu.
X
X
This work was supported fully by a grant from the Agency for Health Care Policy
and Research.
X
SHAR_EOF
chmod 0644 transcan.README ||
echo 'restore of transcan.README failed'
Wc_c="`wc -c < 'transcan.README'`"
test 10853 -eq "$Wc_c" ||
echo 'transcan.README: original size 10853, current size' "$Wc_c"
fi
# ============= transcan.s ==============
if test -f 'transcan.s' -a X"$1" != X"-c"; then
echo 'x - skipping transcan.s (File already exists)'
else
echo 'x - extracting transcan.s (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'transcan.s' &&
#Since last sent to statlib:
#Added error messages for missing imp.con elements that are needed
#Changed canonical variates to have variance 1 instead of 1/(n-1)
X
transcan <- function(x, method=c("canonical","pc"),
X categorical=NULL, asis=NULL, nk,
X imputed=F, trantab=F,
X impcat=c("score","tree"),
X mincut=40, pl=T, allpl=F, show.na=T,
X iter.max=50, eps=.1, curtail=T,
X imp.con=F, shrink=F, init.cat="mode", ...)
{
X
#This is a non-.Internal version of the approx function. The
#S-Plus version of approx sometimes bombs with a bus error.
X
approx <- function(x, y, xout, method = "linear", n = 50, rule = 1, f = 0)
{
X nx <- length(x)
X if(any(is.na(x)) || any(is.na(y)))
X stop("Missing values not allowed")
X if(nx != length(y))
X stop("Lengths of x and y must match")
X if(nx < 2)
X stop("need at least 2 points")
X i <- order(x)
X x <- x[i]
X y <- y[i]
X if(missing(xout))
X xout <- seq(x[1], x[nx], length = n)
X else n <- length(xout)
X methods <- c("linear", "constant")
X if(!(imeth <- pmatch(method, methods, nomatch = 0)))
X stop("method must be \"linear\" or \"constant\"")
X method <- methods[imeth]
X if(method == "linear") {
X f <- -1
X }
X else if(method == "constant") {
X if(f < 0 || f > 1)
X stop("f must be in [0,1]")
X }
X val <- .Fortran("approx",
X x = as.single(x),
X y = as.single(y),
X nx = as.integer(nx),
X xout = as.single(xout),
X m = as.integer(n),
X rule = as.integer(rule),
X f = as.single(f),
X yout = single(n),
X iscr = single(n))[c("xout", "yout")]
X names(val) <- c("x", "y")
X val
}
X
call <- match.call()
method <- match.arg(method)
impcat <- match.arg(impcat)
X
formula <- NULL
X
if(inherits(x,"formula")) {
X formula <- x
X oldop <- options(na.action="na.retain")
X y <- model.frame(x)
X options(oldop)
X d <- dim(y)
X nam <- names(y)
# if(attr(attr(y,"terms"),"response")==1) y <- y[,-1]
X x <- matrix(NA,nrow=d[1],ncol=d[2],
X dimnames=list(attr(y,"row.names"),nam))
X for(i in 1:d[2]) {
X w <- y[[i]]
X if(is.character(w)) w <- factor(w)
X if(is.factor(w)) {
X x[,i] <- unclass(w)
X categorical <- c(categorical, nam[i])
X } else {
X x[,i] <- w
X nu <- length(unique(w, na.rm=T))
X if(nu<2) stop(paste("variable",nam[i],"has only one value"))
X if(nu==2) asis <- c(asis, nam[i])
X else if(nu==3) categorical <- c(categorical, nam[i])
X }
X }
}
X
nam <- dimnames(x)[[2]]
rnam <- dimnames(x)[[1]]
if(length(rnam)==0) rnam <- as.character(1:nrow(x))
p <- ncol(x)
if(is.null(nam)) stop("x must have column names")
X
n <- nrow(x)
if(missing(nk)) nk <- 3*(n<30)+4*(n>=30 & n<100)+5*(n>=100)
X
#Compute constant to multiply canonical variates by to get a variance of 1.0
varconst <- sqrt(n-1)
X
if(!is.null(categorical))
X {
X if(length(categorical)==1 && categorical=="*") categorical <- nam
X oldopts <- options()
X if(impcat=="tree")
X {
X #define na.action that keeps all obs.
X options(na.action="na.retain",
X contrasts=c("contr.treatment","contr.poly"))
X on.exit(options(oldopts))
X }
X else {
X options(contrasts=c("contr.treatment","contr.poly"))
X on.exit(options(oldopts))
X }
X }
if(length(asis)==1 && asis=="*") asis <- nam
X
R <- parms <- coef <- fill.con <- Imputed <- Trantab <- vector("list",p)
fillin <- rep(NA,p); names(fillin) <- nam
scale <- rep(1,p); names(scale) <- nam
#Trantab <- list()
nparm <- shr <- fillin
X
#For canonical-variate expansions (standardized), use scale of 1
xcoef <- matrix(NA, nrow=p, ncol=p+1, dimnames=list(nam,c("intercept",nam)))
usefill <- 1*(is.logical(imp.con) && imp.con)+2*(is.numeric(imp.con))
if(usefill==2 && length(imp.con)!=p)
X stop("length of imp.con != ncol(x)")
X
for(i in 1:p)
{
X lab <- nam[i]
X y <- x[,i]
X na <- is.na(y)
X w <- y[!na]
X if(imputed) Imputed[[i]] <- single(sum(na))
X if(lab %in% asis)
X {
X fillin[i] <- if(usefill==2) imp.con[i] else median(w)
X scale[i] <- mean(abs(w-fillin[i]))
X if(is.na(fillin[i])) stop(paste("fillin value for",lab,"is NA"))
X coef[[i]] <- c(0,1)
X nparm[i] <- 1
X }
X else
X {
X if(lab %in% categorical)
X {
X w <- table(y)
X z <- as.numeric(names(w))
X if(usefill==2) fillin[i] <- imp.con[i]
X else fillin[i] <- z[w==max(w)][1] #most freq. category
X assign("Y", as.factor(y), 1)
X opold <- options(na.action="na.retain")
X w <- model.matrix(~Y) # uses contr.treatment (reference cell coding)
X options(opold)
X r <- attr(w,"contrasts")[[1]]
X attr(r,"codes") <- z
X parms[[i]] <- r
X R[[i]] <- w[,-1,drop=F] #kill intercept column
X nparm[i] <- length(z)-1
X if(usefill>0)
X {
X fill.con[[i]] <- w[y==imp.con[i],-1,drop=F][1,,drop=F]
X #select first hit
X if(length(fill.con[[i]])==0)
X stop("imp.con has a code not in the data for a categorical var")
X }
X }
X else
X {
X fillin[i] <- if(usefill==2) imp.con[i] else median(y[!is.na(y)])
X R[[i]] <- rcspline.eval(y, nk=nk, inclx=T)
X parms[[i]] <- attr(R[[i]], "knots")
X if(usefill>0) fill.con[[i]] <-
X rcspline.eval(fillin[i], parms[[i]], inclx=T)
X nparm[i] <- length(parms[[i]])-1
X }
X }
}
X
xt <- x
if(init.cat %in% c("mode","random")) for(i in (1:p)[nam %in% categorical])
X xt[,i] <- if(init.cat=="mode") {
X if(is.na(fillin[i])) stop(paste("fillin value for",nam[i],"is NA"))
X xt[,i]==fillin[i]
X } else runif(n)
X
p1 <- p-1
R2 <- R2.adj <- single(p); r2 <- r2.adj <- NA
X
last.iter <- F
cat("Convergence criterion:")
for(iter in 1:iter.max)
{
X dmax <- 0
X if(last.iter) xtl <- xt
X for(i in 1:p)
X {
X lab <- nam[i]
X catg <- lab %in% categorical
X xx <- xt[,-i,drop=F]
X k.other <- sum(pmax(nparm[-i]-1,0))/(p-1)+p-1 #effective d.f.
X if(iter==1)
X {
X for(j in 1:p1)
X {
X if(any(z <- is.na(xx[,j])))
X {
X l <- (nam[-i])[j]
X if(is.na(fillin[l]))stop(paste("variable",l,"has fillin value of NA"))
X xx[z,j] <- fillin[l]
X }
X }
X }
X if(method=="pc")
X {
X z <- xx
X for(k in 1:p1) { y <- z[,k]; z[,k] <- (y-mean(y))/sqrt(var(y)) }
X P <- prcomp(z)$x[,1] # 1st prin. comp.
X }
X j <- is.na(x[,i])
X if(lab %in% asis)
X {
X y <- x[!j, i]
X warn <- .Options$warn
X .Options$warn <- -1 #cut out warning about NAs
X f <- lsfit(xx[!j,,drop=F], y)
X .Options$warn <- warn
X newy <- x[,i]
X xcof <- f$coef
X if(shrink)
X {
X nn <- length(y)
X ybar <- mean(y)
X r2 <- 1 - sum(f$residuals^2)/sum((y-ybar)^2)
X h <- max(0,(1-(1-r2)*(nn-1)/(nn-k.other-1))/r2)
X shr[i] <- h
X xcof <- c(ybar*(1-h)+h*xcof[1],h*xcof[-1])
X }
X if(any(j))
X {
X if(usefill>0) newy[j] <- fillin[i]
X else newy[j] <- cbind(1,xx[j,,drop=F]) %*% xcof
X }
X if(last.iter)
X {
X ybar <- mean(y); nn <- length(y)
X if(!shrink) r2 <- 1 - sum(f$residuals^2)/sum((y-ybar)^2)
X r2.adj <- max(0,1-(1-r2)*(nn-1)/(nn-k.other-1))
X if(imputed & any(j)) {names(newy) <- rnam; Imputed[[i]] <- newy[j]; NULL}
X xcoef[i, c("intercept",nam[-i])] <- xcof
X if(trantab) { rr <- range(y); Trantab[[i]] <- list(x=rr, y=rr); NULL }
X }
X }
X else
X {
X f <- cancor(xx[!j,,drop=F], R[[i]][!j,,drop=F])
X r2 <- f$cor[1]^2
X xcof <- c(intercept=-sum(f$xcoef[,1] * f$xcenter),
X f$xcoef[,1])*varconst
X if(method=="canonical")
X cof <- c(intercept=-sum(f$ycoef[,1] * f$ycenter),
X f$ycoef[,1])*varconst
X else cof <- lsfit(R[[i]][!j,,drop=F], P[!j])$coef
X newy <- drop(cbind(1,R[[i]]) %*% cof)
X nn <- n-sum(j)
X k <- nparm[i]-1+k.other
X if(shrink)
X {
X h <- max(0,(1-(1-r2)*(nn-1)/(nn-k-1))/r2)
X shr[i] <- h
X xcof <- h*xcof #mean of can var=0
X }
X if(any(j))
X {
X if(usefill>0) newy[j] <- drop(cbind(1,fill.con[[i]]) %*% cof)
X else newy[j] <- drop(cbind(1,xx[j,,drop=F]) %*% xcof)
X }
X if(last.iter)
X {
X coef[[i]] <- cof
X xcoef[i,c("intercept",nam[-i])] <- xcof
X r2.adj <- max(0,1-(1-r2)*(nn-1)/(nn-k-1))
X if(trantab || (any(j) && catg && impcat=="score"))
X {
X xa <- x[!j, i]
X s <- !duplicated(xa)
X xa <- xa[s]
X ya <- newy[!j][s]
X s <- order(xa)
X Trantab[[i]] <- list(x=xa[s], y=ya[s])
X NULL
X }
X
X if(imputed & any(j))
X {
X if(catg)
X {
X if(usefill>0) pred <- rep(fillin[i], sum(j))
X else
X {
X if(impcat=="tree")
X {
X y <- na.include(as.factor(x[,i]))#keep tree from dropping NAs
X f <- tree(y ~ xx, control=tree.control(nobs=length(y),
X mincut=mincut))
X pred <- (t(apply(-predict(f)[j,,drop=F],1,order)))[,1]
# Gets level number of most probable category
X if(max(pred)==length(levels(y))) #na.include adds NA at end
X warning("imputed value for categorical variable is NA\nuses code 1 higher than real codes in imputed value list")
X }
X else
X { #Get category code with score closest to pred. score
X ww <- apply(outer(newy[j], Trantab[[i]]$y,
X function(x,y)abs(x-y)),1,order)[1,]
X pred <- Trantab[[i]]$x[ww]
X }
X }
X names(pred) <- rnam[j]
X Imputed[[i]] <- pred
X NULL
X }
X else
X {
X if(usefill>0) Im <- rep(fillin[i], sum(j))
X else Im <- approx(newy[!j], x[!j,i], newy[j], rule=2)$y
X names(Im) <- rnam[j]
X Imputed[[i]] <- Im
X NULL
X }
X } #end imputed
X } #end last.iter
X } #end non-asis
X if(curtail && any(j))
X {
X r <- range(newy[!j])
X newy[j] <- pmax(pmin(newy[j],r[2]),r[1])
X }
X if(iter>1)
X {
X dmax <- max(dmax, min(max(abs(xt[,i]-newy),na.rm=T),
X max(abs(-xt[,i]-newy),na.rm=T))/scale[i])
X #Allows for arbitrary flips (negation) of current transformation
X }
X if(last.iter) xtl[,i] <- newy else xt[,i] <- newy
X #don't update working transformations
X #during last iteration since recomputing x-coefficients
X #on the basis of current transformations, which may flip rapidly
X
X if((pl & last.iter) | allpl)
X {
X xti <- if(last.iter) xtl[,i] else xt[,i]
X plot(x[,i], xti, xlab=lab,ylab=paste("Transformed",lab))
X title(sub=paste("R2=",format(round(r2,2)),sep=""),cex=.4,adj=0)
X if(any(j)) title(sub=paste(sum(j),"missing"),cex=.4,adj=1)
X if(show.na && any(j))
X {
X scat1d(xti[j], 4, ...)
X if(imputed && last.iter) scat1d(Imputed[[i]], 3, ...)
X }
X }
X R2[i] <- r2; R2.adj[i] <- r2.adj
X
X } #end i
X if(iter>1)cat(format(round(dmax,3)),"")
X if(iter %% 10 == 0) cat("\n")
X niter <- iter
X if(last.iter) break
X last.iter <- (iter==(iter.max-1)) | (iter>1 && dmax3 & niter==iter.max & dmax>=eps)
X stop(paste("no convergence in",iter.max,"iterations"))
#last & was (!last.iter)
X
#Use xtl instead of xt, otherwise transformed variables will not
#match ones from predict() or Function() since coefficients have
#been updated
X
attr(xtl, "call") <- call
attr(xtl, "formula") <- formula
attr(xtl, "iter") <- niter
cat("Convergence in",niter,"iterations\n")
X
attr(xtl, "imp.con") <- usefill>0
X
names(R2) <- nam
attr(xtl, "rsq") <- R2
cat("R-squared achieved in predicting each variable:\n\n")
print(round(R2, 3))
names(R2.adj) <- nam
attr(xtl, "rsq.adj") <- R2.adj
cat("\nAdjusted R-squared:\n\n")
print(round(R2.adj, 3))
if(shrink)
{
X names(shr) <- nam
X attr(xtl, "shrinkage") <- shr
X cat("\nShrinkage factors:\n\n")
X print(round(shr,3))
}
X
names(parms) <- names(coef) <- nam
attr(xtl, "categorical") <- categorical
attr(xtl, "asis") <- asis
attr(xtl,"parms") <- parms
attr(xtl,"coef") <- coef
attr(xtl,"xcoef") <- xcoef
attr(xtl,"fillin") <- fillin
attr(xtl,"scale") <- scale
r <- apply(xtl, 2, range)
dimnames(r) <- list(c("low","high"), nam)
attr(xtl,"ranges") <- r
if(trantab) attr(xtl, "trantab") <- Trantab
X
if(imputed)
X {
X names(Imputed) <- nam
X attr(xtl,"imputed") <- Imputed
X }
X
class(xtl) <- "transcan"
invisible(xtl)
}
X
X
summary.transcan <- function(x, long=F) {
X dput(attr(x, "call")); cat("\n")
X cat("Iterations:",attr(x,"iter"),"\n\n")
X cat("R-squared achieved in predicting each variable:\n\n")
X print(round(attr(x,"rsq"),3))
X cat("\nAdjusted R-squared:\n\n")
X print(round(attr(x,"rsq.adj"),3))
X if(!is.null(shr <- attr(x,"shrink")))
X {
X cat("\nShrinkage factors:\n\n")
X print(round(shr,3))
X }
X cat("\nCoefficients of canonical variates for predicting each (row) variable\n\n")
X xcoef <- attr(x,"xcoef")[,-1]
X g <- format(round(xcoef,2))
X g[is.na(xcoef)] <- ""
X print(g, quote=F)
X
X imp <- attr(x,"imputed")
X if(!is.null(imp)) {
X nimp <- T
X for(nn in names(imp)) {
X if(length(z <- imp[[nn]])) {
X if(nimp & !long) cat("\nSummary of imputed values\n\n"); nimp <- F
X if(long) {cat("\nImputed values for",nn,"\n\n");print(z)}
X describe(z, nn) } } }
X if(attr(x,"imp.con"))
X cat("\nImputed values set to these constants:\n\n")
X else cat("\nStarting estimates for imputed values:\n\n")
X print(attr(x,"fillin"))
X
X invisible() }
X
print.transcan <- function(x, long=F) {
X dput(attr(x,"call")); cat("\n")
X if(long) print(unclass(x)); else print.matrix(x)
X invisible() }
X
X
uncbind <- function(x, prefix="", suffix="")
{
X nn <- dimnames(x)[[2]]
X for(i in 1:ncol(x)) assign(paste(prefix,nn[i],suffix,sep=""), x[,i], where=1)
X invisible()
}
X
impute.transcan <- function(x, var, name=as.character(substitute(var)))
{
X imp <- attr(x, "imputed")
X if(is.null(imp))
X { warning("imputed was not specified to transcan")
X return(var)
X }
X impval <- imp[[name]]
X if(!length(impval)) return(var)
X
X namvar <- names(var)
X if(is.null(namvar)) names(var) <- namvar <- as.character(1:length(var))
X
X #Now take into account fact that transcan may have been
X #run on a superset of current data frame
X
X nam <- names(impval)[names(impval) %in% namvar]
X m <- length(nam)
X if(m!=sum(is.na(var)))
X warning("number of NAs in var != number of imputed values from transcan")
X if(m==0) return(var)
X if(is.factor(var)) var[nam] <- levels(var)[impval[nam]]
X else var[nam] <- impval[nam]
# lab <- label(var)
# if(is.null(lab) || lab=="") lab <- name
# lab <- paste(lab,"with",m,"NAs imputed")
# attr(var, "label") <- lab
X attr(var, "imputed") <- (1:length(var))[namvar %in% names(impval)]
X class(var) <- c("impute",class(var))
X var
}
X
"[.transcan" <- function(x, ..., drop=T)
{
X ats <- attributes(x)
X ats$dimnames <- ats$dim <- ats$names <- class(x) <- NULL
X y <- x[..., drop = drop]
X attributes(y) <- c(attributes(y), ats)
X if(is.null(dim(y)))
X {
X aty <- attributes(y)
X aty$call <- aty$iter <- aty$rsq <- aty$parms <- aty$coef <-
X aty$xcoef <- aty$rsq.adj <- aty$shrink <-
X aty$fillin <- aty$imputed <- aty$class <- aty$ranges <-
X aty$imp.con <- aty$scale <- aty$categorical <- aty$asis <-
X aty$trantab <- NULL
X attributes(y) <- aty
X if(is.character(z <- list(...)[[1]]))
X attr(y,"label") <- paste("Transformed",z)
X }
X y
}
X
X
predict.transcan <- function(obj, newdata=NULL, iter.max=50, eps=.01,
X curtail=T, type=c("transformed","original"))
{
X type <- match.arg(type)
X
X atr <- attributes(obj)
X parms <- atr$parms
X coef <- atr$coef
X xcoef <- atr$xcoef
X fillin <- atr$fillin
X ranges <- atr$ranges
X scale <- atr$scale
X imp.con<- atr$imp.con
X trantab<- atr$trantab
X categorical <- atr$categorical
X formula <- atr$formula
X
X if(type=="original" & is.null(trantab))
X stop('type="trantab" and trantab=T not specified to transcan')
X
X if(!is.null(formula)) {
X oldop <- options(na.action="na.retain")
X y <- model.frame(formula, data=newdata)
X options(oldop)
# if(attr(y,"terms"),"response")==1) y <- y[,-1,drop=T]
X d <- dim(y)
X p <- d[2]
X newdata <- matrix(NA, nrow=d[1], ncol=p,
X dimnames=list(attr(y,"row.names"), names(y)))
X for(i in 1:p) {
X w <- y[[i]]
X if(is.character(w)) {
X warning("character predictor present. Depending on levels being same as in original fit, and that all levels are present in the data")
X w <- factor(w)
X }
X newdata[,i] <- unclass(w)
X }
X } else {
X if(is.null(newdata)) stop("newdata must be given (unless formula was given to transcan)")
X p <- ncol(newdata)
X }
X if(!is.matrix(newdata)) {
X if(is.null(names(newdata))) names(newdata) <- dimnames(obj)[[2]]
X newdata <- t(as.matrix(newdata))
X }
X
if(!any(is.na(newdata))) iter.max <- 1 #only 1 iteration needed if no NAs
X xt <- newdata
X nam <- dimnames(obj)[[2]]
X if(ncol(obj)!=p) stop("wrong number of columns in newdata")
X if(is.null(dimnames(xt)[[2]]))dimnames(xt) <- list(dimnames(xt)[[1]],nam)
X else if(any(dimnames(newdata)[[2]]!=nam))
X warning("column names in newdata do not match column names in obj")
X if(length(dimnames(xt)[[1]])==0) dimnames(xt) <- list(as.character(1:nrow(xt)),
X dimnames(xt)[[2]])
X
X for(iter in 1:iter.max)
X {
X dmax <- 0
X for(i in 1:p)
X {
X lab <- nam[i]
X j <- is.na(newdata[,i])
X prm <- parms[[lab]]
X if(is.null(prm)) #asis
X {
X newy <- newdata[,i]
X if(any(j))newy[j] <- if(iter==1) fillin[i] else
X drop(cbind(1,xt[j,-i,drop=F]) %*% xcoef[i,-i-1])
X }
X else
X {
X if(is.matrix(prm)) #factor
X {
X lev <- attr(prm, "codes")
X consec.lev <- match(newdata[,i], lev) #may give NAs - OK for next line
X R <- prm[consec.lev,, drop=F]
X if(iter==1 && any(match(newdata[!j,i], lev, 0)==0))
X stop("codes for categorical variable not in original list")
X }
X else R <- rcspline.eval(newdata[,i], prm, inclx=T)
X newy <- drop(cbind(1,R) %*% coef[[i]])
X if(any(j)) newy[j] <- if(iter==1) 0 else
X drop(cbind(1, xt[j,-i,drop=F]) %*%xcoef[i, -i-1])
X }
X if(curtail) newy <- pmax(pmin(newy,ranges[2,i]),ranges[1,i])
X if(iter>1) dmax <- max(dmax, min(
X max(abs(xt[,i]-newy),na.rm=T),
X max(abs(-xt[,i]-newy),na.rm=T))/scale[i])
X xt[,i] <- newy
X } #end i
X niter <- iter
X if(niter>1 && dmax3 & niter==iter.max)
X stop(paste("no convergence in",iter.max,"iterations"))
cat("Convergence in",niter,"iterations\n")
if(type=="transformed") return(xt)
X
for(i in 1:p)
{
X ft <- trantab[[i]]
X j <- is.na(newdata[,i])
X if(any(j))
X {
X if(imp.con) newdata[j,i] <- fillin[i]
X else if(nam[i] %in% categorical)
X {
X # find category with scored value closest to given one
X ww <- apply(outer(xt[j,i],ft$y,function(x,y)abs(x-y)),1,order)[1,]
X newdata[j,i] <- ft$x[ww]
X }
X else newdata[j,i] <- approx(ft$y, ft$x, xout=xt[j,i], rule=2)$y
X }
}
newdata
}
X
Function <- function(object, ...)
X UseMethod("Function")
X
Function.transcan <- function(x, prefix=".", suffix="", where=1)
{
at <- attributes(x)
Nam <- dimnames(x)[[2]]
p <- length(Nam)
categorical <- at$categorical
asis <- at$asis
coef <- at$coef
parms <- at$parms
fnames <- character(p)
X
for(i in 1:p) {
X nam <- Nam[i]
X cof <- coef[[nam]]
X if(nam %in% asis) f <- function(x) x
X else if(nam %in% categorical) {
X codes <- attr(parms[[nam]], "codes")
X g <- "{x <- unclass(x);"
X cof[-1] <- cof[-1] + cof[1] #convert from ref cell to cell means model
X for(j in 1:length(codes)) {
X if(j>1 && cof[j]>0) g <- paste(g,"+")
X g <- paste(g, format(cof[j]), "*(x==",format(codes[j]),")",sep="")
X }
X g <- paste(g, "}", sep="")
X f <- function(x) NULL
X f[[2]] <- parse(text=g)[[1]]
X }
X else f <- attr(rcspline.restate(parms[[nam]], cof), "function")
X
X fun.name <- paste(prefix,nam,suffix,sep="")
X cat("Function for transforming",nam,"stored as",fun.name,"\n")
X assign(fun.name, f, where=where)
X fnames[i] <- fun.name
X }
invisible(fnames)
}
X
na.retain <- function(mf) mf
X
X
SHAR_EOF
chmod 0644 transcan.s ||
echo 'restore of transcan.s failed'
Wc_c="`wc -c < 'transcan.s'`"
test 19742 -eq "$Wc_c" ||
echo 'transcan.s: original size 19742, current size' "$Wc_c"
fi
# ============= rcspline.eval.s ==============
if test -f 'rcspline.eval.s' -a X"$1" != X"-c"; then
echo 'x - skipping rcspline.eval.s (File already exists)'
else
echo 'x - extracting rcspline.eval.s (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'rcspline.eval.s' &&
#rcspline.eval - function to create design matrix for restricted cubic
# spline function of Stone & Koo, given an input vector and optionally
# a vector of knots. If knots are not given, knots are set using
# default algorithm. If the number of knots is not given, 5 are used.
# Terms are normalized by (outer-inner knot)^2.
# Can optionally return antiderivative of spline functions if
# type="integral".
# norm=0 : no normalization of constructed variables
# norm=1 : divide by cube of difference in last 2 knots
# makes all variables unitless
# norm=2 : (default) divide by square of difference in outer knots
# makes all variables in original units of x
#
# Returns:
# x - design matrix for derived spline variables
# (includes original x in first column if inclx=T or
# type="integral")
# attribute knots - input or derived vector of knots
# If knots.only=T, returns instead the vector of estimated or given
# knots.
# If rpm is not null, replaces missing x with rpm before evaluating
# but after estimating knots.
#
# F. Harrell 13 Feb 90
# Modified 28 Mar 90 - improved default knot computation
# 22 Aug 90 - put knots as attribute, return matrix
# 20 Sep 90 - added knots.only argument
# 16 Oct 90 - added rpm argument
# 11 Dec 91 - added type argument
# 27 Dec 91 - added norm argument
# 26 Jun 93 - added evasive action if <3 knots
#
rcspline.eval <- function(x,knots,nk=5,inclx=F,knots.only=F,
X type="ordinary",norm=2, rpm=NULL) {
X
if(missing(knots)) {
X xx <- x[!is.na(x)]
X n <- length(xx)
X if(n<6)stop('fewer than 6 non-missing observations with knots omitted')
X if(nk<3) stop('nk must be >= 3')
X outer <- .1
X if(nk>3) outer <- .05
X if(nk>6) outer <- .025
X knots <- quantile(xx,seq(outer,1.0-outer,length=nk))
X if(length(unique(knots))<3) {
X knots <- quantile(xx,seq(outer,1.0-outer,length=2*nk))
X if((nu <- length(unique(knots)))<3) {
X cat("Fewer than 3 unique knots. Frequency table of variable:\n")
X print(table(xx))
X stop() }
X warning(paste("could not obtain",nk,"knots with default algorithm.\n",
X "Used alternate algorithm to obtain",
X nu,"knots"))
X }
X if(n<100) {
X xx <- sort(xx)
X knots[1]<-xx[5]
X knots[nk]<-xx[n-4] }
X }
knots <- sort(unique(knots))
nk <- length(knots)
if(nk<3) {
X cat("fewer than 3 unique knots. Frequency table of variable:\n")
X print(table(x))
X stop() }
X
if(knots.only) return(knots)
X
x <- as.matrix(x)
storage.mode(x) <- "single"
if(!is.null(rpm)) x[is.na(x)] <- rpm
xx <- matrix(single(1),length(x),nk-2)
knot1 <- knots[1]
knotnk <- knots[nk]
knotnk1 <- knots[nk-1]
if(norm==0) kd <- 1
else if(norm==1) kd <- knotnk-knotnk1
else kd <- (knotnk-knot1)^.66666666666666666666666
X
if(type=="integral") power <- 4 else power <- 3
X
for(j in 1:(nk-2)) {
X xx[,j]<-pmax((x-knots[j])/kd,0)^power +
X ((knotnk1-knots[j])*pmax((x-knotnk)/kd,0)^power -
X (knotnk-knots[j])*(pmax((x-knotnk1)/kd,0)^power))/
X (knotnk-knotnk1)
X }
X
if(power==4) xx <- cbind(x, x*x/2, xx*kd/4)
else if(inclx) xx <- cbind(x, xx)
X
attr(xx,"knots") <- knots
X
xx
X }
X
X
SHAR_EOF
chmod 0644 rcspline.eval.s ||
echo 'restore of rcspline.eval.s failed'
Wc_c="`wc -c < 'rcspline.eval.s'`"
test 3075 -eq "$Wc_c" ||
echo 'rcspline.eval.s: original size 3075, current size' "$Wc_c"
fi
# ============= rcspline.restate.s ==============
if test -f 'rcspline.restate.s' -a X"$1" != X"-c"; then
echo 'x - skipping rcspline.restate.s (File already exists)'
else
echo 'x - extracting rcspline.restate.s (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'rcspline.restate.s' &&
rcspline.restate <- function(knots, coef, type=c("ordinary","integral"),
X x="X", lx=nchar(x),norm=2,
X columns=65, before="& &", after="\\\\",
X begin="", nbegin=0) {
X
type <- match.arg(type)
k <- length(knots)
if(k<3) stop("must have >=3 knots in a restricted cubic spline")
p <- length(coef)
if(p == k) {
X Intc <- coef[1]
X coef <- coef[-1]
X p <- p-1
X }
else Intc <- 0
if(k-1 != p) stop("coef must be of length # knots - 1")
X
knotnk <- knots[k]; knotnk1 <- knots[k-1]; knot1 <- knots[1]
if(norm==0) kd <- 1 else
if(norm==1) kd <- (knotnk-knotnk1)^3 else
kd <- (knotnk-knot1)^2
coef[-1] <- coef[-1]/kd
X
d <- c(0, knots-knotnk)[1:p]
coefk <- sum(coef*d)/(knotnk-knotnk1)
X
d <- c(0, knots-knotnk1)[1:p]
coefk1 <- sum(coef*d)/(knotnk1-knotnk)
X
if(is.null(names(coef)))names(coef) <- paste(x,1:length(coef),sep="")
coef <- c(coef, coefk, coefk1)
names(coef)[k] <- "1st restricted coef"
names(coef)[k+1] <- "2nd restricted coef"
X
if(type=="integral") coef <- c(.5*coef[1],.25*coef[-1])
X
olddig <- .Options$digits
.Options$digits <- 12
cof <- format(coef)
kn <- format(-knots)
if(Intc!=0) {
X txt <- format(Intc)
X if(type=="integral") txt <- paste(txt, "* x")
X if(coef[1]>=0) txt <- paste(txt, "+")
X }
else txt <- ""
X
if(cof[1]!=0) txt <- paste(txt,cof[1],if(type=="ordinary")"* x" else
X "* x^2", sep="")
for(i in 2:(p+2)) {
X nam <- paste("pmax(x", if(knots[i-1]<0) "+" else NULL, kn[i-1],
X ",0)^", if(type=="ordinary")"3" else "4", sep="")
X txt <- paste(txt, if(coef[i]>0 & (i>2 | coef[1]!=0 | Intc!=0)) "+" else NULL,
X cof[i], "*", nam, sep="")
X }
func <- function(x) NULL
func[[2]] <- parse(text=txt)[[1]]
X
.Options$digits <- olddig
cof <- format(coef)
kn <- format(-knots)
X
lcof <- nchar(cof)
cof <- unix("sed -e 's/e+00//' -e 's/e-0\\(.\\)/\\\\!\\\\times\\\\!10^{-\\1}/' -e 's/e-\\(..\\)/\\\\!\\\\times\\\\!10^{-\\1}/' -e 's/e+0\\(.\\)/\\\\!\\\\times\\\\!10^{\\1}/' -e 's/e+\\(..\\)/\\\\!\\\\times\\\\!10^{\\1}/'", cof)
X
cur <- begin; colcnt <- nbegin; tex <- NULL
if(Intc!=0) {
X fint <- format(Intc)
X if(type=="integral") { fint <- paste(fint, x); colcnt <- colcnt+2 }
X cur <- paste(cur, fint, sep="")
X colcnt <- colcnt + nchar(fint)
X if(coef[1]>0) { cur <- paste(cur, " + ", sep=""); colcnt <- colcnt+3 }
X }
if(coef[1]!=0) {
X sp <- if(length(grep("times",cof[1]))==0) "\\:" else NULL
X cur <- paste(cur, cof[1], sp, x, if(type=="integral") "^2",sep="")
X #\:=medium space in LaTeX
X colcnt <- colcnt+lcof[1]+lx+(type=="integral")
X }
X
tex.names <- character(p+2)
size <- lx+lcof[-1]+nchar(kn)+3
X
for(i in 2:(p+2)) {
X nam <- paste("(", x, if(knots[i-1]<0) "+" else NULL,
X kn[i-1], ")_{+}^{", if(type=="ordinary")"3}" else "4}", sep="")
X q <- paste(if(coef[i]>0 & (i>2 | coef[1]!=0 | Intc!=0)) "+" else NULL,
X cof[i], nam, sep="")
X n <- size[i-1]
X if(colcnt+n > columns) {
X tex <- c(tex, cur)
X cur <- ""
X colcnt <- 0 }
X cur <- paste(cur, q, sep="")
X colcnt <- colcnt+n
X }
X
tex <- c(tex, cur)
tex <- paste(before, tex, after)
X
if(Intc!=0) coef <- c(Intercept=Intc, coef)
X
attr(coef, "knots") <- knots
attr(coef, "function") <- func
#attr(tex, "class") <- "TeX"
attr(coef, "latex") <- tex
names(colcnt) <- NULL
attr(coef, "columns.used") <- colcnt
X
X
coef }
X
X
X
X
X
X
SHAR_EOF
chmod 0644 rcspline.restate.s ||
echo 'restore of rcspline.restate.s failed'
Wc_c="`wc -c < 'rcspline.restate.s'`"
test 3258 -eq "$Wc_c" ||
echo 'rcspline.restate.s: original size 3258, current size' "$Wc_c"
fi
# ============= in.operator.s ==============
if test -f 'in.operator.s' -a X"$1" != X"-c"; then
echo 'x - skipping in.operator.s (File already exists)'
else
echo 'x - extracting in.operator.s (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'in.operator.s' &&
"%in%" <- function(a,b) {
X
if(is.factor(a) & is.numeric(b)) {
X warning("a is factor, b is numeric. Assuming b is coded factor values")
X a <- unclass(a) }
X
else if(is.numeric(a) && is.factor(b)) {
X warning("a is numeric, b is factor. Assuming a is coded factor values")
X b <- unclass(b) }
X
match(a, b, nomatch=0) > 0
X }
SHAR_EOF
chmod 0644 in.operator.s ||
echo 'restore of in.operator.s failed'
Wc_c="`wc -c < 'in.operator.s'`"
test 341 -eq "$Wc_c" ||
echo 'in.operator.s: original size 341, current size' "$Wc_c"
fi
# ============= scat1d.s ==============
if test -f 'scat1d.s' -a X"$1" != X"-c"; then
echo 'x - skipping scat1d.s (File already exists)'
else
echo 'x - extracting scat1d.s (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'scat1d.s' &&
### -*-S-*- Improvements due to Martin Maechler
X
scat1d <- function(x, side=3, frac=.02, jitfrac=.008, tfrac, eps=.001,
X lwd=0.1)
{
X pr <- par()
X usr <- pr$usr
X
X if(side==1 | side==3) {l <- 1:2; ax <- 1} else {l <- 3:4; ax <- 2}
X u <- usr[l]
X u.opp <- usr[-l]
X w <- u[2]-u[1]
X x <- x[!is.na(x)]
X n <- length(x)
X if(missing(tfrac)) tfrac <- if(n<125) 1 else max(.1, 125/n)
X else if (tfrac < 0 || tfrac > 1) stop("must have 0 <= tfrac <= 1")
X
X if(jitfrac>0 && any(duplicated(round(x/w/eps))))
X x <- x + runif(n, -w*jitfrac, w*jitfrac)
# h <- (u.opp[2]-u.opp[1])*frac*min(fin)/fin[-ax]
X h <- min(pr$pin)*frac/pr$uin[-ax]
X a <- if(side<3) u.opp[1] else u.opp[2]-h
X b <- if(side<3) u.opp[1]+h else u.opp[2]
X if(tfrac<1)
X {
X l <- tfrac*(b-a)
X a <- a + runif(n)*(b-l-a) #runif(n, a, b-l) if frac>0
X b <- a+l
X }
X if(ax==1) segments(x, a, x, b, lwd=lwd, xpd=frac<0)
X else segments(a, x, b, x, lwd=lwd, xpd=frac<0)
X invisible()
}
SHAR_EOF
chmod 0644 scat1d.s ||
echo 'restore of scat1d.s failed'
Wc_c="`wc -c < 'scat1d.s'`"
test 987 -eq "$Wc_c" ||
echo 'scat1d.s: original size 987, current size' "$Wc_c"
fi
# ============= model.frame.default.s ==============
if test -f 'model.frame.default.s' -a X"$1" != X"-c"; then
echo 'x - skipping model.frame.default.s (File already exists)'
else
echo 'x - extracting model.frame.default.s (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'model.frame.default.s' &&
#Change from S-Plus supplied version (3.1) is to look for
#options(na.action), per Terry Therneau, and add Design()
#Redefine [ and [.factor to carry label attribute
X
model.frame.default <-
function(formula, data = NULL, na.action = na.fail, ...)
{
X if(missing(formula)) {
X if(!missing(data) && inherits(data, "data.frame") && length(
X attr(data, "terms")))
X return(data)
X formula <- as.formula(data)
X }
X else if(missing(data) && inherits(formula, "data.frame")) {
X if(length(attr(formula, "terms")))
X return(formula)
X data <- formula
X formula <- as.formula(data)
X }
X if(missing(na.action) && !is.null(tj <- attr(data, "na.action")))
X na.action <- tj
X else if (!is.null(tj <- options("na.action")[[1]])) na.action <- tj
X
X if(!inherits(formula, "terms"))
X formula <- terms(formula, data = data)
X dots <- substitute(list(...))
X
X .Internal(model.frame(formula, data,
X na.action, dots), "model_frame")
}
X
"[" <- function(x, ..., drop = T) {
lab <- attr(x, "label")
x <- .Internal(x[..., drop = drop], "S_extract", T, 1)
attr(x, "label") <- lab
x
}
X
"[.factor" <- function(x,i, ...) {
X co <- attr(x, "contrasts")
X lab <- attr(x, "label")
X ll <- levels(x)
X y <- NextMethod("[")
X attr(y, "contrasts") <- co
X attr(y, "label") <- lab
X levels(y) <- ll #only needed for length 0 subscripts
X class(y) <- class(x)
X y
}
X
X
#For some reason, modifying "[" makes the S-Plus 3.2 functions
#crosstabs and print.char.matrix not work. These versions do work.
X
crosstabs <- function(formula = ~ ., data = sys.parent(), margin, subset, na.action =
X na.fail, drop.unused.levels = T)
{
X if(mode(formula) != "call" || formula[[1]] != as.name("~"))
X stop("first argument must be a formula")
X rhs <- formula[[length(formula)]]
X lhs <- if(length(formula) == 2) NULL else formula[[2]]
X addends <- function(expr)
X {
# returns list of the addends in an expression of
# the form addend1 + addend2 + addend2 + ...
# The addends may be names or literals or non-addition
# expressions. E.g., addends(expression(x+log(x+y)+z)[[1]])
# returns list(x, log(x + y), z).
X result <- list()
X is.sum <- function(expr)
X mode(expr) == "call" && expr[[1]] == as.name("+")
X if(is.sum(expr))
X for(e in expr[-1]) {
X if(!is.sum(e))
X result <- c(result, list(e))
X else result <- c(result, Recall(e))
X }
X else result <- list(expr)
X result
X }
X if(!is.numeric(data) && !is.list(data))
X stop("data must be a list or a frame number")
X addends.rhs <- addends(rhs)
X if(formula.is.dot <- (length(addends.rhs) == 1 && as.character(
X addends.rhs[[1]]) == ".")) {
# formula ~. means all data columns, y~. means rhs
# includes all data columns except y.
X if(!is.list(data)) stop(
X "Cannot use \".\" in formula unless data is an Splus dataset"
X )
X addends.rhs <- names(data)
X if(!is.null(lhs))
X addends.rhs <- setdiff(addends.rhs, all.names(lhs,
X functions = F, unique = T))
X addends.rhs <- lapply(addends.rhs, as.name)
X }
X if(!missing(subset)) {
X subset <- eval(substitute(subset), local = data)
X }
#
# The following transformations of 'data' do sort of what model.frame() does.
# model.extract() didn't do quite what we needed here so I did this from
# scratch.
X if(!is.data.frame(data)) {
# make a data.frame out of data so we can apply na.action
# and subscript to it. Data frame should include all variables
# mentioned in the formula and subscript expressions.
# This may waste some space -- it only needs to be done if there
# are NA's in the data or if 'subset' is given.
# A minor complication -- some variables in subset expresion
# may not have same length as variables in formula (e.g., subset=1:n)
X this.call <- c(as.name("data.frame"), lapply(all.names(
X functions = F, unique = T, formula), as.name))
X mode(this.call) <- "call"
X data <- eval(this.call, local = data)
X }
# now data is a data frame, no matter what the input was.
# subset will be expression with variables in data
X if(!missing(subset)) {
X if(any(is.na(subset)))
X stop("Missing values not allowed in subset expression")
X data <- data[subset, ]
X }
X columns.needed <- if(!formula.is.dot) intersect(names(data), all.names(
X functions = F, unique = T, formula)) else names(data)
X if(any(is.na(data <- data[, columns.needed, drop = F]))) {
X data <- na.action(data)
X # na.action should remove all NAs, one way or another
X }
X if(is.null(lhs)) {
X this.call <- addends.rhs
X this.call <- c(list(as.name("table")), this.call)
X mode(this.call) <- "call"
X }
X else {
X list.call <- c(as.name("list"), addends.rhs)
X mode(list.call) <- "call"
X if(length(addends.lhs <- addends(lhs)) > 1)
X stop("Only one term allowed on left hand side of formula"
X )
X this.call <- list(as.name("tapply"), addends.lhs[[1]],
X list.call, sum)
X mode(this.call) <- "call"
X }
X tbl <- eval(this.call, local = data)
X if(!is.array(tbl)) {
# tapply doesn't create 1-d array (table does)
# when there is only one classification variable
X tbl <- array(tbl, dim = length(tbl), dimnames = list(names(tbl)
X ))
X }
X if(!is.null(lhs) && any(is.na(tbl))) {
# tapply() puts NA's into empty table entries -- replace with 0's.
X tbl[is.na(tbl)] <- 0
X }
X if(drop.unused.levels && !all(tbl)) {
# for each dimension non.zero gives slices to keep
X non.zero <- lapply(seq(along = dim(tbl)), function(i, tbl)
X apply(tbl, i, any), tbl)
X # tbl<-tbl[non.zero[[1]],non.zero[[2]],...,drop=F]
X tbl <- do.call("[", c(list(as.name("tbl"), drop = F), non.zero)
X )
X }
X attr(tbl, "na.message") <- attr(data, "na.message")
X names(dimnames(tbl)) <- unlist(lapply(addends.rhs, deparse))
X if(missing(margin)) {
X i <- seq(along = dim(tbl))
X margin <- if(length(i) == 1) list("N/Total" = integer(0)) else
X list("N/RowTotal" = setdiff(i, 2), "N/ColTotal"
X = setdiff(i, 1), "N/Total" = integer(0))
X }
# make sure user-supplied margin doesn't contain out-of-range values
X margin <- lapply(margin, function(x, mx)
X if(length(x)) x[x <= mx] else x, length(dim(tbl)))
X if(length(margin)) {
X marginals <- lapply(margin, proportion.table, data = tbl)
X attr(tbl, "marginals") <- marginals
X }
# marginal summaries. These are hard to parameterize correctly.
# For now, assume default margin argument and give row and column
# totals here.
X attr(tbl, "call") <- sys.call()
X class(tbl) <- "crosstabs"
X tbl
}
X
print.char.matrix <-
function(x, ..., hsep = "|", vsep = "-", csep = "+", print.it = T, prefix.width,
X min.colwidth = .Options$digits, abbreviate.dimnames = T, page.width =
X .Options$width)
{
# print an object of class char.matrix. This is a matrix of character
# strings. The strings typically have \n's in them to indicate
# line breaks and this function will break those strings into separate
# lines and put all the lines for each string into one matrix cell.
# The matrix cells are separated by the horizontal and vertical separator
# characters, hsep and vsep (csep is used where they cross).
# This function does not attempt to break up a large matrix into pieces
# which fit on a page.
# Currently, each call to cat() prints exactly one line. One could
# replace the calls to cat with calls to paste(sep="",...) to collect
# all the output. Then the output could be used as a single cell of
# a new char.matrix object so one could construct complex tables.
X x.orig <- x
X if(!print.it)
X val <- ""
X bbox <- string.bounding.box(x)
X w <- bbox$columns
X h <- bbox$rows
X dim(w) <- dim(x)
X dim(h) <- dim(x)
X w <- pmax(apply(w, 2, max), min.colwidth)
X # prefix.width is number of spaces to leave for row names
X if(missing(prefix.width)) prefix.width <- if(!is.null(dimnames(x)[[1]])
X ) max(min.colwidth, min(na.rm = T, median(w), max(nchar(
X dimnames(x)[[1]])))) else 0
X # If matrix is wider than page split it into 2 pieces, the first of
# which is narrow enough, and call print.char.matrix recursively to
# print the two pieces. (If second piece is still too wide, further
# recursion will take care of it.)
# We do not attempt to size to page if print.it=F.
X columns.need <- prefix.width + (ncol(x) + 1) * nchar(hsep) + sum(w)
X if(print.it && (columns.need > page.width)) {
X which <- (prefix.width + cumsum(w + nchar(hsep)) + nchar(hsep)) <=
X page.width
X if(!any(which)) {
X warning("page.width too small to print even one column of table"
X )
X which[1] <- T
X }
X else if(all(which)) {
X warning("Internal error in print.char.matrix")
X }
X print(x[, which, drop = F], ..., prefix.width = prefix.width,
X hsep = hsep, vsep = vsep, csep = csep, print.it =
X print.it, min.colwidth = min.colwidth,
X abbreviate.dimnames = abbreviate.dimnames, page.width
X = columns.need)
X if(any(!which)) {
X cat("\n")
X print(x[, !which, drop = F], ..., prefix.width =
X prefix.width, hsep = hsep, vsep = vsep, csep =
X csep, print.it = print.it, min.colwidth =
X min.colwidth, abbreviate.dimnames =
X abbreviate.dimnames, page.width = page.width)
X }
X }
X else {
# vsep line is line to print between rows of matrix. It looks like
# ------+-----+-----+, made up of vsep and csep characters.
X nhsep <- nchar(hsep)
X if(nchar(vsep) > 0) {
X if(nchar(csep) != nhsep)
X stop("nchar(csep) != nhsep")
X vsep.line <- vector("list", nchar(vsep))
X for(j in seq(length = nchar(vsep))) {
X vsep.line[[j]] <- rep(rep(substring(vsep, j, j),
X length(w)), w + nhsep)
X for(i in seq(length = nhsep))
X vsep.line[[j]][cumsum(w + nhsep) + i - nhsep] <-
X substring(csep, i, i)
X vsep.line[[j]] <- paste(paste(rep(substring(
X vsep, j, j), prefix.width), collapse = ""),
X csep, paste(vsep.line[[j]], collapse = ""),
X "\n", sep = "")
X }
X vsep.line <- do.call("paste", c(vsep.line, sep = "",
X collapse = ""))
X }
X else vsep.line <- ""
X justify <- function(string, width, chop = T)
X {
# left justify string in a space width characters wide.
# string and width may be vectors.
X if(chop) string <- substring(string, 1, width)
X if(nchar(spaces <- " ") < max(
X width))
X spaces <- paste(collapse = "", rep(" ", max(
X width)))
X paste(sep = "", string, substring(spaces, 1, width -
X nchar(string)))
X }
X if(!is.null(ndn <- names(dimnames(x)))) {
# print names of dimensions above the matrix
X out <- paste(sep = "", justify(ndn[1], prefix.width,
X chop = F), hsep, ndn[2], "\n")
X if(print.it)
X cat(out)
X else val <- paste(val, out, sep = "")
X }
# rn is row names, cn is column names
# We must either abbreviate or truncate dimnames to fit in space provided
# We might choose to not abbreviate them if we've already abbreviated
# them or know that we can fit them into prefix.width spaces.
X if(!is.null(rn <- dimnames(x)[[1]])) {
X if(abbreviate.dimnames && max(nchar(rn)) > prefix.width
X )
X rn <- abbreviate(rn, prefix.width)
X }
X else rn <- character(0)
X if(length(cn <- dimnames(x)[[2]])) {
X if(abbreviate.dimnames && max(nchar(cn)) > min(w))
X cn <- abbreviate(cn, min(w))
X out <- paste(justify("", prefix.width), hsep, sep = "",
X paste(collapse = "", sep = "", justify(cn, w),
X hsep))
X out <- paste(out, "\n", sep = "")
X if(print.it)
X cat(out)
X else val <- paste(val, out, sep = "")
X }
X if(print.it)
X cat(vsep.line)
X else val <- paste(val, vsep.line, sep = "")
X for(i in 1:nrow(x)) {
# loop over rows of matrix
# string.break.line breaks strings with newlines into multiple strings
# with one line per string. It returns a list with component i containing
# the text lines from the i'th input string.
X
#Next line changed 30May94 Frank Harrell: added drop=F - drop not being
#transmitted by my "[" for unknown reasons
X lines <- string.break.line(x[i, ,drop=F])
X for(j in 1:max(h[i, ])) {
# loop over text lines in this row
# print row name on first line only (rn[i][j] is rn[i] if j==1, "" otherwise)
X out <- paste(justify(rn[i][j], prefix.width),
X hsep, sep = "")
X # linesj is vector of text lines for columns in this line
X linesj <- unlist(lapply(lines, function(lines,
X j)
X lines[j], j))
X out <- paste(sep = "", out, paste(sep = "",
X collapse = hsep, justify(linesj, w, chop = F)
X ), hsep, "\n")
X if(print.it)
X cat(out)
X else val <- paste(val, out, sep = "")
X }
X if(print.it)
X cat(vsep.line)
X else val <- paste(val, vsep.line, sep = "")
X }
X }
X if(print.it)
X invisible(x.orig)
X else val
}
SHAR_EOF
chmod 0644 model.frame.default.s ||
echo 'restore of model.frame.default.s failed'
Wc_c="`wc -c < 'model.frame.default.s'`"
test 12311 -eq "$Wc_c" ||
echo 'model.frame.default.s: original size 12311, current size' "$Wc_c"
fi
# ============= .Data/.Help/transcan ==============
if test ! -d '.Data'; then
echo 'x - creating directory .Data'
mkdir '.Data'
fi
if test ! -d '.Data/.Help'; then
echo 'x - creating directory .Data/.Help'
mkdir '.Data/.Help'
fi
if test -f '.Data/.Help/transcan' -a X"$1" != X"-c"; then
echo 'x - skipping .Data/.Help/transcan (File already exists)'
else
echo 'x - extracting .Data/.Help/transcan (Text)'
sed 's/^X//' << 'SHAR_EOF' > '.Data/.Help/transcan' &&
.tr @@
.BG
.FN transcan
.FN summary.transcan
.FN print.transcan
.FN impute.transcan
.FN predict.transcan
.FN Function.transcan
.FN [.transcan
.TL
Transformations/Imputations using Canonical Variates
.DN
These functions are the main transformation/imputation function and
several methods for operating on its results.
`transcan' automatically transforms continuous and categorical variables
to have maximum correlation with the best linear combination of the other
variables. There is also an
option to use a substitute criterion - maximum correlation with the
first principal component of the other variables.
Continuous variables
are expanded as restricted cubic splines and categorical variables are
expanded as contrasts (e.g., dummy variables). By default, the first canonical
variate is used to find optimum linear combinations of component columns.
This function is similar to `ace' except that transformations for
continuous variables are fitted using restricted cubic splines,
monotonicity restrictions
are not allowed, and NAs are allowed. When a variable has any NAs,
transformed scores for that variable are imputed using the optimum
transformations or are optionally set to constants. Shrinkage can be
used to safeguard against overfitting when imputing.
Optionally, imputed values on the original scale
are also computed and returned. For this purpose, the `tree' function
can optionally be used to impute categorical variables, using what is
predicted to
be the most probable category.
`summary' prints the function call, R-squares achieved in transforming
each variable, and for each variable the coefficients of all other
transformed variables that are used to estimate the transformation of
the initial variable.
If `imputed=T' was used in the call to transcan, also uses the `describe'
function to print a summary of imputed values. If `long=T', also
prints all imputed values with observation identifiers.
There is also a simple function `print.transcan' which
merely prints the transformation matrix and the function call.
It has an optional argument `long', which if set to `T' causes
detailed parameters to be printed.
`impute' does
imputations for a selected original data variable, on the original scale
(if `imputed=T' was given to `transcan').
`predict' computes predicted variables and imputed values
from a matrix of new data. This matrix should have the same column variables as
the original matrix used with `transcan', and in the same order (unless
a formula was used with `transcan').
`Function.transcan' creates S functions to transform variables using
transformations created by `transcan'. These functions are useful
for getting predicted values with predictors set to values on the original
scale.
The subscript function preserves attributes.
.CS
xt <- transcan(x, method=c("canonical","pc"),
X categorical=NULL, asis=NULL, nk, imputed=F, trantab=F,
X impcat=c("score", "tree"),
X mincut=40, pl=T, allpl=F, show.na=T,
X iter.max=50, eps=.1, curtail=T,
X imp.con=F, shrink=F, init.cat="mode", ...)
summary(xt, long=F)
print(xt, long=F)
impute(xt, var, name)
predict(xt, newdata, iter.max=50, curtail=T,
X type=c("transformed","original"))
Functon(xt, prefix=".", suffix="", where=1)
xt.sub <- xt[subset,]
.RA
.AG x
a matrix containing continuous variable values and codes for categorical
variables. The matrix must have column names (`dimnames'). If row
names are present, they are used in forming the `names' attribute
of imputed values if `imputed=T'. `x' may also be a formula, in which
case the model matrix is created automatically, using data in the calling
frame. Advantages of using a formula are that `categorical' variables
can be determined automatically by a variable being a `factor'
variable, and variables with two unique levels are modeled `asis'.
Variables with 3 unique values are considered to be `categorical' if
a formula is specified. The user can add other variable names to the
`asis' and `categorical' vectors.
.AG xt
an object returned by `transcan'
.AG var
For `impute', is a variable that was originally a column in `x', for
which imputated values are to be filled in. `imputed=T' must have been
used in `transcan'.
.OA
.AG method
use `method="canonical"' or any abbreviation thereof, to use canonical
variates (the default).
`method="pc"' transforms a variable instead so as to maximize
the correlation with the first principal component of the other variables.
.AG categorical
a character vector of names of variables in `x' which are categorical,
for which the ordering of re-scored values is not necessarily preserved.
If `categorical' is omitted, it is assumed that all variables are
continuous (or binary). Set `categorical="*"' to treat all variables
as categorical.
.AG asis
a character vector of names of variables that are not to be transformed.
For these variables, `lsfit' is used to impute missing values.
You may want to treat binary variables `asis'. If imputed=T, you
may want to use `"categorical"' for binary variables if you want
to force imputed values to be one of the original data values.
Set `asis="*"' to treat all variables `asis'.
.AG nk
number of knots to use in expanding each continuous variable (not listed
in `asis') in a restricted cubic spline function. Default is 3 (yielding
2 parameters for a variable) if `n<30', 4 if `30<=n<100', and 5 if
`n>=100' (4 parameters).
.AG imputed
Set to `T' to return a list containing imputed values on the original
scale.
If the transformation for a variable is non-monotonic, imputed
values are not unique. `transcan' uses the `approx' function,
which returns the highest value of the variable with the transformed
score equalling the imputed score. `imputed=T' also causes original-scale imputed values to be shown as tick
marks on the top margin of each graph
when `show.na=T' (for the final iteration only).
For categorical predictors, these imputed values are `jitter'ed so
that their frequencies can be visualized.
.AG trantab
Set to `T' to add an attribute `trantab' to the returned matrix. This
contains a vector of lists each with components `x' and `y' containing
the unique values and corresponding transformed values for the
columns of `x'. This is set up to be used easily with the `approx'
function. You must specify `trantab=T' if you want to later use the
`predict.transcan' function with `type="original"'.
.AG impcat
This argument tells how to impute categorical variables on the original
scale.
The default is `impcat="score"' to impute the category
whose canonical variate score is closest to the predicted score.
Use `impcat="tree"' to impute categorical variables using the
`tree()' function, using the values of all other transformed predictors.
.AG mincut
If `imputed=T' and there are categorical variables, `mincut' specifies
the lowest node size that will be allowed to be split by `tree'. The
default is 40.
.AG pl
Set to `F' to suppress plotting the final transformations with
distribution of scores for imputed values (if `show.na=T').
.AG allpl
Set to `T' to plot transformations for intermediate iterations.
.AG show.na
Set to `F' to suppress the distribution of scores assigned to
missing values (as tick marks on the right margin of each graph).
See also `imputed'.
.AG iter.max
maximum number of iterations to perform for `transcan' or `predict'.
For `predict', only one iteration is used if there
are no NAs in the data.
.AG eps
convergence criterion for `transcan' and `predict'. `eps' is the
maximum change in transformed values from one iteration to the next.
If for a given iteration all new transformations
of variables differ by less than `eps' (with or without negating the
transformation to allow for "flipping") from the transformations in
the previous iteration, one more iteration is done for `transcan'.
During this
last iteration, individual transformations are not updated but
coefficients of transformations are. This improves stability of
coefficients of canonical variates on the right-hand-side.
.AG curtail
for `transcan', causes imputed values on the transformed scale to
be truncated so that their ranges are within the ranges of
non-imputed transformed values.
For `predict', `curtail' defaults to `T' to truncate predicted transformed
values to their ranges in the original fit (`xt').
.AG imp.con
for `transcan', set to `T' to impute NAs on the original scales with
constants (medians or most frequent category codes). Set to a vector
of constants to instead always use these constants for imputation.
These imputed values are ignored when fitting the current working
transformation for a single variable.
.AG shrink
default is `F' to use ordinary least squares or canonical variate estimates.
For the purposes of imputing NAs, you may want to set `shrink=T' to avoid
overfitting when developing a prediction equation to predict each variables
from all the others (see details below).
.AG init.cat
method for initializing scorings of categorical variables. Default is
`"mode"' to use a dummy variable set to 1 if the value is the most
frequent value (this is the default).
Use `"random"' to use a random 0-1 variable. Set
to `"asis"' to use the original integer codes as starting scores.
.AG ...
arguments passed to `scat1d'
.AG long
for `summary', set to `T' to print all imputed values.
For `print', set to `T' to print details of transformations/imputations.
.AG name
name of variable to impute, for `impute()'. Default is character
string version of the second argument (`var') in the call to `impute'.
.AG newdata
a new data matrix for which to compute transformed variables.
Categorical variables must use the same integer codes as were used
in the call to `transcan'. If a formula was originally specified to
`transcan' (instead of a data matrix), `newdata' is optional and if
given must be a data frame; a model
frame is generated automatically from the previous formula. The
`na.action' is handled automatically, and the levels for factor variables
must be the same and in the same order as were used in the original
variables specified in the formula given to `transcan'.
.AG type
By default, the matrix of transformed variables is returned, with imputed
values on the transformed scale. If you had specified `trantab=T' to
`transcan', specifying `type="original"' does the table look-ups with
linear interpolation to return the input matrix `x' but with imputed
values on the original scale inserted for NAs. For categorical variables,
the method used here is to select
the category code having a corresponding scaled value closest to the
predicted transformed value. This corresponds to the default `impcat';
a problem in getting predicted
values for `tree' objects prevented using `tree' for this.
.AG prefix
.AG suffix
When creating separate S functions for each variable in `x', the name
of the new function will be `prefix' placed in front of the variable name,
and `suffix' placed in back of the name. The default is to use names
of the form `.varname', where `varname' is the variable name.
.AG where
Place to store new functions created by `Function'. Default is position
1 in the search list. See the `assign' function for more documention
on the `where' argument.
.AG subset
a standard subsetting/subscripting expression
.RT
For `transcan',
a matrix of class `transcan' with dimnames of `x' and attributes
`call' (with the function call), `iter' (number of iterations done) and
`rsq' and `rsq.adj' containing the
R-squares and adjusted R-squares achieved in predicting each variable
from all the others.
It also has attributes `categorical', `asis', `coef', `xcoef', `parms',
`fillin', `ranges', `scale', and `formula' containing
respectively the values supplied for `categorical' and `asis', the within-variable coefficients used to compute the first canonical
variate, the (possibly shrunk) across-variables coefficients of the first canonical variate
that predicts each variable in turn,
the parameters of the transformation (knots for splines,
contrast matrix for categorical variables),
the initial
estimates for missing values (NA if variable never missing),
the matrix of ranges of the transformed variables (min and max in
first and second row), a vector of scales used to determine
convergence for a transformation, the formula (if `x' was a formula), and optionally a vector of shrinkage
factors used for predicting each variable from the others.
For `"asis"' variables, the
scale is the average absolute difference about the median.
For other variables it is unity, since canonical variables are
standardized.
For `xcoef', row `i' has the coefficients to predict transformed
variable `i', with the column for the coefficient of variable `i'
set to NA.
If `imputed=T'
was given, an optional attribute `imputed' also appears. This
is a list with the vector of imputed values (on the original scale)
for each variable containing NAs.
If `trantab=T, the `trantab' attribute also appears, as described above.
For `impute', returns a vector (the same length as `var') of class
`"impute"' with NAs imputed.
For `predict', returns a matrix with the same number of columns or
variables as were in `x'.
.SE
prints, plots
.DT
The starting approximation to the transformation for each variable
is taken to be the original coding of the variable. The initial
approximation for each missing value is taken to be the median of
the non-missing values for the variable (for continuous ones) or
the most frequent category (for categorical ones). Instead, if `imp.con' is
a vector, its values are used for imputing NAs. When using each
variable as a dependent variable, NAs on that variable cause all
observations to be temporarily deleted. Once a new working transformation
is found for the variable, along with a model to predict that transformation
from all the other variables, that latter model is used to impute
NAs in the selected dependent variable if `imp.con' is not specified.
When that variable is used
to predict a new dependent variable, the current working imputed values
are inserted. Transformations are updated after each variable becomes
a dependent variable, so the order of variables on `x' could conceivably
make a difference in the final estimates. For obtaining out-of-sample
predictions/transformations, `predict' uses the same iterative
procedure as `transcan' for imputation, with the same starting
values for fill-ins as were used by `transcan'. It also (by default)
uses a conservative approach of curtailing transformed variables to
be within the range of the original ones.
Even when `method="pc"' is specified, canonical variables are used
for imputing missing values.
.sp \\
Note that fitted transformations, when evaluated at imputed variable
values (on the original scale), will not precisely match the transformed
imputed values returned in `xt'. This is because `transcan' uses an
approximate method based on linear interpolation to back-solve for
imputed values on the original scale.
.sp \\
Shrinkage uses the method of Van Houwelingen and Le Cessie (1990) (similar to
Copas, 1983). The shrinkage factor is `[1-(1-R2)(n-1)/(n-k-1)]/R2', where
`R2' is the apparent R-squared for predicting the variable, `n' is the number
of non-missing values, and `k' is the effective number of degrees of freedom
(aside from intercepts). A heuristic estimate is used for `k':
`A - 1 + sum(max(0,Bi-1))/m + m', where
`A' is the number of d.f. required
to represent the variable being predicted, the `Bi' are the number of
columns required to represent all the other variables, and `m' is the
number of all other variables. Division by `m' is done because the
transformations for the other variables are fixed at their current
transformations the last time they were being predicted. The `+ m' term
comes from the number of coefficients estimated on the right hand side,
whether by least squares or canonical variates. If a shrinkage factor
is negative, it is set to 0. The shrinkage factor is the ratio of
the adjusted R-squared to the ordinary R-squared.
The adjusted R-squared is `1 - (1 - R2)(n-1)/(n-k-1)', which is also set to
zero if it is negative. If `shrink=F' and the adjusted R-squares are much
smaller than
the ordinary R-squares, you may want to run `transcan' with `shrink=T'.
.sp \\
Canonical variates are scaled to have variance of 1.0, by multiplying canonical
coefficients from `cancor' by `sqrt(n-1)'.
.SH AUTHOR
Frank Harrell
.sp 0
Division of Biometry
.sp 0
Duke University Medical Center
.sp 0
feh@biostat.mc.duke.edu
.SH REFERENCES
Kuhfeld, Warren F: The PRINQUAL Procedure. SAS/STAT User's Guide, Fourth
Edition, Volume 2, pp. 1265-1323, 1990.
.sp \\
Van Houwelingen JC, Le Cessie S: Predictive value of statistical models.
Statistics in Medicine 8:1303-1325, 1990.
.sp \\
Copas JB: Regression, prediction and shrinkage. JRSS B 45:311-354, 1983.
.SA
`impute',
`ace', `avas', `cancor', `prcomp', `tree', `rcspline.eval',
`lsfit', `approx'
.EX
x <- cbind(age, disease, blood.pressure, pH)
#cbind will convert factor object `disease' to integer
par(mfrow=c(2,2))
x.trans <- transcan(x, categorical="disease", asis="pH", imputed=T)
summary(x.trans) #Summary distribution of imputed values, and R-squares
f <- lm(y ~ x.trans) #use transformed values in a regression
#Now replace NAs in original variables with imputed values, if not
#using transformations
age <- impute(x.trans, age)
disease <- impute(x.trans, disease)
blood.pressure <- impute(x.trans, blood.pressure)
pH <- impute(x.trans, pH)
summary(pH) #uses summary.impute to tell about imputations
X #and summary.default to tell about pH overall
newx.trans <- predict(x.trans, xnew)
w <- predict(x.trans, xnew, type="original")
age <- w[,"age"] #inserts imputed values
blood.pressure <- w[,"blood.pressure"]
Function(x.trans) #creates .age, .disease, .blood.pressure, .pH()
#Repeat first fit using a formula
x.trans <- transcan(~ age + disease + blood.pressure + pH,
X asis="pH", imputed=T)
age <- impute(x.trans, age)
predict(x.trans, expand.grid(age=50, disease="pneumonia",
X blood.pressure=60:260, pH=7.4))
transcan(~ age + factor(disease.code)) # disease.code categorical
.KW smooth
.KW regression
.KW multivariate
.KW Methods and Generic Functions
.WR
SHAR_EOF
chmod 0644 .Data/.Help/transcan ||
echo 'restore of .Data/.Help/transcan failed'
Wc_c="`wc -c < '.Data/.Help/transcan'`"
test 18439 -eq "$Wc_c" ||
echo '.Data/.Help/transcan: original size 18439, current size' "$Wc_c"
fi
# ============= .Data/.Help/rcspline.eval ==============
if test -f '.Data/.Help/rcspline.eval' -a X"$1" != X"-c"; then
echo 'x - skipping .Data/.Help/rcspline.eval (File already exists)'
else
echo 'x - extracting .Data/.Help/rcspline.eval (Text)'
sed 's/^X//' << 'SHAR_EOF' > '.Data/.Help/rcspline.eval' &&
.BG
.FN rcspline.eval
.TL
Restricted Cubic Spline Design Matrix
.DN
Computes matrix that expands a single variable into the terms needed to
fit a restricted cubic spline (natural spline) function using the
truncated power basis. Two normalization options are given for somewhat
reducing problems of ill-conditioning. The antiderivative function can
be optionally created. If knot locations are not given, they will be
estimated from the marginal distribution of `x'.
.CS
rcspline.eval(x, knots, nk=5, inclx=F, knots.only=F,
X type="ordinary", norm=2, rpm=NULL)
.RA
.AG x
a vector representing a predictor variable
.OA
.AG knots
knot locations. If not given, knots will be estimated using default
quantiles of `x'. For 3-5 knots, the outer quantiles used are .05 and .95.
For `nk>5', the outer quantiles are .025 and .975. The knots are
equally spaced between these on the quantile scale. For fewer than 100
non-missing values of `x', the outer knots are the 5th smallest and
largest `x'.
.AG nk
number of knots. Default is 5. The minimum value is 3.
.AG inclx
set to `T' to add `x' as the first column of the returned matrix
.AG knots.only
return the estimated knot locations but not the expanded matrix
.AG type
`"ordinary"' to fit the function, `"integral"' to fit its anti-derivative.
.AG norm
`0' to use the terms as originally given by Devlin and Weeks (1986),
`1' to normalize non-linear terms by the cube of the spacing between the last two
knots, `2' to normalize by the square of the spacing between the first
and last knots (the default).
`norm=2' has the advantage of making all
nonlinear terms be on the `x'-scale.
.AG rpm
If given, any NAs in `x' will be replaced with the value `rpm' after
estimating any knot locations.
.RT
If `knots.only=T', returns a vector of knot locations. Otherwise returns
a matrix with `x' (if `inclx=T') followed by `nk-2' nonlinear terms.
The matrix has an attribute `knots' which is the vector of knots used.
.SH REFERENCES
Devlin TF and Weeks BJ (1986): Spline functions for logistic regression
modeling. Proc 11th Annual SAS Users Group Intnl Conf, p. 646-651.
Cary NC: SAS Institute, Inc.
.SA
`ns', `rcspline.restate'
.EX
lrm.fit(rcspline.eval(age,nk=4,inclx=T), death)
.KW Regression
.WR
SHAR_EOF
chmod 0644 .Data/.Help/rcspline.eval ||
echo 'restore of .Data/.Help/rcspline.eval failed'
Wc_c="`wc -c < '.Data/.Help/rcspline.eval'`"
test 2249 -eq "$Wc_c" ||
echo '.Data/.Help/rcspline.eval: original size 2249, current size' "$Wc_c"
fi
# ============= .Data/.Help/rcspline.restate ==============
if test -f '.Data/.Help/rcspline.restate' -a X"$1" != X"-c"; then
echo 'x - skipping .Data/.Help/rcspline.restate (File already exists)'
else
echo 'x - extracting .Data/.Help/rcspline.restate (Text)'
sed 's/^X//' << 'SHAR_EOF' > '.Data/.Help/rcspline.restate' &&
.tr @@
.BG
.FN rcspline.restate
.TL
Re-state Restricted Cubic Spline Function
.DN
This function re-states a restricted cubic spline function in
the un-linearly-restricted form. Coefficients for that form are
returned, along with an S functional representation of this function
and a LaTeX character representation of the function.
.CS
rcspline.restate(knots, coef, x="X", lx=nchar(x),
X type=c("ordinary","integral"),
X norm=2, columns=65, before="& &", after="\\\\",
X begin="", nbegin=0)
.RA
.AG knots
vector of knots used in the regression fit
.AG coef
vector of coefficients from the fit. If the length of `coef' is
`k-1', where `k=length(knots)', the first coefficient must be
for the linear term and remaining `k-2' coefficients
must be for the constructed terms (e.g., from `rcspline.eval').
If the length of `coef' is `k', an intercept is assumed to be in
the first element.
.OA
.AG type
The default is to represent the cubic spline function corresponding
to the coefficients and knots. Set `type="integral"' to instead represent
its anti-derivative.
.AG x
a character string to use as the variable name in the LaTeX expression
for the formula.
.AG lx
length of `x' to count with respect to `columns'. Default is length
of character string contained by `x'. You may want to set `lx'
smaller than this if it includes non-printable LaTeX commands.
.AG norm
normalization that was used in deriving the original nonlinear terms
used in the fit. See `rcspline.eval' for definitions.
.AG columns
maximum number of symbols in the LaTeX expression to allow before
inserting a newline (\\\\) command. Set to a very large number to
keep text all on one line.
.AG before
text to place before each line of LaTeX output. Use `"& &"' for an equation
array environment in LaTeX where you want to have a left-hand prefix
e.g. `f(X) & = &' or using `\\lefteqn'.
.AG after
text to place at the end of each line of output.
.AG begin
text with which to start the first line of output. Useful when adding
LaTeX output to part of an existing formula
.AG nbegin
number of columns of printable text in `begin'
.RT
a vector of coefficients. The coefficients are un-normalized
and two coefficients are added that are linearly dependent on the
other coefficients and knots. The vector of coefficients has four
attributes. `knots' is a vector of knots, `latex' is a vector of text strings
with the LaTeX representation of the formula.
`columns.used' is the number of columns used in the output string
since the last newline command. `function' is an S function.
.SH AUTHOR
Frank Harrell
.sp 0
Division of Biometry, Duke University Medical Center
.sp 0
feh@biostat.mc.duke.edu
.SA
`rcspline.eval', `ns', `rcs', `display', `Function.transcan'
.EX
xx <- rcspline.eval(blood.pressure, inclx=T, nk=4)
knots <- attr(xx, "knots")
coef <- lsfit(xx, y)$coef
options(digits=4)
# rcspline.restate must ignore intercept
w <- rcspline.restate(knots, coef[-1], x="{\\\\rm BP}")
# could also have used coef instead of coef[-1], to include intercept
cat(attr(w,"latex"), sep="\n", file="formula.tex")
X
bp.trans <- attr(w, "function")
# This is an S function of a single argument
bp.sequence <- 60:170
plot(bp.sequence, bp.trans(bp.sequence), type="l")
# Plots fitted blood pressure transformation
X
x <- blood.pressure
xx.simple <- cbind(x, pmax(x-knots[1],0)^3, pmax(x-knots[2],0)^3,
X pmax(x-knots[3],0)^3, pmax(x-knots[4],0)^3))
pred.value <- coef[1] + xx.simple %*% w
.KW Regression
.WR
SHAR_EOF
chmod 0644 .Data/.Help/rcspline.restate ||
echo 'restore of .Data/.Help/rcspline.restate failed'
Wc_c="`wc -c < '.Data/.Help/rcspline.restate'`"
test 3522 -eq "$Wc_c" ||
echo '.Data/.Help/rcspline.restate: original size 3522, current size' "$Wc_c"
fi
# ============= .Data/.Help/%in% ==============
if test -f '.Data/.Help/%in%' -a X"$1" != X"-c"; then
echo 'x - skipping .Data/.Help/%in% (File already exists)'
else
echo 'x - extracting .Data/.Help/%in% (Text)'
sed 's/^X//' << 'SHAR_EOF' > '.Data/.Help/%in%' &&
.tr @@
.BG
.FN %in%
.TL
Find Matching Elements
.DN
For two vectors or scalars `a' and `b',
`a %in% b' returns a vector of logical values corresponding to
the elements in `a'. A value is `T' if that element of `a' is
found somewhere in `b', `F' otherwise. If `a' is a factor object
and `b' is numeric, converts `a' to its integer codes before
the comparison. If `a' is numeric and `b' is `factor', converts
`b' to codes.
.CS
a %in% b
.RA
.AG a
a vector (numeric, character, factor)
.AG b
a vector (numeric, character, factor), matching the mode of a
.RT
vector of logical values with length equal to length of `a'.
.SA
`match'
.EX
X w <- factor(c("a","b","c"))
X w %in% c("b","c")
X w %in% c(2,3) #same as previous, but with warning
X
#Suppose that a variable x has levels "a", "b1", "b2" and you
#want to classify levels "b1" and "b2" as just "b". You can use
#the %in% operator and do the following:
X
X levels(x)[levels(x) %in% c("b1","b2")] <- "b"
.KW manip
.KW character
.WR
SHAR_EOF
chmod 0644 .Data/.Help/%in% ||
echo 'restore of .Data/.Help/%in% failed'
Wc_c="`wc -c < '.Data/.Help/%in%'`"
test 991 -eq "$Wc_c" ||
echo '.Data/.Help/%in%: original size 991, current size' "$Wc_c"
fi
# ============= .Data/.Help/scat1d ==============
if test -f '.Data/.Help/scat1d' -a X"$1" != X"-c"; then
echo 'x - skipping .Data/.Help/scat1d (File already exists)'
else
echo 'x - extracting .Data/.Help/scat1d (Text)'
sed 's/^X//' << 'SHAR_EOF' > '.Data/.Help/scat1d' &&
.tr @@
.BG
.FN scat1d
.TL
One-Dimensional Scatter Diagram
.DN
Adds tick marks (bar codes) on any of the four sides of an existing
plot, corresponding with non-missing values of a vector `x'.
.sp 0
If any two values of `x' are within `eps*w' of each other, where `eps'
defaults to .001 and `w' is the span of the intended axis, values of
`x' are jittered by adding a value uniformly distributed in
`[-jitterfrac*w, jitterfrac*w]', where `jitterfrac' defaults to .008.
Allows plotting random sub-segments to handle very large `x' vectors
(see `tfrac').
.CS
scat1d(x, side=3, frac=0.02, jitfrac=0.008, tfrac,
X eps=1e-3, lwd=0.1)
.RA
.AG x
a vector of data
.OA
.AG side
axis side to use (1=bottom, 2=left, 3=top(default), 4=right)
.AG frac
fraction of smaller of vertical and horizontal axes for tick mark lengths.
Can be negative to move tick marks outside of plot.
.AG jitfrac
fraction of axis for jittering. If <=0, no jittering is done.
.AG tfrac
fraction of tick mark to actually draw. If `tfrac<1',
will draw a random fraction `tfrac' of the line segment at each point.
This is useful for very large samples or ones with some very dense points.
The default value is 1 if the number of non-missing observations `n'
is less than 125, and `max(.1, 125/n)' otherwise.
.AG eps
fraction of axis for determining overlapping points in `x'
.AG lwd
line width for tick marks, passed to ``segments''
.SE
adds line segments to plot
.DT
The length of line segments used is `frac*min(par()$pin) / par()$uin[opp]' data units, where `opp' is the
index of the opposite axis and `frac' defaults to .02.
Assumes that `plot' has already been called.
Current `par("usr")' is used to determine the range of data for the axis
of the current plot. This range is used in jittering and in constructing
line segments.
.SH AUTHOR
Frank Harrell
.sp 0
Division of Biometry
.sp 0
Duke University Medical Center
.sp 0
Durham NC USA
.sp 0
feh@biostat.mc.duke.edu
.sp 1
Martin Maechler
.sp 0
Seminar fuer Statistik
.sp 0
ETH Zurich SWITZERLAND
.sp 0
maechler@stat.math.ethz.ch
.SA
`segments', `jitter', `rug'
.EX
plot(x <- rnorm(50), y <- 3*x + rnorm(50)/2 )
scat1d(x) # density bars on top of graph
scat1d(y, 4) # density bars at right
plot(x <- rnorm(250), y <- 3*x + rnorm(250)/2)
scat1d(x, tfrac=0) # dots randomly spaced from axis
scat1d(y, 4, frac=-.03) # bars outside axis
scat1d(y, 2, tfrac=.2) # same bars with smaller random fraction
.KW dplot
.KW aplot
.WR
SHAR_EOF
chmod 0644 .Data/.Help/scat1d ||
echo 'restore of .Data/.Help/scat1d failed'
Wc_c="`wc -c < '.Data/.Help/scat1d'`"
test 2494 -eq "$Wc_c" ||
echo '.Data/.Help/scat1d: original size 2494, current size' "$Wc_c"
fi
exit 0