1. Documentation--------------------------------------------------------
Documentation for the CLUSBAS Program
John S. Uebersax
February 1994
CLUSBAS is a very easy to use program for cluster analysis. It
has limited options--that is partly what makes it easy to
use--but in simple applications should be adequate to get the job
done. Commands are supplied interactively in response to three
prompting questions. CLUSBAS is especially useful for cluster
analyzing variables from a correlation matrix as an adjunct to
factor or principal components analysis.
Specifications
Algorithm: Average-linkage hierarchical clustering
Input: A full matrix of similarity (including
correlations) or dissimilarity measures
between each pair of variables or cases.
Output: A clustering history, and a tree diagram
(dendogram).
Limits: The program allows up to 100 variables or
cases.
Language: MicroSoft QuickBASIC, version 4.0 or higher
(FORTRAN version also supplied).
CLUSBAS may be freely copied and distributed.
Instructions for Use
It is assumed that you have already created a matrix of pairwise
similarities (e.g., correlations) or dissimilarities (e.g.,
Euclidean distances). This must be named MATRIX.DAT. It must be
a full (square), symmetrical matrix; if it is a correlation
matrix, include the 1.0's on the diagonal. Missing values are
not allowed. The data format is free field--that is, any format
is allowed, as long as there are spaces between successive
numbers and, if applicable, decimal points are shown.
When you run the program, you will be issued the following
prompt:
What is the problem title?
Respond to this by typing a short title for the run, and hit
Enter. The next prompt will be,
How many things are to be clustered?
Type in the number of items, variables, or cases and hit Enter.
Next you will be asked
SIM or DIS data?
If you have dissimilarity data, type DIS (must be all capital
letters) and hit Enter. If you type anything else, the default
condition, similarity or correlation data, will be assumed; if
you have similarity or correlation data, you can just hit Enter
in response to this prompt.
The program will then perform the analysis and print results (a
cluster history and a dendogram) to a file named OUTPUT.
I hope you find the program useful in your work. If you have any
questions or problems using the program, please let me know.
John Uebersax
Bowman Gray School of Medicine
of Wake Forest University
Winston-Salem, NC 27157-1063
Internet: uebersax@phs.bgsm.wfu.edu
71302.2362@compuserve.com (after 7/94)
2. QuickBASIC source code-----------------------------------------------
' PROGRAM CLUSBAS
'-----------------------------------------------------------------------
' This is a simple program well suited for cluster analysis of
' variables. The clustering method used is average-linkage
' hierarchical cluster analysis. There are few options, but that
' provides the advantage of easy use. All commands are supplied to
' the program interactively, in response to prompting questions.
'
' Reference: Anderberg, M. R. (1973). Cluster analysis for
' applications. New York: Academic press.
'
' Almost any text on cluster analysis should have a good
' description of the average-linkage hierarchical clustering
' algorithm. The algorithm begins with an initial similarity
' or dissimilarity matrix between pairs of objects. The
' algorithm proceeds in an iterative way. At each iteration
' the two most similar (we assume similarities for explanation)
' objects are combined into one group. At each successive
' iteration, the two most similar objects or groups of objects are
' merged. Similarity between groups is defined as the average
' similarity between objects in one group with objects in the other.
'
' INPUT: A correlation matrix (or some other similarity or
' dissimilarity matrix) in a file named MATRIX.DAT
' This must contain all the elements of a full
' (n x n), symmetrical matrix. Any format is
' allowable, as long as numbers are separated by
' blanks.
'
' OUTPUT: Output consists of a cluster history and a tree
' diagram (dendogram). The cluster history
' indicates, for each iteration, the objects
' or clusters merged, and the average pairwise
' similarity or dissimilarity in the resulting
' cluster.
'
' Two utility files are created by the program: SORTED.DAT
' and CLUM.DAT. These can be deleted after running the program.
' If they are present from a previous run they will be re-written
' and so will not affect the results of a current run.
'-----------------------------------------------------------------------
' Notes:
'
' 1. Compatible with Microsoft QuickBASIC, versions 4.0 or
' above.
'
' 2. This comes from a pre FORTRAN-77 program--please excuse
' lack of block command structure.
'
' 3. The original version was also written as several separate
' programs--which, for example, explains why files are used
' to communicate data between subroutines.
'
' 4. QuickBASIC apparently lists the subroutines in alphabetic
' order, regardless of how they are presented in the original
' code. Please excuse this inconvenience.
'
' 5. The dendogram is formatted for 132-column line printers.
' If your printer can print longer lines (e.g., with compressed
' type you, can reset the ISCL values in subroutine TREE; see
' comments in there flagged "**SCALING**".
'
' 6. The program may be freely copied and distributed.
'
' Author: John Uebersax
' Revised: Change(s):
' -------- ----------------------------------------------------------
' 03/17/93 Translated from FORTRAN
' 03/22/93 More comments added
'-----------------------------------------------------------------------
DECLARE SUB RMAKE (X$, LIMIT!, ISTART!)
DECLARE SUB TREE ()
DECLARE SUB CONV (N1!, N2!, A$)
DECLARE SUB PRETRE ()
DECLARE SUB CLUSV1 ()
DECLARE SUB SCAN ()
DECLARE SUB BSCAN ()
DECLARE SUB ARRANGE ()
'-----------------------------------------------------------------------
' Driver program
'
DIM X(100, 100) ' Element X(i, j) is the similarity or
' dissimilarity between object or group i
' and object or group j
DIM NIN(100) ' NIN(i) is the number of objects in
' cluster I
DIM NVAR(100) ' Used to identify clusters
DIM ROW$(200) ' For output--a line of the dendogram
COMMON SHARED X(), NIN(), NVAR(), K, L, M, RX, ROW$()
CALL CLUSV1 ' CLUSV1 routine
CALL PRETRE ' PRETRE routine
CALL TREE ' TREE routine
END
SUB ARRANGE
'
' This subroutine updates the similarity or dissimilarity matrix.
' If two objects/groups K and L are merged, it calculates the
' similarity or dissimilarity of the new group with all other objects
' or groups. It does this by averaging the elements in row K of
' X() with those in row L, and similarly for columns K and L.
' The new elements are put in row K and column L (K < L). Row K
' and column L are deleted. Columns and rows greater than L are
' shifted up one column or row to fill in the gap. The resulting
' matrix X() thus has one less column and row then at the beginning
' of the subroutine.
'
MN = M - 1
SAV = X(K, L)
SAV2 = X(K, K)
'
' Calculate similarity or dissimilarity of group formed by merging I
' and J to all other groups by averaging the similarities or
' dissimilarities of I and J with other groups
'
FOR I = 1 TO M
X(I, K) = (X(I, K) * NIN(K) + X(I, L) * NIN(L)) / (NIN(K) + NIN(L))
X(K, I) = X(I, K)
NEXT I
X(K, K) = SAV2 * NIN(K) * (NIN(K) - 1) + X(L, L) * NIN(L) * (NIN(L) - 1)
X(K, K) = X(K, K) + SAV * 2 * NIN(K) * NIN(L)
X(K, K) = X(K, K) / ((NIN(K) + NIN(L)) * (NIN(K) + NIN(L) - 1))
IF (L = M) THEN 60
FOR I = 1 TO M
'
' Shift columns after J up one place
'
FOR J = L TO MN
X(I, J) = X(I, J + 1)
NEXT J
NEXT I
FOR I = L TO MN
'
' Shift rows after J up one place
'
FOR J = 1 TO M
X(I, J) = X(I + 1, J)
NEXT J
NEXT I
' Update number of objects in each cluster
'
NIN(K) = NIN(K) + NIN(L)
FOR I = L TO MN
NIN(I) = NIN(I + 1)
NEXT I
GOTO 70
60 NIN(K) = NIN(K) + NIN(L)
70 END SUB
SUB BSCAN
'
' This subroutine looks for the minimum dissimilarity. It finds
' element (K, L), where K and L are the most dissimilar objects
' or groups.
'
N = 1
RRRMIN = 1000000!
MN = M - 1
FOR I = 1 TO MN
N = N + 1
FOR J = N TO M
IF ((RRRMIN - X(I, J) < 0!)) THEN 1010
IF ((RRRMIN - X(I, J) = 0!)) THEN 1005
IF ((RRRMIN - X(I, J) > 0!)) THEN 1005
1005 K = I
L = J
RRRMIN = X(I, J)
1010 NEXT J
NEXT I
RX = RRRMIN
END SUB
SUB CLUSV1
'
' This routine does the cluster analysis, taking data from file
' . Parameters controlling the analysis are obtained
' interactively. The results are stored in file