Clustering a dissimilarity matrix

cluster -## [-= -Xxfile] file

-# number of clusters
-= if set, bias towards similar size clusters
-X ffilelist of indices with fixed cluster assignments
-o output file name, just -o means file_clust
-V verbosity level (0 = only fatal errors)
-h show this message
verbosity level (add what you want):
1 = input/output
2 = state of clustering
8 = temperature / cost at cooling

The program reads a dissimilarity matrix of the form i, j, d_i,j (columns 1,2,3 of the input file). Any missing values are filled in by the mean of the given values. Now -# clusters are formed by minimising the average dissimilarity of each entity to all the entities within each cluster. The method is described in Schreiber and Schmitz. Certain indices may be assigned to a cluster by listing the index and the cluster number in ffile 9the argument of the -X option).

Progress is monitored by a string printed at brief intervals. Here, clusters are lettered by A, B, ... On output, the clustering is described by giving for each index the cluster number and the average dissimilarities of that item to each cluster.

As an example, consider four time series 1,2,3,4 where 1 and 2 are very similar, 3 and 4 as well, but teh two groups are quite dissimilar. This may be reflected in the dissimilarity matrix

i j d_i,j

1 1 0

1 2 0

1 3 1

1 4 1

2 1 0

2 2 0

2 3 1

2 4 1

3 1 1

3 2 1

3 3 0

3 4 0

4 1 1

4 2 1

4 3 0

4 4 0

i	j	d_i,j

1	1	0
1	2	0
1	3	1
1	4	1
2	1	0
2	2	0
2	3	1
2	4	1
3	1	1
3	2	1
3	3	0
3	4	0
4	1	1
4	2	1
4	3	0
4	4	0

Running

> cluster -#2

on this data will yield as output

This means that, as expected, set 1 and two are in cluster 1. Also, 3 and 4 are in cluster 2. They all have average distance 0. to their home cluster and 1. to the other cluster.

Dissimilarity matrices for time series can be produced either using nstat_z or by computing any other dissimilarity measure (xc2, xzero, xcor, with appropriate settings) in a loop.

Here one more example for UNIX users using pipelines:

> ( henon -l 1000 ; ikeda -l 1000 ) | nstat_z -#10 | cluster -#2

A time series of 2000 points is produced, the first 1000 from the Hénon map, the second from the Ikeda map. Splitting it into 10 segments, nstat_z produces a 10 by 10 matrix which is then used to form 2 clusters:

 1  0.510285676  1.51835787
 1  0.515677989  1.47538877
 1  0.505068302  1.50731277
 1  0.526149631  1.50477791
 1  0.54192245  1.5086087
 2  1.49558449  0.900290847
 2  1.49753571  0.912206411
 2  1.50899839  0.89825052
 2  1.49374235  0.903614342
 2  1.51858497  0.915635824

Indeed, the first 5 segments form one cluster and segments 6-10 the other.

Table of Contents * TISEAN home