I wrote a simple script that is intended to do hierarchical clustering on a simple test dataset. The test data that was used.

I found the function fclusterdata to be a candidate to cluster my data into two clusters. It takes two mandatory call parameters: the data set and a threshold. The problem is, I couldn't find a threshold that would yield the expected two clusters.

I'd be happy if anyone can tell me what I am doing wrong. I'd also be happy if anyone could point on other approaches that would be better suited for my clustering (I explicitly want to avoid to specify the number of clusters beforehand.)

Here is my code:

import time
import scipy.cluster.hierarchy as hcluster
import numpy.random as random
import numpy

import pylab
pylab.ion()

data = random.randn(2,200)

data[:100,:100] += 10

for i in range(5,15):
    thresh = i/10.
    clusters = hcluster.fclusterdata(numpy.transpose(data), thresh)
    pylab.scatter(*data[:,:], c=clusters)
    pylab.axis("equal")
    title = "threshold: %f, number of clusters: %d" % (thresh, len(set(clusters)))
    print title
    pylab.title(title)
    pylab.draw()
    time.sleep(0.5)
    pylab.clf()

Here is the output:

threshold: 0.500000, number of clusters: 129
threshold: 0.600000, number of clusters: 129
threshold: 0.700000, number of clusters: 129
threshold: 0.800000, number of clusters: 75
threshold: 0.900000, number of clusters: 75
threshold: 1.000000, number of clusters: 73
threshold: 1.100000, number of clusters: 58
threshold: 1.200000, number of clusters: 1
threshold: 1.300000, number of clusters: 1
threshold: 1.400000, number of clusters: 1
有帮助吗?

解决方案

Note that the function reference has an error. The correct definition of the t parameter is: "The cut-off threshold for the cluster function or the maximum number of clusters (criterion=’maxclust’)".

So try this:

clusters = hcluster.fclusterdata(numpy.transpose(data), 2, criterion='maxclust', metric='euclidean', depth=1, method='centroid')
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top