About the return value and usage of scipy.cluster.hierarchy.fcluster
-
20-12-2019 - |
Question
Suppose we have four observations and the return value of scipy.cluster.hierarchy.linkage is:
[[ 1. 3. 0.08 2. ]
[ 2. 4. 0.28813559 3. ]
[ 0. 5. 1. 4. ]]
This return value means: first observations 1 and 3 are merged to new cluster 4, then observation 2 is added into this new cluster to form a still new cluster 5. Finally the observation 0 is clustered. Since I want to get two clusters {1,3,2} and {0}, I expect a return value of [2,1,1,1] which means that element 0 belongs to cluster 2 and the rest are grouped into another cluster 1, using threshold 0.4. But actually scipy.cluster.hierarchy.fcluster returns [ 3 1, 2 ,1 ]. Of course I can write python code to analyse linkage's returning 2-D array by myself, but I think the fcluster function can return what I want if I set the threshold to be 0.4. However, I don't know how to provide parameters to it, so I wonder if you could provide with some example codes to conduct hierarchical clustering using linkage
and give the final result using fcluster
with observations grouped in a cluster represented by a set. Thank you.
Solution
fcluster
has inconsistent
as standard argument for the criterion to choose. Use distance
as argument, to take the cophenetic distance from the linkage matrix Z[:,2]
. You might just use maxclust
as criterion if you want to specify the number of clusters. If you're clustering with single linkage, likely some clusters are singletons (outliers).
Help(fcluster) gives the needed info on how to use the function, so do the docs