Unsupervised Classification for documents

https://datascience.stackexchange.com/questions/14156

16-10-2019
|

Pergunta

I'm trying to create a classifier in which there is less "manual" work for the user. For less manual work I mean that there won't be an initial phase of manual labeling of a training set, like in Machine Learning (supervised)

My dataset is composed by instances that are really different by class. They are documents in which there are orders for specified products for different clients. And every client got its template.

For example I got:

[Client A]
Image
Date: xxx  Order: 
Products:
Table

[Client B]
Date: xxx
Order
Image
Products:
table
Image

Now I'm doing the classification doing a simple check on every document, for the presence of a specified feature, that's manually identified by a user (by area and using edit distance)

The classes are really different (in some cases), and trying an unsupervised classifier like an agglomerative clustering the classes are split really well. After that, using measures like TF/ICF often the features (in my case I use tokenized and normalized text as features) which got the greater values are the ones that are used in my manual classification.

The criterions that I use for stop the clustering iteration are different ( I got different configuration) like max distance, or max number of clusters.

After that, I thought that when the clusters will be created an user at the end will label each cluster identifying the class by the best TF/ICF (term frequency, inverse cluster frequency) features found in each cluster. And after that the clusters will be used like "classifier". I know that this approach will lead to worse classification, but It's not a problem.

The problem is that when two classes are really similar (I got classes in which the difference is only the customer code, for example) they are really difficult to split.

Any idea on how approach this problem? And, there is a way in which my algorithm can find out if there is a "new class" in the flux?

Solução

If you have a good amount of instances for every class, you can try using a density-based approach for clustering, with algorithms like DBSCAN.

If you can label at least some of the documents, you can use semi-supervised learning. Usually, when SSL is used for clustering, you need to specify "cannot link" and "must link" constraints for some pairs of instances, which is basically labeling some instances. One algorithm that follows this approach is HMRF-KMeans (Hidden Markov Random Fields K-Means).

Outras dicas

I can't comment 'cause of lack of reputation. Have you to use only "Agglomerative clustering"?

I think it's better K-Means clustering for your usage. You can detect few difference with k-Means.

If need to use the "Agglomerative clustering" you shoul tweak the dissimilarity measure.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange