Pergunta

I have a load of documents, which have a load of key value pairs in them. The key might not be unique so there might be multiple keys of the same type with different values.

I want to compare the similarity of the keys between 2 documents. More specifically the string similarity of these values. I am thinking of using something like the Smith-Waterman Algorithm to compare the similarity.

So I've drawn a picture of how I'm thinking about representing the data -

enter image description here

The values in the cells are the result of the smith-waterman algorithm (or some other string similarity metric).

Image that this matrix represents a key type of "things" I then need to add the "things" similarity score into a vector of 0 or 1. Thats ok.

What I can't figure out is how I determine if the matrix is similar or not similar - ideally I want to convert the matrix to an number between 0 and 1 and then I'll just set a threshold to score it as either 0 or 1.

Any ideas how I can create a score of the matrix? Does anyone know any algorithms that do this type of thing (obviously things like how smith waterman works is kind of applicable).

Foi útil?

Solução

As I understood, Document 1 and Document 2 may have different number of keys. And you wand to get final similarity evaluation between 0 and 1. If so, I would propose following algorithm:

  1. Sum of max. vals is equal to 0.
  2. Select maximum value from doc-doc matrix and add it to Sum of max. vals.
  3. Remove row and column with maximum value from the matrix.
  4. Repeat steps 2-3 until rows or columns are ended.
  5. Denominate Sum of max. vals by average number of key words in two texts.

Final estimation would be equal to 1, if both documents have identical length, and every word from Doc 1 has equivalent in Doc 2.

You haven't mentioned software, you are using, but here is R example of function, computing such similarity (it takes object of class matrix as input):

eval.sim <- function(sim.matrix){
  similarity <- 0
  denominator <- sum(dim(sim.matrix)) / 2
  for(i in 1:(min(c(nrow(sim.matrix), ncol(sim.matrix))) - 1)){
    extract <- which(sim.matrix == max(sim.matrix), arr.ind=T)[1, ]
    similarity <- similarity + sim.matrix[extract[1], extract[2]]
    sim.matrix <- sim.matrix[-extract[1], -extract[2]]
  }
  similarity <- similarity + max(sm.copy)
  similarity <- similarity / denominator
}

In python -

import numpy as np

def score_matrix(sim_matrix):
    similarity = 0
    denominator = sum(sim_matrix.shape) / 2
    for i in range(min(sim_matrix.shape)):
        x, y = np.where(sim_matrix == np.max(sim_matrix))[0][0], np.where(sim_matrix == np.max(sim_matrix))[1][0]
        similarity += sim_matrix[x, y]
        sim_matrix = np.delete(sim_matrix,(x),axis=0)
        sim_matrix = np.delete(sim_matrix,(y),axis=1)
    return similarity / denominator

Outras dicas

If your goal is to transform your matrix into a number (your similarity measure), you may want to use a matrix norm.

For instance, using the Frobenius norm on your example would return 1.488086.

I think your goal is to find how similar two documents are, if that's the case I suggest apply following algorithm:

This approach gives how much similar Doc1 is wrt Doc2. (Similarity values will be different for Doc2 wrt Doc1 if its not a square matrix)

  1. In your matrix between Doc1 and Doc2, Get the max similarity value row by row.
    1. Take the sum and divide by number of rows
    2. This will give you the similarity index. For eg. In your matrix image, I see maximum similarity row-by-row is: 0.88 , 1, 0.6 So (0.88 + 1 + 0.6)/3 = 82.67%

This means Doc2 is 82.67% similar to Doc1. The similarity cannot go beyond this value as we selected max similar items in each row.

Licenciado em: CC-BY-SA com atribuição
scroll top