Lemmatizer in R or python (am, are, is -> be?) [closed]

https://stackoverflow.com/questions/22993796

01-07-2023
|

Vra

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 9 years ago.

Improve this question

I'm not a [computational] linguistic, so please excuse my supper dummy-ness in this topic.

According to Wikipedia, lemmatisation is defined as:

Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

Now my question is, is the lemmatised version of any member of the set {am, is, are} supposed to be "be"? If not, why not?

Second question: How do I get that in R or python? I've tried methods like this link, but non of them gives "be" given "are". I guess at least for the purpose of classifying text documents, this makes sense to be true.

I also couldn't do that with any of the given demos here.

What am I doing/assuming wrong?

Oplossing

So here is a way to do it in R, using the Northwestern University lemmatizer, MorphAdorner.

lemmatize <- function(wordlist) {
  get.lemma <- function(word, url) {
    response <- GET(url,query=list(spelling=word,standardize="",
                                   wordClass="",wordClass2="",
                                   corpusConfig="ncf",    # Nineteenth Century Fiction
                                   media="xml"))
    content <- content(response,type="text")
    xml     <- xmlInternalTreeParse(content)
    return(xmlValue(xml["//lemma"][[1]]))    
  }
  require(httr)
  require(XML)
  url <- "http://devadorner.northwestern.edu/maserver/lemmatizer"
  return(sapply(wordlist,get.lemma,url=url))
}

words <- c("is","am","was","are")
lemmatize(words)
#   is   am  was  are 
# "be" "be" "be" "be"

As I suspect you are aware, correct lemmatization requires knowledge of the word class (part of speech), contextually correct spelling, and also depends upon which corpus is being used.

Gelisensieer onder: CC-BY-SA met toeskrywing

Nie verbonde aan StackOverflow