Question

I'm doing a lot of natural language processing with a bit unsusual requirements. Often I get tasks similar to lemmatization - given a word (or just piece of text) I need to find some patterns and transform the word somehow. For example, I may need to correct misspellings, e.g. given word "eatin" I need to transform it to "eating". Or I may need to transform words "ahahaha", "ahahahaha", etc. to just "ahaha" and so on.

So I'm looking for some generic tool that allows to define transormation rules for such cases. Rules may look something like this:

 {w}in   ->  {w}ing
 aha(ha)+  ->  ahaha

That is I need to be able to use captured patterns from the left side on the right side.

I work with linguists who don't know programming at all, so ideally this tool should use external files and simple language for rules.

I'm doing this project in Clojure, so ideally this tool should be a library for one of JVM languages (Java, Scala, Clojure), but other languages or command line tools are ok too.

There are several very cool NLP projects, including GATE, Stanford CoreNLP, NLTK and others, and I'm not expert in all of them, so I could miss the tool I need there. If so, please let me know.

Note, that I'm working with several languages and perform very different tasks, so concrete lemmatizers, stemmers, misspelling correctors and so on for concrete languages do not fit my needs - I really need more generic tool.

UPD. It seems like I need to give some more details/examples of what I need.

Basically, I need a function for replacing text by some kind of regex (similar to Java's String.replaceAll()) but with possibility to use caught text in replacement string. For example, in real world text people often repeat characters to make emphasis on particular word, e.g. someoone may write "This film is soooo boooring...". I need to be able to replace these repetitive "oooo" with only single character. So there may be a rule like this (in syntax similar to what I used earlier in this post):

{chars1}<char>+{chars2}?  ->  {chars1}<char>{chars2}

that is, replace word starting with some chars (chars1), at least 3 chars and possibly ending with some other chars (chars2) with similar string, but with only a single . Key point here is that we catch on a left side of a rule and use it on a right side.

Was it helpful?

Solution

I've found http://userguide.icu-project.org/transforms/general to be useful as well for some general pattern/transform tasks like this, ignore the stuff about transliteration, its nice for doing a lot of things.

You can just load up rules from a file into a String and register them, etc.

http://userguide.icu-project.org/transforms/general/rules

OTHER TIPS

I am not an expert in NLP, but I believe Snowball might be of interest to you. Its a language to represent stemming algorithms. Its stemmer is used in the Lucene search engine.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top