I never worked with nltk before. There could be a better solution too. In my code snippet I am simply doing the following:
Reading a file that needs to be checked for non-english/english words named as
frequencyList.txt
to a variable named aslines
.Then I am opening a new file named as
eng_words_only.txt
. This file will contain the english words only. Initially this file will be empty, later after executing the script this file will contain all the English language words present infrequencyList.txt
Now for every word in
frequencyList.txt
I check if it is also present inwordnet
. If the word is present then I write this word to theeng_words_only.txt
file, else I do nothing. Please see I am usingwordnet
just for demo purpose. It doesn't contains all the English language words!
Code:
from nltk.corpus import wordnet
fList = open("frequencyList.txt","r")#Read the file
lines = fList.readlines()
eWords = open("eng_words_only.txt", "a")#Open file for writing
for w in lines:
if not wordnet.synsets(w):#Comparing if word is non-English
print 'not '+w
else:#If word is an English word
print 'yes '+w
eWords.write(w)#Write to file
eWords.close()#Close the file
Testing: I first created a file named as frequencyList.txt
with the following contents:
cat
meoooow
mouse
then upon executing the code snippet you'll see the following output in the console:
not cat
not meoooow
yes mouse
Then a file will be created eng_words_only.txt
which contains only the words that were supposed to be of the English language. The eng_words_only.txt
will contain only mouse
word. You may notice that cat is an English word but it is still not in the eng_words_only.txt
file. This is the reason why you should use a good source instead of wordnet.
Please note: The python script file and the frequencyList.txt
should be in the same directory. Also, instead of frequencyList.txt
you can use any of your file that you want to check/investigate. In that case don't forget to change the files names in the code snippet too.
Second Solution: Although you didn't ask for it but still there is an other way too to do this English word test.
Here is the code: Here the wordlist-eng.txt is the file which contains the English words. You have to keep
wordlist-eng.txt
, frequencyList.txt
and the python script in the same directory.
with open("wordlist-eng.txt") as word_file:
english_words = set(word.strip().lower() for word in word_file)
fList = open("frequencyList.txt","r")
lines = fList.readlines()
fList.close()
eWords = open("eng_words_only.txt", "a")
for w in lines:
if w.strip().lower() in english_words:
eWords.write(w)
else: pass
eWords.close()
After executing the script the eng_words_only.txt
will contain all the English words that were present in frequencyList.txt
file.
I hope this was helpful.