Ciphey/app/languageCheckerMod/LanguageChecker.py

83 lines
3.8 KiB
Python

"""
██████╗██╗██████╗ ██╗ ██╗███████╗██╗ ██╗
██╔════╝██║██╔══██╗██║ ██║██╔════╝╚██╗ ██╔╝
██║ ██║██████╔╝███████║█████╗ ╚████╔╝
██║ ██║██╔═══╝ ██╔══██║██╔══╝ ╚██╔╝
╚██████╗██║██║ ██║ ██║███████╗ ██║
© Brandon Skerritt
Github: brandonskerritt
Class to determine whether somethine is English or not.
1. Calculate the Chi Squared score of a sentence
2. If the score is significantly lower than the average score, it _might_ be English
2.1. If the score _might_ be English, then take the text and compare it to the sorted dictionary
in O(n log n) time.
It creates a percentage of "How much of this text is in the dictionary?"
The dictionary contains:
* 20,000 most common US words
* 10,000 most common UK words (there's no repition between the two)
* The top 10,000 passwords
If the word "Looks like" English (chi-squared) and if it contains English words, we can conclude it is
very likely English. The alternative is doing the dictionary thing but with an entire 479k word dictionary (slower)
2.2. If the score is not English, but we haven't tested enough to create an average, then test it against the dictionary
Things to optimise:
* We only run the dictionary if it's 20% smaller than the average for chi squared
* We consider it "English" if 45% of the text matches the dictionary
* We run the dictionary if there is less than 10 total chisquared test
How to add a language:
* Download your desired dictionary. Try to make it the most popular words, for example. Place this file into this folder with languagename.txt
As an example, this comes built in with english.txt
Find the statistical frequency of each letter in that language.
For English, we have:
self.languages = {
"English":
[0.0855, 0.0160, 0.0316, 0.0387, 0.1210,0.0218, 0.0209, 0.0496, 0.0733, 0.0022,0.0081, 0.0421, 0.0253, 0.0717, 0.0747,0.0207, 0.0010, 0.0633, 0.0673, 0.0894,0.0268, 0.0106, 0.0183, 0.0019, 0.0172,0.0011]
}
In chisquared.py
To add your language, do:
self.languages = {
"English":
[0.0855, 0.0160, 0.0316, 0.0387, 0.1210,0.0218, 0.0209, 0.0496, 0.0733, 0.0022,0.0081, 0.0421, 0.0253, 0.0717, 0.0747,0.0207, 0.0010, 0.0633, 0.0673, 0.0894,0.0268, 0.0106, 0.0183, 0.0019, 0.0172,0.0011]
"German": [0.0973]
}
In alphabetical order
And you're.... Done! Make sure the name of the two match up
"""
from string import punctuation
import app.languageCheckerMod.dictionaryChecker
import app.languageCheckerMod.chisquared
from app import languageCheckerMod
class LanguageChecker:
def __init__(self):
self.dictionary = languageCheckerMod.dictionaryChecker.dictionaryChecker()
self.chi = languageCheckerMod.chisquared.chiSquared()
def __add__(self, otherLanguageObject):
# sets the added chi squared to be of this one
new = otherLanguageObject.getChiSquaredObj() + self.getChiSquaredObj()
self.chi = new
return self
def checkLanguage(self, text):
if text == "":
return False
result = self.chi.checkChi(text)
if result:
result2 = self.dictionary.confirmlanguage(text, "English")
if result2:
return True
else:
return False
else:
return False
def getChiSquaredObj(self):
return self.chi
def getChiScore(self):
return self.chi.totalChi