Working on a new feature for eSpeak GUI I started looking into language identification. Forcing users to manually choose the text’s language is a botheration, so trying to guess it by checking which system dictionary contains the most words from the text or some other method would surely be beneficial.
After a quick search I learned that it’s much easier than this: it’s possible to reliably determine the language based on statistic n-gram information. Ignoring the fact that now I officially hate Firefox, Chromium, OpenOffice.org and everyone else there for not implementing this and having me spend the day changing the spell-checker’s language, I was left with the choice on how to use this in eSpeak GUI.
The first option I found was TextCat, which is also the only library I’ve found to be packaged for Debian. However, ignoring the fact that upstream isn’t maintaining it any more (such a library shouldn’t need too much maintainance, after all), the package declares incorrect dependencies (bug filled a month ago, no response yet) and the API is also pretty crappy (it requires a physical file indicating the location of the statistic models).
Unrelated to that, I’ve also found that the Catalan text samples it includes are incorrect, so the same may be true for other languages. I guess it’d make sense to work on a new (and completely Unicode) language samples collection. I’ve thought of using something like the Universal Declaration of Human Rights since this way all languages can have the same text, but being more of a legal thing it may be biased by some words being too repetitive.
Looking for other alternatives to the TextCat library I’ve only found the following:
- TextCat (same name, different code): PHP licensed, so incompatible with GPL projects.
- Mguesser (part of mnogosearch-mysql): it’s a standalone executable and not a library.
- SpamAssassin’s TextCat.pm: also a standalone executable, this time written in Perl. Apparently they were using a fork of TextCat (the original library, not the PHP licensed one) before that.
So it looks like I’ll have to start by getting a good collection of text samples I can use to generate the statistic data. Then I have several options on how to actually use it. As I see it, those are my possibilities:
- Fixing libtextcat‘s packaging and just using that.
- Taking it over as new upstream maintainer. Not my preferred option as I don’t really feel like maintaining a C library at this point.
- Trying to convince the maintainer of the new TextCat (with last commit January this year and a more sane API) to re-license it in a GPL-compatible way, packaging that and seeing how that one works (haven’t tried it out yet).
- Writing my own implementation in Python, maybe based upon this example or TextCat.pm.
Any other ideas, pointers to some library I may have missed or offers to collaborate are very welcome. Please also note that my intention in writing this post is not only to rant about there being no well-maintained ready-to-use library being available, but especially raising awareness on the topic of language identification. I’d love to see this feature all around the desktop, just like (and in combination with) spell-checking, which is already omnipresent.





Chromium also contains a language detection library that may be of use. It’s used for the Google Translate integration.
http://src.chromium.org/viewvc/chrome/trunk/src/third_party/cld/
The Sonnet spellchecking framework from KDElibs supports automatic language detection, and it seems to work quite well for my use, all around my KDE desktop. You can see the detectLanguage function here : http://websvn.kde.org/trunk/KDE/kdelibs/kdecore/sonnet/globals.cpp?view=markup
I think depending on kdelibs for such a small feature would be overkill, but you could borrow the implementation ideas.
Sounds like a great project, keep us informed.
Thanks all for your comments so far.
@ David:
That looks really interesting, it’s certainly something to keep an eye on. It’s probably a bit too big for what I need in my application, though.
@ xvello:
Oh, thanks for that pointer. From a quick look at it it seems like they are using the first method I thought of (checking which system dictionary contains the most words from the text).
Now that you’ve got me considering that option again, it’s probably the easiest one for eSpeak GUI. The advantage of the n-gram method, it’s better scalability for big texts and large amounts of possible languages, isn’t that relevant to me since I only want to check against the languages the user knows.
OpenOffice.org is using forked GPL TexCat, see http://qa.openoffice.org/issues/show_bug.cgi?id=73173
I’ve turned Mguesser into a library, in case you’re interested: http://www.miriamruiz.es/code/library_guesser-0.4.tgz
I expect my first thought would have been to slap together some software to calculate each language’s probability using Bayes’ theorem, but I’m glad there’s research (and software) for this. Please post again when you choose a solution!
Subscribing to the “language identification” tag.
OpenOffice.org is using BSD licensed libtextcat patched for UTF-8 support (where “gram” is a real multibyte char), see http://qa.openoffice.org/issues/show_bug.cgi?id=73173
If you want help maintain upstream C library I could help you.
Don’t fully comprehend the other tasks but this is worth project.
Mailing you private detail
Is there a project page.
Greetings,
Apache Nutch (java crawler) also has language identification code based on n-grams AFAIK in org.apache.nutch.analysis.lang.LanguageIdentifier
http://nutch.apache.org/apidocs-1.2/org/apache/nutch/analysis/lang/LanguageIdentifier.html
We already implemented our own Python code based on the n-gram technique in the Translate Toolkit, which is currently used by Virtaal to help users do language selection. You can see the code here:
http://translate.svn.sourceforge.net/viewvc/translate/src/trunk/translate/lang/ngram.py
http://translate.svn.sourceforge.net/viewvc/translate/src/trunk/translate/lang/identify.py
It is based on a toy Python implementation we found at the time.
Please work with us to make the best Python language detection available. Our code works well, but the models aren’t great, and there are some languages it struggles to identify at the moment. We tried to remove some of the incorrect models from our copy to try to improve the quality. Let me know if you want to discuss some possibilities for reuse/factoring out.