Working on a new feature for eSpeak GUI I started looking into language identification. Forcing users to manually choose the text’s language is a botheration, so trying to guess it by checking which system dictionary contains the most words from the text or some other method would surely be beneficial.
After a quick search I learned that it’s much easier than this: it’s possible to reliably determine the language based on statistic n-gram information. Ignoring the fact that now I officially hate Firefox, Chromium, OpenOffice.org and everyone else there for not implementing this and having me spend the day changing the spell-checker’s language, I was left with the choice on how to use this in eSpeak GUI.
The first option I found was TextCat, which is also the only library I’ve found to be packaged for Debian. However, ignoring the fact that upstream isn’t maintaining it any more (such a library shouldn’t need too much maintainance, after all), the package declares incorrect dependencies (bug filled a month ago, no response yet) and the API is also pretty crappy (it requires a physical file indicating the location of the statistic models).
Unrelated to that, I’ve also found that the Catalan text samples it includes are incorrect, so the same may be true for other languages. I guess it’d make sense to work on a new (and completely Unicode) language samples collection. I’ve thought of using something like the Universal Declaration of Human Rights since this way all languages can have the same text, but being more of a legal thing it may be biased by some words being too repetitive.
Looking for other alternatives to the TextCat library I’ve only found the following:
- TextCat (same name, different code): PHP licensed, so incompatible with GPL projects.
- Mguesser (part of mnogosearch-mysql): it’s a standalone executable and not a library.
- SpamAssassin’s TextCat.pm: also a standalone executable, this time written in Perl. Apparently they were using a fork of TextCat (the original library, not the PHP licensed one) before that.
So it looks like I’ll have to start by getting a good collection of text samples I can use to generate the statistic data. Then I have several options on how to actually use it. As I see it, those are my possibilities:
- Fixing libtextcat‘s packaging and just using that.
- Taking it over as new upstream maintainer. Not my preferred option as I don’t really feel like maintaining a C library at this point.
- Trying to convince the maintainer of the new TextCat (with last commit January this year and a more sane API) to re-license it in a GPL-compatible way, packaging that and seeing how that one works (haven’t tried it out yet).
- Writing my own implementation in Python, maybe based upon this example or TextCat.pm.
Any other ideas, pointers to some library I may have missed or offers to collaborate are very welcome. Please also note that my intention in writing this post is not only to rant about there being no well-maintained ready-to-use library being available, but especially raising awareness on the topic of language identification. I’d love to see this feature all around the desktop, just like (and in combination with) spell-checking, which is already omnipresent.