Language Identification and it’s state in Free Software

Working on a new feature for eSpeak GUI I started looking into language identification. Forcing users to manually choose the text’s language is a botheration, so trying to guess it by checking which system dictionary contains the most words from the text or some other method would surely be beneficial.

After a quick search I learned that it’s much easier than this: it’s possible to reliably determine the language based on statistic n-gram information. Ignoring the fact that now I officially hate Firefox, Chromium, OpenOffice.org and everyone else there for not implementing this and having me spend the day changing the spell-checker’s language, I was left with the choice on how to use this in eSpeak GUI.

The first option I found was TextCat, which is also the only library I’ve found to be packaged for Debian. However, ignoring the fact that upstream isn’t maintaining it any more (such a library shouldn’t need too much maintainance, after all), the package declares incorrect dependencies (bug filled a month ago, no response yet) and the API is also pretty crappy (it requires a physical file indicating the location of the statistic models).

Unrelated to that, I’ve also found that the Catalan text samples it includes are incorrect, so the same may be true for other languages. I guess it’d make sense to work on a new (and completely Unicode) language samples collection. I’ve thought of using something like the Universal Declaration of Human Rights since this way all languages can have the same text, but being more of a legal thing it may be biased by some words being too repetitive.

Looking for other alternatives to the TextCat library I’ve only found the following:

  • TextCat (same name, different code): PHP licensed, so incompatible with GPL projects.
  • Mguesser (part of mnogosearch-mysql): it’s a standalone executable and not a library.
  • SpamAssassin’s TextCat.pm: also a standalone executable, this time written in Perl. Apparently they were using a fork of TextCat (the original library, not the PHP licensed one) before that.

So it looks like I’ll have to start by getting a good collection of text samples I can use to generate the statistic data. Then I have several options on how to actually use it. As I see it, those are my possibilities:

  1. Fixing libtextcat‘s packaging and just using that.
  2. Taking it over as new upstream maintainer. Not my preferred option as I don’t really feel like maintaining a C library at this point.
  3. Trying to convince the maintainer of the new TextCat (with last commit January this year and a more sane API) to re-license it in a GPL-compatible way, packaging that and seeing how that one works (haven’t tried it out yet).
  4. Writing my own implementation in Python, maybe based upon this example or TextCat.pm.

Any other ideas, pointers to some library I may have missed or offers to collaborate are very welcome. Please also note that my intention in writing this post is not only to rant about there being no well-maintained ready-to-use library being available, but especially raising awareness on the topic of language identification. I’d love to see this feature all around the desktop, just like (and in combination with) spell-checking, which is already omnipresent.

The Red Hat Way

I’ve just used one of the GUADEC USB sticks for the first time and found that, in addition to several PDF brochures and a bunch of wallpapers, it includes a Red Hat commercial. Excellent as always.

Command-line script to edit PDF file meta-data

I’ve just written a little wrapper around pdftk to simplify the modification of PDF file meta-data.

It extracts the existing meta-data, opens it in your favourite editor, writes it back to the original PDF and finally removes the temporary files it generated.

#! /bin/sh

if [ ! -f "$1" ]
then
	echo "Usage: $0"
	exit 1
fi

FILE=$(tempfile --prefix=pdf-metadata-)
pdftk "$1" dump_data output $FILE
creation_time=$(stat -c %Y $FILE)
$EDITOR $FILE

if [ "$(stat -c %Y $FILE)" -le $creation_time ]
then
	echo "Information not modified, aborting."
	rm -f "$FILE"
	exit 2
fi

output_file=$(tempfile --prefix=pdf-modified-)
pdftk "$1" update_info $FILE output "$output_file" || \
	(rm -f $FILE && exit 3)

mv -f "$output_file" "$1"
rm -f $FILE

I thought I’d share it in case anyone else finds it useful. Do whatever you want with it; if you want to use it somewhere serious, the ISC License is fine.

Zeitgeist 0.5.1 released!

On behalf of the Zeitgeist Project team, I am pleased to announce the immediate availability of Zeitgeist 0.5.1.

What is Zeitgeist?

Zeitgeist is an event-logging framework for desktop and mobile devices. Applications can push events into the log, and anyone can query the log via the rich query API. The logged events are semantically categorized and can come from any sort of activity, such as file usage, communications, browsing history, etc.

The Zeitgeist engine is a user-level service and does not provide a GUI. It is intended to support dedicated journalling applications and deep integration with other desktop components.

Where?

Downloads: https://launchpad.net/zeitgeist/+download (zeitgeist-0.5.1.tar.gz)

About Zeitgeist: http://zeitgeist-project.com
Wiki: http://live.gnome.org/Zeitgeist

News since 0.5.0

2010-09-09: Zeitgeist 0.5.1 "Spongebob is not funny"

Engine:

- Don't use the return value of Extension.post_insert_event() when
dispatching the post insert hooks. The post_insert_event() method
has no return value.
- Initialize ZeitgeistEngine after RemoteInterface, so that --replace
does its job before the main engine and extensions start (LP: #614315).
- Added support for queries on the Subject.Storage field of an Event
(LP: #580364).
- Some optimizations in the find_events() method. Also the profiling
data is much more useful.

Python API:

- Check arguments of Event.new_for_values() and Subject.new_for_values()
(LP: #580372).
- Redefined the result of TimeRange.always(), UNIX timestamp "0" is now
the left corner of the interval (LP: #614295).
- Added a new helper module called zeitgeist.mimetypes which basically
provides two functions (LP: #586524):
* get_interpretation_for_mimetype(), which tries to get a suitable
interpretation for a given mime-type.
* get_manifestation_for_uri(), which tries to lookup a manifestation
for the given URI.
- The DataSource model now provides easy access to the information it
holds through properties.

Overall:

- The tool to build our ontology now supports rdflib2 and rdflib3
(LP: #626224).
- Added "make check" and "make doc" commands to the rootlevel Makefile
(LP: #628661)
- Translation updates.
- Updated test suite.
- Documentation updates.

Book Review: The Heart Mender

Some time ago I found out about Book Sneeze, a service by Thomas Nelson offering free books in exchange for reviewing them on your blog (the reviews aren’t required to be positive).

I decided to give it a try, and a couple days after signing up my account got approved. The website shows a changing selection of a bit more than a handful books available at any time. When I visited the page, most stuff available seemed to be about religion or self-help, which I’m not really interested in, and I decided for the book which looked the most like a novel. It happened to be Andy Andrews’ «The Heart Mender» (formerly published as Island of Saints).

After the usual delay (snail post is painfully slow here) I received the expected package, and they were even so nice as to include a second copy to give away. I’ll see what I do with it :). Anyway, here goes the review!

When I started reading the book, it surprised me twofold: first of all, it didn’t begin with the actual story I had expected, but with a narration of how the author got to write the book, after finding a box containing WWII artefacts and a photo buried in his backyard. Even more surprising, though, I got captivated by it after reading just the first few pages, something many few books achieve so fast.

After a few chapters exploring the origin of the artefacts, the author begins with the actual story of how Josef Landerman, Lieutenant of a German U-Boat sent to the Gulf of Mexico, ends up at the mercy of Helen Mason. Helen, a widow who hasn’t been able to overcome her ill fortune, is thus confronted with the choice of saving the life of someone wearing the same uniform that killed her husband years ago. Even if Helen decides to help him, what will await him so far from home?

While it’s sold as a romance focusing on the topic of forgiveness, the book doesn’t miss its fair share of adventure and it is a pleasurable read. I can only recommend it.