My dot files (Tips and Tricks for Bash & co.)

.bashrc

# If not running interactively, don't do anything
[ -z "$PS1" ] && return

# don't put duplicate lines in the history and ignore same sucessive entries.
export HISTCONTROL=ignoreboth

# make the history longer
HISTFILESIZE=5000

# append to the history file, don't overwrite it
shopt -s histappend

# check the window size after each command and, if necessary,
# update the values of LINES and COLUMNS.
shopt -s checkwinsize

# make less more friendly for non-text input files, see lesspipe(1)
[ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)"

case "$TERM" in
xterm*|rxvt*|screen)
    #PS1='\[\e[1;34m\][\u, \W]\$ \[\e[m\]'
    # http://live.gnome.org/Git/Tips, http://tldp.org/HOWTO/Bash-Prompt-HOWTO/x329.html
    PS1='\[\e[1;34m\][\u, \W$(__git_ps1 "(\[\e[1;30m\]%s\[\e[m\]\[\e[1;34m\])")]\$ \[\e[m\]'
    ;;
*)
    ;;
esac

# enable color support of ls and also add handy aliases
if [ -x /usr/bin/dircolors ]; then
    eval "`dircolors -b`"
    alias ls='ls --color=auto'
fi

shopt -s cdspell
shopt -s cmdhist

# enable programmable completion features
if [ -f /etc/bash_dyncompletion ]; then
    . /etc/bash_dyncompletion
elif [ -f /etc/bash_completion ]; then
    . /etc/bash_completion
fi

. ~/.bash_aliases

export PATH=$PATH:/sbin:/usr/sbin:/home/rainct/bin:/home/rainct/.local/bin
export DEBFULLNAME="Siegfried-Angel Gevatter Pujals"
export DEBSIGN_KEYID="363DEAE3"
export DEB_MAINTAINER_MODE=1
export PBUILDFOLDER="/home/rainct/pbuilder"
export QUILT_PATCHES=debian/patches
export GREP_OPTIONS='--color=auto --exclude-dir=\.svn'
export EDITOR=nano

# This also needs an entry in .devscript:
# DEBUILD_PRSERVE_ENVVARS=DPKG_GENSYMBOLS_CHECK_LEVEL
export DPKG_GENSYMBOLS_CHECK_LEVEL=4

Continue reading →

Language Identification and it’s state in Free Software

Working on a new feature for eSpeak GUI I started looking into language identification. Forcing users to manually choose the text’s language is a botheration, so trying to guess it by checking which system dictionary contains the most words from the text or some other method would surely be beneficial.

After a quick search I learned that it’s much easier than this: it’s possible to reliably determine the language based on statistic n-gram information. Ignoring the fact that now I officially hate Firefox, Chromium, OpenOffice.org and everyone else there for not implementing this and having me spend the day changing the spell-checker’s language, I was left with the choice on how to use this in eSpeak GUI.

The first option I found was TextCat, which is also the only library I’ve found to be packaged for Debian. However, ignoring the fact that upstream isn’t maintaining it any more (such a library shouldn’t need too much maintainance, after all), the package declares incorrect dependencies (bug filled a month ago, no response yet) and the API is also pretty crappy (it requires a physical file indicating the location of the statistic models).

Unrelated to that, I’ve also found that the Catalan text samples it includes are incorrect, so the same may be true for other languages. I guess it’d make sense to work on a new (and completely Unicode) language samples collection. I’ve thought of using something like the Universal Declaration of Human Rights since this way all languages can have the same text, but being more of a legal thing it may be biased by some words being too repetitive.

Looking for other alternatives to the TextCat library I’ve only found the following:

  • TextCat (same name, different code): PHP licensed, so incompatible with GPL projects.
  • Mguesser (part of mnogosearch-mysql): it’s a standalone executable and not a library.
  • SpamAssassin’s TextCat.pm: also a standalone executable, this time written in Perl. Apparently they were using a fork of TextCat (the original library, not the PHP licensed one) before that.

So it looks like I’ll have to start by getting a good collection of text samples I can use to generate the statistic data. Then I have several options on how to actually use it. As I see it, those are my possibilities:

  1. Fixing libtextcat‘s packaging and just using that.
  2. Taking it over as new upstream maintainer. Not my preferred option as I don’t really feel like maintaining a C library at this point.
  3. Trying to convince the maintainer of the new TextCat (with last commit January this year and a more sane API) to re-license it in a GPL-compatible way, packaging that and seeing how that one works (haven’t tried it out yet).
  4. Writing my own implementation in Python, maybe based upon this example or TextCat.pm.

Any other ideas, pointers to some library I may have missed or offers to collaborate are very welcome. Please also note that my intention in writing this post is not only to rant about there being no well-maintained ready-to-use library being available, but especially raising awareness on the topic of language identification. I’d love to see this feature all around the desktop, just like (and in combination with) spell-checking, which is already omnipresent.

The Red Hat Way

I’ve just used one of the GUADEC USB sticks for the first time and found that, in addition to several PDF brochures and a bunch of wallpapers, it includes a Red Hat commercial. Excellent as always.

Command-line script to edit PDF file meta-data

I’ve just written a little wrapper around pdftk to simplify the modification of PDF file meta-data.

It extracts the existing meta-data, opens it in your favourite editor, writes it back to the original PDF and finally removes the temporary files it generated.

#! /bin/sh

if [ ! -f "$1" ]
then
	echo "Usage: $0"
	exit 1
fi

FILE=$(tempfile --prefix=pdf-metadata-)
pdftk "$1" dump_data output $FILE
creation_time=$(stat -c %Y $FILE)
$EDITOR $FILE

if [ "$(stat -c %Y $FILE)" -le $creation_time ]
then
	echo "Information not modified, aborting."
	rm -f "$FILE"
	exit 2
fi

output_file=$(tempfile --prefix=pdf-modified-)
pdftk "$1" update_info $FILE output "$output_file" || \
	(rm -f $FILE && exit 3)

mv -f "$output_file" "$1"
rm -f $FILE

I thought I’d share it in case anyone else finds it useful. Do whatever you want with it; if you want to use it somewhere serious, the ISC License is fine.

Zeitgeist 0.5.1 released!

On behalf of the Zeitgeist Project team, I am pleased to announce the immediate availability of Zeitgeist 0.5.1.

What is Zeitgeist?

Zeitgeist is an event-logging framework for desktop and mobile devices. Applications can push events into the log, and anyone can query the log via the rich query API. The logged events are semantically categorized and can come from any sort of activity, such as file usage, communications, browsing history, etc.

The Zeitgeist engine is a user-level service and does not provide a GUI. It is intended to support dedicated journalling applications and deep integration with other desktop components.

Where?

Downloads: https://launchpad.net/zeitgeist/+download (zeitgeist-0.5.1.tar.gz)

About Zeitgeist: http://zeitgeist-project.com
Wiki: http://live.gnome.org/Zeitgeist

News since 0.5.0

2010-09-09: Zeitgeist 0.5.1 "Spongebob is not funny"

Engine:

- Don't use the return value of Extension.post_insert_event() when
dispatching the post insert hooks. The post_insert_event() method
has no return value.
- Initialize ZeitgeistEngine after RemoteInterface, so that --replace
does its job before the main engine and extensions start (LP: #614315).
- Added support for queries on the Subject.Storage field of an Event
(LP: #580364).
- Some optimizations in the find_events() method. Also the profiling
data is much more useful.

Python API:

- Check arguments of Event.new_for_values() and Subject.new_for_values()
(LP: #580372).
- Redefined the result of TimeRange.always(), UNIX timestamp "0" is now
the left corner of the interval (LP: #614295).
- Added a new helper module called zeitgeist.mimetypes which basically
provides two functions (LP: #586524):
* get_interpretation_for_mimetype(), which tries to get a suitable
interpretation for a given mime-type.
* get_manifestation_for_uri(), which tries to lookup a manifestation
for the given URI.
- The DataSource model now provides easy access to the information it
holds through properties.

Overall:

- The tool to build our ontology now supports rdflib2 and rdflib3
(LP: #626224).
- Added "make check" and "make doc" commands to the rootlevel Makefile
(LP: #628661)
- Translation updates.
- Updated test suite.
- Documentation updates.

 
web development