AbiSaurus v0.05


Subject: AbiSaurus v0.05
From: Jared Davis (aiken@clan11.com)
Date: Sun Jul 08 2001 - 23:35:23 CDT


Hello all,

I have recently completed version 0.05 of AbiSaurus, a thesaurus library
that (hopefully) AbiWord can use. The thesaurus is currently English-only,
and since I don't really speak any other languages I'd be hard pressed to
add more.

The library is, for the interim, available at:
    http://aiken.clan11.com/abisaurus/

This is the first version of the thesaurus that I think is generally
"acceptable" in terms of speed and disk space requirements. Decompressed,
the thesaurus data fits into about 2.4 megs of space, and the average search
time on my K7-650 is under .2 seconds. In addition, the header file to
actually use the library is nice and light, and doesn't use anything more
than regular old character pointers, so I figure it should be easy to use
with UT_String or any other sort of string class...

I have already written a command-line interface to it where you do:
    ./AbiSaurus [word]

And another, interactive thesaurus program that prompts for words and spits
out synonyms indefinitely. I have provided a simple PHP interface to the
command-line interface so that you can search the thesaurus "online". The
URL for this is:
    http://aiken.clan11.com/abisaurus/online/

The data for the thesaurus comes from Project Guttenberg (linked to from the
AbiSaurus site), and is released into the public domain, so I figure it's
acceptable to use it. The code, of course, is GPL.

I will be working in the next few days to put together an explanation of the
exact steps for converting the Moby thesaurus (from Guttenberg) into the
data files used, and providing source code to do this.

I will also be working in the next few weeks to try to add part-of-speech
information to the thesaurus so that it can group results more intelligently
and return better data. (For example, right now if you type in 'lead' into
the thesaurus, it returns over 300 synonyms because it is looking for 'lead'
as in the metal, 'lead' as in leadership, 'lead' as in the lead role, etc.)
I have already found public-domain parts-of-speech list, so I'll try to
start working on that as soon as I have some more time.

I would like AbiSaurus to be associated with the AbiWord project, but as the
online interface demonstrates there's no real reason it must be specific to
AbiWord. I've heard that there were cross-development efforts going on
between the KWord and AbiWord developers, so maybe they'd like to use
AbiSaurus as well. In any case, this is a future discussion... I could put
it up on sourceforge or something, or it can become part of AbiWord proper
if you'd prefer -- no reason it needs to stay on my machine.

Things that the thesaurus still needs, and help that I will definitely need
from others:

1. Portability. Right now the thesaurus compiles and runs on my Linux box,
but I do not know how portable it is or anything. I am not familiar with
autoconf/autogen yet and don't really know how to use DLLs in Windows.

2. Integration with AbiWord. I am not very familiar with AbiWord's source
code yet and still haven't managed to get AbiWord to compile on my Mandrake
box despite Dom and Martin's best efforts. So anyway, I'll be hard pressed
to integrate this with AbiWord, much less to integrate it correctly.
Someone familiar with the code base will certainly do a much better (and
faster) job. =)

In any case, this has been fun, and I hope you like it!

    Jared



This archive was generated by hypermail 2b25 : Sun Jul 08 2001 - 23:36:00 CDT