Re: AbiSaurus v0.05


Subject: Re: AbiSaurus v0.05
rob.campbell@att.net
Date: Mon Jul 09 2001 - 07:50:24 CDT


I entered "evil" in the online version and received 236
responses. One of the challenges will be filtering this
data so that is more manageable (for the programmers)
and more useful (for the end user).

I tested MS Word's thesaurus and saw one option (now
obvious) that you would want to implement even in the
command line version. The list can be filtered based on
the word's part of speech (e.g., noun or adjective).
This would be easy to implement, if the Project
Guttenberg database already includes part of speech as a
field.

MS Word also lets the list be filtered based on a
particular definition of the word. For example, the
options for "evil" are:

wickedness (noun)
wicked (adective)
foul (adjective)

This wouldn't be any more difficult than filtering on
parts of speech, except that it's less likely that the
data to do so already exists.

Also, for all its various meanings, MS Word presented
only 20 synonyms for "evil", so some other kind of
filtering is taking place automatically. I think that
MS Word's list is too limited, but AbiSaurus' is
daunting without additional guidance. When I
entered "copious", a less common word, Abisaurus
returned 65 synonyms. These terms are less common and
more likely to have subtle variations of meaning that I
don't know.

It looks like very good work, and it's very fast. I've
bookmarked the site. Thanks.

Rob Campbell
rob.campbell@att.net
> Hello all,
>
> I have recently completed version 0.05 of AbiSaurus, a thesaurus library
> that (hopefully) AbiWord can use. The thesaurus is currently English-only,
> and since I don't really speak any other languages I'd be hard pressed to
> add more.
>
> The library is, for the interim, available at:
> http://aiken.clan11.com/abisaurus/
>
> This is the first version of the thesaurus that I think is generally
> "acceptable" in terms of speed and disk space requirements. Decompressed,
> the thesaurus data fits into about 2.4 megs of space, and the average search
> time on my K7-650 is under .2 seconds. In addition, the header file to
> actually use the library is nice and light, and doesn't use anything more
> than regular old character pointers, so I figure it should be easy to use
> with UT_String or any other sort of string class...
>
> I have already written a command-line interface to it where you do:
> ./AbiSaurus [word]
>
> And another, interactive thesaurus program that prompts for words and spits
> out synonyms indefinitely. I have provided a simple PHP interface to the
> command-line interface so that you can search the thesaurus "online". The
> URL for this is:
> http://aiken.clan11.com/abisaurus/online/
>
> The data for the thesaurus comes from Project Guttenberg (linked to from the
> AbiSaurus site), and is released into the public domain, so I figure it's
> acceptable to use it. The code, of course, is GPL.
>
> I will be working in the next few days to put together an explanation of the
> exact steps for converting the Moby thesaurus (from Guttenberg) into the
> data files used, and providing source code to do this.
>
> I will also be working in the next few weeks to try to add part-of-speech
> information to the thesaurus so that it can group results more intelligently
> and return better data. (For example, right now if you type in 'lead' into
> the thesaurus, it returns over 300 synonyms because it is looking for 'lead'
> as in the metal, 'lead' as in leadership, 'lead' as in the lead role, etc.)
> I have already found public-domain parts-of-speech list, so I'll try to
> start working on that as soon as I have some more time.
>
> I would like AbiSaurus to be associated with the AbiWord project, but as the
> online interface demonstrates there's no real reason it must be specific to
> AbiWord. I've heard that there were cross-development efforts going on
> between the KWord and AbiWord developers, so maybe they'd like to use
> AbiSaurus as well. In any case, this is a future discussion... I could put
> it up on sourceforge or something, or it can become part of AbiWord proper
> if you'd prefer -- no reason it needs to stay on my machine.
>
> Things that the thesaurus still needs, and help that I will definitely need
> from others:
>
> 1. Portability. Right now the thesaurus compiles and runs on my Linux box,
> but I do not know how portable it is or anything. I am not familiar with
> autoconf/autogen yet and don't really know how to use DLLs in Windows.
>
> 2. Integration with AbiWord. I am not very familiar with AbiWord's source
> code yet and still haven't managed to get AbiWord to compile on my Mandrake
> box despite Dom and Martin's best efforts. So anyway, I'll be hard pressed
> to integrate this with AbiWord, much less to integrate it correctly.
> Someone familiar with the code base will certainly do a much better (and
> faster) job. =)
>
> In any case, this has been fun, and I hope you like it!
>
> Jared
>
>
>



This archive was generated by hypermail 2b25 : Mon Jul 09 2001 - 07:50:31 CDT