Re: Integrating OpenMedSpel word list with Abiword's spell checker

From: Dominic Lachowicz <domlachowicz_at_gmail.com>
Date: Tue Apr 30 2013 - 18:50:11 CEST

What I mean by "composable" is that you can create an "English OR
Spanish OR German OR Hindi OR Mandarin" composite dict monstrosity by
issuing the following commands (excuse my Lisp-ish love of parens):

Dict final = compose(English, compose(Spanish, compose(German,
compose(Hindi, Mandarin))));

On Tue, Apr 30, 2013 at 12:46 PM, Dominic Lachowicz
<domlachowicz@gmail.com> wrote:
> I'm simplifying a bit, but Enchant basically exposes 2 methods from
> its dictionary-level API:
>
> 1) "Is this word correct?"
> 2) "Give me some suggestions for this word."
>
> In the case of mashing up (say) English plus OpenMedSpel, imagine that
> you had 2 dictionaries A & B (respectively) wrapped up into a new
> Enchant "CompositeDict" object that implemented Enchant's plugin API.
> The implementation of #1 becomes:
>
> return is_word_correct(A, word) || is_word_correct(B, word);
>
> A naive first implementation of #2 could become:
> return set_union(suggest_for_word(A, word), suggest_for_word(B, word));
>
> Where "set_union()" would contain the set of all unique words
> suggested for @word. If you're worried about the order that the
> suggestions come back in (so that better suggestions bubble to the top
> of your list), you could have a convention that you always return
> suggestions from the Left Hand Side (i.e. dict A) first. Again, it's
> naive, but it probably handles most of the use cases "well enough".
>
> The nice thing about implementing a CompositeDict type object as an
> EnchantDict object is that the functions you implement become
> composable. Eg: you could create an "English OR Spanish OR German OR
> Hindi OR Mandarin" dict using the same primitives and loose set of
> conventions.
>
> Is it too simplistic? Maybe. But then again, You (probably) Aren't
> Going to Need It. Keep It Simple, Silly. Only overthink things or add
> complexity to them at the latest possible moment, when you're sure
> that it's not good enough. And you won't know if it was good enough or
> not until you've gotten a demo in front of some users.
>
> Basically, I think that this is about 100 lines of C code inside of
> Enchant proper (inclusive of whitespace, comments, curly braces, and
> copyright headers), plus some smarts in AbiWord's "choose a language"
> dialog box.
>
> On Tue, Apr 30, 2013 at 12:29 PM, Vidh <vidhu2366@gmail.com> wrote:
>> Dom,
>>
>> thanks for your time.
>>
>> First of all, I am happy that for most of the discussion, we both are
>> in sync with what could be done under this feature dev.
>>
>> As I already guessed, the "AND" operation will be for
>> experimentation/research use cases. Totally agreed on the simplicity
>> of implementing that. I was just getting curious on "AND". Hence
>> wanted a clarification. Now I am convinced on that part! :)
>>
>> Coming to implementation of OpenMedSpel dictionary, I had already
>> tried it as Aspell dictionary in this short hack experiment:
>> http://vidhoon.wordpress.com/2013/04/02/integrating-openmedspel-wordlist-in-abiword-spell-checker/
>> I am yet to explore it in other forms and yes this can be done to make
>> a wise decision overall.
>>
>> I needed one more clarification. Again quoting from your earlier mail,
>> you had stated that "intersect the words in a medical dictionary with
>> one in an (eg.)
>> English or Dutch dictionary". I did some analysis and found the
>> enchant broker APIs to support only dictionary level
>> manipulations.That is, we can find if a word is in dict1 or dict2. But
>> we cannot generate a "dict" of English words + OpenMedSpel words using
>> Enchant API and then use "dict" for spell checking.
>>
>> So the application operations would be at dict level and not at words
>> level is what I deduce. Please correct me if I am wrong.
>>
>> On Tue, Apr 30, 2013 at 9:49 PM, Dominic Lachowicz
>> <domlachowicz@gmail.com> wrote:
>>> Hi Vidh,
>>>
>>> Yes, that's what I was thinking, but by no means do I have a monopoly
>>> on good ideas.
>>>
>>> OR is useful for use cases like "This word is either in English or
>>> Spanish". Or in the case of a medical dictionary, "This word is either
>>> English or it's a medical term of art". "OR" will easily cover in
>>> excess of 99.999% of your use cases.
>>>
>>> Other boolean operators besides "OR" may only be useful in a strictly
>>> academic/masturbatory sense. Eg: "I'd like to know what words are
>>> common across these 2 languages. If I had an 'AND' operator, I could
>>> do that." I don't think that the incremental work to support AND vs OR
>>> is all that huge - my sense is that it's on the order of a few lines
>>> of code. Now, I'm not saying that you should implement AND and OR. I'd
>>> just want the design to be flexible enough to support it easily, if it
>>> becomes necessary.
>>>
>>> How you'd do the UI for selecting those multiple languages is up to
>>> you. You could allow users to combine languages "on the fly". You
>>> could pre-populate the list with an "English Medical Dictionary"
>>> option. If this is for a GSoC type of project, I'd say that you have a
>>> good deal of latitude to experiment. Heck, try it both ways, and put
>>> it in front of "focus groups". See which they prefer.
>>>
>>> Think about how you'd want to implement the OpenMedSpel dictionary. We
>>> could package it up as a Hunspell dictionary, an Aspell dictionary, or
>>> pass it to Enchant as a custom word list file. All of the approaches
>>> have their advantages and drawbacks.
>>>
>>> Cheers,
>>> Dom
>>>
>>> On Tue, Apr 30, 2013 at 11:04 AM, Vidh <vidhu2366@gmail.com> wrote:
>>>> Thanks Dom.
>>>>
>>>> OK I have developed ideas for my proposal based on the outline you
>>>> gave in previous mail on Apr 9th.
>>>>
>>>> The idea is "to combine several dictionaries to create a composite
>>>> dictionary and use it for spell checking multi lingual documents".
>>>>
>>>> I need some clarifications from your previous mail:
>>>> When will operations other than OR be needed between multiple
>>>> dictionaries? Let us assume a multi lingual document - Eng,French and
>>>> German. In this case, we will need to make sure that the words typed
>>>> by user should belong to either Eng or French or German right?
>>>> So if my understanding is right, why will need AND operation which you
>>>> had mentioned earlier?
>>>>
>>>> Next coming to the usage of this feature. I am planning to make the
>>>> user select the languages he would be using in the document (in the
>>>> case of multi lingual docs) in Tools->Set Language(s) option. Then
>>>> based on that we could create a CompositeDict instance and use it for
>>>> spell checking the document.
>>>>
>>>> Is this the kind of usage you have in mind? I just want to pool in
>>>> ideas and refine to select the most practical and useful feature set.
>>>>
>>>> In the case of OpenMedSpel, we would by default have a CompositeDict
>>>> for English-US and it would comprise of normal English-US words (.dict
>>>> file) and other domain specific lists like OpenMedSpel. So, this
>>>> problem becomes a subset of the proposed idea.
>>>>
>>>> Let me know your thoughts.
>>>>
>>>> Martin,
>>>>
>>>> Feel free to chip in your thoughts also! :)
>>>>
>>>> thanks!
>>>>
>>>> On Tue, Apr 30, 2013 at 6:38 AM, Dominic Lachowicz
>>>> <domlachowicz@gmail.com> wrote:
>>>>> Hi Vidh,
>>>>>
>>>>> Sorry - I'm not often available on IRC or Google Chat. I'm happy to
>>>>> continue the discussion over email, though.
>>>>>
>>>>> Thanks,
>>>>> Dom
>>>>>
>>>>> On Mon, Apr 29, 2013 at 1:31 AM, Vidh <vidhu2366@gmail.com> wrote:
>>>>>> Hi Martin & Dom,
>>>>>>
>>>>>> Are you guys available sometime on IRC to discuss this?
>>>>>> thanks!
>>>>>>
>>>>>> On Tue, Apr 9, 2013 at 8:10 PM, Dominic Lachowicz
>>>>>> <domlachowicz@gmail.com> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> My thought was to create a new enchant method called "Dict
>>>>>>> enchant_dict_compose(Dict 1, Dict 2, ...)". What it would do is
>>>>>>> intersect the words in a medical dictionary with one in an (eg.)
>>>>>>> English or Dutch dictionary. We'd need to create an internal
>>>>>>> CompositeDict object which would perform some boolean operation on the
>>>>>>> words - eg: OR or AND. You could use this as a building block to
>>>>>>> support mixed-language texts (eg: English + Spanish). "Medical" here
>>>>>>> is really just a domain-specific language.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Dom
>>>>>>>
>>>>>>> On Tue, Apr 9, 2013 at 9:59 AM, Vidh <vidhu2366@gmail.com> wrote:
>>>>>>>> Martin & Dom,
>>>>>>>>
>>>>>>>> Any thoughts about this?
>>>>>>>> I would like to take this to completion as a feature for abiword.
>>>>>>>>
>>>>>>>> thanks!
>>>>>>>>
>>>>>>>> On Tue, Mar 26, 2013 at 11:53 PM, Vidh <vidhu2366@gmail.com> wrote:
>>>>>>>>> +screenshot
>>>>>>>>>
>>>>>>>>> On Tue, Mar 26, 2013 at 11:52 PM, Vidh <vidhu2366@gmail.com> wrote:
>>>>>>>>>> Martin,
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> Both the features you have listed works.
>>>>>>>>>> Since Aspell already is built for spell check and suggestion, once the
>>>>>>>>>> word list is integrated, these are automatically taken care of.
>>>>>>>>>>
>>>>>>>>>> Please check the screen shot attached showing the functional state.
>>>>>>>>>>
>>>>>>>>>> I want to focus on taking this work to closure and possible extending
>>>>>>>>>> it as required.
>>>>>>>>>> I would love to know what Dom thinks abt this and hear his suggestions.
>>>>>>>>>>
>>>>>>>>>> thanks & regards,
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 26, 2013 at 6:10 PM, Martin Sevior <msevior@gmail.com> wrote:
>>>>>>>>>>> HI Vidh,
>>>>>>>>>>>
>>>>>>>>>>> Dom is more the expert on openmedspell and spell checking than me.
>>>>>>>>>>> Would you like to comment Dom?
>>>>>>>>>>>
>>>>>>>>>>> At first glance this looks really impressive. I guess the next step
>>>>>>>>>>> would be to type in some words that almost match those in openmedspell
>>>>>>>>>>> and see if they are
>>>>>>>>>>> 1. underlined
>>>>>>>>>>> 2. Have the correct openmedspell word offered as a suggestion from the
>>>>>>>>>>> drop down list.
>>>>>>>>>>>
>>>>>>>>>>> Congrats on making such fast progress!
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> Martin
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Mar 25, 2013 at 8:11 PM, Vidh <vidhu2366@gmail.com> wrote:
>>>>>>>>>>>> Hi Martin,
>>>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Mar 23, 2013 at 12:16 AM, Vidh <vidhu2366@gmail.com> wrote:
>>>>>>>>>>>>> Integrating OpenMedSpel word list with Abiword's spell checker
>>>>>>>>>>>>> ====================================================
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried doing random things related to Abiword in order to relate with
>>>>>>>>>>>>> source code.
>>>>>>>>>>>>> I got things moving on this GSOC idea:
>>>>>>>>>>>>> http://www.abisource.com/wiki/Google_Summer_of_Code_2013#OpenMedSpel_plugin_for_AbiWord
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am writing this mail to report what I have done related to this idea
>>>>>>>>>>>>> and to present some results.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The goal is to make Abiword simultaneously spell check against
>>>>>>>>>>>>> OpenMedSpel and the default language dictionary.
>>>>>>>>>>>>> I have achieved the same result by making few changes to "enchant" and "aspell".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here are the steps I followed:
>>>>>>>>>>>>> ========================
>>>>>>>>>>>>> 1)Downloaded OpenMedSpel wordlist from:
>>>>>>>>>>>>> http://www.e-medtools.com/openmedspel100.zip
>>>>>>>>>>>>>
>>>>>>>>>>>>> This had the following files:
>>>>>>>>>>>>>
>>>>>>>>>>>>> vidhoon@vidhoonv:/usr/lib/aspell$ ls -l ~/Downloads/openmedspel100
>>>>>>>>>>>>> total 2968
>>>>>>>>>>>>> -rw-rw-r-- 1 vidhoon vidhoon 607041 Feb 14 2007 OpenMedSpel 100.csv
>>>>>>>>>>>>> -rw-rw-r-- 1 vidhoon vidhoon 558312 Mar 14 01:38 OpenMedSpel 100.txt
>>>>>>>>>>>>> -rw-rw-r-- 1 vidhoon vidhoon 1169 Feb 14 2007 README_OpenMedSpel.txt
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2) The txt file in the download had DOS characters in it. Hence, I had
>>>>>>>>>>>>> to do this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> $dos2unix OpenMedSpel\ 100.txt OpenMedSpelunix.txt
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3) Now I created a wordlist for aspell using the command below:
>>>>>>>>>>>>>
>>>>>>>>>>>>> $aspell --lang=en create master ./openmedspel.rws <
>>>>>>>>>>>>> ~/Downloads/openmedspel100/OpenMedSpel\ 100.txt
>>>>>>>>>>>>>
>>>>>>>>>>>>> (why I chose Aspell? - I could easily find relevant documentation to
>>>>>>>>>>>>> solve this problem)
>>>>>>>>>>>>> This link was really helpful:
>>>>>>>>>>>>> http://aspell.net/0.50-doc/man-html/5_Working.html#SECTION00640000000000000000
>>>>>>>>>>>>>
>>>>>>>>>>>>> 4) After this, I located Aspell in my local system and found it at:
>>>>>>>>>>>>> /usr/lib/aspell
>>>>>>>>>>>>>
>>>>>>>>>>>>> In this location I could find all "multi" dictionary files and "rws"
>>>>>>>>>>>>> lists included in them.
>>>>>>>>>>>>> I copied the openmedspel.rws list to this location.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 5) I took the en_US.multi (since OpenMedSpel wordlist is also USA
>>>>>>>>>>>>> english) and found it to contain "en_US-wo_accents.multi"
>>>>>>>>>>>>> Then I opened en_US-wo_accents.multi and added the new wordlist
>>>>>>>>>>>>> created as shown below:
>>>>>>>>>>>>>
>>>>>>>>>>>>> vidhoon@vidhoonv:/usr/lib/aspell$ cat en_US-wo_accents.multi
>>>>>>>>>>>>> # Generated with Aspell Dicts "proc" script version 0.60.2
>>>>>>>>>>>>> add en-common.rws
>>>>>>>>>>>>> add en_US-wo_accents-only.rws
>>>>>>>>>>>>> add openmedspel.rws
>>>>>>>>>>>>>
>>>>>>>>>>>>> 6) I understood that Aspell is exercised through "enchant" by Abiword
>>>>>>>>>>>>> and located enchant in my local system:
>>>>>>>>>>>>> /usr/share/enchant
>>>>>>>>>>>>>
>>>>>>>>>>>>> 7) I found the "private" enchant.ordering file that determines the way
>>>>>>>>>>>>> spell checker and dictionaries are chosen for each language.
>>>>>>>>>>>>> I learned more about this file from this link:
>>>>>>>>>>>>> https://listman.redhat.com/archives/xdg-list/2003-July/msg00188.htmlhttp://accounts.unimelb.edu.au
>>>>>>>>>>>>>
>>>>>>>>>>>>> For time being, I hacked this file to place "aspell" ahead of
>>>>>>>>>>>>> "myspell" from the ordering so that aspell gets picked for en_US and
>>>>>>>>>>>>> this would contain OpenMedSpel wordlist also.
>>>>>>>>>>>>>
>>>>>>>>>>>>> vidhoon@vidhoonv:/usr/lib/aspell$ cat /usr/share/enchant/enchant.ordering
>>>>>>>>>>>>> *:aspell,myspell,ispell //-> order changed in this line
>>>>>>>>>>>>> fi:voikko,ispell,myspell,aspell
>>>>>>>>>>>>> fi_FI:voikko,ispell,myspell,aspell
>>>>>>>>>>>>> he:hspell,myspell
>>>>>>>>>>>>> he_IL:hspell,myspell
>>>>>>>>>>>>> yi:uspell
>>>>>>>>>>>>> tr:zemberek
>>>>>>>>>>>>> tr_TR:zemberek
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now I can see that abiword spell check does not underline words from
>>>>>>>>>>>>> OpenMedSpel list which indicates that the goal is achieved. That is,
>>>>>>>>>>>>> Abiword does spell check for English US - normal words and OpenSpelMed
>>>>>>>>>>>>> words.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have attached screenshot of abiword illustrating this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So coming to the next steps,
>>>>>>>>>>>>> =======================
>>>>>>>>>>>>> 1)I would like to know how these steps involving enchant and aspell
>>>>>>>>>>>>> (purely! abiword need not even be compiled) can be transformed into an
>>>>>>>>>>>>> abiword plugin.
>>>>>>>>>>>>> 2)Have I achieved the required result in the right way?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Or is it that a UI based option from Abiword for users to add their
>>>>>>>>>>>>> own word lists for a language dictionary is needed?
>>>>>>>>>>>>> I think this might fit in the "plugin" definition!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any help, suggestions and thoughts appreciated.
>>>>>>>>>>>>> thanks & regards!
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Vidhoon Viswanathan
>>>>>>>>>>>>> +91 7760711773
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Vidhoon Viswanathan
>>>>>>>>>>>> +91 7760711773
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Vidhoon Viswanathan
>>>>>>>>>> +91 7760711773
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Vidhoon Viswanathan
>>>>>>>>> +91 7760711773
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Vidhoon Viswanathan
>>>>>>>> +91 7760711773
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> "I like to pay taxes. With them, I buy civilization." -- Oliver Wendell Holmes
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Vidhoon Viswanathan
>>>>>> +91 7760711773
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> "I like to pay taxes. With them, I buy civilization." -- Oliver Wendell Holmes
>>>>
>>>>
>>>>
>>>> --
>>>> Vidhoon Viswanathan
>>>> +91 7760711773
>>>
>>>
>>>
>>> --
>>> "I like to pay taxes. With them, I buy civilization." -- Oliver Wendell Holmes
>>
>>
>>
>> --
>> Vidhoon Viswanathan
>> +91 7760711773
>
>
>
> --
> "I like to pay taxes. With them, I buy civilization." -- Oliver Wendell Holmes

-- 
"I like to pay taxes. With them, I buy civilization." --  Oliver Wendell Holmes
Received on Tue Apr 30 18:50:24 2013

This archive was generated by hypermail 2.1.8 : Tue Apr 30 2013 - 18:50:24 CEST