Re: Integrating OpenMedSpel word list with Abiword's spell checker

From: Dominic Lachowicz <domlachowicz_at_gmail.com>
Date: Tue Apr 30 2013 - 18:19:32 CEST

Hi Vidh,

Yes, that's what I was thinking, but by no means do I have a monopoly
on good ideas.

OR is useful for use cases like "This word is either in English or
Spanish". Or in the case of a medical dictionary, "This word is either
English or it's a medical term of art". "OR" will easily cover in
excess of 99.999% of your use cases.

Other boolean operators besides "OR" may only be useful in a strictly
academic/masturbatory sense. Eg: "I'd like to know what words are
common across these 2 languages. If I had an 'AND' operator, I could
do that." I don't think that the incremental work to support AND vs OR
is all that huge - my sense is that it's on the order of a few lines
of code. Now, I'm not saying that you should implement AND and OR. I'd
just want the design to be flexible enough to support it easily, if it
becomes necessary.

How you'd do the UI for selecting those multiple languages is up to
you. You could allow users to combine languages "on the fly". You
could pre-populate the list with an "English Medical Dictionary"
option. If this is for a GSoC type of project, I'd say that you have a
good deal of latitude to experiment. Heck, try it both ways, and put
it in front of "focus groups". See which they prefer.

Think about how you'd want to implement the OpenMedSpel dictionary. We
could package it up as a Hunspell dictionary, an Aspell dictionary, or
pass it to Enchant as a custom word list file. All of the approaches
have their advantages and drawbacks.

Cheers,
Dom

On Tue, Apr 30, 2013 at 11:04 AM, Vidh <vidhu2366@gmail.com> wrote:
> Thanks Dom.
>
> OK I have developed ideas for my proposal based on the outline you
> gave in previous mail on Apr 9th.
>
> The idea is "to combine several dictionaries to create a composite
> dictionary and use it for spell checking multi lingual documents".
>
> I need some clarifications from your previous mail:
> When will operations other than OR be needed between multiple
> dictionaries? Let us assume a multi lingual document - Eng,French and
> German. In this case, we will need to make sure that the words typed
> by user should belong to either Eng or French or German right?
> So if my understanding is right, why will need AND operation which you
> had mentioned earlier?
>
> Next coming to the usage of this feature. I am planning to make the
> user select the languages he would be using in the document (in the
> case of multi lingual docs) in Tools->Set Language(s) option. Then
> based on that we could create a CompositeDict instance and use it for
> spell checking the document.
>
> Is this the kind of usage you have in mind? I just want to pool in
> ideas and refine to select the most practical and useful feature set.
>
> In the case of OpenMedSpel, we would by default have a CompositeDict
> for English-US and it would comprise of normal English-US words (.dict
> file) and other domain specific lists like OpenMedSpel. So, this
> problem becomes a subset of the proposed idea.
>
> Let me know your thoughts.
>
> Martin,
>
> Feel free to chip in your thoughts also! :)
>
> thanks!
>
> On Tue, Apr 30, 2013 at 6:38 AM, Dominic Lachowicz
> <domlachowicz@gmail.com> wrote:
>> Hi Vidh,
>>
>> Sorry - I'm not often available on IRC or Google Chat. I'm happy to
>> continue the discussion over email, though.
>>
>> Thanks,
>> Dom
>>
>> On Mon, Apr 29, 2013 at 1:31 AM, Vidh <vidhu2366@gmail.com> wrote:
>>> Hi Martin & Dom,
>>>
>>> Are you guys available sometime on IRC to discuss this?
>>> thanks!
>>>
>>> On Tue, Apr 9, 2013 at 8:10 PM, Dominic Lachowicz
>>> <domlachowicz@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> My thought was to create a new enchant method called "Dict
>>>> enchant_dict_compose(Dict 1, Dict 2, ...)". What it would do is
>>>> intersect the words in a medical dictionary with one in an (eg.)
>>>> English or Dutch dictionary. We'd need to create an internal
>>>> CompositeDict object which would perform some boolean operation on the
>>>> words - eg: OR or AND. You could use this as a building block to
>>>> support mixed-language texts (eg: English + Spanish). "Medical" here
>>>> is really just a domain-specific language.
>>>>
>>>> Thanks,
>>>> Dom
>>>>
>>>> On Tue, Apr 9, 2013 at 9:59 AM, Vidh <vidhu2366@gmail.com> wrote:
>>>>> Martin & Dom,
>>>>>
>>>>> Any thoughts about this?
>>>>> I would like to take this to completion as a feature for abiword.
>>>>>
>>>>> thanks!
>>>>>
>>>>> On Tue, Mar 26, 2013 at 11:53 PM, Vidh <vidhu2366@gmail.com> wrote:
>>>>>> +screenshot
>>>>>>
>>>>>> On Tue, Mar 26, 2013 at 11:52 PM, Vidh <vidhu2366@gmail.com> wrote:
>>>>>>> Martin,
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Both the features you have listed works.
>>>>>>> Since Aspell already is built for spell check and suggestion, once the
>>>>>>> word list is integrated, these are automatically taken care of.
>>>>>>>
>>>>>>> Please check the screen shot attached showing the functional state.
>>>>>>>
>>>>>>> I want to focus on taking this work to closure and possible extending
>>>>>>> it as required.
>>>>>>> I would love to know what Dom thinks abt this and hear his suggestions.
>>>>>>>
>>>>>>> thanks & regards,
>>>>>>>
>>>>>>> On Tue, Mar 26, 2013 at 6:10 PM, Martin Sevior <msevior@gmail.com> wrote:
>>>>>>>> HI Vidh,
>>>>>>>>
>>>>>>>> Dom is more the expert on openmedspell and spell checking than me.
>>>>>>>> Would you like to comment Dom?
>>>>>>>>
>>>>>>>> At first glance this looks really impressive. I guess the next step
>>>>>>>> would be to type in some words that almost match those in openmedspell
>>>>>>>> and see if they are
>>>>>>>> 1. underlined
>>>>>>>> 2. Have the correct openmedspell word offered as a suggestion from the
>>>>>>>> drop down list.
>>>>>>>>
>>>>>>>> Congrats on making such fast progress!
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>>> On Mon, Mar 25, 2013 at 8:11 PM, Vidh <vidhu2366@gmail.com> wrote:
>>>>>>>>> Hi Martin,
>>>>>>>>> Any thoughts on this?
>>>>>>>>>
>>>>>>>>> On Sat, Mar 23, 2013 at 12:16 AM, Vidh <vidhu2366@gmail.com> wrote:
>>>>>>>>>> Integrating OpenMedSpel word list with Abiword's spell checker
>>>>>>>>>> ====================================================
>>>>>>>>>>
>>>>>>>>>> I tried doing random things related to Abiword in order to relate with
>>>>>>>>>> source code.
>>>>>>>>>> I got things moving on this GSOC idea:
>>>>>>>>>> http://www.abisource.com/wiki/Google_Summer_of_Code_2013#OpenMedSpel_plugin_for_AbiWord
>>>>>>>>>>
>>>>>>>>>> I am writing this mail to report what I have done related to this idea
>>>>>>>>>> and to present some results.
>>>>>>>>>>
>>>>>>>>>> The goal is to make Abiword simultaneously spell check against
>>>>>>>>>> OpenMedSpel and the default language dictionary.
>>>>>>>>>> I have achieved the same result by making few changes to "enchant" and "aspell".
>>>>>>>>>>
>>>>>>>>>> Here are the steps I followed:
>>>>>>>>>> ========================
>>>>>>>>>> 1)Downloaded OpenMedSpel wordlist from:
>>>>>>>>>> http://www.e-medtools.com/openmedspel100.zip
>>>>>>>>>>
>>>>>>>>>> This had the following files:
>>>>>>>>>>
>>>>>>>>>> vidhoon@vidhoonv:/usr/lib/aspell$ ls -l ~/Downloads/openmedspel100
>>>>>>>>>> total 2968
>>>>>>>>>> -rw-rw-r-- 1 vidhoon vidhoon 607041 Feb 14 2007 OpenMedSpel 100.csv
>>>>>>>>>> -rw-rw-r-- 1 vidhoon vidhoon 558312 Mar 14 01:38 OpenMedSpel 100.txt
>>>>>>>>>> -rw-rw-r-- 1 vidhoon vidhoon 1169 Feb 14 2007 README_OpenMedSpel.txt
>>>>>>>>>>
>>>>>>>>>> 2) The txt file in the download had DOS characters in it. Hence, I had
>>>>>>>>>> to do this:
>>>>>>>>>>
>>>>>>>>>> $dos2unix OpenMedSpel\ 100.txt OpenMedSpelunix.txt
>>>>>>>>>>
>>>>>>>>>> 3) Now I created a wordlist for aspell using the command below:
>>>>>>>>>>
>>>>>>>>>> $aspell --lang=en create master ./openmedspel.rws <
>>>>>>>>>> ~/Downloads/openmedspel100/OpenMedSpel\ 100.txt
>>>>>>>>>>
>>>>>>>>>> (why I chose Aspell? - I could easily find relevant documentation to
>>>>>>>>>> solve this problem)
>>>>>>>>>> This link was really helpful:
>>>>>>>>>> http://aspell.net/0.50-doc/man-html/5_Working.html#SECTION00640000000000000000
>>>>>>>>>>
>>>>>>>>>> 4) After this, I located Aspell in my local system and found it at:
>>>>>>>>>> /usr/lib/aspell
>>>>>>>>>>
>>>>>>>>>> In this location I could find all "multi" dictionary files and "rws"
>>>>>>>>>> lists included in them.
>>>>>>>>>> I copied the openmedspel.rws list to this location.
>>>>>>>>>>
>>>>>>>>>> 5) I took the en_US.multi (since OpenMedSpel wordlist is also USA
>>>>>>>>>> english) and found it to contain "en_US-wo_accents.multi"
>>>>>>>>>> Then I opened en_US-wo_accents.multi and added the new wordlist
>>>>>>>>>> created as shown below:
>>>>>>>>>>
>>>>>>>>>> vidhoon@vidhoonv:/usr/lib/aspell$ cat en_US-wo_accents.multi
>>>>>>>>>> # Generated with Aspell Dicts "proc" script version 0.60.2
>>>>>>>>>> add en-common.rws
>>>>>>>>>> add en_US-wo_accents-only.rws
>>>>>>>>>> add openmedspel.rws
>>>>>>>>>>
>>>>>>>>>> 6) I understood that Aspell is exercised through "enchant" by Abiword
>>>>>>>>>> and located enchant in my local system:
>>>>>>>>>> /usr/share/enchant
>>>>>>>>>>
>>>>>>>>>> 7) I found the "private" enchant.ordering file that determines the way
>>>>>>>>>> spell checker and dictionaries are chosen for each language.
>>>>>>>>>> I learned more about this file from this link:
>>>>>>>>>> https://listman.redhat.com/archives/xdg-list/2003-July/msg00188.htmlhttp://accounts.unimelb.edu.au
>>>>>>>>>>
>>>>>>>>>> For time being, I hacked this file to place "aspell" ahead of
>>>>>>>>>> "myspell" from the ordering so that aspell gets picked for en_US and
>>>>>>>>>> this would contain OpenMedSpel wordlist also.
>>>>>>>>>>
>>>>>>>>>> vidhoon@vidhoonv:/usr/lib/aspell$ cat /usr/share/enchant/enchant.ordering
>>>>>>>>>> *:aspell,myspell,ispell //-> order changed in this line
>>>>>>>>>> fi:voikko,ispell,myspell,aspell
>>>>>>>>>> fi_FI:voikko,ispell,myspell,aspell
>>>>>>>>>> he:hspell,myspell
>>>>>>>>>> he_IL:hspell,myspell
>>>>>>>>>> yi:uspell
>>>>>>>>>> tr:zemberek
>>>>>>>>>> tr_TR:zemberek
>>>>>>>>>>
>>>>>>>>>> Now I can see that abiword spell check does not underline words from
>>>>>>>>>> OpenMedSpel list which indicates that the goal is achieved. That is,
>>>>>>>>>> Abiword does spell check for English US - normal words and OpenSpelMed
>>>>>>>>>> words.
>>>>>>>>>>
>>>>>>>>>> I have attached screenshot of abiword illustrating this.
>>>>>>>>>>
>>>>>>>>>> So coming to the next steps,
>>>>>>>>>> =======================
>>>>>>>>>> 1)I would like to know how these steps involving enchant and aspell
>>>>>>>>>> (purely! abiword need not even be compiled) can be transformed into an
>>>>>>>>>> abiword plugin.
>>>>>>>>>> 2)Have I achieved the required result in the right way?
>>>>>>>>>>
>>>>>>>>>> Or is it that a UI based option from Abiword for users to add their
>>>>>>>>>> own word lists for a language dictionary is needed?
>>>>>>>>>> I think this might fit in the "plugin" definition!
>>>>>>>>>>
>>>>>>>>>> Any help, suggestions and thoughts appreciated.
>>>>>>>>>> thanks & regards!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Vidhoon Viswanathan
>>>>>>>>>> +91 7760711773
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Vidhoon Viswanathan
>>>>>>>>> +91 7760711773
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Vidhoon Viswanathan
>>>>>>> +91 7760711773
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Vidhoon Viswanathan
>>>>>> +91 7760711773
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Vidhoon Viswanathan
>>>>> +91 7760711773
>>>>
>>>>
>>>>
>>>> --
>>>> "I like to pay taxes. With them, I buy civilization." -- Oliver Wendell Holmes
>>>
>>>
>>>
>>> --
>>> Vidhoon Viswanathan
>>> +91 7760711773
>>
>>
>>
>> --
>> "I like to pay taxes. With them, I buy civilization." -- Oliver Wendell Holmes
>
>
>
> --
> Vidhoon Viswanathan
> +91 7760711773

-- 
"I like to pay taxes. With them, I buy civilization." --  Oliver Wendell Holmes
Received on Tue Apr 30 18:19:49 2013

This archive was generated by hypermail 2.1.8 : Tue Apr 30 2013 - 18:19:49 CEST