Re: Updated: Summary of what I have done in GSoc2011_chenxiajian

From: Kathiravelu Pradeeban <kk.pradeeban_at_gmail.com>
Date: Mon Aug 15 2011 - 19:13:29 CEST

Hi Chen,
Had a look. Looks fine.

On Mon, Aug 15, 2011 at 9:49 PM, Chen Xiajian <chenxiajian1985@gmail.com> wrote:
> Hi
> the attachment is the summary of what I have done in GSoc2011. Please
> check it.  We can discus more detailed  tomorrow same time as usual.

Sure. Let's discuss further the usual time tomorrow.

Thank you.
Regards,
Pradeeban.
>
> I have build the User interface to manage hyphenation. user can enable
> or disable hyphenation function in user interface (GUI). Tomorrow I
> will focus on Linux GTK version.
>
> Best Regard!
>
> Chen Xiajian
>
>
>
> ======================================================
> Summary of What I have done in GSoc2011 1
>
> Until now, my works in GSoc2011 including four parts as following:
> 1.Hyphenation module in Enchant
>        Read and get totally understand the source code of Enchant
>        Reuse the abstract layer of Enchant and add Hyphenation function in
> Enchant, so that we can add more language easily
>        Deal with more languages
>        Add five backend implementation, including ispell, myspell,
> zemberek, voikko, uspell
>        Deal with the spelling-checking module
>
> 2.Call the Hyphenation function in Abiword.
>        Find split info using enchant_dict_hyphenate
>        Split Text_Run to split word pass the line width and keep their format
>        Deal with user's operation(select, delete, cut, paste)
>        User can select weather to enable the hyphenation function
>
> 3. Simple Implementation of Chinese Spell-Checking in Enchant
>        Add a simple spell-check framework for Chinese in Enchant
>        Add library to support
>        Some survey about Chinese Spell-checking
>
> 4. Code Re-factor and debug
>        Code Re-factor, include keep the code flexible
>        Debug coding problem
>
>
> The detail things:
> 1 Hyphenation module in Enchant
> 1.1 Add hyphenation function in Enchant
> Firstly, I add hyphenation method in Enchant:
> ================the code===========
> I think we can combine the hyphenation with spell-checking together,
> So that we can make the code more flexible. In my opinion, the
> hyphenation function defines as following:
> EnchantDict* enchant_broker_request_dict (EnchantBroker* broker, const
> char *const lang); //same as spell-checking
> char *enchant_dict_hyphenate(EnchantDict *dict, const char *const
> word,size_t len);
>
> In order to achieve the function and implement in abstract layer, we
> need to add hyphenation function in EnchantDict. something like, just
> as a function pointer:
> char* (*hyphenate) (struct str_enchant_dict * me,
>                          const char *const word, size_t len,
>                          size_t * out_n_suggs);
>
> and the function is implement by the backend. Take “ispell” as example:
> static char * ispell_dict_hyphenate (EnchantDict * me, const char *const word,
>                    size_t len, size_t * out_n_suggs)
> {
>       ISpellChecker * checker;
>       checker = (ISpellChecker *) me->user_data;
>       return checker->hyphenate (word, len, out_n_suggs);
> }
>
> Finally, we set the connetion
>  dict->hyphenate = ispell_dict_hyphenate;
>  dict->suggest = hspell_dict_hyphenate;
> dict->suggest = zemberek_dict_hyphenate;
>
> 1.2 Add five backends to support hyphenation
>  including ispell, myspell, zemberek, voikko, uspell
>        Hunspell: using seperated dictionary: such as hyph_en_us.dic.  we
> can download dic from internet
>        Libhyphenaiton: the dictionary is provided by author, sometimes limited
>        Zemberek: for Turkis
>        Voikko: for Finnish
>
> the changes:
> 1 deleted the unneed connection, such as HSpell
> 2 add hunspell(myspell) hyphenation code
> 3 implement hyphenation using hunspell
> 4 implement hyphenation using Zemberek
>
> ======1 deleted the unneed connection, such as HSpell===========
> Hebrew don’t need any hyphenation
> Yiddish don’t need any hyphenation
> =======2 Implement hyphenation using hunspell
> In order to use libhyphenation. We need to add files:
> hyphen/hnjalloc.h
> hyphen/hnjalloc.c
> hyphen/hyph_en_US.dic
> hyphen/hyphen.c
> hyphen/hyphen.gyp
> hyphen/hyphen.h
> hyphen/hyphen.patch
> hyphen/hyphen.tex
>
> ========3 Implement hyphenation using Zemberek
>  just using dbus_g_proxy_call the same as Spell-Check in Zemberek:
> the hyphenation is as following
>  char* Zemberek::hyphenate(const char* word)
> {
>       char* result;
>       GError *Error = NULL;
>       if (!dbus_g_proxy_call (proxy, "hecele", &Error,
>               G_TYPE_STRING,word,G_TYPE_INVALID,
>               G_TYPE_STRV, &result,G_TYPE_INVALID)) {
>                       g_error_free (Error);
>                       return NULL;
>       }
>       char*result=0;
>       return result;
> }
>
> 1.3 ISpell
> I used Libhyphenation in ISpell. The simple code is just like this:
> static char *
> ispell_dict_hyphenate (EnchantDict * me, const char *const word)
> {
>        ISpellChecker * checker;
>
>        checker = (ISpellChecker *) me->user_data;
>        if(me->tag!="")
>          return checker->hyphenate (word,me->tag);
>    return checker->hyphenate (word,"en_us");
> }
> The concrete code in ISpellChecker is :
> char *
> ISpellChecker::hyphenate(const char * const utf8Word, const char *const tag)
> {  //we must choose the right language tag
>        char* param_value = enchant_broker_get_param (m_broker,
> "enchant.ispell.hyphenation.dictionary.path");
>        if(languageMap[tag]!="")
>        {
>                string result=Hyphenator(RFC_3066::Language(languageMap[tag]),param_value).hyphenate(utf8Word).c_str();
>
>                char* temp=new char[result.length()];
>                strcpy(temp,result.c_str());
>                return temp;
>        }
>        return NULL;
> }
> 1.4 MySpell
> I used Libhyphenate in ISpell. The simple code is just like this:
> char*
> MySpellChecker::hyphenate (const char* const word, size_t len,char* tag)
> {
>        if(len==-1) len=strlen(word);
>        if (len > MAXWORDLEN
>                || !g_iconv_is_valid(m_translate_in)
>                || !g_iconv_is_valid(m_translate_out))
>                return 0;
>        char* result=0;
>        myspell->hyphenate(word,result,tag);
>        return result;
> }
> The concrete code in MySpellChecker is :
> void Hunspell::hyphenate( const char* const word, char* result, char* tag )
> {
>        HyphenDict *dict;
>        char buf[BUFSIZE + 1];
>        char *hyphens=new char[BUFSIZE + 1];
>        char ** rep;
>        int * pos;
>        int * cut;
>        /* load the hyphenation dictionary */
>        string filePath="hyph_";
>        filePath+=tag;
>        filePath+=".dic";
>        if ((dict = hnj_hyphen_load(filePath.c_str())) == NULL) {
>                fprintf(stderr, "Couldn't find file %s\n",tag);
>                fflush(stderr);
>                exit(1);
>        }
>     int len=strlen(word);
>     if (hnj_hyphen_hyphenate2(dict, word, len-1, hyphens, NULL, &rep,
> &pos, &cut)) {
>                                free(hyphens);
>                                fprintf(stderr, "hyphenation error\n");
>                                exit(1);
>                }
>
>        hnj_hyphen_free(dict);
>        result=hyphens;
> }
>
> 1.5 zemberek
> The way in Zemberek is same with the two above:
> static char*
> zemberek_dict_hyphenate (EnchantDict * me, const char *const word)
> {
>        Zemberek *checker;
>        checker = (Zemberek *) me->user_data;
>        return checker->hyphenate (word);
> }
> But the way for the concrete implementation is different from the two.
> We use zemberek_service
> char* Zemberek::hyphenate(const char* word)
> {
>        char* result;
>        GError *Error = NULL;
>        if (!dbus_g_proxy_call (proxy, "hecele", &Error,
>                G_TYPE_STRING,word,G_TYPE_INVALID,
>                G_TYPE_STRV, &result,G_TYPE_INVALID)) {
>                        g_error_free (Error);
>                        return NULL;
>        }
>
>        char*result=0;
>        return result;
> }
> 1.6 voikko
> The hyphenation implementation in Voikko is easy since Voikko has
> hyphenaiton’s API.
> static char **
> voikko_dict_suggest (EnchantDict * me, const char *const word,
>                     size_t len, size_t * out_n_suggs)
> {
>        char **sugg_arr;
>        int voikko_handle;
>
>        voikko_handle = (long) me->user_data;
>        sugg_arr = voikko_suggest_cstr(voikko_handle, word);
>        if (sugg_arr == NULL)
>                return NULL;
>        for (*out_n_suggs = 0; sugg_arr[*out_n_suggs] != NULL; (*out_n_suggs)++);
>        return sugg_arr;
> }
>
> 1.7 Deploy of enchant in Abiword
> I just copy the buliding result of enchant to the right place in Abiword:
> enchant\bin\Debug\libenchant_myspell.dll
> ---->abiword\msvc2008\Debug\lib\enchant\libenchant_myspell.dll
> enchant\bin\Debug\libenchant_ispell.dll
> ---->abiword\msvc2008\Debug\lib\enchant\libenchant_ispell.dll
> enchant\bin\Debug\libenchant.dll---->
> abiword\msvc2008\Debug\bin\ibenchant.dll
>
> 1.8 Test in Linux
> I have test the Enchant module in RedHat.  It works fine for me.
>
> 2 Call the Hyphenation function in Abiword.
>        Split run to split word and keep the format
>        Find split info
>        Deal with user's operation(select, delete, cut, paste)
>
> Main Goal: call hyphenation module of enchant to display the
> hyphenation result in abiword. After user's operation, refresh the
> hyphenation-result accordingly include user adding new word, delete
> word, copy word, cut word
>
> The main code is adding in the format function in LineBreaker.h(cpp)
> // find the split point
> while (pRunToBump && pLine->getNumRunsInLine() && (pLine->getLastRun()
> != m_pLastRunToKeep))
>                {
>                        UT_ASSERT(pRunToBump->getLine() == pLine);
>                        if(!pLine->removeRun(pRunToBump))
>                        {
>                                pRunToBump->setLine(NULL);
>                        }
>                        UT_ASSERT(pLine->getLastRun()->getType() != FPRUN_ENDOFPARAGRAPH);
>                        if(pLine->getLastRun()->getType() == FPRUN_ENDOFPARAGRAPH)
>                        {
>                                fp_Run * pNuke = pLine->getLastRun();
>                                pLine->removeRun(pNuke);
>                        }
>                pRunToBump->printText();  //trace out debug message & run two time
>                pNextLine->insertRun(pRunToBump);  //called when create new line
>                        // to get the split word
>                        if (!(pRunToBump->getPrevRun() && pLine->getNumRunsInLine() &&
> (pLine->getLastRun() != m_pLastRunToKeep)))
>                        {
>                                pRunToSplit=pRunToBump;
>                                PD_StruxIterator text(pRunToBump->getBlock()->getStruxDocHandle(),
>                                        pRunToBump->getBlockOffset() + fl_BLOCK_STRUX_OFFSET);
>
>                                text.setUpperLimit(text.getPosition() + pRunToBump->getLength() - 1);
>                                UT_ASSERT_HARMLESS( text.getStatus() == UTIter_OK );
>                                UT_UTF8String sTmp;
>                                while(text.getStatus() == UTIter_OK)
>                                {
>                                        UT_UCS4Char c = text.getChar();
>                                        UT_DEBUGMSG(("| %d |",c));
>                                        if(c >= ' ' && c <128)
>                                                sTmp +=  static_cast<char>(c);
>                                        ++text;
>                                }
>                                UT_DEBUGMSG(("The Split Text |%s| \n",sTmp.utf8_str()));
>                                if(sTmp.utf8_str()!=0)
>                                {
>                    pWordToSplit=sTmp;
>                                        UT_DEBUGMSG(("wordToSplit |%s| \n",pWordToSplit.utf8_str()));
>                                }
>                        }
>                        pRunToBump = pRunToBump->getPrevRun();
>                        UT_DEBUGMSG(("Next runToBump %x \n",pRunToBump));
>                }
>        }
>        //modify src/text/fmt/xp/fb_LineBreaker.cpp to place hypernation points
>        //spit the word
>        if(pWordToSplit.length()!=NULL)
>        {
>        pWordHyphenationResult=pBlock->_hyphenateWord(pWordToSplit.ucs4_str().ucs4_str(),0,0);
>                int tickLeft=pLine->getAvailableWidth();
>                if (pWordHyphenationResult && *pWordHyphenationResult){
>                        gchar *c = g_ucs4_to_utf8(pWordHyphenationResult, -1, NULL, NULL, NULL);
>                        for(int index=g_utf8_strlen(c,NULL);index>=0;--index)
>                        {
>                                if(pWordHyphenationResult[index]=='-'&&index<tickLeft)
>                                {
>                                        pBreakPoint=index;
>                                        fp_TextRun* textout=static_cast<fp_TextRun*>(pRunToSplit);
>                                        textout->split(pBreakPoint);
>                                }
>                        }
>                }
>        }
>
>
> 3 Simple Implementation of Chinese Spell-Check in Enchant
> After GSoc2011, I would like to add Chinese Spell-Check in Enchant.
> Chinese Spell-Check is also a very important issue in Word-Processor.
> I found some lib to support; I just build a simple framework since
> time is limit.
> The main function:
>
>
> 4 Code Re-factor and debug
> 5. Still to improve
>        Code Re-Factor
>        Deal with more language
>        include more user's operation(such as operate with picture may
> influence the hyphenation result)
>
> more:
>        Fully Support hyphenation in Abiword
>        Support more language
>        More tests in Linux(Unix)
>        Finish the Implementation of Chinese Spell-Check in Enchant
>        User interface about Hyphenation
>

-- 
Kathiravelu Pradeeban.
Software Engineer.
WSO2 Inc.
Blog: [Llovizna] http://kkpradeeban.blogspot.com/
Received on Mon Aug 15 19:14:02 2011

This archive was generated by hypermail 2.1.8 : Mon Aug 15 2011 - 19:14:02 CEST