Re: Patch: RecognizeContents for UTF-8


Subject: Re: Patch: RecognizeContents for UTF-8
From: Vlad Harchev (hvv@hippo.ru)
Date: Mon Apr 09 2001 - 13:05:11 CDT


On Tue, 10 Apr 2001, Andrew Dunbar wrote:

> >> Here's my patch to allow loading UTF-8 files regardless
> >> of the filename extension. Including the .txt which
> >> is generally the case on Windows at least.
> >
> > Hmm, why all .txt files are utf8 files on Windows? Only ASCII files
> are
> >really utf8 files, any other encoding is not a subset of utf8 - so I
> think
> >nothing should be changed in IE_Imp_Text::RecognizeSuffix.
>
> No. On Windows a .txt file may be any encoding
> including Unicode encodings. I'm sure this is not
> Windows specific. AbiWord already doesn't check
> a file's contents for plain text but provides it as
> a fallback. I have done the same for the extension.
> Is there a better way to do this other than renaming
> all UTF-8 files with a .utf8 extension?

 Hmm, ability to give utf8-encoded files a .txt extension is a nice thing.
 I'm just having difficulty proving that after this patch user will be able to
import .txt files that are not in utf8 - if you are sure (and tested it) that
user will be able to do that, then your patch is welcomed..

> > Also, the AW will get an assertion failure (will crash) for any file
> with
> >8bit content if it's not a valid utf8 sequence - that's an incorrect
> >behaviour -
> > UT_ASSERT(UT_SHOULD_NOT_HAPPEN);
> >should definitely be removed.
>
> Actually the if statements cover all 256 possible
> values. This is just extra protection in case the
> logic is broken. Should I replace the assert with
> a debug message or add an expanatory comment?

 Hmm, this is very confusing if all if's above cover all possible variants -
it looks to me like the following (that's why I thought if's above don't cover
all possible variants):
        if (1==1) {
        } else {
                UT_ASSERT(UT_SHOULD_NOT_HAPPEN);
        }

 So a comment explaining the reason should be added there - it's very
confusing IMO without any comments..

> >Stylistic note: there is no need to surround return value in
> parenthesis.
>
> Glad to hear. Was copying nearby code's style.

 You should have considered all files in the directory rather all lines in
that file and made a weighted decision :)

> Andrew.

 Best regards,
  -Vlad



This archive was generated by hypermail 2b25 : Mon Apr 09 2001 - 14:05:14 CDT