Menu

#390 Files wrongly shown as different (Unicode)

closed-fixed
nobody
3
2009-03-23
2003-11-20
ganier
No

File1.txt is an ansi file (french codepage). File1.txt has
been opened in notepad and saved in unicode in file2.txt.

Compare file1.txt and file2.txt.

Expected :
no difference

Currently :
WinMerge detects a difference on the first line.

Cause : it is a design problem :
Ansi file :
file -> saved as Ansi -> Ansi plugins -> diffutils
Unicode file :
file -> saved as UCS2-LE -> Unicode plugins
-> conversion to UTF-8 -> diffutils

Chars above 0x80 in the current codepage are coded
otherwise in UTF-8.

Discussion

  • ganier

    ganier - 2003-11-20
     
  • ganier

    ganier - 2003-11-20

    Logged In: YES
    user_id=804270

    Resolution reported after the next version of plugins.

     
  • ganier

    ganier - 2003-11-20
    • priority: 5 --> 3
    • status: open --> open-later
     
  • ganier

    ganier - 2003-11-20

    Logged In: YES
    user_id=804270

    Interesting comments in another patch :

    Laurent:

    Ansi files arrive in FileTransform_PreprocessA in Ansi format
    (during plugins call), and they leave
    FileTransform_PreprocessA in the same format.
    FileTransform_PreprocessW is for Unicode files, and there is a
    conversion UCS-2 -> UTF-8 as last step.

    Perry:

    I think that all files should wind up UTF-8 going to
    diffutils, even Ansi files, eg,

    Ansi file :
    file -> saved as Ansi -> Ansi plugins -> conversion
    to
    UTF-8 -> diffutils

    Unicode file :
    file -> saved as UCS2-LE -> Unicode plugins -> conversion
    to UTF-8 -> diffutils

    My thought is that if you compare an 8-bit (Ansi) version of
    a file to a Unicode version, diffutils needs to see them
    both in the same encoding, which means UTF-8, so both need
    to be put in UTF-8 before hitting diffutils.

    The easy fix is to convert everything to UTF-8 all the time
    going into diffutils. The more efficient solution is to only
    force 8-bit into UTF-8 when the following isn't true:
    m_rtbuff.m_encoding == m_lfbuff.m_encoding
    &&
    m_rtbuff.m_codepage == m_lfbuff.m_codepage

    I mean, if both input files are 8-bit and apparently the
    same codepage, we can omit the UTF-8 conversion. This may
    be the 99% of the time case :)

     
  • Anonymous

    Anonymous - 2003-11-22

    Logged In: YES
    user_id=60964

    > Chars above 0x80 in the current codepage are coded
    otherwise in UTF-8.

    BTW, UTF-8 is a unicode based encoding, so it is invariant
    -- I mean, it has no relationship to "codepage".

    What is the french codepage ? Is that ISO-8859-1 (or 15), or
    is it CP-1252, or is it an MS-DOS page, or something that I
    don't know about :) ?

     
  • ganier

    ganier - 2003-11-22

    Logged In: YES
    user_id=804270

    French codepage is in fact just CP-1252.

    > Chars above 0x80 in the current codepage are coded
    > otherwise in UTF-8
    It was just a bad translation. Is there an adverb (instead of
    otherwise) to say that the chars above 0x80 have another
    value in UTF-8 ?

     
  • Anonymous

    Anonymous - 2003-11-22

    Logged In: YES
    user_id=60964

    Ok; in the other bug, I was misunderstanding your proposal,
    so I thought maybe you didn't understand encodings. Now, I
    understand your proposal (and want to adopt it anyway), so
    never mind -- the misunderstanding was mine :)

     
  • Kimmo Varis

    Kimmo Varis - 2004-01-30

    Logged In: YES
    user_id=631874

    I just tested this with latest experimental (2.1.5.7) and it
    detects files as different. Interesting though that
    word-diff shows "No difference" messagebox.

     
  • Kimmo Varis

    Kimmo Varis - 2004-04-29
    • labels: 559476 --> Unicode and text encoding
     
  • Kimmo Varis

    Kimmo Varis - 2004-04-29

    Logged In: YES
    user_id=631874

    Changing to new "Unicode" category.

     
  • elsapo

    elsapo - 2005-12-29

    Logged In: YES
    user_id=1195173

    Gee, first I'm going to have to try to clear up some of the
    morass of temp file handling, to even attack this -- it is a
    mess and very hard to follow :(

    So I'll probably post some cleanup patches on the way to
    getting a handle on the flow here...

     
  • elsapo

    elsapo - 2005-12-31

    Logged In: YES
    user_id=1195173

    I've set out to patch this several times, because I think
    thath I could now make a patch for this (if I can avoid
    getting bitten by the problems with temp file handling, and
    I think I should be able to duck them), but

    - I've realized the problem is not limited to unicode files:
    two identical files encoded differently are not correctly
    unified in encoding (eg, the same rc file in two different
    encodings should be identical under RCLocalizationHelper,
    but will not be until the prediffer is fixed to unify the
    encodings)

    - I found three bugs in Quick Compare

    - I found two bugs in Guess Encoding

     
  • Kimmo Varis

    Kimmo Varis - 2008-08-04

    Logged In: YES
    user_id=631874
    Originator: NO

    Bug item #1185285 Cannot merge files with different EOLs
    http://winmerge.org/bug/1185285
    is probably related.

     
  • Matthias

    Matthias - 2008-12-30

    solved with patch 2477680

     
  • Kimmo Varis

    Kimmo Varis - 2009-03-23

    I'm not very happy about the fix. But I also want this item out of my view. :)

    I think it is time to get new bugs against current WinMerge versions for possibly related/remaining bugs. So closing this item as "fixed".

     
  • Kimmo Varis

    Kimmo Varis - 2009-03-23
    • status: open-later --> closed-fixed
     

Log in to post a comment.