WinMerge / Bugs / #390 Files wrongly shown as different (Unicode)

#390 Files wrongly shown as different (Unicode)

Status: closed-fixed

Owner: nobody

Labels: Unicode and text encoding (33)

Priority: 3

Updated: 2009-03-23

Created: 2003-11-20

Creator: ganier

Private: No

File1.txt is an ansi file (french codepage). File1.txt has
been opened in notepad and saved in unicode in file2.txt.

Compare file1.txt and file2.txt.

Expected :
no difference

Currently :
WinMerge detects a difference on the first line.

Cause : it is a design problem :
Ansi file :
file -> saved as Ansi -> Ansi plugins -> diffutils
Unicode file :
file -> saved as UCS2-LE -> Unicode plugins
-> conversion to UTF-8 -> diffutils

Chars above 0x80 in the current codepage are coded
otherwise in UTF-8.

Discussion

ganier - 2003-11-20

files 1&2.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ganier - 2003-11-20

Logged In: YES
user_id=804270

Resolution reported after the next version of plugins.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ganier - 2003-11-20

priority: 5 --> 3

status: open --> open-later
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ganier - 2003-11-20

Logged In: YES
user_id=804270

Interesting comments in another patch :

Laurent:

Ansi files arrive in FileTransform_PreprocessA in Ansi format
(during plugins call), and they leave
FileTransform_PreprocessA in the same format.
FileTransform_PreprocessW is for Unicode files, and there is a
conversion UCS-2 -> UTF-8 as last step.

Perry:

I think that all files should wind up UTF-8 going to
diffutils, even Ansi files, eg,

Ansi file :
file -> saved as Ansi -> Ansi plugins -> conversion
to
UTF-8 -> diffutils

Unicode file :
file -> saved as UCS2-LE -> Unicode plugins -> conversion
to UTF-8 -> diffutils

My thought is that if you compare an 8-bit (Ansi) version of
a file to a Unicode version, diffutils needs to see them
both in the same encoding, which means UTF-8, so both need
to be put in UTF-8 before hitting diffutils.

The easy fix is to convert everything to UTF-8 all the time
going into diffutils. The more efficient solution is to only
force 8-bit into UTF-8 when the following isn't true:
m_rtbuff.m_encoding == m_lfbuff.m_encoding
&&
m_rtbuff.m_codepage == m_lfbuff.m_codepage

I mean, if both input files are 8-bit and apparently the
same codepage, we can omit the UTF-8 conversion. This may
be the 99% of the time case :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2003-11-22

Logged In: YES
user_id=60964

> Chars above 0x80 in the current codepage are coded
otherwise in UTF-8.

BTW, UTF-8 is a unicode based encoding, so it is invariant
-- I mean, it has no relationship to "codepage".

What is the french codepage ? Is that ISO-8859-1 (or 15), or
is it CP-1252, or is it an MS-DOS page, or something that I
don't know about :) ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ganier - 2003-11-22

Logged In: YES
user_id=804270

French codepage is in fact just CP-1252.

> Chars above 0x80 in the current codepage are coded
> otherwise in UTF-8
It was just a bad translation. Is there an adverb (instead of
otherwise) to say that the chars above 0x80 have another
value in UTF-8 ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2003-11-22

Logged In: YES
user_id=60964

Ok; in the other bug, I was misunderstanding your proposal,
so I thought maybe you didn't understand encodings. Now, I
understand your proposal (and want to adopt it anyway), so
never mind -- the misunderstanding was mine :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2004-01-30

Logged In: YES
user_id=631874

I just tested this with latest experimental (2.1.5.7) and it
detects files as different. Interesting though that
word-diff shows "No difference" messagebox.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2004-04-29

labels: 559476 --> Unicode and text encoding
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2004-04-29

Logged In: YES
user_id=631874

Changing to new "Unicode" category.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

elsapo - 2005-12-29

Logged In: YES
user_id=1195173

Gee, first I'm going to have to try to clear up some of the
morass of temp file handling, to even attack this -- it is a
mess and very hard to follow :(

So I'll probably post some cleanup patches on the way to
getting a handle on the flow here...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

elsapo - 2005-12-31

Logged In: YES
user_id=1195173

I've set out to patch this several times, because I think
thath I could now make a patch for this (if I can avoid
getting bitten by the problems with temp file handling, and
I think I should be able to duck them), but

- I've realized the problem is not limited to unicode files:
two identical files encoded differently are not correctly
unified in encoding (eg, the same rc file in two different
encodings should be identical under RCLocalizationHelper,
but will not be until the prediffer is fixed to unify the
encodings)

- I found three bugs in Quick Compare

- I found two bugs in Guess Encoding

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2008-08-04

Logged In: YES
user_id=631874
Originator: NO

Bug item #1185285 Cannot merge files with different EOLs
http://winmerge.org/bug/1185285
is probably related.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Matthias - 2008-12-30

solved with patch 2477680

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2009-03-23

I'm not very happy about the fix. But I also want this item out of my view. :)

I think it is time to get new bugs against current WinMerge versions for possibly related/remaining bugs. So closing this item as "fixed".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Varis - 2009-03-23

status: open-later --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.