Help Debugging Invalid Unicode Characters

benblamey · Post by **benblamey** » Tue Nov 06, 2012 5:02 am

Hi,

The problem with my file is that it seems to contain byte sequences that are not valid UTF-8 (according to DiffMerge, that is).

I can read the offending file with the .NET stream reader, using UTF-8 and ASCII encodings, and I don't get an exception.

It would be helpful if say, DiffMerge reported the binary sequence in hex, that way I could search for the hex string in a hex editor and fix the problem. With a large file, I have no information to go on! Thanks!

---------------------------
Error!
---------------------------
The file in the left panel:
<filename>
could not be imported.

Error: Cannot import file using given character encoding.
Details: Invalid bytes in file for character encoding 'Unicode 8 bit (UTF-8)'.
---------------------------
OK
---------------------------

Version 3.3.2 (1139) [x64]

jeffhostetler · Post by **jeffhostetler** » Tue Nov 06, 2012 1:17 pm

Yeah that error message could be a little more helpful
and include the offending sequence. I'll log that.

Are the files UTF-8 w/ or w/o byte-order-marks ? Or are
they something else and just getting treated as-if they
are UTF-8 ?

You might try temporarily modifying the Ruleset for that suffix and
set it to always ask you or always assume iso-8859-1 (latin-1) and
try to load them and see if there are any differences and/or broken
utf-8 multi-byte sequences ? Often times I noticed that there is just
one bogus char and that's enough to throw off the parse of the entire
file.

jeff
W7501

Chacapamac · Post by **Chacapamac** » Fri Sep 06, 2013 6:14 am

I can see that in all type of files if you have an french accent in a comment or language files.
The problem is when that happen, I’m a little scare to do anything (edit & save) that file as it maybe garble the entire file.

I’m not to sur what to do here?

jeffhostetler · Post by **jeffhostetler** » Tue Sep 10, 2013 5:42 am

If the source files are not UTF-8, but do contain French accent characters,
set the encoding to ISO-8859-1 (or whatever encoding / code page you
normally use) and DiffMerge should import it just fine. Internally, DiffMerge
works with Unicode and will, as it imports the file, convert it from whatever
encoding into Unicode for editing. It will display the encoding(s) that it
used in the status bar at the bottom of the window.

Then when you Save, DiffMerge will use *that* encoding to convert from
Unicode back to the original encoding before writing out the file. The only
caveat is to not add characters to the document that are not in that encoding.
For example, if you are using ISO-8859-1 (Latin-1), it is OK to add any of
the accented characters that appear in Latin-1, but don't add comments in
Russian or Greek.

Hope this helps,

SourceGear Support

Help Debugging Invalid Unicode Characters

Help Debugging Invalid Unicode Characters

Re: Help Debugging Invalid Unicode Characters

Re: Help Debugging Invalid Unicode Characters

Re: Help Debugging Invalid Unicode Characters