WAD DESCRIBE under TCC 19 doesn't work with diacritics

  • This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.
Feb 1, 2010
15
0
#1
Hi Rex,

Coming back from 2010! There is still a problem with French accented characters (éèàôöùüû...) in file descriptions. The issue is similar with the one I explained here: https://jpsoft.com/forums/threads/describe-under-tcc-11-doesnt-work-with-diacritics.1731/ (this was for TCC 11).

Here is the scenario for this issue:

>TestFile
describe TestFile
Description de "D:\X\TestFile" : This is a test éèàçùôöûü [initial description manually typed here]
describe TestFile
Description de "D:\X\TestFile" : This is a test ÚÞÓþ¨¶÷¹³ [description not edited by hand!]
describe TestFile
Description de "D:\X\TestFile" : This is a test ┌ÌË■¿Â¸╣│
describe TestFile
Description de "D:\X\TestFile" : This is a test +╠╦ª┐┬©ªª
describe TestFile
Description de "D:\X\TestFile" : This is a test +ª-¬+-®¬¬
describe TestFile
Description de "D:\X\TestFile" : This is a test +¬-¼+-«¼¼
describe TestFile
Description de "D:\X\TestFile" : This is a test +¼-╝+-½╝╝
describe TestFile
Description de "D:\X\TestFile" : This is a test +╝-++-¢++
describe TestFile
Description de "D:\X\TestFile" : This is a test ++-++-ó++
describe TestFile
Description de "D:\X\TestFile" : This is a test ++-++-¾++
describe TestFile
Description de "D:\X\TestFile" : This is a test ++-++-¥++
describe TestFile
Description de "D:\X\TestFile" : This is a test ++-++-Ñ++
describe TestFile
Description de "D:\X\TestFile" : This is a test ++-++-Ð++

etc., etc... One can see that all characters with diacritics are progressively changed into "+" or "-" signs. What do you think?

I forgot to say that the description is modified each time the file is copied or moved.

Second issue with files which contain accented characters in their name:

>TestéèàFile
describe TestéèàFile
Description de "D:\X\TestéèàFile" : This is a test
describe TestéèàFile
Description de "D:\X\TestéèàFile" : [The description disappeared!]

In fact the description is indeed inside descript.ion after the first describe but it is not shown by the second describe command, so it will disappear if I simply hit Enter after that.

I guess these issues exist with each alphabet containing diacritics, not only the French one. Are you aware of that?

Bye,
Esteban
 

Charles Dye

Super Moderator
Staff member
May 20, 2008
3,427
40
Albuquerque, NM
prospero.unm.edu
#2
Fun with mojibake! Is TCC perhaps using one code page to write the DESCRIPT.ION file, and another code page to read it back?

I'd try opening the DESCRIPT.ION file in a text editor, then saving it back as UTF-16 with a Byte Order Mark.
 
#3
It seems it's related to my answer in this posting here ...

https://jpsoft.com/forums/threads/for-reads-text-in-ascii.6818/#post-39387

At least on my System (Win 10 x64 swiss-german, TC 19.10.45 x64) and with the setting from link above, I have no problems with your examples:

Code:
[D:\_Tests_]>afile1

[D:\_Tests_]describe afile1
Beschreibung "D:\_Tests_\afile1" : éèàçùôöûü

[D:\_Tests_]describe afile1
Beschreibung "D:\_Tests_\afile1" : éèàçùôöûü

[D:\_Tests_]describe afile1
Beschreibung "D:\_Tests_\afile1" : éèàçùôöûü

[D:\_Tests_]
[D:\_Tests_]
[D:\_Tests_]
[D:\_Tests_]>TestéèàFile

[D:\_Tests_]DESCRIBE TestéèàFile
Beschreibung "D:\_Tests_\TestéèàFile" : This is a Test

[D:\_Tests_]DESCRIBE TestéèàFile
Beschreibung "D:\_Tests_\TestéèàFile" : This is a Test

[D:\_Tests_]DESCRIBE TestéèàFile
Beschreibung "D:\_Tests_\TestéèàFile" : This is a Test

[D:\_Tests_]

So for me, it's definitive not a bug!

HTH
 
Feb 1, 2010
15
0
#4
Mmmm... Let me some time to analyze your answers and make some tests! Thanks anyway! :smile:

However I jumped from TCC 16 (with which I was -- almost -- happy) to TCC 19, without making any change to the system. So something has been changed meanwhile, don't know exactly at what version number. Similar to what happened between TCC 11.0 build 40 and build 46, the code has been changed somewhat and the issue with diacritics (another one) disappeared...

Esteban
 
May 29, 2008
521
3
Groton, CT
#6
Though I'm still using TCC v17, I'm going to make a guess that all you have to do is change the code page. Your default code page is probably 437, try
Code:
CHCP 1252
I put that into my TCSTART.BTM.
 

Charles Dye

Super Moderator
Staff member
May 20, 2008
3,427
40
Albuquerque, NM
prospero.unm.edu
#7
I can replicate Esteban's issue using only one DESCRIBE command, and without changing code pages:

Code:
C:\x>> TestFile

C:\x>describe TestFile
Describe "C:\x\TestFile" : This is a test ÉÈÀÇÙÔÖÛÜ

C:\x>dir /z

 Volume in drive C is Hard Drive  Serial number is 6218:b594
 Directory of  C:\x\*

.  <DIR>  3/14/16  9:39
..  <DIR>  3/14/16  9:39
TestFile  0  3/14/16  9:39 This is a test ╔╚└╟┘╘╓█▄
  0 bytes in 1 file and 2 dirs
  184,812,871,680 bytes free

C:\x>type /x DESCRIPT.ION
0000 0000 22 54 65 73 74 46 69 6c  65 22 20 54 68 69 73 20  "TestFile" This
0000 0010 69 73 20 61 20 74 65 73  74 20 c9 c8 c0 c7 d9 d4  is a test ÉÈÀÇÙÔ
0000 0020 d6 db dc 0d 0a  ÖÛÜ..

C:\x>
It seems TCC is writing the description file using the Windows code page (1252 in my case), but reading it back as the console code page (437) -- pretty clearly a bug.

Code pages are not the solution. Code pages are the problem! Now watch this:

Code:
C:\y>option //unicodeoutput=yes

C:\y>> TestFile

C:\y>describe TestFile
Describe "C:\y\TestFile" : This is a test ÉÈÀÇÙÔÖÛÜ

C:\y>dir /z

 Volume in drive C is Hard Drive  Serial number is 6218:b594
 Directory of  C:\y\*

.  <DIR>  3/14/16  9:41
..  <DIR>  3/14/16  9:41
TestFile  0  3/14/16  9:41 This is a test ÉÈÀÇÙÔÖÛÜ
  0 bytes in 1 file and 2 dirs
  184,812,871,680 bytes free

C:\y>type /x DESCRIPT.ION
0000 0000 22 00 54 00 65 00 73 00  74 00 46 00 69 00 6c 00  " T e s t F i l
0000 0010 65 00 22 00 20 00 54 00  68 00 69 00 73 00 20 00  e "  T h i s
0000 0020 69 00 73 00 20 00 61 00  20 00 74 00 65 00 73 00  i s  a  t e s
0000 0030 74 00 20 00 c9 00 c8 00  c0 00 c7 00 d9 00 d4 00  t  É È À Ç Ù Ô
0000 0040 d6 00 db 00 dc 00 0d 00  0a 00  Ö Û Ü . .

C:\y>
Undocumented behavior: New DESCRIPT.ION files are created according to the UnicodeOutput directive.

In my arrogant opinion, new DESCRIPT.ION files should always be created as UTF-16, regardless UnicodeOutput. Floppies are dead! Ditto redirection the the clipboard, which suffers similar issues.
 
#8
It seems TCC is writing the description file using the Windows code page (1252 in my case), but reading it back as the console code page (437) -- pretty clearly a bug.

Code pages are not the solution. Code pages are the problem!
Okay, if it's so, then I agree that it's a bug.

Nevertheless, not only because this, also because other similar things from/to Console, I changed my SYSTEM code page NOT TCC only (I don't know, if there is a different) from "Swiss German Win 10" 850 to 1252. For details use my link above ... Also, I do NOT use the Unicode Output option in TCC. THEN I had never a problem with codepages anymore related to TCMD/TCC or even System Console (Prompt).

Okay, it's not Unicode supported. But at least I have even a workaround for problems such as generated ASCII files (through redirect or whatever) with localized chars from Console to "real" Win programs or vice versa. And IF I have to use old .BAT files or so with codepage 437 it should be not a problem, because I do not know ANY such file with COMMANDS in non-english ("low" ASCII) - so it makes no problems with the command parser. And actual such files (.CMD or installer files for console or something like that) are probably even created with codepage 1252 (with a "real" Win Program) anyway and no more in 437 (as I saw till now). At least I had NEVER a problem with this config till now.

PS: Sorry, I hope you can read AND UNDERSTAND my text, it's a bit difficult in english to explain for me :-)[/QUOTE]
 
Feb 1, 2010
15
0
#9
Thanks for all your suggestions.

I admit I did not have enough time to test them but what troubles me is that TCC worked up to a point (for instance it was OK at v16) then proceeded to screw up from a given release, don't know which one. Similar situation appeared somewhere in v11 if I remember well. When I reported the problem, I guess Rex quietly corrected the issue between v11.0 build 40 and build 46. Good point but I've never had an explanation at that time.

Why should have to change code pages or whatever in the system when things were OK before? I have tons of descriptions in tens of thousands files since 4DOS v3 (yes sir!), with diacritics both in file names and descriptions. Some of them have already been destroyed or erased, I cannot imagine use TCC to simply copy a file until the issue is corrected.

Hope someone will understand my viewpoint. And sorry also for my bad English!
 

rconn

Administrator
Staff member
May 14, 2008
10,205
86
#10
WAD -- this is a Windows (and user-configuration) issue, not TCC.

Windows has two (incompatible) ways of converting ASCII files to Unicode (and everything internal in Windows is Unicode) - locale based (GUI apps) and codepage based (console apps). The extended diacritical ASCII characters only exist in a few codepages, and not the one you're using.

There are three workarounds for this problem:
  1. Use the NTFSDescriptions option instead of the obsolete DESCRIPT.ION file. This works with all third-party apps, is much, much faster than the DESCRIPT.ION file, and is 100% reliable because it's all Unicode and there's no Unicode->ASCII->Unicode conversion involved.
  2. Use a Unicode DESCRIPT.ION file. This is also 100% reliable regardless of the locale or codepage in use.
  3. Use the correct codepage & locale (and Unicode font) combination. This requires a bit of configuration on your end, and will still cause problems if you switch locales or codepages.
 
Feb 1, 2010
15
0
#11
Hi Rex,
  1. NTFS description is not compatible with FAT32 which I still use for compatibility. I believe XnView and a few other applications and batches won't work either. What about my NAS (Synology)? I have many, many files with classic description, even if it should be easy to convert then, I am too afraid to loose all my work. All files on all my PCs and my NAS should be converted at once to avoid risky mixes, it makes me having cold sweats in the neck... Not mentioning the time lost to make these changes.
  2. Unicode description : well, perhaps. But what if normal and unicode descriptions collides? For example copying files with unicode descriptions to a directory with standard descriptions or conversely? And what about files containing accented letters in their names?
  3. Changing code pages could be risky with other applications or even Windows, I prefer not to try.
That said, I still don't understand why you changed TCC behavior: it was nearly OK with v16 and KO with v19. I did not tested all versions... For all foreign users who need accented letters it's a real pain. Can't you go back regarding this specific change? You were already able to do something similar between TCC 11.0 build 40 and build 46...

Regards
 
#12
Changing code pages could be risky with other applications or even Windows, I prefer not to try.
No need to change the Codepage for "real" windows programs, only the OEMCP is necessary (Console Codepage). And you know that even this Codepage is not always the same PER DEFAULT? Example: Windows (10) American English is probably OEMCP 437 ... Windows SwissGerman 850. I believe this was at least the case in Windows 7 already (eventually even already in XP or earlier).

And as I said commands from batch files or so should not be not different in 437, 850 or 1252.

With the change OEMCP to 1252 you have the same codepage in Console as for Windows programs, that's all.

That said, I still don't understand why you changed TCC behavior: it was nearly OK with v16 and KO with v19. I did not tested all versions... For all foreign users who need accented letters it's a real pain. Can't you go back regarding this specific change? You were already able to do something similar between TCC 11.0 build 40 and build 46...
I do really not believe that Rex has changed this! If I remember correctly, Console has a different Codepage since long time (ever?) in Windows (see my words above). And you should know: TCC takes the Codepage from Window Console, it's not a own TCC codepage)!

So, please no changes here: this would break the compatibility to the Windows Console which is highly undesired!
 
Feb 1, 2010
15
0
#13
Well, I did not say that the change was deliberate but it may have been caused by another modification. There is no magic here. Just explain me why TCC 16 works and TCC 19 doesn't? CHCP command returns 850 in both versions. And I am using the same OS, Win 7 Pro x64 French.

I also tested TCC 19 under Win 10 Home x64 French on a brand new Dell laptop just unpacked, same issue... So maybe this is not a bug but at least there was a change which triggered an unwanted behavior. Does someone from JP has tested TCC 19 under a French Windows? Virtual Box or VMware are good alternatives.

Now some positive words: Alpengreis and Dcantor you are right. CP 1252 works great! At least up to now (for a couple of hours). Wikipedia says it replaces the old CP 850 for Western European languages Latin alphabet. Duly noted. So I added CHCP 1252 into tcstart.btm and all my problems flew away. Magic! I hope I will not have unpleasant surprises in the future...

Thanks to all guys!

Edit: Hmmm, bad news, ALL my old descriptions are not compatible with CP 1252, they have to be edited!!! Ouch, another pain in the ass...
 
Last edited:
#14
Try "iconv" in Cygwin.

n.b. I haven't tried it.

Code:
iconv -f CP850 -t CP1252 descript.ion > newdescript.ion
Here's the man page.

Code:
ICONV(1)                      Linux Programmer's Manual                      ICONV(1)

NAME
       iconv - character set conversion

SYNOPSIS
       iconv [OPTION...] [-f encoding] [-t encoding] [inputfile ...]
       iconv -l

DESCRIPTION
       The  iconv  program converts text from one encoding to another encoding.  More
       precisely, it converts from the encoding given for the -f option to the encod‐
       ing  given for the -t option. Either of these encodings defaults to the encod‐
       ing of the current locale. All the inputfiles are read and converted in  turn;
       if  no  inputfile  is given, the standard input is used. The converted text is
       printed to standard output.

       The encodings permitted are system dependent. For the libiconv implementation,
       they are listed in the iconv_open(3) manual page.

       Options controlling the input and output format:

       -f encoding, --from-code=encoding
              Specifies the encoding of the input.

       -t encoding, --to-code=encoding
              Specifies the encoding of the output.

       Options controlling conversion problems:

       -c     When  this  option  is  given,  characters that cannot be converted are
              silently discarded, instead of leading to a conversion error.

       --unicode-subst=formatstring
              When this option is given, Unicode characters  that  cannot  be  repre‐
              sented  in  the  target encoding are replaced with a placeholder string
              that is constructed from the given formatstring, applied to the Unicode
              code point. The formatstring must be a format string in the same format
              as for the printf command or the printf() function,  taking  either  no
              argument or exactly one unsigned integer argument.

       --byte-subst=formatstring
              When this option is given, bytes in the input that are not valid in the
              source encoding are replaced with a placeholder  string  that  is  con‐
              structed  from the given formatstring, applied to the byte's value. The
              formatstring must be a format string in the  same  format  as  for  the
              printf  command  or the printf() function, taking either no argument or
              exactly one unsigned integer argument.

       --widechar-subst=formatstring
              When this option is given, wide characters in the input  that  are  not
              valid  in  the  source  encoding are replaced with a placeholder string
              that is constructed from the given formatstring, applied to the  byte's
              value.  The  formatstring must be a format string in the same format as
              for the printf command or the printf() function, taking either no argu‐
              ment or exactly one unsigned integer argument.

       Options controlling error output:

       -s, --silent
              When  this  option is given, error messages about invalid or unconvert‐
              ible characters are omitted, but the actual  converted  text  is  unaf‐
              fected.

       The  iconv  -l or iconv --list command lists the names of the supported encod‐
       ings, in a system dependent format. For the libiconv implementation, the names
       are  printed  in  upper  case,  separated by whitespace, and alias names of an
       encoding are listed on the same line as the encoding itself.

EXAMPLES
       iconv -f ISO-8859-1 -t UTF-8
              converts input from the old West-European encoding ISO-8859-1  to  Uni‐
              code.

       iconv -f KOI8-R --byte-subst="<0x%x>"
                       --unicode-subst="<U+%04X>"
              converts  input  from  the  old  Russian  encoding KOI8-R to the locale
              encoding, substituting an angle bracket notation with hexadecimal  num‐
              bers for invalid bytes and for valid but unconvertible characters.

       iconv --list
              lists the supported encodings.

CONFORMING TO
       POSIX:2001

SEE ALSO
       iconv_open(3), locale(7)

GNU                                 March 31, 2007                           ICONV(1)
 
#15
Well, I did not say that the change was deliberate but it may have been caused by another modification. There is no magic here. Just explain me why TCC 16 works and TCC 19 doesn't? CHCP command returns 850 in both versions. And I am using the same OS, Win 7 Pro x64 French.
I can explain only the following:

IF a console has CP 437 or 850 AND Windows 1252, you will ALWAYS have a problem, if you create a document in Windows with umlauts and show it in the Console and vice versa!

I also tested TCC 19 under Win 10 Home x64 French on a brand new Dell laptop just unpacked, same issue... So maybe this is not a bug but at least there was a change which triggered an unwanted behavior.
No, this is normal and good so and fully logical. It was here the same before I changed the CP for Console. This is how it works and is WAD! Sorry that I can not give another answer ;-)

If this behaviour was not the same on your earlier computers or so, then you had another config. I still not believe that JP has changed this, because - as I said - TCC takes the CP from Windows Console. And this is right and important to be compatible to the Windows Console (Prompt)!

You can make a new setup for example with my OS version on 100 000 computers. I'm sure If you choose "Swiss German" (for ex.), you will always have the same behaviour (the problem with umlauts) on every single computer! Well, I would nevertheless not bet with real money ;-)

Now some positive words: Alpengreis and Dcantor you are right. CP 1252 works great! At least up to now (for a couple of hours). Wikipedia says it replaces the old CP 850 for Western European languages Latin alphabet. Duly noted. So I added CHCP 1252 into tcstart.btm and all my problems flew away. Magic! I hope I will not have unpleasant surprises in the future...

Thanks to all guys!
Sounds good but is not enough to avoid the problems in Windows Console too. Then you need "a right" solution (see my link).

Edit: Hmmm, bad news, ALL my old descriptions are not compatible with CP 1252, they have to be edited!!! Ouch, another pain in the ass...
This cannot be avoided if you change CP. To edit such files you could eventually use the tip from JohnQSmith (maybe it exist even other tools for this purpose) for automatically converting. Or you take another workaround (see answer from Rex). Just: if you make NO workaround you have to live with this problem probably till Windows itself changes the CP for Console (in this case you would have the non-compatibility problem with your old DESCRIPT.ION files too)!

However: i hope you can solve your problem!