1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

WAD DESCRIBE under TCC 19 doesn't work with diacritics

Discussion in 'Support' started by Esteban, Mar 12, 2016.

  1. Esteban

    Joined:
    Feb 1, 2010
    Messages:
    15
    Likes Received:
    0
    Hi Rex,

    Coming back from 2010! There is still a problem with French accented characters (éèàôöùüû...) in file descriptions. The issue is similar with the one I explained here: https://jpsoft.com/forums/threads/describe-under-tcc-11-doesnt-work-with-diacritics.1731/ (this was for TCC 11).

    Here is the scenario for this issue:

    >TestFile
    describe TestFile
    Description de "D:\X\TestFile" : This is a test éèàçùôöûü [initial description manually typed here]
    describe TestFile
    Description de "D:\X\TestFile" : This is a test ÚÞÓþ¨¶÷¹³ [description not edited by hand!]
    describe TestFile
    Description de "D:\X\TestFile" : This is a test ┌ÌË■¿Â¸╣│
    describe TestFile
    Description de "D:\X\TestFile" : This is a test +╠╦ª┐┬©ªª
    describe TestFile
    Description de "D:\X\TestFile" : This is a test +ª-¬+-®¬¬
    describe TestFile
    Description de "D:\X\TestFile" : This is a test +¬-¼+-«¼¼
    describe TestFile
    Description de "D:\X\TestFile" : This is a test +¼-╝+-½╝╝
    describe TestFile
    Description de "D:\X\TestFile" : This is a test +╝-++-¢++
    describe TestFile
    Description de "D:\X\TestFile" : This is a test ++-++-ó++
    describe TestFile
    Description de "D:\X\TestFile" : This is a test ++-++-¾++
    describe TestFile
    Description de "D:\X\TestFile" : This is a test ++-++-¥++
    describe TestFile
    Description de "D:\X\TestFile" : This is a test ++-++-Ñ++
    describe TestFile
    Description de "D:\X\TestFile" : This is a test ++-++-Ð++

    etc., etc... One can see that all characters with diacritics are progressively changed into "+" or "-" signs. What do you think?

    I forgot to say that the description is modified each time the file is copied or moved.

    Second issue with files which contain accented characters in their name:

    >TestéèàFile
    describe TestéèàFile
    Description de "D:\X\TestéèàFile" : This is a test
    describe TestéèàFile
    Description de "D:\X\TestéèàFile" : [The description disappeared!]

    In fact the description is indeed inside descript.ion after the first describe but it is not shown by the second describe command, so it will disappear if I simply hit Enter after that.

    I guess these issues exist with each alphabet containing diacritics, not only the French one. Are you aware of that?

    Bye,
    Esteban
     
  2. Charles Dye

    Charles Dye Super Moderator
    Staff Member

    Joined:
    May 20, 2008
    Messages:
    3,287
    Likes Received:
    39
    Fun with mojibake! Is TCC perhaps using one code page to write the DESCRIPT.ION file, and another code page to read it back?

    I'd try opening the DESCRIPT.ION file in a text editor, then saving it back as UTF-16 with a Byte Order Mark.
     
  3. Alpengreis

    Joined:
    Jan 12, 2014
    Messages:
    228
    Likes Received:
    6
    It seems it's related to my answer in this posting here ...

    https://jpsoft.com/forums/threads/for-reads-text-in-ascii.6818/#post-39387

    At least on my System (Win 10 x64 swiss-german, TC 19.10.45 x64) and with the setting from link above, I have no problems with your examples:

    Code:
    [D:\_Tests_]>afile1
    
    [D:\_Tests_]describe afile1
    Beschreibung "D:\_Tests_\afile1" : éèàçùôöûü
    
    [D:\_Tests_]describe afile1
    Beschreibung "D:\_Tests_\afile1" : éèàçùôöûü
    
    [D:\_Tests_]describe afile1
    Beschreibung "D:\_Tests_\afile1" : éèàçùôöûü
    
    [D:\_Tests_]
    [D:\_Tests_]
    [D:\_Tests_]
    [D:\_Tests_]>TestéèàFile
    
    [D:\_Tests_]DESCRIBE TestéèàFile
    Beschreibung "D:\_Tests_\TestéèàFile" : This is a Test
    
    [D:\_Tests_]DESCRIBE TestéèàFile
    Beschreibung "D:\_Tests_\TestéèàFile" : This is a Test
    
    [D:\_Tests_]DESCRIBE TestéèàFile
    Beschreibung "D:\_Tests_\TestéèàFile" : This is a Test
    
    [D:\_Tests_]

    So for me, it's definitive not a bug!

    HTH
     
  4. Esteban

    Joined:
    Feb 1, 2010
    Messages:
    15
    Likes Received:
    0
    Mmmm... Let me some time to analyze your answers and make some tests! Thanks anyway! :smile:

    However I jumped from TCC 16 (with which I was -- almost -- happy) to TCC 19, without making any change to the system. So something has been changed meanwhile, don't know exactly at what version number. Similar to what happened between TCC 11.0 build 40 and build 46, the code has been changed somewhat and the issue with diacritics (another one) disappeared...

    Esteban
     
  5. Christian Albaret

    Joined:
    Jul 1, 2008
    Messages:
    154
    Likes Received:
    1
    Same here — in French too. I discover the problem: actually I don't use DESCRIBE that much, and when I use it I write in english.
    I am not going to change the codepages, this could interfere with other persons in the company.
     
  6. dcantor

    Joined:
    May 29, 2008
    Messages:
    507
    Likes Received:
    3
    Though I'm still using TCC v17, I'm going to make a guess that all you have to do is change the code page. Your default code page is probably 437, try
    Code:
    CHCP 1252
    
    I put that into my TCSTART.BTM.
     
  7. Charles Dye

    Charles Dye Super Moderator
    Staff Member

    Joined:
    May 20, 2008
    Messages:
    3,287
    Likes Received:
    39
    I can replicate Esteban's issue using only one DESCRIBE command, and without changing code pages:

    Code:
    C:\x>> TestFile
    
    C:\x>describe TestFile
    Describe "C:\x\TestFile" : This is a test ÉÈÀÇÙÔÖÛÜ
    
    C:\x>dir /z
    
     Volume in drive C is Hard Drive  Serial number is 6218:b594
     Directory of  C:\x\*
    
    .  <DIR>  3/14/16  9:39
    ..  <DIR>  3/14/16  9:39
    TestFile  0  3/14/16  9:39 This is a test ╔╚└╟┘╘╓█▄
      0 bytes in 1 file and 2 dirs
      184,812,871,680 bytes free
    
    C:\x>type /x DESCRIPT.ION
    0000 0000 22 54 65 73 74 46 69 6c  65 22 20 54 68 69 73 20  "TestFile" This
    0000 0010 69 73 20 61 20 74 65 73  74 20 c9 c8 c0 c7 d9 d4  is a test ÉÈÀÇÙÔ
    0000 0020 d6 db dc 0d 0a  ÖÛÜ..
    
    C:\x>
    
    It seems TCC is writing the description file using the Windows code page (1252 in my case), but reading it back as the console code page (437) -- pretty clearly a bug.

    Code pages are not the solution. Code pages are the problem! Now watch this:

    Code:
    C:\y>option //unicodeoutput=yes
    
    C:\y>> TestFile
    
    C:\y>describe TestFile
    Describe "C:\y\TestFile" : This is a test ÉÈÀÇÙÔÖÛÜ
    
    C:\y>dir /z
    
     Volume in drive C is Hard Drive  Serial number is 6218:b594
     Directory of  C:\y\*
    
    .  <DIR>  3/14/16  9:41
    ..  <DIR>  3/14/16  9:41
    TestFile  0  3/14/16  9:41 This is a test ÉÈÀÇÙÔÖÛÜ
      0 bytes in 1 file and 2 dirs
      184,812,871,680 bytes free
    
    C:\y>type /x DESCRIPT.ION
    0000 0000 22 00 54 00 65 00 73 00  74 00 46 00 69 00 6c 00  " T e s t F i l
    0000 0010 65 00 22 00 20 00 54 00  68 00 69 00 73 00 20 00  e "  T h i s
    0000 0020 69 00 73 00 20 00 61 00  20 00 74 00 65 00 73 00  i s  a  t e s
    0000 0030 74 00 20 00 c9 00 c8 00  c0 00 c7 00 d9 00 d4 00  t  É È À Ç Ù Ô
    0000 0040 d6 00 db 00 dc 00 0d 00  0a 00  Ö Û Ü . .
    
    C:\y>
    
    Undocumented behavior: New DESCRIPT.ION files are created according to the UnicodeOutput directive.

    In my arrogant opinion, new DESCRIPT.ION files should always be created as UTF-16, regardless UnicodeOutput. Floppies are dead! Ditto redirection the the clipboard, which suffers similar issues.
     
  8. Alpengreis

    Joined:
    Jan 12, 2014
    Messages:
    228
    Likes Received:
    6
    Okay, if it's so, then I agree that it's a bug.

    Nevertheless, not only because this, also because other similar things from/to Console, I changed my SYSTEM code page NOT TCC only (I don't know, if there is a different) from "Swiss German Win 10" 850 to 1252. For details use my link above ... Also, I do NOT use the Unicode Output option in TCC. THEN I had never a problem with codepages anymore related to TCMD/TCC or even System Console (Prompt).

    Okay, it's not Unicode supported. But at least I have even a workaround for problems such as generated ASCII files (through redirect or whatever) with localized chars from Console to "real" Win programs or vice versa. And IF I have to use old .BAT files or so with codepage 437 it should be not a problem, because I do not know ANY such file with COMMANDS in non-english ("low" ASCII) - so it makes no problems with the command parser. And actual such files (.CMD or installer files for console or something like that) are probably even created with codepage 1252 (with a "real" Win Program) anyway and no more in 437 (as I saw till now). At least I had NEVER a problem with this config till now.

    PS: Sorry, I hope you can read AND UNDERSTAND my text, it's a bit difficult in english to explain for me :-)[/QUOTE]
     
  9. Esteban

    Joined:
    Feb 1, 2010
    Messages:
    15
    Likes Received:
    0
    Thanks for all your suggestions.

    I admit I did not have enough time to test them but what troubles me is that TCC worked up to a point (for instance it was OK at v16) then proceeded to screw up from a given release, don't know which one. Similar situation appeared somewhere in v11 if I remember well. When I reported the problem, I guess Rex quietly corrected the issue between v11.0 build 40 and build 46. Good point but I've never had an explanation at that time.

    Why should have to change code pages or whatever in the system when things were OK before? I have tons of descriptions in tens of thousands files since 4DOS v3 (yes sir!), with diacritics both in file names and descriptions. Some of them have already been destroyed or erased, I cannot imagine use TCC to simply copy a file until the issue is corrected.

    Hope someone will understand my viewpoint. And sorry also for my bad English!
     
  10. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,813
    Likes Received:
    82
    WAD -- this is a Windows (and user-configuration) issue, not TCC.

    Windows has two (incompatible) ways of converting ASCII files to Unicode (and everything internal in Windows is Unicode) - locale based (GUI apps) and codepage based (console apps). The extended diacritical ASCII characters only exist in a few codepages, and not the one you're using.

    There are three workarounds for this problem:
    1. Use the NTFSDescriptions option instead of the obsolete DESCRIPT.ION file. This works with all third-party apps, is much, much faster than the DESCRIPT.ION file, and is 100% reliable because it's all Unicode and there's no Unicode->ASCII->Unicode conversion involved.
    2. Use a Unicode DESCRIPT.ION file. This is also 100% reliable regardless of the locale or codepage in use.
    3. Use the correct codepage & locale (and Unicode font) combination. This requires a bit of configuration on your end, and will still cause problems if you switch locales or codepages.
     
  11. Esteban

    Joined:
    Feb 1, 2010
    Messages:
    15
    Likes Received:
    0
    Hi Rex,
    1. NTFS description is not compatible with FAT32 which I still use for compatibility. I believe XnView and a few other applications and batches won't work either. What about my NAS (Synology)? I have many, many files with classic description, even if it should be easy to convert then, I am too afraid to loose all my work. All files on all my PCs and my NAS should be converted at once to avoid risky mixes, it makes me having cold sweats in the neck... Not mentioning the time lost to make these changes.
    2. Unicode description : well, perhaps. But what if normal and unicode descriptions collides? For example copying files with unicode descriptions to a directory with standard descriptions or conversely? And what about files containing accented letters in their names?
    3. Changing code pages could be risky with other applications or even Windows, I prefer not to try.
    That said, I still don't understand why you changed TCC behavior: it was nearly OK with v16 and KO with v19. I did not tested all versions... For all foreign users who need accented letters it's a real pain. Can't you go back regarding this specific change? You were already able to do something similar between TCC 11.0 build 40 and build 46...

    Regards
     
  12. Alpengreis

    Joined:
    Jan 12, 2014
    Messages:
    228
    Likes Received:
    6
    No need to change the Codepage for "real" windows programs, only the OEMCP is necessary (Console Codepage). And you know that even this Codepage is not always the same PER DEFAULT? Example: Windows (10) American English is probably OEMCP 437 ... Windows SwissGerman 850. I believe this was at least the case in Windows 7 already (eventually even already in XP or earlier).

    And as I said commands from batch files or so should not be not different in 437, 850 or 1252.

    With the change OEMCP to 1252 you have the same codepage in Console as for Windows programs, that's all.

    I do really not believe that Rex has changed this! If I remember correctly, Console has a different Codepage since long time (ever?) in Windows (see my words above). And you should know: TCC takes the Codepage from Window Console, it's not a own TCC codepage)!

    So, please no changes here: this would break the compatibility to the Windows Console which is highly undesired!
     
  13. Esteban

    Joined:
    Feb 1, 2010
    Messages:
    15
    Likes Received:
    0
    Well, I did not say that the change was deliberate but it may have been caused by another modification. There is no magic here. Just explain me why TCC 16 works and TCC 19 doesn't? CHCP command returns 850 in both versions. And I am using the same OS, Win 7 Pro x64 French.

    I also tested TCC 19 under Win 10 Home x64 French on a brand new Dell laptop just unpacked, same issue... So maybe this is not a bug but at least there was a change which triggered an unwanted behavior. Does someone from JP has tested TCC 19 under a French Windows? Virtual Box or VMware are good alternatives.

    Now some positive words: Alpengreis and Dcantor you are right. CP 1252 works great! At least up to now (for a couple of hours). Wikipedia says it replaces the old CP 850 for Western European languages Latin alphabet. Duly noted. So I added CHCP 1252 into tcstart.btm and all my problems flew away. Magic! I hope I will not have unpleasant surprises in the future...

    Thanks to all guys!

    Edit: Hmmm, bad news, ALL my old descriptions are not compatible with CP 1252, they have to be edited!!! Ouch, another pain in the ass...
     
    #13 Esteban, Mar 20, 2016
    Last edited: Mar 20, 2016
  14. JohnQSmith

    Joined:
    Jan 19, 2011
    Messages:
    559
    Likes Received:
    7
    Try "iconv" in Cygwin.

    n.b. I haven't tried it.

    Code:
    iconv -f CP850 -t CP1252 descript.ion > newdescript.ion
    Here's the man page.

    Code:
    ICONV(1)                      Linux Programmer's Manual                      ICONV(1)
    
    NAME
           iconv - character set conversion
    
    SYNOPSIS
           iconv [OPTION...] [-f encoding] [-t encoding] [inputfile ...]
           iconv -l
    
    DESCRIPTION
           The  iconv  program converts text from one encoding to another encoding.  More
           precisely, it converts from the encoding given for the -f option to the encod‐
           ing  given for the -t option. Either of these encodings defaults to the encod‐
           ing of the current locale. All the inputfiles are read and converted in  turn;
           if  no  inputfile  is given, the standard input is used. The converted text is
           printed to standard output.
    
           The encodings permitted are system dependent. For the libiconv implementation,
           they are listed in the iconv_open(3) manual page.
    
           Options controlling the input and output format:
    
           -f encoding, --from-code=encoding
                  Specifies the encoding of the input.
    
           -t encoding, --to-code=encoding
                  Specifies the encoding of the output.
    
           Options controlling conversion problems:
    
           -c     When  this  option  is  given,  characters that cannot be converted are
                  silently discarded, instead of leading to a conversion error.
    
           --unicode-subst=formatstring
                  When this option is given, Unicode characters  that  cannot  be  repre‐
                  sented  in  the  target encoding are replaced with a placeholder string
                  that is constructed from the given formatstring, applied to the Unicode
                  code point. The formatstring must be a format string in the same format
                  as for the printf command or the printf() function,  taking  either  no
                  argument or exactly one unsigned integer argument.
    
           --byte-subst=formatstring
                  When this option is given, bytes in the input that are not valid in the
                  source encoding are replaced with a placeholder  string  that  is  con‐
                  structed  from the given formatstring, applied to the byte's value. The
                  formatstring must be a format string in the  same  format  as  for  the
                  printf  command  or the printf() function, taking either no argument or
                  exactly one unsigned integer argument.
    
           --widechar-subst=formatstring
                  When this option is given, wide characters in the input  that  are  not
                  valid  in  the  source  encoding are replaced with a placeholder string
                  that is constructed from the given formatstring, applied to the  byte's
                  value.  The  formatstring must be a format string in the same format as
                  for the printf command or the printf() function, taking either no argu‐
                  ment or exactly one unsigned integer argument.
    
           Options controlling error output:
    
           -s, --silent
                  When  this  option is given, error messages about invalid or unconvert‐
                  ible characters are omitted, but the actual  converted  text  is  unaf‐
                  fected.
    
           The  iconv  -l or iconv --list command lists the names of the supported encod‐
           ings, in a system dependent format. For the libiconv implementation, the names
           are  printed  in  upper  case,  separated by whitespace, and alias names of an
           encoding are listed on the same line as the encoding itself.
    
    EXAMPLES
           iconv -f ISO-8859-1 -t UTF-8
                  converts input from the old West-European encoding ISO-8859-1  to  Uni‐
                  code.
    
           iconv -f KOI8-R --byte-subst="<0x%x>"
                           --unicode-subst="<U+%04X>"
                  converts  input  from  the  old  Russian  encoding KOI8-R to the locale
                  encoding, substituting an angle bracket notation with hexadecimal  num‐
                  bers for invalid bytes and for valid but unconvertible characters.
    
           iconv --list
                  lists the supported encodings.
    
    CONFORMING TO
           POSIX:2001
    
    SEE ALSO
           iconv_open(3), locale(7)
    
    GNU                                 March 31, 2007                           ICONV(1)
    
     
  15. Alpengreis

    Joined:
    Jan 12, 2014
    Messages:
    228
    Likes Received:
    6
    I can explain only the following:

    IF a console has CP 437 or 850 AND Windows 1252, you will ALWAYS have a problem, if you create a document in Windows with umlauts and show it in the Console and vice versa!

    No, this is normal and good so and fully logical. It was here the same before I changed the CP for Console. This is how it works and is WAD! Sorry that I can not give another answer ;-)

    If this behaviour was not the same on your earlier computers or so, then you had another config. I still not believe that JP has changed this, because - as I said - TCC takes the CP from Windows Console. And this is right and important to be compatible to the Windows Console (Prompt)!

    You can make a new setup for example with my OS version on 100 000 computers. I'm sure If you choose "Swiss German" (for ex.), you will always have the same behaviour (the problem with umlauts) on every single computer! Well, I would nevertheless not bet with real money ;-)

    Sounds good but is not enough to avoid the problems in Windows Console too. Then you need "a right" solution (see my link).

    This cannot be avoided if you change CP. To edit such files you could eventually use the tip from JohnQSmith (maybe it exist even other tools for this purpose) for automatically converting. Or you take another workaround (see answer from Rex). Just: if you make NO workaround you have to live with this problem probably till Windows itself changes the CP for Console (in this case you would have the non-compatibility problem with your old DESCRIPT.ION files too)!

    However: i hope you can solve your problem!
     

Share This Page