Fileread fails on Unicode file

StarliteLemming · Jun 17, 2016

I'm trying to read in log files from Exact Audio Copy, but both @line and @fileread fail.

The files are in Unicode (2 bytes per char), but I think the problem is that every file starts with a single FF byte. I can see how this would desynchronise a Unicode read.

When I use @line, only the first line can be read (correctly) -- higher line request return nothing. When I use @fileread, only the first two bytes of the file are read. Subsequent reads return EOF.

I don't know if FOR or DO would work (likely not), but either would require a major restructure of my batch file, so I'd prefer to avoid them.

Is my only other option to read byte by byte with @filereadb? Or could I perhaps discard the initial byte somehow then continue from there with @fileread? Or is the problem elsewhere?

Thanks for any help. (TCC v13 x64)

rconn · Jun 18, 2016

A Unicode file in Windows *requires* the 2-byte header (xFEFF). There is no way to read a file as Unicode in TCC without the correct header.

Can you modify the file & insert the header before reading it with @LINE or @FILEREAD?

StarliteLemming · Jun 18, 2016

Okay, that's interesting. I actually misinterpreted the display slightly, then.

When I list one of the log files, the first six bytes are always FF FE 45 00 78 00 (which is 255 254 69 00 120 00 in decimal). But %@ascii[%@fileread[]] returns 160 9632 69 and nothing else (that's the whole line, and further filereads return EOF). Interpreted as three numbers, that would be A0 25A0 45 in hex. So that doesn't make a lot of sense to me. Note that the fileopen uses ,r,t as options, but ,r alone has the same result.

I just tried using FOR at the command line, and that does seem to work correctly (though echoing strings with pipes in them makes for a lot of errors).

I'll attach a sample file so you can try it yourself, if you want; zipped so it doesn't get modified in transit.

Failing any further options, looks like restructuring my code to use FOR is the best way forward. But if you can suggest something else, that would be awesome.

Thanks, mate.

vefatica · Jun 18, 2016

I don't know what the ultimate goal is, but it might be easier to achieve if you read the file into an array. Examples (I renamed your file):

Code:

v:\> setarray /f /r Unicode.log a

v:\> echo %a[0]
Exact Audio Copy V1.1 from 23. June 2015

v:\> echo %a[2]
EAC extraction logfile from 17. June 2016, 0:05

v:\> echo %a[%@dec[%@arrayinfo[a,1]]]
==== Log checksum 513B43020E430C81E1B7DC9C2E9C83F3A69D4DE7E7BA2D93EA3F23D4AECE56CC ====

StarliteLemming · Jun 19, 2016

Interesting idea. Thanks for the suggestion. Unfortunately, my version of TCC doesn't have the SETARRAY command.

As for what I'm doing, I'm trying to build a CSV table of the results of each file extraction. That's why a linear once-over the file is sufficient (though I suppose memory concerns are a bit passe). However, in my code I tried to do the track processing in a subroutine. Restructuring the code for FOR will involve setting and resetting status flags -- bit messy.

I'm actually really curious why @FILEREAD is failing. I've had other programs also struggle, on occasion, with these files.

vefatica · Jun 19, 2016

Do you have @FILEREADB? Here are some things to consider.

Code:

v:\> set h=%@fileopen[Unicode.log,r,b]

v:\> set r=%@filereadb[%h,10]

v:\> echo %r
255 254 69 0 120 0 97 0 99 0

v:\> echo %@fileseek[%h,2,0] (skip the BOM)
2 (skip the BOM)

v:\> set r=%@filereadb[%h,10]

v:\> echo %r
69 0 120 0 97 0 99 0 116 0

v:\> do i=0 to 9 ( echos %@if[%@word[%i,%r] NE 0,%@char[%@word[%i,%r]],] )
Exact

dcantor · Jun 19, 2016

Or maybe even

Code:

do i=0 to 8 by 2 ( echos %@char[%@EVAL[256*%@word[%@inc[%i],%r]+%@word[%i,%r]]] )
Exact

vefatica · Jun 19, 2016

dcantor said:
Or maybe even

Code:

do i=0 to 8 by 2 ( echos %@char[%@EVAL[256*%@word[%@inc[%i],%r]+%@word[%i,%r]]] ) Exact

Nice! If he doesn't have SETARRAY (introduced in v10), then he doesn't have a command line DO ... probably no problem.

Dave, Looking at your post and the quoted version in the message composer, I realize that I never realized that the CODE tags work even in lowercase! I always type them myself ... it's faster.

I wish Rex would chime in. I would have expected @FILEREAD to handle Unicode.

StarliteLemming · Jun 20, 2016

Actually, I do have command line DO. I paid for v3, v4, v5, but since then I've found TCC LE to be sufficient for (most of) my needs. Hence, I'm using TCC LE v13 x64. I do plan to buy a more recent version of TCC at some stage in the future, but I've never found much use for all the graphical features in TCE.

OK, so what's being suggested is to read byte-by-byte, strip the UTF header using @fileseek, then convert the bytes to characters by either skipping zero bytes or doing the maths to combine them into a 16-bit value. Both great suggestions. The latter obviously has the advantage of picking up European accented characters, which do appear in some CD track and artist names (Lady Gaga has some tracks with umlauts, for example). The basic problem with both of these approaches is that they don't easily pick up line-endings, and the data I have is heavily line-oriented.

Using FOR %line in (@file.log) ... does work, for some reason, where @fileread doesn't. So that has to be the simplest fall-back.

The trouble with using FOR is that I've structured the code to read in each line just in time for when I need it. For example, I have a subroutine that reads in the data for each track and keeps processing lines until it reaches the end of the track, at which point it returns to another line-reading section that looks for the start of a new track. If I use FOR, all the lines are being read in at the same point. So I have to track where I am using status flags (such as a variable that's set to 1 when I'm inside a track and 0 when I'm not). From a programming point of view, that's pretty clunky.

Still, unless someone can see something inherently dodgy about my Unicode files and work-around for it, looks like I'd better move things around and use FOR.

Thanks for the ideas.

rconn · Jun 24, 2016

I tried your file both with TCC v19 and TCC/LE v14, and had no problems reading lines (with @LINE) from your file. What's the exact syntax you're using?

StarliteLemming · Jun 25, 2016

Well that's a bit embarrassing!

Yes, it seems @LINE[] does work. The results I was getting were because I was reading lines 1, 3, 5, which are all blank in the file, hence ECHO %@LINE["file.log",1] was returning ECHO is OFF.

So I can use @LINE with a global line counter -- that certainly makes life easier!

It's still odd that @FILEREAD doesn't work, but I'll leave that in your capable hands. I still suspect there's something odd with the way these log files are written (other programs have occasional glitches reading them -- though it could also be code page issues). If there is a bug in @FILEREAD (or was, in the version of TCC I'm using), it's clearly not very impactful to have slipped by for so long (assuming it hasn't already been fixed in a later version). And if it is still around, I hope this has been useful. :)

I see there's a new version of TCC LE. Thank you. I'll have to upgrade.

And thanks to everyone who's responded.

Search

Welcome!

Fileread fails on Unicode file

StarliteLemming

rconn

Administrator

StarliteLemming

Attachments

vefatica

StarliteLemming

vefatica

dcantor

vefatica

StarliteLemming

rconn

Administrator

StarliteLemming

Similar threads