How to? filter text stream with a regular expression

Avi Shmidman · Feb 26, 2012

I'm looking for a way to do grep-type filtering on a text stream in TCC. For instance, I'd like to filter out lines of a given file with a string of 5 or more digits.
With powershell I can use the "match" function:
type filename.txt | where {$_ -match "\d{5}"}
Is there an equivalent within TCC? I've already seen that TCC does have excellent regex support. For instance, to perform a "dir" of filenames with strings of 5 or more digits, I can write:
dir "::\d{5}"
However, I'd like to be able to harness TCC's regex processing with any piped text stream on the command line. Is this possible?

Steve Pitts · Feb 26, 2012

Avi Shmidman said:
I'm looking for a way to do grep-type filtering on a text stream in TCC

ffind /e"^\d{5}" /v filename.txt

Avi Shmidman · Feb 26, 2012

Hi Steve,

Thanks for the pointer! I just tried out a few ffind combos, and I was pleased to find that it supported UTF-8 with the /8 option, and that it also supports use as a pipe command. That is, one can write, for instance:

dir | ffind /e"\d{5} /v

(OK, I realize that in this case I could have just added the regex to the dir command directly, but I note it here for demonstration purposes).

However, I've noticed two issues regarding ffind and multilanguage text:

1] When I pipe text into ffind, any non-English text becomes corrupted in the final output, whether or not I set the "/8" switch. I imagine that this is because piped command-line output is not piped as UTF-8. So, I'm wondering - just as Rex recently added the option for all redirected streams to be processed as UTF-8, is there a parallel option for piped streams to be processed as UTF-8, too?

2] Even without the piping, I find that ffind does not properly process regex strings that contain non-English characters. This is strange, because it does properly process simple search strings with non-English chars. Here's my case:

- I have a file, text.txt, with UTF-8 text, containing English and Hebrew characters.
- If I run:
ffind /t"א" /v /8 test.txt (that's an "aleph" character in the /t argument)
Then the result is correct - all lines are displayed containing the character "aleph".
- However, if I run:
ffind /e"א" /v /8 test.txt
Then the results are now blank!
Why would /t process it correctly, but /e not do so?

Charles Dye · Feb 26, 2012

Avi Shmidman said:
1] When I pipe text into ffind, any non-English text becomes corrupted in the final output, whether or not I set the "/8" switch. I imagine that this is because piped command-line output is not piped as UTF-8. So, I'm wondering - just as Rex recently added the option for all redirected streams to be processed as UTF-8, is there a parallel option for piped streams to be processed as UTF-8, too?

Is your input text Unicode, or is it some OEM format?

Avi Shmidman · Feb 26, 2012

1] In my second example, my sample file is encoded in UTF-8.
2] In my first example, regarding piping, I'm simply piping the output of the dir command to ffind. Admittedly, the code page of my shell is set to 1255 (Hebrew-Windows), rather than 65001 (UTF-8), because, as I've noted in a different thread, when I switch TCC to code page 65001 I don't see any Hebrew characters whatsoever. Nevertheless, I was hoping that piped text could be converted "on the fly" to UTF-8, the same way redirected output (with >) is successfully converted to UTF-8.

Steve Pitts · Feb 26, 2012

Avi Shmidman said:
Why would /t process it correctly, but /e not do so?

At a guess (you'll have to wait for Rex to provide the definitive answer) because they are processed by completely different pieces of code, with the /t handled directly by TCC and the /e passed to the third-party REGEX code to handle. As to the rest of your post, I'm afraid that I've no ideas because I'm lucky enough to be able to run vanilla Take Command in vanilla Windows and have no need of any additional language or code page support (or at least, the UK pages are rarely different enough from the American ones to cause any real issues these days), so I've never had the need to dig into those areas, sorry.

Charles Dye · Feb 26, 2012

Avi Shmidman said:
1] In my second example, my sample file is encoded in UTF-8.

I don't know whether FFIND is expected to understand UTF-8 input text, but I would guess not. Have you tried it with UTF-16 text instead?

Avi Shmidman · Feb 26, 2012

Well, Charles, as I noted, ffind does a great job with UTF-8 text with the /T parameters. And I'm using it with the /8 switch, which puts it into UTF-8 mode. So there is something specific about the way the search string and file are sent to the regex processor that seems to be the problem.

rconn · Feb 26, 2012

Avi Shmidman said:
Well, Charles, as I noted, ffind does a great job with UTF-8 text with the /T parameters. And I'm using it with the /8 switch, which puts it into UTF-8 mode. So there is something specific about the way the search string and file are sent to the regex processor that seems to be the problem.

The RE library (Oniguruma) has to be configured to handle anything other than ASCII and Unicode input. Adding UTF-8 shouldn't be too difficult, but it's going to be substantially more work to configure it for other encodings (like RTL languages). That definitely won't be in v13.

rconn · Feb 26, 2012

Avi Shmidman said:
Nevertheless, I was hoping that piped text could be converted "on the fly" to UTF-8, the same way redirected output (with >) is successfully converted to UTF-8.

If you use the /U8 startup switch, all redirected output (including pipes) is converted to UTF-8.

Avi Shmidman · Feb 27, 2012

Hi Rex,
1] Ah, I see, you are correct, it's the UTF-8 that was scaring the regular expression library. When I went back to "UnicodeOutput=Yes", then I found that piping text through to ffind's regular expression parser worked perfectly. That is, I can now write:
dir | ffind /e"א" /v
With UTF-16, this succeeds. With UTF-8, it found nothing.

2] Interestingly, this issue affected ffind's non-regex string processing too. That is, typing the same thing but with /t, like this:
dir | ffind /t"א" /v
resulted in the same issue. With UTF-8, it finds nothing (even if I add the /8 switch). On the other hand, with UTF-16 output, the output is all good.

Avi Shmidman · Feb 27, 2012

So, just to clarify, because there are a lot of variables. I tried:
(a) piping Hebrew text into ffind
(b) running ffind on a UTF-8 text file.
And I tried each of these with both (1) regex and (2) non-regex strings.
I found that:
(1a and 1b) with regex, UTF-8 was not processed correctly, neither with piping nor on a text file
(2a) When piping UTF-8, a non-regex string did not work, either
(2b) However, running ffind on a UTF-8 file with a non-regex string did work.

With UTF-16, all four permutations work.

Search

Welcome!

How to? filter text stream with a regular expression

Avi Shmidman

Steve Pitts

Avi Shmidman

Charles Dye

Super Moderator

Avi Shmidman

Steve Pitts

Charles Dye

Super Moderator

Avi Shmidman

rconn

Administrator

rconn

Administrator

Avi Shmidman

Avi Shmidman

Similar threads