1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How to? filter text stream with a regular expression

Discussion in 'Support' started by Avi Shmidman, Feb 26, 2012.

  1. Avi Shmidman

    Joined:
    Feb 23, 2012
    Messages:
    238
    Likes Received:
    3
    I'm looking for a way to do grep-type filtering on a text stream in TCC. For instance, I'd like to filter out lines of a given file with a string of 5 or more digits.
    With powershell I can use the "match" function:
    type filename.txt | where {$_ -match "\d{5}"}
    Is there an equivalent within TCC? I've already seen that TCC does have excellent regex support. For instance, to perform a "dir" of filenames with strings of 5 or more digits, I can write:
    dir "::\d{5}"
    However, I'd like to be able to harness TCC's regex processing with any piped text stream on the command line. Is this possible?
     
  2. Steve Pitts

    Joined:
    Jul 7, 2008
    Messages:
    158
    Likes Received:
    0
    ffind /e"^\d{5}" /v filename.txt
     
  3. Avi Shmidman

    Joined:
    Feb 23, 2012
    Messages:
    238
    Likes Received:
    3
    Hi Steve,

    Thanks for the pointer! I just tried out a few ffind combos, and I was pleased to find that it supported UTF-8 with the /8 option, and that it also supports use as a pipe command. That is, one can write, for instance:

    dir | ffind /e"\d{5} /v

    (OK, I realize that in this case I could have just added the regex to the dir command directly, but I note it here for demonstration purposes).

    However, I've noticed two issues regarding ffind and multilanguage text:

    1] When I pipe text into ffind, any non-English text becomes corrupted in the final output, whether or not I set the "/8" switch. I imagine that this is because piped command-line output is not piped as UTF-8. So, I'm wondering - just as Rex recently added the option for all redirected streams to be processed as UTF-8, is there a parallel option for piped streams to be processed as UTF-8, too?

    2] Even without the piping, I find that ffind does not properly process regex strings that contain non-English characters. This is strange, because it does properly process simple search strings with non-English chars. Here's my case:

    - I have a file, text.txt, with UTF-8 text, containing English and Hebrew characters.
    - If I run:
    ffind /t"א" /v /8 test.txt (that's an "aleph" character in the /t argument)
    Then the result is correct - all lines are displayed containing the character "aleph".
    - However, if I run:
    ffind /e"א" /v /8 test.txt
    Then the results are now blank!
    Why would /t process it correctly, but /e not do so?
     
  4. Charles Dye

    Charles Dye Super Moderator
    Staff Member

    Joined:
    May 20, 2008
    Messages:
    3,288
    Likes Received:
    39
    Is your input text Unicode, or is it some OEM format?
     
  5. Avi Shmidman

    Joined:
    Feb 23, 2012
    Messages:
    238
    Likes Received:
    3
    1] In my second example, my sample file is encoded in UTF-8.
    2] In my first example, regarding piping, I'm simply piping the output of the dir command to ffind. Admittedly, the code page of my shell is set to 1255 (Hebrew-Windows), rather than 65001 (UTF-8), because, as I've noted in a different thread, when I switch TCC to code page 65001 I don't see any Hebrew characters whatsoever. Nevertheless, I was hoping that piped text could be converted "on the fly" to UTF-8, the same way redirected output (with >) is successfully converted to UTF-8.
     
  6. Steve Pitts

    Joined:
    Jul 7, 2008
    Messages:
    158
    Likes Received:
    0
    At a guess (you'll have to wait for Rex to provide the definitive answer) because they are processed by completely different pieces of code, with the /t handled directly by TCC and the /e passed to the third-party REGEX code to handle. As to the rest of your post, I'm afraid that I've no ideas because I'm lucky enough to be able to run vanilla Take Command in vanilla Windows and have no need of any additional language or code page support (or at least, the UK pages are rarely different enough from the American ones to cause any real issues these days), so I've never had the need to dig into those areas, sorry.
     
  7. Charles Dye

    Charles Dye Super Moderator
    Staff Member

    Joined:
    May 20, 2008
    Messages:
    3,288
    Likes Received:
    39
    I don't know whether FFIND is expected to understand UTF-8 input text, but I would guess not. Have you tried it with UTF-16 text instead?
     
  8. Avi Shmidman

    Joined:
    Feb 23, 2012
    Messages:
    238
    Likes Received:
    3
    Well, Charles, as I noted, ffind does a great job with UTF-8 text with the /T parameters. And I'm using it with the /8 switch, which puts it into UTF-8 mode. So there is something specific about the way the search string and file are sent to the regex processor that seems to be the problem.
     
  9. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,809
    Likes Received:
    82
    The RE library (Oniguruma) has to be configured to handle anything other than ASCII and Unicode input. Adding UTF-8 shouldn't be too difficult, but it's going to be substantially more work to configure it for other encodings (like RTL languages). That definitely won't be in v13.
     
  10. rconn

    rconn Administrator
    Staff Member

    Joined:
    May 14, 2008
    Messages:
    9,809
    Likes Received:
    82
    If you use the /U8 startup switch, all redirected output (including pipes) is converted to UTF-8.
     
  11. Avi Shmidman

    Joined:
    Feb 23, 2012
    Messages:
    238
    Likes Received:
    3
    Hi Rex,
    1] Ah, I see, you are correct, it's the UTF-8 that was scaring the regular expression library. When I went back to "UnicodeOutput=Yes", then I found that piping text through to ffind's regular expression parser worked perfectly. That is, I can now write:
    dir | ffind /e"א" /v
    With UTF-16, this succeeds. With UTF-8, it found nothing.

    2] Interestingly, this issue affected ffind's non-regex string processing too. That is, typing the same thing but with /t, like this:
    dir | ffind /t"א" /v
    resulted in the same issue. With UTF-8, it finds nothing (even if I add the /8 switch). On the other hand, with UTF-16 output, the output is all good.
     
  12. Avi Shmidman

    Joined:
    Feb 23, 2012
    Messages:
    238
    Likes Received:
    3
    So, just to clarify, because there are a lot of variables. I tried:
    (a) piping Hebrew text into ffind
    (b) running ffind on a UTF-8 text file.
    And I tried each of these with both (1) regex and (2) non-regex strings.
    I found that:
    (1a and 1b) with regex, UTF-8 was not processed correctly, neither with piping nor on a text file
    (2a) When piping UTF-8, a non-regex string did not work, either
    (2b) However, running ffind on a UTF-8 file with a non-regex string did work.

    With UTF-16, all four permutations work.
     

Share This Page