SDK - GetLine and redirected stdin

May 30, 2008
122
1
#1
I'm writing a plugin which needs to read from standard input. The
GetLine function in the SDK seems appropriate, but I'm having some
trouble using it.

I have a callback routine, which gets called in a loop from a library
API. The callback does

HANDLE in = GetStdHandle(STD_INPUT_HANDLE);
...
GetLine(in, ...)

By tracing the function calls, it seems to me that every time the
GetLine function is called, it starts again from the beginning of
standard input. This only seems to happen when stdin is redirected
from a file, like

plugin /L MyPlugin
myplugincmd <f

It does *not* happen in the case

echo some text | (plugin /L MyPlugin & myplugincmd & plugin /U MyPlugin)

What's the issue here? And how should I use GetLine to be robust when
called with a redirected file as input?

Thanks,
Paul.
 
Jun 10, 2008
35
0
#2
"p.f.moore" <> wrote:

>
> I'm writing a plugin which needs to read from standard input. The
> GetLine function in the SDK seems appropriate, but I'm having some
> trouble using it.
>
> I have a callback routine, which gets called in a loop from a library
> API. The callback does
>
> HANDLE in = GetStdHandle(STD_INPUT_HANDLE);
> ...
> GetLine(in, ...)
>
> By tracing the function calls, it seems to me that every time the
> GetLine function is called, it starts again from the beginning of
> standard input. This only seems to happen when stdin is redirected
> from a file, like
>
> plugin /L MyPlugin
> myplugincmd <f
>
> It does *not* happen in the case
>
> echo some text | (plugin /L MyPlugin & myplugincmd & plugin /U MyPlugin)
>
> What's the issue here? And how should I use GetLine to be robust when
> called with a redirected file as input?
GetLine() is a can of worms.

Having said that, I ran into the same problem and found the culprit to
be a call to QueryIsFileUnicode() which was also in the loop. Move this
out of the loop and it should work. At least it did so for me.

--
cheers thomasl

web: http://thomaslauer.com/start
 
#3
On Sat, 05 Jul 2008 10:16:02 -0500, you wrote:


>By tracing the function calls, it seems to me that every time the
>GetLine function is called, it starts again from the beginning of
>standard input. This only seems to happen when stdin is redirected
>from a file, like
>
> plugin /L MyPlugin
> myplugincmd <f
My GREPP uses either of two utility routines, one uses GetLine() the other uses
fgetws(). When I switch to the GetLine() routune I see that it doesn't work
with redirected input. I don't know why. Maybe Rex will chime in. But
GetLine() is very slow, reading a byte at a time, according to Rex, to
facilitate pipes. My other, much faster, routine looks like this (without all
the Oniguruma stuff).

INT GetEm(HANDLE hFile, WCHAR *pszRegEx, BOOL bCase, BOOL bReverse, BOOL bQuiet)
{
// Onig stuff

INT rc = 0;
WCHAR buf[8192];
BOOL bUnicode = QueryIsFileUnicode(hFile);
INT hCrt = _open_osfhandle((long) hFile, bUnicode ? _O_BINARY : _O_TEXT);
FILE *hf = _fdopen( hCrt, bUnicode ? "rb" : "r" );

// Onig stuff
// bInterrupt may be set by a temporary console ctrl handler

while ( !bInterrupt && !feof(hf) && fgetws((WCHAR*)buf, 8192, hf) )
{
// Onig stuff and set rc
}

byebye :
fclose(hf);
_close(hCrt);
return rc;
}

This approach does work with redirected input:

v:\> grepp reset alterping.btm
:reset
if "%signal" EQ "r" goto reset

v:\> grepp reset < alterping.btm
:reset
if "%signal" EQ "r" goto reset
 
May 30, 2008
122
1
#4
2008/7/5 thomasl <>:


> GetLine() is a can of worms.
Too right! :-)


> Having said that, I ran into the same problem and found the culprit to
> be a call to QueryIsFileUnicode() which was also in the loop. Move this
> out of the loop and it should work. At least it did so for me.
It did indeed. That fixed the problem. Thanks for the suggestion.
Paul.
 
May 30, 2008
122
1
#5
2008/7/5 vefatica <>:

> My GREPP uses either of two utility routines, one uses GetLine() the other uses
> fgetws(). When I switch to the GetLine() routune I see that it doesn't work
> with redirected input. I don't know why. Maybe Rex will chime in.
See Thomas' comment - maybe it's related to QueryIsFileUnicode?


> But GetLine() is very slow, reading a byte at a time, according to Rex, to
> facilitate pipes. My other, much faster, routine looks like this
[...]

Yes, I had an attempt at using ReadFile directly, it was much easier -
but I had no idea in that case how to support both Unicode and ANSI
stdin. I may go back to that approach, and try again, as although the
speed isn't a huge issue here, the code was a lot simpler.

Thanks for the example.
Paul.
 
#6
On Sat, 05 Jul 2008 11:08:23 -0500, you wrote:


>GetLine() is a can of worms.
>
>Having said that, I ran into the same problem and found the culprit to
>be a call to QueryIsFileUnicode() which was also in the loop. Move this
>out of the loop and it should work. At least it did so for me.
Experiment shows that GetLine()'s nEditFlag should be 0x10000 for a pipe while
it should match the file in the case of redirected stdin. Rex, how does a
plugin tell the difference?
 
#7
On Sat, 05 Jul 2008 11:08:23 -0500, you wrote:


>Having said that, I ran into the same problem and found the culprit to
>be a call to QueryIsFileUnicode() which was also in the loop. Move this
>out of the loop and it should work. At least it did so for me.
Thomas, does your routine work in both cases, stdin redirected from a Unicode
file ... from a non-Unicode file?
 

rconn

Administrator
Staff member
May 14, 2008
10,572
97
#8
vefatica wrote:


> Quote:
> >GetLine() is a can of worms.
> >
> >Having said that, I ran into the same problem and found the culprit to
> >be a call to QueryIsFileUnicode() which was also in the loop. Move this
> >out of the loop and it should work. At least it did so for me.
>
> Experiment shows that GetLine()'s nEditFlag should be 0x10000 for a pipe
> while
> it should match the file in the case of redirected stdin. Rex, how does a
> plugin tell the difference?
QueryIsPipeHandle().

Rex Conn
JP Software
 
May 30, 2008
122
1
#9
2008/7/5 vefatica <>:

> Experiment shows that GetLine()'s nEditFlag should be 0x10000 for a pipe while
> it should match the file in the case of redirected stdin.
Is that affected by the UnicodeOutput flag? I can see rather a lot of
cases to consider here:

* Console input (With or without UnicodeOutput)
* Redirected file (Either ASCII or Unicode)
* Pipe input (With or without UnicodeOutput)
* Here document (In a batch file which can be either Unicode or ASCII)


> Rex, how does a plugin tell the difference?
Indeed, the key question here is that, given that GetLine needs to be
passed a flag to describe the encoding (Unicode or ANSI) and the OS
APIs (ReadFile etc) simply read bytes, how does one derive the correct
encoding to be used? I suspect that's what QueryIsFileUnicode is
about, but I suspect that works simply by checking for a BOM - and
hence it won't work elsewhere.

I think the rule should be:

1. If STD_INPUT_HANDLE points at a seekable device, check the start of
the file for a BOM and work from there.
2. If STD_INPUT_HANDLE is not seekable, it's either the console or a
character device. Check using QueryIsConsole, and if it's the console
go on the basis of UnicodeOutput, otherwise assume ASCII.

And maybe plugin commands should have an optional encoding flag, to
override this.

Questions: (a) Is this reasonable, and (b) how does it tie in with
what the SDK and/or TCC do at the moment?

Paul.

PS I'll do some experiments when I have a spare moment, and report back...
 
#10
On Sat, 05 Jul 2008 12:17:15 -0500, you wrote:


>> Experiment shows that GetLine()'s nEditFlag should be 0x10000 for a pipe
>> while
>> it should match the file in the case of redirected stdin. Rex, how does a
>> plugin tell the difference?
>---End Quote---
>QueryIsPipeHandle().
That's not in TakeCmd.h or exposed by TakeCmd.dll.
 

rconn

Administrator
Staff member
May 14, 2008
10,572
97
#11
vefatica wrote:

> On Sat, 05 Jul 2008 12:17:15 -0500, you wrote:
>
>
> Quote:
> >> Experiment shows that GetLine()'s nEditFlag should be 0x10000 for a
> pipe
> >> while
> >> it should match the file in the case of redirected stdin. Rex, how
> does a
> >> plugin tell the difference?
> >---End Quote---
> >QueryIsPipeHandle().
>
> That's not in TakeCmd.h or exposed by TakeCmd.dll.
The entire contents of the function:

// check to see if the specified handle is connected to a pipe
DLLExports int QueryIsPipeHandle( HANDLE hFile )
{
return ( GetFileType( hFile ) == FILE_TYPE_PIPE );
}

Rex Conn
JP Software
 
May 30, 2008
122
1
#12
2008/7/5 p.f.moore <>:

> PS I'll do some experiments when I have a spare moment, and report back...
It looks like the following is effective:

HANDLE in = GetStdHandle(STD_INPUT_HANDLE);
BOOL uni;

if (QueryIsConsole(in)) {
uni = FALSE;
} else if (GetFileType(in) == FILE_TYPE_PIPE) {
uni = QueryUnicodeOutput();
} else {
uni = QueryIsFileUnicode(in);
}

Printf(L"Treat as Unicode: %s\n", uni ? L"Yes" : L"No");

The only case I'm nervous about is where I unilaterally assume that
the console is always ANSI. Rex - is this true? Is it impossible for a
handle for which QueryIsConsole is true, to be Unicode? I certainly
can't make it happen...

Paul.
 

rconn

Administrator
Staff member
May 14, 2008
10,572
97
#13
p.f.moore wrote:

> 2008/7/5 p.f.moore <>:
>
> Quote:
> > PS I'll do some experiments when I have a spare moment, and report
> back...
>
> It looks like the following is effective:
>
> HANDLE in = GetStdHandle(STD_INPUT_HANDLE);
> BOOL uni;
>
> if (QueryIsConsole(in)) {
> uni = FALSE;
> } else if (GetFileType(in) == FILE_TYPE_PIPE) {
> uni = QueryUnicodeOutput();
> } else {
> uni = QueryIsFileUnicode(in);
> }
>
> Printf(L"Treat as Unicode: %s\n", uni ? L"Yes" : L"No");
>
> The only case I'm nervous about is where I unilaterally assume that
> the console is always ANSI. Rex - is this true? Is it impossible for a
> handle for which QueryIsConsole is true, to be Unicode? I certainly
> can't make it happen...
The (unredirected) console in TCC is always Unicode, never ANSI.

Rex Conn
JP Software
 
May 30, 2008
122
1
#14
2008/7/6 rconn <>:

> The (unredirected) console in TCC is always Unicode, never ANSI.
??? Surely not. What is this code doing, then?

DLLExports INT WINAPI test (LPTSTR lpszString)
{
HANDLE in = GetStdHandle(STD_INPUT_HANDLE);
BOOL c = QueryIsConsole(in);
char buf[10];
DWORD n;
int i;

Printf(L"Stdin is%s a console\n", c ? L"" : L" not");
ReadFile(in, buf, 5, &n, NULL);
for (i = 0; i < n; ++i) {
Printf(L"%c", isprint(buf) ? buf : '.');
}
Printf(L"\n");

for (i = 0; i < n; ++i) {
Printf(L"%2.2x ", buf);
if ((i % 16) == 15)
Printf(L"\n");
}
Printf(L"\n");

return 0;
}

Result:


Stdin is a console
abcdefg
abcde
61 62 63 64 65

So that to me implies that standard input, the console, is returning
bytes. I suspect I'm misunderstanding your use of the term "console"
here, or something else is wrong in what I'm doing. Unnervingly
enough, the characters which were *not* read by my test command did
not get used as input to the next command line, but were left and
picked up by the next execution of the test command.

With a bit of fiddling around, it looks to me like the input is coming
in using the console code page (850 on my machine) but is being
displayed in something else (I can't easily tell what).

Ultimately, what I want to do is to have a plugin command which reads
its "standard input" (pipe, console, redirected file, here document,
whatever) using standard ReadFile, or something equivalent which I can
use to read an arbitrary block of data in one go (using GetLine to
read a line at a time is OK for some uses, but not all), and then
establish what the character encoding of that data is, so that I can
convert it to Unicode. Some aspects of this are impossible (a
redirected file could be in any arbitrary encoding) but I'm willing to
compromise a little (for files, use BOM detection for UTF-16 and
otherwise assume an 8-bit character set which matches ASCII for
0-127). But as things stand, I'm struggling even to understand what
cases I have to address.

The irony of this is that for my personal use, I'm mostly OK with
ASCII - it's only really the odd latin-15 character (most notably the
pound sign £) that hits me.
Paul.
 
Jun 10, 2008
35
0
#15
vefatica <> wrote:

>
> On Sat, 05 Jul 2008 11:08:23 -0500, you wrote:
>
> ---Quote---
> >Having said that, I ran into the same problem and found the culprit to
> >be a call to QueryIsFileUnicode() which was also in the loop. Move this
> >out of the loop and it should work. At least it did so for me.
> ---End Quote---
> Thomas, does your routine work in both cases, stdin redirected from a Unicode
> file ... from a non-Unicode file?
Hmm, I hope and think it does and my tests seem to support this hope...
but then again, with GetLine() everything is possible;-). This API has
surprised me more often than I care to count.

Have a look into the source for my lua4nt or idle4nt plugin (especially
function reader(), there it is in all its gory detail):
http://thomaslauer.com/download/lua4nt01.zip
http://thomaslauer.com/download/idle4nt01.zip

--
cheers thomasl

web: http://thomaslauer.com/start
 

rconn

Administrator
Staff member
May 14, 2008
10,572
97
#16
p.f.moore wrote:

> 2008/7/6 rconn <>:
>
> Quote:
> > The (unredirected) console in TCC is always Unicode, never ANSI.
>
> ??? Surely not.
Definitely yes -- ALL of the internal APIs (including the console) in XP
/ Vista are Unicode. If you're running an ASCII app, all of the Unicode
APIs get thunked back & forth.


> What is this code doing, then?
>
> DLLExports INT WINAPI test (LPTSTR lpszString)
> {
> HANDLE in = GetStdHandle(STD_INPUT_HANDLE);
> BOOL c = QueryIsConsole(in);
> char buf[10];
> DWORD n;
> int i;
>
> Printf(L"Stdin is%s a console\n", c ? L"" : L" not");
> ReadFile(in, buf, 5, &n, NULL);
> for (i = 0; i < n; ++i) {
> Printf(L"%c", isprint(buf) ? buf : '.');
> }
> Printf(L"\n");
>
> for (i = 0; i < n; ++i) {
> Printf(L"%2.2x ", buf);
> if ((i % 16) == 15)
> Printf(L"\n");
> }
> Printf(L"\n");
>
> return 0;
> }


You're not directly accessing the console -- you're calling it
indirectly through the ReadFile API, so it's getting converted to ASCII.

If you're using a non-Unicode font (not recommended), you'll add
another layer of confusion (and thunking).

Rex Conn
JP Software
 
#17
On Sun, 06 Jul 2008 09:32:03 -0500, you wrote:


>You're not directly accessing the console -- you're calling it
>indirectly through the ReadFile API, so it's getting converted to ASCII.
Using ReadConsole() instead, I see Unicode.

Why does ReadFile() do that?

I noticed that QueryIsFileUnicode(GetStdHandle(STD_INPUT_HANDLE)) is FALSE.

It's a bit confusing.
 
#18
On Sun, 06 Jul 2008 09:32:03 -0500, you wrote:


>If you're using a non-Unicode font (not recommended), you'll add
>another layer of confusion (and thunking).
While that may be true, it should be noted that this command

timer & *dir f:\windows\system32 & timer

(2283 lines) takes 50% longer when Lucida Console is used than when the same
size raster font is used (here, 2.7 vs. 1.8 seconds when the end of the console
screen buffer is not reached, 1.9 vs. 1.3 seconds when started with a full
console screen buffer).

The added confusion and thunking seem to speed things up!
 
May 30, 2008
122
1
#19
2008/7/6 vefatica <>:

> On Sun, 06 Jul 2008 09:32:03 -0500, you wrote:
>>You're not directly accessing the console -- you're calling it
>>indirectly through the ReadFile API, so it's getting converted to ASCII.

> Using ReadConsole() instead, I see Unicode.
Aargh. I never looked at ReadConsole. I'm not sure I'd even realised
it existed...


> Why does ReadFile() do that?
ReadFile is defined as a bytes-only interface, so it has to encode its
input. I assume it uses the console code page to do this, so it's
entirely valid. I suspect if I had a keyboard which could generate
significant chunks of non-ASCII data (rather than just £, €, ¦ and ¬)
I might stand more of a chance of understanding what's going on...


> I noticed that QueryIsFileUnicode(GetStdHandle(STD_INPUT_HANDLE)) is FALSE.
>
> It's a bit confusing.
Too right!

To simplify right down, suppose I have a plugin which wants to read
from in = GetStdHandle(STD_INPUT_HANDLE). I guess I need to do the
following:

1. Test QueryIsConsole(in) [btw, what is the OS API equivalent to this?]
2. If it's true, use ReadConsole, and I get back wide characters.
3. If it's false, use ReadFile. I now need to know the encoding.
4. Check if it's a pipe (GetFileType(in) == FILE_TYPE_PIPE).
5. If it is, it's UTF-16 (wide characters) if QueryUnicodeOutput() is
true, else *QUESTION 1*
6. If it's not a pipe, it's a file and so it's seekable and we can
check the BOM.
7. If there's no BOM, we're as stuffed as any other application and we
should use the system default (CP_ACP?)

Question 1 - what's the encoding of a pipe when unicode output isn't in force?
Question 2 - is CP_ACP the correct way of specifying the current
system codepage?
Question 3 - is the above correct?

That's so complicated that there's a question 4 - "do I care?" - but
I'm going to be conscientious and try to do it right... :-)

Paul.
 

rconn

Administrator
Staff member
May 14, 2008
10,572
97
#20
vefatica wrote:


> Quote:
> >If you're using a non-Unicode font (not recommended), you'll add
> >another layer of confusion (and thunking).
>
> While that may be true, it should be noted that this command
>
> timer & *dir f:\windows\system32 & timer
>
> (2283 lines) takes 50% longer when Lucida Console is used than when the same
> size raster font is used (here, 2.7 vs. 1.8 seconds when the end of the
> console
> screen buffer is not reached, 1.9 vs. 1.3 seconds when started with a full
> console screen buffer).
>
> The added confusion and thunking seem to speed things up!
Here, Lucida Console draws in 0.69 seconds vs. 0.78 seconds for Terminal.

I suspect what you're really measuring is anti-aliasing & ClearType vs.
doing nothing, not Unicode vs. ASCII. (This is going to be highly
dependent on how good a video card you have!)

Rex Conn
JP Software