How many lines are in this file?

#1
I have a ~31MB ASCII export of HKML.

If I count the number of occurrences of 0x0A (LF) with my own program I get 627507, which agrees with the line count given by two versions of WC.EXE (Gnu and Thompson Toolkit).

If I count the number of occurrences of 0x0D (CR), I get 627546, which agrees with the line counts given by TEXTPAD (editor) and DEVENV's editor.

If I count the number of occurrences of 0x0D0A (CRLF), I get 627470, which doesn't agree with anything else I've seen. It does suggest that there are 76 CRs not followed by LFs and 37 LFs not preceded by CRs.

TCC's @LINES gives 627560. How does it arrive at this number? Is there a sense in which it is correct?
 

rconn

Administrator
Staff member
May 14, 2008
10,499
94
#2
> I have a ~31MB ASCII export of HKML.
>
> If I count the number of occurrences of 0x0A (LF) with my own program I
> get 627507, which agrees with the line count given by two versions of
> WC.EXE (Gnu and Thompson Toolkit).
>
> If I count the number of occurrences of 0x0D (CR), I get 627546, which
> agrees with the line counts given by TEXTPAD (editor) and DEVENV's
> editor.
>
> If I count the number of occurrences of 0x0D0A (CRLF), I get 627470,
> which doesn't agree with anything else I've seen. It does suggest that
> there are 76 CRs not followed by LFs and 37 LFs not preceded by CRs.
>
> TCC's @LINES gives 627560. How does it arrive at this number? Is
> there a sense in which it is correct?
@LINES counts lines ending in either a null, CR, LF, or CR/LF.

But what use it is counting lines in a mangled file with random EOL's?

Rex Conn
JP Software
 
#3
@LINES counts lines ending in either a null, CR, LF, or CR/LF.

But what use it is counting lines in a mangled file with random EOL's?

Rex Conn
JP Software
It is plain ASCII, with no NULs. The numbering of lines is a way of refering to where in the file some text is. TCC's notion of what line something is on differs from the notion held by my two editors.

And I haven't yet figured out how to get TCC's count.
 
#4
@LINES counts lines ending in either a null, CR, LF, or CR/LF.
The routine below would seem to count as you say TCC does, yet it gives a count agreeing with my editors but which is 14 less than TCC's count.

What's TCC counting that I'm not?

Code:
// File is memory-mapped
while ( p < pEnd )
    {
        if ( *p == 13 ) // count every CR
        {
            dwCount += 1;
            p += 1;
            if ( *p == 10 ) // if followed by LF, go past it
                p += 1;
        }
        else if ( *p == 10 ) // count LFs not preceded by CR
        {
            dwCount += 1;
            p += 1;
        }
        else if ( *p == 0 ) // count NULs
        {
            dwCount += 1;
            p += 1;
        }
        else
            p += 1;
    }
 
#5
rconn wrote:
| @LINES counts lines ending in either a null, CR, LF, or CR/LF.

This explains the line count discrepancies in Vince's report.

| But what use it is counting lines in a mangled file with random
| EOL's?

Technically, there are no "lines" in an ASCII text file. Each printable
character, including space, displays a character in the current cursor
position, and moves the cursor one character to the right. CR, LF, FF, VT,
and HT are "format effectors", and move the cursor without displaying any
new characters. Of these, only LF, FF and VT cause vertical cursor motion,
and should thus be the only ones which affect the line count. NUL and DEL
are neither printable characters nor format effectors. A CR by itself,is
intended for overtyping, and should not be counted as a line.

Several conventions are in use to represent text formatted into lines. As we
all know the PC-DOS convention is to use CR,LF sequence, imitating the
characters that had to be sent to early mechanical printers. This allows
using the CR to cause overtyping. The Unix convention to drop the CR, and
use the NL character (borrowing the code of LF for the purpose) represents a
serious file size reduction for files with short lines, which several
decades ago was significant. However, the C/Unix convention of using NUL as
a string terminator has always been, and still is, an abomination,
preventing the use of one of the legitimate data characters for its intended
purpose (padding). In the old TRS-80 the worst scheme was used: CR by
itself. This virtually prevents overtyping. (Open)VMS accepts all three of
them, but its native format is to use CR and LF only as format effectors,
and to prefix each line (record) of data with its 16-bit character count.
None of these conventions uses the NUL character to delimit lines.

However, a registry dump is not a "mangled file", nor does it have "random
EOL-s". It is a text file, but not one formatted into lines, thus Rex is
correct, it has no meaningful "line count". How it is displayed, esp. where
the display shows line breaks, depends on the viewing program. Notepad,
textpad, word, wordpad, etc. may each show it differently from any other.
--
Steve
 
#6
On Thu, 08 Oct 2009 21:38:21 -0500, Steve Fábián <> wrote:

|rconn wrote:
|| @LINES counts lines ending in either a null, CR, LF, or CR/LF.
|
|This explains the line count discrepancies in Vince's report.

No it doesn't.

And %LINES says 3 (4 lines) for this one:

65 0d 0d 0a 66 0d 67 0a 68 0d 0a 69 0d 0d 0d 0a e...f.g.h..i....

How does it do that? One of "0d0d0a", "0d", "0a", "0d0a", and "0d0d0d0a" isn't
counted (apparently contrary to what he wrote).

And (see my other recent post) how does it get **more** lines than me in that
scenario?
--
- Vince
 

rconn

Administrator
Staff member
May 14, 2008
10,499
94
#7
> The routine below would seem to count as you say TCC does, yet it gives
> a count agreeing with my editors but which is 14 less than TCC's count.
>
> What's TCC counting that I'm not?
This is what TCC is doing:

// get a line and set the file pointer to the next line
for ( i = 0; (( i < nMaxSize ) && ( *pszLine != EoF )); i++,
pszLine++ ) {

if ( *pszLine == _TEXT('\r') ) {

// skip the CR
i++;

// check for nitwit MS programmers writing a
CR/CR/LF
if (( pszLine[1] == _TEXT('\r')) && ( pszLine[2] =_TEXT('\n') ))
i += 2;
// skip a LF following a CR
else if ( pszLine[1] == _TEXT('\n') )
i++;

break;

} else if ( *pszLine == _TEXT('\n') ) {
i++;
break;
}
}

If you have a really long line (>32K) it's going to be counted as 2+ lines.

Rex Conn
JP Software
 
#8
vefatica wrote:
| The routine below would seem to count as you say TCC does, yet it
| gives a count agreeing with my editors but which is 14 less than
| TCC's count.
|
| What's TCC counting that I'm not?

Vince:

The ANSI-C89 program I wrote (last modified 2000-02-08) counts all 256
possible 8-bit character codes, though it reports only CR, HT, and LF count,
and whether or not other special characters are used. For your experiments,
I'd count each code individually, and would also count all uninterrupted
sequences of the potential "new line" codes (NUL, CR, LF, FF, VT). This last
is tricky, e.g. CR-CR-LF-LF, the possibilities are unlimited.
--
Steve
 

rconn

Administrator
Staff member
May 14, 2008
10,499
94
#9
> How does it do that? One of "0d0d0a", "0d", "0a", "0d0a", and
> "0d0d0d0a" isn't counted (apparently contrary to what he wrote).
See my other post -- the CR/CR/LF line ending is a special case; some
less-than-stellar MS programmer(s) put that in some Windows files. (Writing
a CR/LF when in text mode.)

Rex Conn
JP Software
 
#10
vefatica wrote:
| And %LINES says 3 (4 lines) for this one:
|
| 65 0d 0d 0a 66 0d 67 0a 68 0d 0a 69 0d 0d 0d 0a e...f.g.h..i....

I was able to build the above file using the batch file below:

@echos %@char[0x65 0x0d 0x0d 0x0a 0x66 0x0d 0x67 0x0a 0x68 0x0d 0x0a 0x69
0x0d 0x0d 0x0d 0x0a]`` > test.txt

and I can verify your result. It is a highly unusual file format though. My
old program counted CRs and LFs correctly. It was not intended to count
sequences.
--
Steve
 
#11
I can't figure that out. What's actually counting the lines? Are you
incrementing a line counter after leaving the for-loop, then re-entering the
for-loop with a new buffer?

On Thu, 08 Oct 2009 22:00:54 -0500, rconn <> wrote:

|This is what TCC is doing:
|
| // get a line and set the file pointer to the next line
| for ( i = 0; (( i < nMaxSize ) && ( *pszLine != EoF )); i++,
|pszLine++ ) {
|
| if ( *pszLine == _TEXT('\r') ) {
|
| // skip the CR
| i++;
|
| // check for nitwit MS programmers writing a
|CR/CR/LF
| if (( pszLine[1] == _TEXT('\r')) && ( pszLine[2] =_TEXT('\n') ))
| i += 2;
| // skip a LF following a CR
| else if ( pszLine[1] == _TEXT('\n') )
| i++;
|
| break;
|
| } else if ( *pszLine == _TEXT('\n') ) {
| i++;
| break;
| }
| }
--
- Vince
 
#12
On Thu, 08 Oct 2009 22:37:16 -0500, Steve Fábián <> wrote:

|vefatica wrote:
|| And %LINES says 3 (4 lines) for this one:
||
|| 65 0d 0d 0a 66 0d 67 0a 68 0d 0a 69 0d 0d 0d 0a e...f.g.h..i....
|
|I was able to build the above file using the batch file below:
|
|@echos %@char[0x65 0x0d 0x0d 0x0a 0x66 0x0d 0x67 0x0a 0x68 0x0d 0x0a 0x69
|0x0d 0x0d 0x0d 0x0a]`` > test.txt
|
|and I can verify your result. It is a highly unusual file format though. My
|old program counted CRs and LFs correctly. It was not intended to count
|sequences.

OK, I get Rex's response to the example above.

I still can't figure out how TCC can get **more** lines than me. I don't
**think** there are any 32K lines in that reg export.
--
- Vince
 
#13
On Thu, 08 Oct 2009 22:00:54 -0500, rconn <> wrote:

|If you have a really long line (>32K) it's going to be counted as 2+ lines.

I see what you mean. But I don't exactly get how it works (no problem). There
are several such lines in the REGfile I was using.

v:\> (for /l %i in (1,1,17000) (echos a >> 17000.txt)) & echo. >> 17000.txt

v:\> (for /l %i in (1,1,17000) (echos a >> 17000.txt)) & echo. >> 17000.txt

v:\> (for /l %i in (1,1,17000) (echos a >> 17000.txt)) & echo. >> 17000.txt

v:\> dir /k /m 17*
2009-10-09 00:04 51,006 17000.txt

The file has three normally terminated lines of 17000 a's.

v:\> echo %@lines[17000.txt]
14

FWIW, you can count lines a heck of a lot faster if you memory map the file (and
you don't need a buffer of limited size).

v:\> timer & lines.exe hklm.reg & timer
Timer 1 on: 00:15:32
627546 lines
Timer 1 off: 00:15:33 Elapsed: 0:00:00.09

v:\> timer & echo %@lines[hklm.reg] & timer
Timer 1 on: 00:15:34
627560
Timer 1 off: 00:15:38 Elapsed: 0:00:03.86
--
- Vince
 
#14
@LINES counts lines ending in either a null, CR, LF, or CR/LF.
In spite of what you said, TCC finds 4 lines here.

Code:
61 62 63 0d 64 65 66 0d  0a 67 68 00 69 0d 0d 0a  abc.def..gh.i...
6a 6b 6c 0d 0d 0d 0a 6d  6e 6f                    jkl....mno

v:\> echo %@lines[test.txt]
3
0. abc[CR]def (terminated by CRLF)
1. gh[NUL]<nul>i (terminated by CRCRLF)
2. jkl (terminated by CRCRCRLF)
3. mno (unterminated)

It would seem that, in effect, you're just counting LFs (the CRs and the NUL don't matter) and checking for an unterminated last line.

I'd like to reliably come up with the same counts as TCC. Please help me understand.</nul>
 
May 29, 2008
529
3
Groton, CT
#16
Rex,

I suggest that the algorithm for counting lines be changed to do this:

1. Count any number of consecutive CRs (possibly followed by a LF) as one line. (But every LF is the end of a line.)

2. Count any number of consecutive NULs as a single terminator as well and count it as just one line.

3. Count VTs and FFs line terminators, like LFs.

4. Count the EOF as a line terminator (it probably already is), but not if it follows a line terminator character CR, LF, VT, FF, or NUL.

If it is felt that this would break existing .btm files that people have, perhaps a newer set of function names could be used, like @XLINE[file, lineno], @XLINES[file]. ('X' for eXtended)

In a perfect world, there'd be a way to specify a list of terminators. I don't expect to be trying to count lines imported from an IBM 1620 BCD file using record marks as terminators, but it would be nice if I could do so.
 
#18
vefatica wrote:
| ---Quote (Originally by vefatica)---
| It would seem that, in effect, you're just counting LFs (the
| CRs and the NUL don't matter) and checking for an unterminated last
| line.
|---End Quote---
| If that's what's happening, I'm not at all complaining. It's simple
| and fast, and agrees with WC.EXE (which I've relied on for years and
| never had reason to question).

wc.exe is from Unix, so its rule is to count "NL" characters - identical
result is not surprising.
--
Steve