How to? Use wild cards in include list without duplicate processing

Steve Fabian · Aug 24, 2013

I had occasion to search for files by different wildcards using an include list containing multiple wildcard entries, e.g.,
dir *post*;*path*
and file postpath.btm showed up twice - obviously once for each wildcard its name matches. Is there a generic way to avoid such multiple processing? In the present instance the effect was negligible, but if the command were COPY(for instance to back up selections of files) it would attempt to copy a possibly huge file more than once. I know of many ways to program around it, for example, the /X option of COPY, combined with selecting by attributes: /A:A will in most cases do it, but it is not a generic solution.

vefatica · Aug 24, 2013

Hmmm! I'll bet it has worked that way for a very long time (and that it won't be changed). I'm surprised no one has complained about it before this. I think it's bad behavior. The help doesn't address this specifically: "When you use an include list, all files that match any entry in the include list are processed together, and will appear together in the directory display or SELECT list". DO suffers from this too but winds up enumerating the files in a different order.

Code:

v:\> dir /km *ses*;*sion*
2013-02-27  13:56  3,096  session.log
2013-02-27  13:56  3,096  session.log
2013-02-03  14:35  4,499  session.txt
2013-02-03  14:35  4,499  session.txt
2012-07-19  22:44  7,168  sessionid.exe
2012-07-19  22:44  7,168  sessionid.exe

v:\> do f in *ses*;*sion* ( echo %f )
session.log
session.txt
sessionid.exe
session.log
session.txt
sessionid.exe

Steve Fabian · Aug 24, 2013

FOR and PDIR behaves identically. In all of them the /O:NE option sorted them properly, forcing duplicates to be consecutive. , Now we just need a new ordering option, e.g., X, to eXclude duplicates, to combine with the other future option, to order all items as a single group, instead of sorting each directory's content separately. Hopefully V16 will at last fulfill that very old request.

vefatica · Aug 24, 2013

Steve Fabian said:
FOR and PDIR behaves identically. In all of them the /O:NE option sorted them properly, forcing duplicates to be consecutive. , Now we just need a new ordering option, e.g., X, to eXclude duplicates, to combine with the other future option, to order all items as a single group, instead of sorting each directory's content separately. Hopefully V16 will at last fulfill that very old request.

I don't think the same file should be reported more than once. Even the byte and allocation totals are affected by this.

Steve Fabian · Aug 24, 2013

vefatica said:
I don't think the same file should be reported more than once. Even the byte and allocation totals are affected by this.

I agree, but it is a change from current operation, and though you and I think the current operation is incorrect, and may even be dangerous, yet to protect whatever misguided code may exist out there that might depend on it, adding an option to suppress multiple reporting, rather than unconditionally suppressing it is more backward compatible. Bur I wouldn't complain it the change were made without requiring an option. AFAIK in all of my code each of the file specifications in any single include list are all orthogonal to each other. The case precipitating my OP was an interactive search for similar batch files.

rconn · Aug 24, 2013

vefatica said:
Hmmm! I'll bet it has worked that way for a very long time (and that it won't be changed). I'm surprised no one has complained about it before this.

It's worked that way for 24 years, and it won't be changed. (CMD behaves the same way, albeit without the include lists.)

If nobody's complained about it before, that's a strong indication that nobody cares.

rconn · Aug 25, 2013

If you want to remove duplicates before processing a file list, it's trivial to pipe it through TPIPE.

Steve Fabian · Aug 25, 2013

rconn said:
It's worked that way for 24 years, and it won't be changed. (CMD behaves the same way, albeit without the include lists.)

The issue applies only to include lists, not supported in CMD, so comparison with CMD behavior is not possible.

Originally I just asked how to get a pruned report with wild cards in include lists (i.e., without duplications), and later I came to the conclusion it could only be feasible if the report is sorted, which most commands that handle include lists do support, making it technically feasible.

rconn said:
If nobody's complained about it before, that's a strong indication that nobody cares.

Had not observed the issue before, which certainly indicates it occurs rarely enough not t make a difference in TCMD sales.

rconn said:
If you want to remove duplicates before processing a file list, it's trivial to pipe it through TPIPE.

Well, that is logically feasible, but results in some include list operations requiring 3 stages instead of 1 (1/ collecting file list, 2/ pruning via TPIPE, 3/ the action on each file) instead of a single command, and either the creation a a temporary file (for the result of TPIPE), or the invocation of the actual processing command separately for each file.

Regardless, once the long-requested ability to order all file reports of a single command from all reported directories together as a unit, the filtering out duplicates ought to be quite simple.

Charles Dye · Aug 25, 2013

I wonder whether this deduplication couldn't be achieved by enhancing WildcardComparison() to accept an include list in the current pszWildcardName parameter. How would you add this ability without adding another argument and breaking backward compatibility? Perhaps by using e.g. a vertical bar instead of the current semicolon as a wildspec separator?

vefatica · Aug 25, 2013

I agree that duplicate files isn't going to matter often. But having to prepare a list of files and then process the list separately somewhat detracts from the convenience of using include lists in the first place. And there's byte/allocation totals as I mentioned earlier. It seems like a good candidate for make_it_work_better and (if absolutely necessary) make it an option to emulate_legacy_behavior/duplicate_cmd_bugs.

Steve Fabian · Aug 25, 2013

Charles Dye said:
...
How would you add this ability without adding another argument and breaking backward compatibility? Perhaps by using e.g. a vertical bar instead of the current semicolon as a wildspec separator?

Using a different marker than semicolon to indicate that the list is capable of generating duplicates is good thinking, but I would rather find a character other than the one which signals a pipe... But though the include list syntax is identical in all commands which support it, a command option (though it may be difficult to find an option code that is available in all such commands, unless using a new /N suboption) would provide backward compatibility.

vefatica · Aug 25, 2013

Steve Fabian said:
Using a different marker than semicolon to indicate that the list is capable of generating duplicates is good thinking, but I would rather find a character other than the one which signals a pipe... But though the include list syntax is identical in all commands which support it, a command option (though it may be difficult to find an option code that is available in all such commands, unless using a new /N suboption) would provide backward compatibility.

If there had to be an option, I'd prefer another radio button in Option\Startup ... "New Style Include Lists". That would allow the legacy behavior to be the default while putting very little burden on users who wanted the newer behavior. I doubt there would be much demand for switching "on the fly".

But, honestly, I find it hard to believe that getting rid of duplicate include list matches by default would bother (m)any users. If users had been working around that (unfortunate) behavior for twenty-some years, I suspect it would have been mentioned before (and with some regularity).

rconn · Aug 25, 2013

Steve Fabian said:
The issue applies only to include lists, not supported in CMD, so comparison with CMD behavior is not possible.

No, you can get the same behavior with multiple arguments or FOR.

Steve Fabian · Aug 25, 2013

Vince, I agree. But see below.

Rex, IMHO multiple arguments are intentional (when not mistypes), so it is unlikely the user would want to filter out duplicates. They are always processed in the order specified, and each argument is processed through the command individually, and may not even be located in the same directory. OTOH matches to the elements of an include list are per definitionem in the same directory, and if any suboption but u of the /O option is used, they are sorted together, thus duplicates will be adjacent. My above contrast of multiple arguments v. include lists applies to the file specification list of FOR as well, as shown in the example below.
>dir/b/w
a.1 a.2 b.1 b.2
>for /o:ne %x in (*.1 *.2) echo %x
a.1
a.2
b.1
b.2
>for /o:ne %x in (*.1;*.2) echo %x
a.1
b.1
a.2
b.2

Of course, the include list a.*;*.1;*.2 (for searching the above directory) would be more representative of the underlying issue we are discussing. Please nobody suggest that the wildcard list could be set up so that no file matches more than one - even the three-step procedure (get all with possible duplicates, sort with no duplicates retained, process each file in result) is simpler than doing that in the general case.

rconn · Aug 25, 2013

Steve Fabian said:
Vince, I agree. But see below.

Rex, IMHO multiple arguments are intentional (when not mistypes), so it is unlikely the user would want to filter out duplicates. They are always processed in the order specified, and each argument is processed through the command individually, and may not even be located in the same directory. OTOH matches to the elements of an include list are per definitionem in the same directory, and if any suboption but u of the /O option is used, they are sorted together, thus duplicates will be adjacent.

It can be done. It will require a complete rewrite of all of the file handling code, so it will end up affecting about 100 commands and several hundred internal variables and variable functions.

It should take about 2 months, plus testing. Do you want it badly enough to give up all of the other new features you have requested?

vefatica · Aug 25, 2013

Steve Fabian said:
Vince, I agree. But see below.

Rex, IMHO multiple arguments are intentional (when not mistypes), so it is unlikely the user would want to filter out duplicates. They are always processed in the order specified, and each argument is processed through the command individually, and may not even be located in the same directory. OTOH matches to the elements of an include list are per definitionem in the same directory, and if any suboption but u of the /O option is used, they are sorted together, thus duplicates will be adjacent. My above contrast of multiple arguments v. include lists applies to the file specification list of FOR as well, as shown in the example below.
>dir/b/w
a.1 a.2 b.1 b.2
>for /o:ne %x in (*.1 *.2) echo %x
a.1
a.2
b.1
b.2
>for /o:ne %x in (*.1;*.2) echo %x
a.1
b.1
a.2
b.2

Of course, the include list a.*;*.1;*.2 (for searching the above directory) would be more representative of the underlying issue we are discussing. Please nobody suggest that the wildcard list could be set up so that no file matches more than one - even the three-step procedure (get all with possible duplicates, sort with no duplicates retained, process each file in result) is simpler than doing that in the general case.

I couldn't care less about the order. It's the duplicates, especially in the case of an include list, that I find objectionable. I think of an include list as a single specification that a file either matches or does not match. With separate arguments, the search must be wildcard_spec by wildcard_spec because the specs may involve different directories. When, as with include lists, the search is within one directory the search should be file by file and the only question is "Is a given file included by the include list or not?". Reporting the same file more than once doesn't make sense to me and it is likely to cause difficulties (at least error messages ... DEL/MOVE).

vefatica · Aug 25, 2013

rconn said:
It can be done. It will require a complete rewrite of all of the file handling code, so it will end up affecting about 100 commands and several hundred internal variables and variable functions.

It should take about 2 months, plus testing. Do you want it badly enough to give up all of the other new features you have requested?

I've been around non-stop since 4DOSv2 (I think) and this has never bitten me. So I suppose it's likely to never make a difference to me and changing it probably wouldn't be good use of your time. It's only the perfectionist in me that says "Yeah, take six months and change a few antique behaviors that could be made better".

Steve Fabian · Aug 25, 2013

Vince, your order makes sense if there are hundreds of list elements and only a few files in the directory, but it cannot make use any wildcard processing provided by the file system API. The other order, breaking up the list into its elements, and finding all matches to each element, can utilize the file system's wildcard APIs. From looking at unsorted reports in TCC include list handling, I guess Rex chose the latter to enhance M$ file handling. The disadvantage is that a file may match multiple list elements, and each such match results in a separate report. To eliminate scattered duplicate reports sorting is needed.

Rex, in my OP I asked if there is a way to eliminate the duplications within the command handling the include list, and the obvious answer is NO. It would be nice, but not essential. I would not want to give up any of my other requests for this one. I do wonder however if the oft-requested "sort all files from all directories together" feature is coming at last? Charles Dye's SIFT plugin does that, but it is limited to 4096 files, and not does not have PDIR's flexibility on what fields to include in the report.

rconn · Aug 25, 2013

Steve Fabian said:
I do wonder however if the oft-requested "sort all files from all directories together" feature is coming at last? Charles Dye's SIFT plugin does that, but it is limited to 4096 files, and not does not have PDIR's flexibility on what fields to include in the report.

It could be done, provided you have an x64 version of Windows and enough RAM.

But nobody has ever come up with a convincing reason why they would want to sort all files from all directories together (and be unable to tell where any of the files actually came from!).

vefatica · Aug 25, 2013

Steve Fabian said:
Vince, your order ...

Please don't call it my order. I don't give a hoot about the order. I object to the same file appearing more than once in a DIR listing, regardless of where, or being processed more than once in a DO loop or a FOR loop (and I suppose by several other commands that honor include lists).

Steve Fabian · Aug 25, 2013

rconn said:
It could be done, provided you have an x64 version of Windows and enough RAM.

But nobody has ever come up with a convincing reason why they would want to sort all files from all directories together (and be unable to tell where any of the files actually came from!).

One of the reasons for sorting all directories together is to prune the proliferation of identical files (usually, though not always, named identically, too). And if you use PDIR and the file is reported by the fpn field, you already have full path.

Steve Fabian · Aug 25, 2013

vefatica said:
Please don't call it my order. I don't give a hoot about the order. I object to the same file appearing more than once in a DIR listing, regardless of where, or being processed more than once in a DO loop or a FOR loop (and I suppose by several other commands that honor include lists).

You used the word order exclusively for the reporting order. In the quoted phrase, I used it to describe the order of operations in finding all matches to an include list in a directory. Once upon a time one might have used row major vs. column major ... I, too, desire no duplicates; but unlike you there are times when I do desire sorted reports, though most of the time it's more for appearance than function.

rconn · Aug 27, 2013

vefatica said:
I couldn't care less about the order. It's the duplicates, especially in the case of an include list, that I find objectionable. I think of an include list as a single specification that a file either matches or does not match. With separate arguments, the search must be wildcard_spec by wildcard_spec because the specs may involve different directories.

You're inventing a nonexistent distinction. An include list is treated exactly the same as multiple arguments; the only difference is that the include list doesn't require a pathname for each subsequent argument.

Steve Fabian · Aug 27, 2013

What TCC does now is acceptable, but I do see a difference between the way separate arguments are treated and the way include list members are treated, as seen in the example below.

Code:

[C:\BTM]*dir/b/s/a-d/ou histc* histi*
C:\BTM\HISTC.BTM
C:\BTM\V0000\HISTC.BTM
C:\BTM\HISTI.BTM
C:\BTM\V0000\HISTI.BTM

[C:\BTM]*dir/b/s/a-d/ou histc*;histi*
C:\BTM\HISTC.BTM
C:\BTM\HISTI.BTM
C:\BTM\V0000\HISTC.BTM
C:\BTM\V0000\HISTI.BTM

In the first command, with its arguments separated by white space, directory recursion is done separately for each argument. In the second command there is only one argument, an include list, of the same wildcards. The search for all possible matches to the include list is done at each directory recursion level separately. Well, this may not be how the command actually operates, but its result cannot be distinguished from a command which does operate as I described.

rconn · Aug 27, 2013

Steve Fabian said:
What TCC does now is acceptable, but I do see a difference between the way separate arguments are treated and the way include list members are treated, as seen in the example below.

That has nothing to do with file searching; it's strictly an artifact of how DIR groups its arguments.

JohnQSmith · Aug 28, 2013

rconn said:
But nobody has ever come up with a convincing reason why they would want to sort all files from all directories together (and be unable to tell where any of the files actually came from!).

How about the ten largest files on the drive?

Code:

c:\> dir /b /s /os | tail

The ten newest files in my internet cache?

Code:

c:\> dir /b /a /s /od "%@shfolder[32]" | tail

I've been asking for this for years, but apparently I haven't had a "convincing reason". My second example is what prompted my initial request.

Dan Glynhampton · Aug 28, 2013

JohnQSmith said:
How about the ten largest files on the drive?

Code:

c:\> dir /b /s /os | tail

YES!

I jump through the hoops of using pdir to create a comma separated list of all the files on the drive and then open that in excel to sort it. Would be sooooo much better to use your example code, if it worked that way of course.

Charles Dye · Aug 28, 2013

Dan Glynhampton said:
YES!

I jump through the hoops of using pdir to create a comma separated list of all the files on the drive and then open that in excel to sort it. Would be sooooo much better to use your example code, if it worked that way of course.

Possibly of interest: http://prospero.unm.edu/plugins/sift.html

Dan Glynhampton · Aug 29, 2013

Charles Dye said:
Possibly of interest: http://prospero.unm.edu/plugins/sift.html

Thanks Charles, I'll try and take a look at the weekend.

rconn · Aug 29, 2013

JohnQSmith said:
How about the ten largest files on the drive?

Code:

c:\> dir /b /s /os | tail

Unless you're running Windows x64 with a LOT of RAM (>8Gb), you're not going to be able to do this regardless. (And you'll need a lot of patience.)

Welcome!

How to? Use wild cards in include list without duplicate processing

Administrator

Administrator

Super Moderator

Administrator

Administrator

Administrator

Administrator

Administrator

Super Moderator

Administrator

Similar threads