Disk usage and hard links

ben

Jan 3, 2012
28
4
UK
#1
Is there a command that displays the disk usage of a directory or directory tree without counting the space used by a file twice if there ae two hard links to it? DIR counts the space used by one file as many times as there are hard links to it. Linux/Cygwin du counts it once only.
Code:
[z:] dir /h /k
2013-06-08  20:37           1,000  x
          1,000 bytes in 1 file and 0 dirs    4,096 bytes allocated
...

[z:] dir /h /k /u
          1,000 bytes in 1 file and 0 dirs    4,096 bytes allocated
...

[z:] \cygwin\bin\du -b
1000    .

[z:] mklnk x z
\tmp\tcmd\z -> \tmp\tcmd\x

[z:] dir /h /k
2013-06-08  20:37           1,000  x
2013-06-08  20:37           1,000  z
          2,000 bytes in 2 files and 0 dirs    8,192 bytes allocated
...

[z:] dir /h /k /u
          2,000 bytes in 2 files and 0 dirs    8,192 bytes allocated
...

[z:] \cygwin\bin\du -b
1000    .
Thanks.
 
#2
Try "/nj". Here's the difference
Code:
v:\> dir /a:d /k /m e*
2013-06-08  17:06        <DIR>    empty
2013-06-08  17:08    <JUNCTION>    empty1 [V:\empty]
 
v:\> dir /s /u foo*
 
Volume in drive V is DATA          Serial number is c007:d3e4
 
Directory of  V:\empty\foo*
 
                5 bytes in 1 file and 0 dirs    4,096 bytes allocated
 
    Total for:  V:\empty\foo*
                5 bytes in 1 file and 0 dirs    4,096 bytes allocated
 
Directory of  V:\empty1\foo*
 
                5 bytes in 1 file and 0 dirs    4,096 bytes allocated
 
    Total for:  V:\empty1\foo*
                5 bytes in 1 file and 0 dirs    4,096 bytes allocated
 
    Total for:  V:\foo*
                10 bytes in 2 files and 0 dirs    8,192 bytes allocated
    6,215,962,624 bytes free
 
v:\> dir /s /u /nj foo*
 
Volume in drive V is DATA          Serial number is c007:d3e4
 
Directory of  V:\empty\foo*
 
                5 bytes in 1 file and 0 dirs    4,096 bytes allocated
 
    Total for:  V:\empty\foo*
                5 bytes in 1 file and 0 dirs    4,096 bytes allocated
 
    Total for:  V:\foo*
                5 bytes in 1 file and 0 dirs    4,096 bytes allocated
    6,215,962,624 bytes free
 
#4
A hard link is just a directory entry pointing to the same file system object. It is possible to (efficiently) determine the number of links to a file (the original being just one of them); that number is given by GetFileInformationByHandle(). But naming the links requires searching the whole volume for directory entries pointing to the same disk data (inode). And it is not possible to determine which of many hard links was the original file. This took about 18 seconds!
Code:
j:\> listlinks c:\Windows\System32\AdapterTroubleshooter.exe
c:\Windows\System32\AdapterTroubleshooter.exe has 2 directory entries:
c:\Windows\System32\AdapterTroubleshooter.exe
c:\Windows\winsxs\x86_microsoft-windows-adaptertroubleshooter_31bf3856ad364e35_6
.1.7600.16385_none_d1d79dd7e49a786f\AdapterTroubleshooter.exe
I suspect it would be quite difficult for TCC to do all the searching and adequately keep track of the hard link data in order to avoid counting a file once for each hard link (especially if "/S" (recurse) were used). And if a hard-linked file appeared in two dicectories, in which of the directories should its size be reported?

TCC wouldn't have to search for other links. But it would have to keep track of files (actually inodes) with more than 1 hard link and test each subsequent directory entry to see if it's a link to something already counted. No doubt there'd be a performance hit.

That said, on by system drive, with some 3.4GB of links (~15,500 files), the utility below took only about 2% longer when asked not to count every instance of a file with multiple directory entries

Mark Russinovich's (sysinternals) utility DU.EXE may be of some value to you.
Code:
Du v1.5 - report directory disk usage
Copyright (C) 2005-2013 Mark Russinovich
Sysinternals - www.sysinternals.com
 
usage: du [-c[t]] [-l <levels> | -n | -v] [-u] [-q] <directory>
  -c    Print output as CSV. Use -ct for tab delimiting.
  -l    Specify subdirectory depth of information (default is one level).
  -n    Do not recurse.
  -q    Quiet (no banner).
  -u    Count each instance of a hardlinked file.
  -v    Show size (in KB) of all subdirectories.
 
CSV output is formatted as:
Path,CurrentFileCount,CurrentFileSize,FileCount,DirectoryCount,DirectorySize
Code:
v:\> timer v:\empty\du c:\
Files:        69405
Directories:  17777
Size:        16,489,375,757 bytes
Size on disk: 16,720,311,264 bytes
 
Elapsed: 0:00:26.34
 
v:\> timer v:\empty\du -u c:\
Files:        84925
Directories:  17777
Size:        19,826,638,820 bytes
Size on disk: 20,094,518,240 bytes
 
Elapsed: 0:00:25.99
 

ben

Jan 3, 2012
28
4
UK
#5
Thanks for your response.

As you say, it's easy for TCC to determine of two directory entries whether they're hard links to the same file, and there's no need for it search for other links to any given file. And the cost of keeping track of inodes while summing the space used is minimal, as your experiment with Sysinternals du shows.

And thanks for pointing out the existence of Sysinternals du (I use some of their other tools). Their du seems to be a cut-down version of the Unix one.

You ask which of two directories containing a hard link to a single file should include that file's size. Unix and Sysinternals du solve this in the same way: they report the space used by a file the first time they encounter a hard link to it; if they find a hard link to a file already found they report the size as zero. If you have parallel directory tress that contain links to the same files, this gives a very clear visual indication of where the links are broken (e.g. by rsync).

I suspect that what suggested to me that TCC's dir command might help in investigating hard links was the fact that the footer of its output shows a figure labelled "bytes allocated". In the presence of multiple links to a single file either the figure is wrong or its description is misleading.

Anyway, I know now not to look to dir when I need to investigate hard links.

Thanks again.
 
#6
I have been experimenting with this. Determining if a file has multiple hard links requires opening the file. DU.EXE cannot always do that, even if running elevated; it misses less when running elevated, but still doesn't get everything (notably on the system drive).

I wrote a crude utility to count bytes used below a specified starting point **not** counting files with multiple hard links more than once. When used in places without troublesome files, it duplicates DU.EXE's results, and in places where there are no multiply linked files, it duplicates the results of "DIR /A /S". It's considerably faster than DU.EXE. Keeping track of multiply-linked files (on my system drive, probably a worst case) cost an additional 4% in execution time.

FWIW, I kept track of multiply-linked files with the likes of this.
Code:
struct HLINK
{
    ULONGLONG Index;
    DWORD LinksLeft;
    HLINK *pNext;
    HLINK *pPrev;
    static HLINK *pFirst;
    static HLINK *pLast;
    HLINK(ULONGLONG ullIndex, DWORD Links);
    ~HLINK();
};
 
HLINK *HLINK::pFirst = NULL;
HLINK *HLINK::pLast = NULL;
 
HLINK::HLINK(ULONGLONG ullIndex_in, DWORD dwLinks_in)
{
    Index = ullIndex_in;
    LinksLeft = dwLinks_in - 1;
    // standard linked list stuff; new one at end
    pNext = NULL;
    if ( pLast )    { pLast->pNext = this;    pPrev = pLast; }
    else            { pPrev = NULL;            pFirst = this; }
    pLast = this;
}
 
HLINK::~HLINK(VOID)
{
    //standard linked list stuff
    if ( pNext ) pNext->pPrev = pPrev;
    else pLast = pPrev;
    if ( pPrev ) pPrev->pNext = pNext;
    else pFirst = pNext;
}
 
// call this only for files with more than 1 link
BOOL IsInodeInList( ULONGLONG ullIndex_in, DWORD dwLinks_in )
{
    HLINK *p;
    for ( p = HLINK::pFirst; p != NULL; p = p->pNext )
    {
        if ( p->Index == ullIndex_in )  // it's in the list already
        {
            p->LinksLeft -= 1;            // decrement LinksLeft
            if ( p->LinksLeft == 0 )    // if all links found
                delete p;                // keep list small
            return TRUE;
        }
    }
    // else add it to the list
    p = new HLINK(ullIndex_in, dwLinks_in);
    return FALSE;
}
And inside a "do {...} while (FindNextFile())" loop,
Code:
wsprintf(FQName, L"%s%s", szDir, w32fd.cFileName);
hFile = CreateFile(FQName, FILE_READ_ATTRIBUTES | FILE_READ_EA, 7, NULL, OPEN_EXISTING, 0, NULL);
if ( hFile != INVALID_HANDLE_VALUE )
{
    if ( GetFileInformationByHandle(hFile, &bhfi) )
    {
        if ( bhfi.nNumberOfLinks <= 1
                || !IsInodeInList(((ULONGLONG) bhfi.nFileIndexHigh * 0x100000000UI64 + bhfi.nFileIndexLow),
                bhfi.nNumberOfLinks) )
        {
            ullTotalBytes += ((ULONGLONG) bhfi.nFileSizeHigh * 0x100000000UI64 + bhfi.nFileSizeLow);
        }
    }
    CloseHandle(hFile);
}
//else
 

ben

Jan 3, 2012
28
4
UK
#7
I'm gratified that this has piqued your interest.

Thank you for your ideas. I wrote a simple program based on them (using a std::set<ULONGLONG> (or std::map<ULONGLONG,DWORD>) to store the visited inodes) which works nicely. It could be the basis of a useful tool.

Have you found a case where your program produces a more accurate result than Sysinternals du?
 
#8
I'm gratified that this has piqued your interest.

Thank you for your ideas. I wrote a simple program based on them (using a std::set<ULONGLONG> (or std::map<ULONGLONG,DWORD>) to store the visited inodes) which works nicely. It could be the basis of a useful tool.

Have you found a case where your program produces a more accurate result than Sysinternals du?
What's "accurate"? There's no standard.

I think he whole idea of such an exercise is a bit iffy. Here's an interesting anomaly. Below, run elevated, DU and MYDU process the same number of directories and the same number of files, yet come up with different totals. Since the methods are different, I can only suppose each can open some files that the other can't, and the fact that the numbers of files processed are the same is just a coincidence.
Code:
l:\projects\mydu\release> timer u:\du.exe c:\
Timer 1 on: 19:32:52
 
Files:        71101
Directories:  18228
Size:        17,325,502,945 bytes
Size on disk: 17,533,109,216 bytes
 
Timer 1 off: 19:33:20  Elapsed: 0:00:27.56
 
l:\projects\mydu\release> timer mydu.exe c:\
Timer 1 on: 19:33:25
Bytes:  17325457529
Alloc:  17477607424
Files:  71101
Dirs:  18228
Timer 1 off: 19:33:45  Elapsed: 0:00:20.10
My computations are quite straightforward (and I think I got them right).
Code:
hFile = CreateFile(FQName, FILE_READ_ATTRIBUTES | FILE_READ_EA, 7, NULL, OPEN_EXISTING, 0, NULL);
if ( hFile != INVALID_HANDLE_VALUE )
{
    if ( GetFileInformationByHandle(hFile, &bhfi) )
    {
        if ( bhfi.nNumberOfLinks <= 1
                || !IsInodeInList(((ULONGLONG) bhfi.nFileIndexHigh * 0x100000000UI64 + bhfi.nFileIndexLow), bhfi.nNumberOfLinks) )
        {
            ULONGLONG size = ((ULONGLONG) bhfi.nFileSizeHigh * 0x100000000UI64 + bhfi.nFileSizeLow);
            ullTotalBytes += size;
            ullTotalAllocated += (((size + dwAllocUnit - 1) / dwAllocUnit) * dwAllocUnit);
            ullFiles += 1;
        }
    }
    CloseHandle(hFile);
}
 
#9
I did just find an oversight. The file sizes reported do not include NTFS streams and I am not checking for the existence of NTFS streams. Oh boy! ... a couple more hours of fun!
 
#10
I did just find an oversight. The file sizes reported do not include NTFS streams and I am not checking for the existence of NTFS streams. Oh boy! ... a couple more hours of fun!
Make that three hours! I have it to this point ... (see below) when run elevated, DU and MYDU report the same number of files, the same number of directories, and the same number of bytes, but a different number of bytes allocated (and MYDU is still a bit faster). I think I'll next need to consider how to handle compressed files/folders.
Code:
l:\projects\mydu\release> timer mydu.exe c:\
Timer 1 on: 22:07:34
Bytes:  17325653586
Alloc:  17478168576
Files:  71163 (plus 39 streams)
Dirs:  18238
Timer 1 off: 22:07:58  Elapsed: 0:00:24.57
 
l:\projects\mydu\release> timer u:\du.exe c:\
Timer 1 on: 22:08:02
 
Files:        71163
Directories:  18238
Size:        17,325,653,586 bytes
Size on disk: 17,533,490,144 bytes
 
Timer 1 off: 22:08:30  Elapsed: 0:00:27.89
 

ben

Jan 3, 2012
28
4
UK
#11
Thanks for spending so much time on this. I hope this is as useful to you as it is to me.

There something odd about Sysinternals du's report. It says:

Code:
Size on disk: 17,533,490,144 bytes
But 17,533,490,144 / 1024 = 17,122,548.96875. I see similar things. A quick investifation suggests that the residues come from streams. TCMD dir /g /: also shows allocated space figures for streams that are not multiples of 1024 (let alone 4096). Can that be right?
 
#12
Thanks for spending so much time on this. I hope this is as useful to you as it is to me.

There something odd about Sysinternals du's report. It says:

Code:
Size on disk: 17,533,490,144 bytes
But 17,533,490,144 / 1024 = 17,122,548.96875. I see similar things. A quick investifation suggests that the residues come from streams. TCMD dir /g /: also shows allocated space figures for streams that are not multiples of 1024 (let alone 4096). Can that be right?
On NTFS, that can be right. Files small enough to fit in the 1024-byte MFT entry (along with the other data that must be stored there) are stored right in the MFT entry and aren't allocates any clusters. I don't know how to tell when this happens. Now, I'm computing allocation based on whole clusters per stream (including the default :$DATA stream). Even when file byte totals match, either of the allocations reported by DU and MYDU may be the larger one.
 

ben

Jan 3, 2012
28
4
UK
#13
If a file is stored in the MFT entry, its contribution to the allocated total is 0, so that total should still be divisible by 4096. But I daresay the filesystem regards it as its own business where it's put such a file and sees fit not to tell us.

But I can do what I need to do directly, by checking the identity or otherwise of inodes, rather than indirectly, by counting allocated space. With your help I am now able to write an efficient program to do that. Thank you very much for your efforts.
 
#14
I'm now doing this for my own enjoyment. And I'm now including 1024 bytes allocated for every MFT entry (files, streams, and directories). And I'm including in allocated the sizes of directories themselves (not of the files in them, but the size of the info that specifies what's in a given directory, get it with GetFileInformationByHandle(hDir)); those are integer multiples of the cluster size. In both cases (especially the second), these are bytes allocated which would become unallocated if the entity were deleted. So I get an inflated (safe?) estimate of allocation. Whatever the number is, I have pretty little use for it.
 
#16
I just discovered this. If you're doing something like this yourself, if can easily be made more accurate.

Code:
GetFileInformationByHandleEx(hFile, FileStandardInfo, &fsi, sizeof(fsi));
ullTotalBytes += fsi.EndOfFile.QuadPart;
ullTotalAllocated += ((fsi.AllocationSize.QuadPart < ClusterSize) ? 0 : fsi.AllocationSize.QuadPart);
How this works (by example):

v:\tpipe.log is 508 bytes, small enough to fit in the MFT entry. So fsi.AllocationSize.QuadPart is 512 (no clusters needed) and fsi.EndOfFile.QuadPart is 508 (the file size)

v:\tp.out is 927 bytes, too big for the MFT entry. So fsi.AllocationSize.QuadPart is 4096 (one cluster) and fsi.EndOfFile.QuadPart is 927 (size).

This sees right through compression in the sense that ...

For v:\4utilcopy\4Utils.vcproj (in a compressed filder, uncompressed size 21966), fsi.AllocationSize.QuadPart is 12288 (enough for the compressed file) and fsi.EndOfFile.QuadPart is 21966 (the uncompressed size).
 
#17
I made DU a plugin in SYSUTILS. It requires Vista or later (won't even load otherwise) and NTFS. It's in ftp://lucky.syr.edu/4plugins/NeedsVista. It's byte counts usually agree with SysInternals's DU.EXE. It's allocation counts are computed differently (hopefully more accurately). Every file, stream, and directory has a 1024-byte MFT entry; that space is always counted. For "non-resident" files, streams, and directories (not small enough to fit in theit MFT entry) the appropriate multiple of the cluster size (allocation unit) is added.

In the end, my EXE (54KB) was still a tad faster that the SI utility (and 1/4 the size). Adding it to SYSUTILS (taking advantage of all that TakeCmd.dll has to offer) required only 3.5KB.

It works like this (see below). Someday I'll remove @DU from 4UTILS and add it to SYSUTILS to be used to pick out any one of the numbers shown by DU. I wouldn't mind comments and bug reports. I haven't tested (can't) the 64-bit version.

Code:
v:\> du /?
Get size/allocation info or show non-default streams; requires NTFS
 
  MYDU.EXE directory [/S(treams)]
 
  quote directory names containing whitespace
 
v:\> du g:\
Files:                  20,934
  Bytes:                5,264,261,374
  Alloc:                5,322,233,856
  Resident:            3,294
 
Additional Streams:    20
  Bytes:                520
  Alloc:                24,576
  Resident:            19
 
Dirs:                  1,956
  Bytes:                7,599,104
  Alloc:                7,599,104
  Resident:            1,195
 
File/Stream Bytes:      5,264,261,894
File/Stream Alloc:      5,322,258,432
 
Total Bytes:            5,271,860,998
Total Alloc:            5,329,857,536
 
Alloc Unit:            4,096
 
v:\> du g:\ /s
G:\ConEmu\ConEmuPack.130427.7z:Zone.Identifier
G:\Depends\depends.chm:Zone.Identifier
G:\Depends\depends.dll:Zone.Identifier
G:\ExamDiff\Readme.txt:Zone.Identifier
G:\ProcessMonitor\extracted\Eula.txt:Zone.Identifier
G:\ProcessMonitor\extracted\procmon.chm:Zone.Identifier
G:\ProcessMonitor\extracted\Procmon.exe:Zone.Identifier
<more snipped>