24

The purpose of the script is the following:

  1. Print the number of files recursively found within a directory (omitting folders themselves)
  2. Print the total sum file size of the directory
  3. Not crash the computer because of massive memory use.

So far (3) is the tough part.

Here is what I have written and tested so far. This works perfectly well on folders with a hundred or even a thousand files:

$hostname=hostname
$directory = "foo"
$dteCurrentDate = Get-Date –f "yyyy/MM/dd"

$FolderItems = Get-ChildItem $directory -recurse
$Measurement = $FolderItems | Measure-Object -property length -sum
$colitems = $FolderItems | measure-Object -property length -sum
"$hostname;{0:N2}" -f ($colitems.sum / 1MB) + "MB;" + $Measurement.count + " files;" + "$dteCurrentDate"

On folders with millions of files, however, the $colitems variable becomes so massive from the collection of information of millions of files that it makes the system unstable. Is there a more efficient way to draw and store this information?

4 Answers 4

40

If you use streaming and pipelining, you should be reduce problem with (3) a lot, because when you stream, each object is passed along the pipeline as and when they are available and do not take up much memory and you should be able to process millions of files (though it will take time).

Get-ChildItem $directory -recurse | Measure-Object -property length -sum

I don't believe @Stej's statement, Get-ChildItem probably reads all entries in the directory and then begins pushing them to the pipeline., is true. Pipelining is a fundamental concept of PowerShell (provide the cmdlets, scripts, etc. support it). It both ensures that processed objects are passed along the pipeline one by one as and when they are available and also, only when they are needed. Get-ChildItem is not going to behave differently.

A great example of this is given in Understanding the Windows PowerShell Pipeline.

Quoting from it:

The Out-Host -Paging command is a useful pipeline element whenever you have lengthy output that you would like to display slowly. It is especially useful if the operation is very CPU-intensive. Because processing is transferred to the Out-Host cmdlet when it has a complete page ready to display, cmdlets that precede it in the pipeline halt operation until the next page of output is available. You can see this if you use the Windows Task Manager to monitor CPU and memory use by Windows PowerShell.

Run the following command: Get-ChildItem C:\Windows -Recurse. Compare the CPU and memory usage to this command: Get-ChildItem C:\Windows -Recurse | Out-Host -Paging.

Benchmark on using Get-ChildItem on c:\ (about 179516 files, not milions, but good enough):

Memory usage after running $a = gci c:\ -recurse (and then doing $a.count) was 527,332K.

Memory usage after running gci c:\ -recurse | measure-object was 59,452K and never went above around 80,000K.

(Memory - Private Working Set - from TaskManager, seeing memory for the powershell.exe process. Initially, it was about 22,000K.)

I also tried with two million files (it took me a while to create them!)

Similar experiment:

Memory usage after running $a = gci c:\ -recurse ( and then doing $a.count ) was 2,808,508K.

Memory usage while running gci c:\ -recurse | measure-object was 308,060K and never went above around 400,000K. After it finished, it had to do a [GC]::Collect() for it to return to the 22,000K levels.

I am still convinced that Get-ChildItem and pipelining can get you great memory improvements even for millions of files.

Sign up to request clarification or add additional context in comments.

4 Comments

Get-ChildItem actually behaves differently.
@manojlds You might want to add the -force flag to Get-ChildItem so it will read system and hidden files (I hate that "feature").
@stej - I compared the memory usage. Can you try the same, from my updated answer?
@manojlds The memory usage is almost constant for ..|measure-object. But it is true for the case where directories don't have thousands/milions of files. gci inside calls standard .NET GetFiles and that is quite cheap (in memory usage) for our conditions. However, it is much more different for conditions that apply for @stephen measurements (millions of files in a directory).
11

Get-ChildItem probably reads all entries in the directory and then begins pushing them to the pipeline. In case that Get-ChildItem doesn't work well, try to switch to .NET 4.0 and use EnumerateFiles and EnumeratedDirectories:

function Get-HugeDirStats($directory) {
    function go($dir, $stats)
    {
        foreach ($f in [system.io.Directory]::EnumerateFiles($dir))
        {
            $stats.Count++
            $stats.Size += (New-Object io.FileInfo $f).Length
        }
        foreach ($d in [system.io.directory]::EnumerateDirectories($dir))
        {
            go $d $stats
        }
    }
    $statistics = New-Object PsObject -Property @{Count = 0; Size = [long]0 }
    go $directory $statistics

    $statistics
}

#example
$stats = Get-HugeDirStats c:\windows

Here the most expensive part is the one with New-Object io.FileInfo $f, because EnumerateFiles returns just file names. So if only count of files is enough, you can comment the line.

See Stack Overflow question How can I run PowerShell with the .NET 4 runtime? to learn how to use .NET 4.0.


You may also use plain old methods which are also fast, but read all the files in directory. So it depends on your needs, just try it. Later there is comparison of all the methods.

function Get-HugeDirStats2($directory) {
    function go($dir, $stats)
    {
        foreach ($f in $dir.GetFiles())
        {
            $stats.Count++
            $stats.Size += $f.Length
        }
        foreach ($d in $dir.GetDirectories())
        {
            go $d $stats
        }
    }
    $statistics = New-Object PsObject -Property @{Count = 0; Size = [long]0 }
    go (new-object IO.DirectoryInfo $directory) $statistics

    $statistics
}

Comparison:

Measure-Command { $stats = Get-HugeDirStats c:\windows }
Measure-Command { $stats = Get-HugeDirStats2 c:\windows }
Measure-Command { Get-ChildItem c:\windows -recurse | Measure-Object -property length -sum }
TotalSeconds      : 64,2217378
...

TotalSeconds      : 12,5851008
...

TotalSeconds      : 20,4329362
...

@manojlds: Pipelining is a fundamental concept. But as a concept it has nothing to do with the providers. The file system provider relies on the .NET implementation (.NET 2.0) that has no lazy evaluation capabilities (~ enumerators). Check that yourself.

6 Comments

a good solution put be careful changing to .Net as in that thread - you're standard\v2 PowerShell will not work with all functionality (including remoting). Better off copying the folder and making the changes there, calling this PS only when you need to use v4. They do work in parallel.
@Matt, I run PowerShell under .NET 4.0 for months and have no problem. (changed the config file)
remoting works ok? that's cool. I had issues about 3-4 months ago - I wonder if I wasn't using the most up to date version of .net v4.
I use remoting from time to time between Win7<->Win server 2008 and it works :) Try again, I hope it will be ok.
@stej - As a C# dev, I agree with you and this is how I would do it in C#. But I would like to see if Get-childItem can indeed do the job. Can you try the commands I did in my updated answer and measure the memory usage?
|
1

The following function is quite cool and is fast to calculate the size of a folder, but it doesn't always work (especially when there is a permission problem or a too long folder path).

Function sizeFolder($path) # Return the size in MB.
{
    $objFSO = New-Object -com  Scripting.FileSystemObject
    ("{0:N2}" -f (($objFSO.GetFolder($path).Size) / 1MB))
}

1 Comment

Is there a way to get around the limitations?
0

Here's what I usually use inspired by old unix aliases using du and ls, granted a .net method is faster (https://github.com/jdhitsolutions/PSScriptTools/blob/master/docs/Get-FolderSizeInfo.md). I only calculate the top level directories and files.

function com { param([Parameter(ValueFromPipeline=$True)]
  [int64]$number)
  process { '{0:n0}' -f $number } }

function dus($dir=".") { 
  get-childitem -force -directory $dir -erroraction silentlycontinue `
    -attributes !reparsepoint | 
    foreach { $f = $_ 
      get-childitem -force -r $_.FullName  -attributes !reparsepoint -ea 0 | 
        measure-object -property length -sum -erroraction silentlycontinue | 
        select @{Name="Name";Expression={$f}},Sum} | sort Sum |
    select name,@{n='Sum';e={$_.sum | com}}
}

function siz() {
  ls -file -force | select name,length | 
    sort length |
    select name,@{n='Length';e={$_.length | com}}
}
dus

name                   sum
----                   ---
Documents              2,790,100,862
AppData                20,571,019,655
Downloads              25,564,270,200


siz

name                   length
----                   ------
ntuser.dat.LOG2        3,691,520
ntuser.dat.LOG1        3,698,688
NTUSER.DAT             14,680,064

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.