Monday, April 16, 2012

Speed Reading with PowerShell

Many of you have had to read in large text files for processing in PowerShell.  The Get-Content cmdlet is perfect for this.  However, it can be very sloooow with large files.  There are multiple ways to speed this up.  For example, we could dive into .NET using the [System.IO.File]::ReadAllLines() method. For simplicity, let's stick with the Get-Content cmdlet.  Following is an example that demonstrates a couple different techniques, the one to focus on is the use of the "-ReadCount" parameter.

# define some random nouns, verbs and adverbs            
$noun = "Ed","Hal","Jeff","Doug","Don","Kirk","Dmitry"            
$verb = "ran","walked","drank","ate","threw","scripted","snored"       
$adverb = "quickly","randomly","erratically","slowly","slovenly","loudly"            
            
# create an array with 10,000 random sentences             
$content = 1..10000 | foreach {            
    "{0} {1} {2}." -f ($noun|Get-Random),($verb|Get-Random),($adverb|Get-Random)            
}            
            
# save our array to a text file            
$path = "c:\temp\RandomSentences.txt"              
$content | Out-File -FilePath $path            
            
# read in the files and measure the time taken.            
(measure-command -Expression { Get-Content -Path $path }).TotalMilliseconds
(measure-command { Get-Content $path -ReadCount 10 }).TotalMilliseconds
(measure-command { Get-Content $path -ReadCount 100}).TotalMilliseconds
(measure-command { Get-Content $path -ReadCount 1000}).TotalMilliseconds


The results....
164.6186
24.5987
19.7441
16.2411



Explanation

The Get-Content cmdlet does more behind the scenes then just present the data.  There are a properties being populated as it reads in the file.  By default, this happens for each line as it it read.  For large files, this overhead can be reduced by setting the -ReadCount parameter.  With this parameter set, you will only be manipulating the behind the scenes properties in a collection size that is equal to the number you set the -ReadCount attribute to.
  
Hope this helps!