Monday, August 29, 2011

PowerShell and Benford's Law

Was reading through a statistics blog (R) the other day when I read a posting on Benford's law. The definition according to the blog is:

"
Benford's law, also called the first-digit law, states that in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 about 30% of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than 5% of the time."

The probabilities are distributed as demonstration here.



This seemed counter-intuitive and I wanted to validate it myself. Let's look at the leading digit of all the txt files in one of my directories. Enter PowerShell.....
# Explore Benford's Law 

$array=@()
foreach ($item in (Get-ChildItem -Path p:\ -Filter *.txt -Recurse))
{
$array+= $item.length.toString()[0]
}

$array `
| Group-Object -NoElement `
| Sort-Object count -Descending `
| Format-Table @{label=”#”;expression={$_.Name}},
@{label=”Count”;expression={"{0:%##}" -f $($_.Count/$array.Count)}},
@{label=”Histogram”;expression={“▄” * $_.Count}} -autosize

I consider this a validation, but lets try one another example, this time looking at leading digits on the workingset of the processes on my desktop:

$array=@()       

foreach($a in (Get-Process))
{
$array+= $a.WorkingSet.toString()[0]
}

$array `
| Group-Object -NoElement `
| Sort-Object count -Descending `
| Format-Table @{label=”#”;expression={$_.Name}},
@{label=”Count”;expression={"{0:%##}" -f $($_.Count/$array.Count)}},
@{label=”Histogram”;expression={“▄” * $_.Count}} -autosize

Again, this seems to hold true. Now that I have examples of Benford's law, I feel compelled to try and understand it. Wish me luck!


3 comments:

jsnover said...

That's freaky.

Ashley McGlone said...

Too cool!

Lee Holmes said...

Interesting. Here's your data sorted by number (Benford's Law) as opposed to frequency. Still looks pretty reasonable.

PS E:\lee> gc C:\temp\stahler_1.txt | Convert-TextObject | sort P1 | Out-Graph P1 P2
1 ██████████████████████████████████████████████████████████████ 25
2 ███████████████████████████████████ 14
3 █████████████████████████████████████████████ 18
4 ██████████████████████████████ 12
5 ████████████████████ 8
6 █████████████████████████ 10
7 ███████████████ 6
8 ████████ 3
9 ███████████████ 6
PS E:\lee> gc C:\temp\stahler_2.txt | Convert-TextObject | sort P1 | Out-Graph P1 P2
1 ████████████████████████████████████████████████████████████████ 29
2 ████████████████████████████████████ 16
3 ████████████████████ 9
4 ██████████████████████ 10
5 ███████████████████████████ 12
6 ████████████████ 7
7 ████████████████████ 9
8 ███████ 3
9 ███████ 3