Lazy Evaluation: Benefits, and costs too

I try to avoid performance topics in my blog, because it’s very hard to create generalizations on performance. So, read this not as general performance guidelines, but as a way to understand Lazy Evaluation, LINQ, and Functional Programming.

A developer asked me a question about an algorithm that calculated some statistics for a very large sequence of numbers.  Here’s a (slightly) modified version of the code that calculated the Standard Deviation:

var mean = sequence.Sum() / sequence.Count();
var variance = sequence.Select(n => n * n).Sum() / sequence.Count() -
(sequence.Sum() * sequence.Sum()) /
((double)sequence.Count() * (double)sequence.Count());
var stdDeviation = Math.Sqrt(variance);

In the production code, each value in the sequence is calculated from other values, and requires a file read.

The important point for this code was that the sequence (more than a million numbers) was being enumerated eight times. Every call to Count(), or Sum(), including calculating the sum of squares, required enumerating the entire sequence. For a sample like this, you could pre-calculate and store the count, sum, and sum of squares:

        double sum = sequence.Sum();
int count = sequence.Count();
var mean = sum / count;
var variance = sequence.Select(n => n * n).Sum() / count -
(sum * sum) / ((double)count * (double)count);
var stdDeviation = Math.Sqrt(variance);

More complicated algorithms may require more work.  In many cases, a better solution is to use a different method, Aggregate, to calculate all the needed values in one enumeration.  Here’s the same calculation using one enumeration to calculate the sum, sum of squares, and count for the sequence:

var seed = new { Count = 0, Sum = 0.0, SumOfSquares = 0.0 };
var accValues = sequence.Aggregate(seed, (currentTotal, element) => new
Count = currentTotal.Count + 1,
Sum = currentTotal.Sum + element,
SumOfSquares = currentTotal.SumOfSquares + element * element

// now, calculate Standard deviation:
var mean = accValues.Sum / accValues.Count;
var variance = accValues.SumOfSquares / accValues.Count -
(accValues.Sum * accValues.Sum) /
((double)accValues.Count * (double)accValues.Count);
var stdDeviation = Math.Sqrt(variance);

I said in my opening that this wasn’t about performance. In fact, in my tests on a sequence of random numbers, all three samples take very similar times, within 100ms. In the full production version, where enumeration took much more time, and there were more calculations (resulting in more than 8 enumerations), the results were quite striking.

Another possible enhancement to this algorithm would be to use a new version of Aggregate in PLINQ to perform some of the calculations in parallel.

What I’m saying is this: please don’t rewrite LINQ queries that your happy with. Instead, remember that lazy evaluation means calculating the values each time. File this away for later and when you see code that does exhibit the problems I mentioned above, try folding enumerations into a single operation that calculates multiple results.

Created: 8/29/2011 5:10:24 PM

Current Projects

I create content for .NET Core. My work appears in the .NET Core documentation site. I'm primarily responsible for the section that will help you learn C#.

All of these projects are Open Source (using the Creative Commons license for content, and the MIT license for code). If you would like to contribute, visit our GitHub Repository. Or, if you have questions, comments, or ideas for improvement, please create an issue for us.

I'm also the president of Humanitarian Toolbox. We build Open Source software that supports Humanitarian Disaster Relief efforts. We'd appreciate any help you can give to our projects. Look at our GitHub home page to see a list of our current projects. See what interests you, and dive in.

Or, if you have a group of volunteers, talk to us about hosting a codeathon event.