standard deviation

Sun Mar 26 05:16:02 AEST 1989

There are two issues associated with the computation of the standard
deviation which have been brought up in the provious postings.  The
first of these is which of the following is the "correct" estimator
for the standard deviation from a sample:

	x[0], x[1], ..., x[N-1]

of size N.

	mean = 0.0;
	for (i=0; i<N; i++)
		mean += x[i];
	mean /= N;
	sum = 0.0;
	for (i=0; i<N; i++)
		sum += (x[i] - mean)*(x[i] - mean);

	1) std_dev = sqrt(sum/N);

	2) std_dev = sqrt(sum/(N-1.0));

I would like to emphasis here that both 1) and 2) are valid estimators
with both having people who recommend them.

	1) is what is called the moment estimator, and also the
		maximum likelyhood estimator.

	2) is sometimes called the unbiased estimator since the
		expected value of sum/(N-1) is the variance of
		the population.

In any case there is little difference in the computed value's.

The second issue, which is more important, is how to compute the value
accurately on a machine with finite precision.  There were two computation
formula's provided.  It should be noted that mathematically they are
equivalent, thus the only issue is which provides the more accurate result.

NOTE:  In all the computations below, I will use a denominator of N.  If
	you want the other estimator, just change the one term.

	1)
		mean = 0.0;
		for (i=0; i<N; i++)
			mean += x[i];
		mean /= N;
		sum = 0.0;
		for (i=0; i<N; i++)
			sum += (x[i] - mean)*(x[i] - mean);
		std_dev = sqrt(sum/N);

	2)
		xsum = xsq = 0.0;
		for (i=0; i<N; i++) {
			xsum += x[i];
			xsq += x[i]*x[i];
		}
		std_dev = sqrt((xsq - xsum*xsum/N)/N);

NOTE:  These are code segments, in the sense that above, the x[] array
	was declared and assigned values, i was declared an int, sqrt()
	returns a double, etc.

NOTE:  These coding segments are not the most efficient.

1) is generally called the two pass formula for computing the standard
	deviation.

2) is called a one-pass formula for computing the standard deviation.

At this point, we notice that 1) is definitely slower than 2) since it
requires two passes through the data.  HOWEVER, I claim that it is more
accurate.
Inorder to provide some data rather than words, below are two test programs
each of which:

		1) reads a data set
		2) computes the standard deviation
		3) prints out the result.

------------------- cut here -----------------------
#include <stdio.h>
extern char *malloc();

main(argc, argv)
int argc;
char **argv;
{
	double *x, std_dev, sum, mean, sqrt();
	int i, N;
	FILE *fopen(), *fp;

	if (argc != 2) {
		printf("std: requires an input file!\n");
		exit(1);
	} else if ((fp = fopen(argv[1], "r")) == NULL) {
		printf("std: cannot open %s\n", argv[1]);
		exit(1);
	}
	fscanf(fp, "%d", &N);
	x = (double *) malloc(N*sizeof(double));
	if (x == NULL) {
		printf("std:  not sufficient memory!\n");
		exit(1);
	}
	for (i=0; i<N; i++)
		fscanf(fp, "%lf", &x[i]);
	mean = 0.0;
	for (i=0; i<N; i++)
		mean += x[i];
	mean /= N;
	sum = 0.0;
	for (i=0; i<N; i++)
		sum += (x[i] - mean)*(x[i] - mean);
	std_dev = sqrt(sum/N);
	printf("std_dev = %24.16e\n", std_dev);
}
------------------------- cut here --------------------
#include <stdio.h>
extern char *malloc();

main(argc, argv)
int argc;
char **argv;
{
	double *x, std_dev, xsum, xsq, sqrt();
	int i, N;
	FILE *fopen(), *fp;

	if (argc != 2) {
		printf("std: requires an input file!\n");
		exit(1);
	} else if ((fp = fopen(argv[1], "r")) == NULL) {
		printf("std: cannot open %s\n", argv[1]);
		exit(1);
	}
	fscanf(fp, "%d", &N);
	x = (double *) malloc(N*sizeof(double));
	if (x == NULL) {
		printf("std:  not sufficient memory!\n");
		exit(1);
	}
	for (i=0; i<N; i++)
		fscanf(fp, "%lf", &x[i]);
	xsum = xsq = 0.0;
	for (i=0; i<N; i++) {
		xsum += x[i];
		xsq += x[i]*x[i];
	}
	std_dev = sqrt((xsq - xsum*xsum/N)/N);
	printf("std_dev = %24.16e\n", std_dev);
}
------------------------- cut here -------------------------
The input data is given below:
------------------------- cut here -------------------------
10
111111111.0
111111112.0
111111111.0
111111112.0
111111111.0
111111112.0
111111111.0
111111112.0
111111111.0
111111112.0
-------------------------- cut here ------------------------

I will compile the first program and call the executable "twopass,"
and the second program will be called "onepass."  The next section gives
the compile and output from the data:
-------------------------- cut here ------------------------
hrdcpy cc -O -o twopass test.c -lm
+ /etc/preroot /native /bin/cc -O -o twopass test.c -lm 
hrdcpy cc -O -o onepass test1.c -lm
+ /etc/preroot /native /bin/cc -O -o onepass test1.c -lm 
hrdcpy onepass data
std_dev =   1.7888543819998317e+00
hrdcpy twopass data
std_dev =   5.0000000000000000e-01
---------------------------- cut here ----------------------

Notice that the onepass standard deviation is 1.78, while the twopass
standard deviation is 0.5!!!!!!  Incidentally, the true value is 0.5!!

					Bob Craig