faster bcopy using duffs device (source)

Bakul Shah bvs at light.uucp
Sat Sep 9 08:43:33 AEST 1989


In article <19473 at mimsy.UUCP> chris at mimsy.UUCP (Chris Torek) writes:
>In article <5180 at portia.Stanford.EDU> stergios at Jessica.stanford.edu
>(stergios marinopoulos) writes:
>>I wanted a faster bcopy, so I used duffs device as a basis for it.
>
>bcopy() should be written in assembly (on most processors), put in
>a library, and forgotten about, because---for instance---a dbra loop
>beats a Duff loop on a 68010, every time.

A couple more points.

Even on a single processor different trade-offs exist for different
amount of copying (e.g. use of movems on a 68000 for large copies) or
different alignments (e.g. word copies when src,dst are word aligned,
something else when they are not).  Vendors providing stdlib should mess
with such details.

It is preferable to use standard functions whenever possible (memcpy
instead of bcopy), since ANSI compilers can optimize them much better.
For instance, on a particular machine a compiler may choose to inline
something like

	memcpy(void * dst, void * src, unsigned count)
	{
		/* copy right here for small counts */
		if (count < BREAKEVENCOUNT) {
			char * d = (char *)dst;
			char * s = (char *)src;
			unsigned c = count;
			while (c-- != 0)
				*d++ = *s++;
			return dst;
		}
		/* call a function depending on relative alignment */
		return
			(((unsigned)dst&3) == ((unsigned)src&3) ?
			 __alignedcpy : __unalignedcpy
			) (dst, src, count);
	}

Ain't that Disgusting!  Even more can be done if the count happens
to be a constant -- though this case can only be handled in a compiler
(as there is no preprocessor equiv. of #if defined(xxx) for detecting
constants).  Inlining is especially useful when small amounts have to
be copied.  Anyway, it is best to hide this in a compiler or stdlib.h.

To give you a datapoint, on a AMD29000 such tricks cut time down from
about 10 cycles/byte in C code to under 0.7 cycles/byte for aligned src,
dst and 0.9 cycle/byte for unaligned src, dst (for copying about 100
bytes).  For very large copies it is possible to approach 29k's limit of
0.5 cycles/bytes within 5% -- assuming data memory can stream.

-- Bakul Shah <..!{ames,sun,ucbvax,uunet}!amdcad!light!bvs>



More information about the Comp.lang.c mailing list