"conventional" copy fragment?

Sat Oct 1 19:38:26 AEST 1988

In article <2585 at ingr.UUCP>, jones at ingr.UUCP (Mark Jones) writes:
> In article <8809092109.AA06071 at tycho.yerkes.uchicago.edu>, pearce at TYCHO.YERKES.UCHICAGO.EDU ("Eric C. Pearce") writes:
>> Personally, I prefer this "conventional" copy fragment:
>> [code]
> How about this "conventional" copy fragment
> 	memcpy(array2,array1,sizeof(*A)*Count);

As with most of these arguments, it really depends on the specific
application.  On some machines (such as the VAX), function call
overhead is large enough that if you're copying small amounts of data
(like ten or twelve bytes), it's faster to inline it.

> Use the functions in the C library.  If you need more speed, recode
> the library routines in assembly, and link them in ahead of the
> standard library.

Not enough: even the function call overhead can swamp the time taken to
perform the copy.

> Don't write trashy C code to try to get better speed.  It just ain't
> worth it, now or later.

Welcome to the real world, where we need to get the response packet out
to the robot within 28 milliseconds or the whole thing comes to a
screeching halt.  If inlining a copy makes the difference between a
working system and a not-quite-working system, it's worth it.

This is not just my idea, either.  I wrote a program to find out what's
really happening.  This was run on an otherwise free MicroVAX-II under
4.3BSD with the standard 4.3 pcc-based cc, with no optimization options
enabled.  The times given are real times, but they stayed stable across
several runs, so they ought to be reasonably good.  Here's what the
various lines mean:

Overhead: The time taken when there's no copy being done (ie, loop
	overhead).
Fancy inline: The "conventional" copy recommended by "Eric C. Pearce",
	with all inner-loop variables declared register.
Simple inline: An inline loop whose core is
	for (nb=...;nb>0;nb--) *to++ = *from++;
	where all three variables are declared register.
asm inline: In-line asm() directives to implement a loop such as the
	one used for "Simple inline".
Fxncall: Calling a function containing the same loop used by "Simple
	inline".
Library: Calling memcpy().

All pointers are pointers to char, so copying copies one byte at a time.

Here are the times resulting from 100000 iterations, copying 12 bytes
each loop:

Overhead = 2.3 usec/loop
Fancy inline = 60.6 usec/loop
Simple inline = 49.1 usec/loop
asm inline = 40.1 usec/loop
Fxncall = 74.3 usec/loop
Library = 71.9 usec/loop

And here, 10000 iterations, copying 512 bytes each loop:

Overhead = 2 usec/loop
Fancy inline = 979 usec/loop
Simple inline = 1881 usec/loop
asm inline = 1654 usec/loop
Fxncall = 1877 usec/loop
Library = 1794 usec/loop

Moral: Nothing beats actually trying a few things and finding out
what's best.  Of course, "fastest" is not always "best", though if you
care enough to start unrolling your block copies, speed is presumably
important to you.  In particular, note the differences depending on the
amount of data being copied, so what you want to do depends on how much
stuff you're typically copying.

For the curious, the program I used appears below; to extract it, feed
this through atob and uncompress.

xbtoa Begin
+.\K<d^U14V at K"L+F),(6RL%s(MJ1E#WT'H_1Zn#b]uK%9q*mL,u-R97Nb#F9V+ at t1'L&S(bfBW/5s
t+_C8'`'EM&UL("4m`h3V_#DAH]>C+HUZ6SS[U1.XYouRoH)*A#Y^r_5GN1;hD=O;?2_Msos$KRF3P
@e6kU-a]g:uN.*;mgGt_L4Ihm+o'dl+ at 0CO:DQ&:%p%9 at HL9]'KaP8J.YpXM\]tlJCr?Fc&*5YHH`M
Gk&XJGI2&JX?ma+549F3'7"JAn#eY]n(C%_<NRjtb)aEbm[0K<h/9_Y\1Kh8,30uF>#,-Q'DS71P(8
\Y'2IPn*qf]_,\EIfhj1gp.m)H7K*8NKA#n6sTd1kCGMmb8 at s'baP.-7(5n'U;&#VU].hg_@@=Fu.p
1$2SdOYu;kFFAR,+'kaX-K/HsiF!f=(ja8k$?Tr0)M]h2kTPg5Nn0nZR_]9`MIt('UQ$5TfHQ;4VKe
kp56AqON]Om&A<0:d)'b%lp&RKpW2U!N*$2r47N&JU<nh*El.-nocN3cKIENLpdRmFtVc at f]7CUoke
TmM5HJrjVB5W(<Kjo at PMWnH#oBga/IK2k!hgtY28Eb_g\R?Y97hXJqPuCkc5T"RLdA9RE<<oCIJs,p
K:hQMak&G4KaW+l3 at Q#THRs'[@#]74_7i5]cnAS6Ua8po!psn*<Xh!lRl`W&H)[!?>EI'1=]!;><#t
b at O"16)))M8MB,hjgV_PXO=$fo<80"99c3l`1KS3]YBE\n_f(u/0&R]^S&;+;C5KPF_Yk)HpARDPZn
S=7O,\K%-caf1\FRBC_`5[gn5L;9Pg>-ME?URJEe$aZ;W_luT]lbn`j225.:!?2a]Nm='HC9i[>#>n
`S\L2R4KcD).X0-a=EtE+5nT5k\0ud8lYlJUbq?"'/7j"Qn#7>Y`70d#.TqFa(@=1dR at Z)f`7ic(VL
34BG=t9ebkbUEqgMV?@GK(bii<C03Sl#\4Dnf]b). at Jq_VAt<JZd^OKCH;jDqHmJ$8</uhq285E;!a
m'<\h'Lb%mP)3<\2[cDa5DRg+ACT\EnFi`MG$7MG4!ZcH3;9'YfN6q^e!q"kAn%kKO_FlW"UdFL!)u
g*2_ at uBmo[Ke3I8d2S&V`OjZfecbE6(nD)7Me<8&,(e6!q?<Qr23MBV;R@)/n$pokF9-T(U\4P(ULL
VRc8gmoC_o!sjuIL)2Cr$Z6N,=C-LZr7+(A>=sIgVGKPL[0M+(0S4)[O9bL#`c at f)TDX.>^\d,2IuE
RKgU,Os&]2Ns-S.Ui%k*#4/s(AhHk$9*8-]K*&;*%!i!qtrX!qRn9-uL;q_T,(q6c;-0 at ZL#ogV;"[
6[Gj_P'Pe8-]_&OF]^Iaq,q&_>GHRK1=nZBai'N&P+F0-j5).X_9pMl;*?1RGl_modLp,d=Z!uTE<O
9^^.79L`'1O+aI`(QY"BE.AYr1l.8E5_GG9(jPn`C2MOkfd>Wtd at O$_C6oRRjNi0N)'@NO0Sk\`3<F
fMcZcKmXr_HP$gmK/Q&Bc,l13dMd_]"Hc`XBMd@@A\K*b^p$mO=e/
xbtoa End N 1227 4cb E 6f S 24195 R d308a326

					der Mouse

			old: mcgill-vision!mouse
			new: mouse at larry.mcrcim.mcgill.edu