Are you away of the loop-mode instructions for the 68010?  They are
discussed on the last few pages of the 68000-68008-68010 book from
Motorola.  I did some testing, and for long copies (> ~100 bytes)
they are a whole lot faster.  Apparently the compiler doesn't
use them.  I wrote a memcpy()-type routine, and compiled it with
and without the optimizer, and it did not use these instructions.
The libc.a versions do use them, so either these were hand-coded
in assembler, were hand optimized, used a different compiler, or
I'm missing something.  The MGR bitblt could be sped up a log
just by using these instructions.

The way they work (this is from memory; my book is at home) is as
follows.  Given a normal copy function:
	for (i=100; i > 0; i--)
		*dest++ = *src++;

the compiler outputs something like:

	mov.l	&100,%d0
	mov.l	dest,%a0
	mov.l	src,%a1
top:	mov.b	(%a1)+,(%a0)+
	sub.l	&1,%d0
	bgt	top

Convert this to

	mov.l	&100,%d0
	mov.l	dest,%a0
	mov.l	src,%a1
top:	mov.b	(%a1)+,(%a0)+
	dbf	%d0,top

and the 68010 read this as loop mode (due to its prefetch), and 
does not fetch the move or branch instructions again, saving 4
memory accesses (1 for mov.b, 1 for sub.l, and 2 for bgt).  This
is a big win.  Note that it only works for  branches with a
negative displacement of 4 (i.e. one instruction before the
dbxx), which happens to be ideal for copies.

Anyhow, I thing this would make a huge improvement to MGR,
since it showed me approx 10 times the performance on a
quick 1000-byte-copy benchmark.  Check it out.

