Binary data file compatibility across machines

Peter Holzer hp at vmars.tuwien.ac.at
Tue Nov 27 03:09:34 AEST 1990


stiber at cs.ucla.edu (Michael D Stiber) writes:


>On different machines, the implementation of C data types is different.
>I forget what fixed types' lengths are, but I know that at least some of
>them may vary.  I also know that doubles can have different encoding
>schemes (ie, IEEE vs. DEC).  Then, there's little endian machines versus
>big endian ones.

>So, my question is this:  Say you want to share data files among
>different machines.  You also want to be able to use the same code on
>each machine.  Therefore, you want to have either a uniform file format,
>or you want the code to be able to figure out what the file format is,
>and convert it to the native data type representation.  Now, one alternative
>would be ASCII files --- this is guaranteed to work (assuming that you
>can get C on an IBM 3090 to write ASCII).  However, in my application,
>ASCII would produce files that are way too huge --- I must use a binary
>format.  So, is there an already-existing, standard solution to this
>problem of binary data file transfer?
>--
>			    Michael Stiber
>			  stiber at cs.ucla.edu
>		   ...{ucbvax,ihpn4}!ucla-cs!stiber
>		     UCLA Computer Science Dept.

I do not know of any standard solution (ANSI or ISO or
something) but here is my personal ``standard'':
(Well, most of the time I just use ASCII files. They are not that
much bigger, and I can examine (and change!!) them with standard
tools)

For integer data I choose the format that is used on the machines 
I am working on most of the time. Each binary data file then
gets a header describing the data format. Something like

<magic number>		2 Bytes: 'P' 'B' (portable binary)
<integer type> 		1 Byte: 0 = 2compl., 1 = 1compl.,
				2 = sign/mag,
<endianness>		1 Byte: 0 = little, 1 = big.
<float-format>		??

I didn't need float format until now but I would adopt at least
two different formats: IEEE and a generic format where a float
is broken into a mantissa (long int) and exponent (short int).

Shorts are assumed to be 2 bytes, longs 4 bytes (The minimums
required by ANSI).

A program which reads these files would first check if the data
format is the same as it uses internally. If it does it can use
fread/fwrite for the rest of the file, else it has to call
special routines to deal with the various types.
Most of the time the file will be read on the same machine it
was written, so files can usually be read fast.

A portable routine to read a big-endian sign/magnitude long would then
be:

long read_long_bs (FILE * fp)
{
	unsigned long ul;
	long	l;

	ul = getc (fp);
	ul = (ul << 8) | getc (fp);
	ul = (ul << 8) | getc (fp);
	ul = (ul << 8) | getc (fp);

	l = ul & 0x80000000 ? - (ul & 0x7fffffff) : ul;
	return l;
}

Oh yes I am assuming that a character is 8 bits and the machine
is using the ASCII character set. If that is not the case the 
program must not use more than the lowest eight bits of any
character and strings must be converted to ASCII first.

--
|    _  | Peter J. Holzer                       | Think of it   |
| |_|_) | Technical University Vienna           | as evolution  |
| | |   | Dept. for Real-Time Systems           | in action!    |
| __/   | hp at vmars.tuwien.ac.at                 |     Tony Rand |



More information about the Comp.lang.c mailing list