doubtful assumptions about pointers

Wed Jan 10 19:44:39 AEST 1990

In article <1250.25ab3338 at csc.anu.oz> bdm659 at csc.anu.oz writes:
>The following is a list of Doubtful Assumptions (DAs).  ...
>I'd welcome proofs in either case.

Well, I'll try to respond, but with explanations, not rationalistic
"proofs".  I really have to take issue with people who insist on
dissociating the purpose of the C Standard from reality, instead
arguing excessively over formalism.  We expressed the Standard in
technical English rather than a formal notation primarily in order
to aid programmers (and to a lesser degree, implementors) to relate
it to their daily activity.  It is not intended to form a system
suitable for treating with formal symbolic logic and therefore
should not be taken as such.  Thus, a truly perverse implementation
might actually comply with the letter of the Standard while
exploiting unintended loopholes to produce a travesty quite at
variance with the spirit of C.  (We tried to document in the
Rationale most of the intentional loopholes.)

Another meta-comment here is that the DA examples indicate too
much concern with representational aspects of entities within a
C program and too little concern with dealing with data at the
appropriate level of abstraction.  In the vast majority of
applications, these questions should not even arise.

The answers I give will assume that implementations do not go out of
their way to introduce unnecessary complications.  (Necessary ones,
caused by architectural or environmental considerations, are okay;
we deliberately allowed slack in the specifications to cover those.)

>DA[0]:  int *pi; char *pc;
>        Suppose pi is valid, and do  pc = (char*) pi.  Then *pc overlaps *pi
>        in the sense that changing the value of *pc changes the value of *pi.

TA (True Assumption).  The addresses of the bytes within a single object
constitute a nice linear address space.  (However, there need not be one
global linear address space within which all objects are located.)

It is not specified which PART of *pi is accessed by *pc, but some part
must be.  Big-endian and little-endian architectures will differ here.

>DA[1]:  int *pi, *pj;  char *pc, *pd;
>        Suppose pi and pj are valid,  and that  pi == pj .
>        Now do  pc = (char*) pi; pd = (char*) pj .
>        Then  pc == pd .
>        [I bet this one generates some heat.  Don't forget to justify
>         your disproof with references to pANS.]

TA.  Pointers to distinct objects (including bytes within other objects)
compare unequal and vice-versa.  The only loophole an implementation
could exploit here would be to randomly select a byte address within the
int object when the conversion to char* occurs, knowing that alignment
constraints applied during the inverse conversion would recover the same
int*.  Even if such a loophole is logically permitted by the specification,
I don't think it poses a serious practical threat, because I see no
legitimate reason for introducing such run-time indeterminacy and
therefore don't expect to see it in practical implementations.  (The GNU
project might do it just to show how "clever" they are; that seems to be
their style, judging by their original treatment of #pragma.  Frankly,
such childish antics merely reinforce the negative opinion many already
have of "pointy-headed Ivy League intellectuals", who play obstructive
semantic games while the rest of us are trying to do productive work.)

>DA[2]:  Just like DA[1], but using type void* instead of char*.

TA.  A void* is really just a byte* (i.e., a char*) subject to additional
programmer-safety compile-time constraints.  The run-time representation
of void* and char* MUST be identical (3.1.2.5), and this implies that
success for one equality comparison implies success for the other.

>DA[3]:  long *pi, *pj;
>        Suppose that pi is valid, and do  pj = (long*)(int*) pi;
>        Then  pi == pj .
>        [comment: there's no rule that says an int can't have a more
>         strict alignment requirement that a long.]

TA.  If the conversion to int* does not violate the alignment constraint,
then the test for equality must succeed.  I don't know of any architectures
where it would be reasonable for the C implementation to impose stricter
alignment constraints on int than on long, so this is in practice a TA.
Artifical implementations could be devised that make this a DA; see my
meta-comments at the beginning of this article.

>DA[4]:  int i, *pi;
>        Suppose pi is a null pointer, and do  i = (int) pi .
>        Then  i == 0 .

FA (False Assumption), even assuming that int is the appropriate
implementation-defined integral type to satisfy 3.3.4.  The most obvious
implementation is to simply copy the pointer-representation bit pattern
unchanged into the integral datum, as indicated by the footnote.
Definitely, a null pointer need not be represented as all zero bits.

>DA[5]:  Just like DA[4], but with i of type  unsigned long .

Same comments here, assuming that unsigned-long is the appropriate type.
For integral types that are TOO LONG, it is a semantic violation and MUST
BE DIAGNOSED.  I suppose that it's within both the letter and the spirit
of the Standard for an implementation definition of the "size of integer
required" to be "any size greater than or equal <n>" and a suitable
statement about the integral representation for all qualifying sizes;
that would avoid the need for such a diagnostic.

(NOTE:  I would regard any conformance test that looked for the too-long
diagnostic as being silly, for the same general reasons that I gave for
considering excessive linguistic analysis of the Standard as being silly.
It was certainly not intended that such specifications get in the way of
either programmers or implementors, and so long as there is a sensible
way around having to take the specs so literally when that has undesired
effects, we should all agree to do so -- in this case, as indicated via
a benign interpretation of what the implementation definition can be.)

>DA[6]:  int *pi, *pj;
>        Suppose pi is valid, and do  pj = (int*)(unsigned long) pi.
>        Then  pi == pj .

FA, assuming again that unsigned-long is the appropriate type.  (By the
preceding discussion, unsigned-long should certainly in practice always
be acceptable for 3.3.4 purposes.)  There can be reasonable
implementations such that no integral type can hold all the information
needed to represent a pointer.  That is why the Standard does not
require that the mapping between pointers and integers be invertible.

>DA[7]:  int i, *pi;
>        Suppose i != 0, and do  pi = (int*) i .
>        Then  pi != (int*)0 .

FA.  (int*)0 is a null pointer of type (int*), whereas pi is the
implementation-defined result of converting the integer value 0 to an
int*.  0 in this source code context may be treated as a special case
by the compiler.

>DA[8]:  int *pi, *pj;
>        Suppose pi is a valid pointer of kind P3, and do
>        pj = (int*)(char*) pi .   Then  pi == pj .
>        [comment: the rule in section 3.3.4 only applies to pointers
>         to objects, which pi might not be.]

[P3 means "one past the end".]  I think 3.3.4 meant for "type" to
distribute over "object or incomplete", as it does explicitly later in
the same sentence.  The intent is to distinguish these from function
pointers.  Even if that interpretation is not upheld by X3J11, it would
be most unlikely that an implementation would cause this example not to
succeed, because it would take more work not to.  Thus, this is also TA.

>DA[9]:  int *pi;
>        Suppose that an external function f() is declared without prototype.
>        It expects a single argument of type void*.   Assume that pi is valid.
>        Then the call  f(pi)  works.
>        [comment:  See my remarks on section 3.1.2.5 below.]

FA.  int* and void* need not have the same representation, and generally
would not for a word-addressed architecture.  f((char*)pi) would work.

>DA[10]: void *pv;  external void *f();
>        In fact, f returns a value of type int*.
>        Then   pv = f()  works.
>        [comment:  See my remarks on section 3.1.2.5 below.]

FA, for the same reason as preceding.  External interfaces must have
matching input and output data representations, for obvious reasons.

>References and some nit-picking.

[references omitted, since legitimately the whole Standard must be taken
as an integrated specification (which I once imprecisely labeled a
"gestalt"), not as a set of unrelated axioms from which formal deductions
are to be made]

>3.1.2.5.  types and type terminology
>          definitions of "object type" and "incomplete type"
>   nit-pick:  This section several rules of the form "types X and Y have the
>              same representation and alignment requirements".  Footnote 15
>              tells us that this is intended to imply interchangeability as
>              function arguments, function return values, and members of
>              unions.  However, this does not follow from the rule.
>              Interchangeability of two types as function arguments requires,
>              in addition, equality of argument-passing mechanisms.  This is
>              nowhere prescribed.

I don't know what you mean by this; the footnote is EXPLAINING what we
intended by these terms.  Don't you think that function arguments have
to be somehow represented and aligned?  (Note, by the way, that there are
often different alignment requirements for function arguments than for
other uses of the same data type; e.g., char arguments on the PDP-11.)

>3.3.4.    more on conversion amongst pointer types
>          conversions between integral types and pointers
>   nit-pick:  The case of (obj*)0 should be excluded from these rules
>              as it is specified differently in 3.2.2.3.

3.2.2.3 says that a null pointer constant, which may be expressed as 0
(among other alternatives), converted to a pointer constitutes a null
pointer.  3.3.4 says that an arbitrary integer may be converted to a
pointer.  Thus i=0,pi=(int*)i; does not necessarily result in pi
containing a null pointer representation, and that is intentional.

3.3.4 explains that conversions involving pointers (except ..., not
relevant here) shall be specified by an explicit cast, and it spells
out their implementation-defined and undefined aspects.  Note that
the construct (int*)0 is not covered under 3.3.4 since it does not
involve the conversion of an integer to a pointer -- 3.2.2.3 has
already given that construct a different interpretation.  That leaves
lots of constructs for which the Standard assigns no other meaning to
be encompassed by 3.3.4, for example (int*)1.

I really don't see that there is any practical problem in understanding
null pointers expressed like (int*)0 in C source code, once one realizes
that in this context 0 is a null pointer constant, not an integer.  There
has been continual confusion about this in comp.lang.c (INFO-C), but it
has nothing to do with the Standard; rather it inheres in the overloading
of the token 0 in source code to have multiple meanings.  This was not an
issue for the architectures for which C was initially implemented, but
the necessity of treating such expressions specially became more evident
as C spread to unusual architectures.  (The theoretically proper way to
have dealt with this would of course have been for the language to have
provided a reserved symbol such as "nil".  Keep that in mind when you
design the D programming language.)

>3.3.8.    relational operators
>   nit-pick:  The phrase "or both are null pointers" is missing from the
>              sentence in lines 8-10.  See the otherwise identical sentence
>              in section 3.3.9.

No, this omission was deliberate, since it is improper to provide a null
pointer as an operand of a relational operator, which is what 3.3.8 is
all about.  3.3.9 covers the equality operators, for which null pointers
are permissible operands.

In summary:

As I've said in the past and elaborated somewhat upon at the beginning
of this article, one cannot understand what C is by applying formalistic
arguments to the phraseology in the Standard.  I doubt that the Standard
in itself suffices to completely specify what is essential about C to
someone who has never encountered it (or, even more extreme, who knows
nothing about computer programming); THAT IS NOT ITS PURPOSE.  It is
merely intended to serve as a reference "treaty" by which both C
programmers and C implementors agree to be bound, in order to facilitate
the use of C as a practical tool in solving real-world problems, with
particular emphasis on source-level application portability.

Therefore, you should refer to the Standard to see what the terms of the
treaty are, not to determine what is sane or insane.  An unduly warped
implementation does not facilitate the use of C; there is much more
involved in determining the utility of an implementation than merely
literal conformance to the letter of the Standard.  (X3J11 termed these
"quality of implementation" issues.)  An implementor who provides a
perverse implementation would undoubtedly incur the wrath of his
customers, and deservedly so.