Spelling utilities in C (Part 1 of 3)

Thu Jun 28 17:01:49 AEST 1990

#!/bin/sh-----cut here-----cut here-----cut here-----cut here-----
# shar:	Shell Archiver
#	Run the following text with /bin/sh to create:
#	README #	copyr.h #	dawg.h #	dictlib.h #	grope.h 
cat - << \SHAR_EOF > README

This is a suite of programs for working with words.  I am releasing
it to alt.sources primarily to invite people to test it on as many
platforms as possible (mail me back any bug reports please!) so that
the final code will be completely portable.

Unpack with unshar or sh < part1; sh < part2; sh < part3

Utilities for spelling checking and correction are supplied, but
the ultimate aim is to support all sorts of word-manipulation activities
such as a writers workbench and assorted word games (See PS).

This news posting is in three parts; part1: this file + headers;
part2: utility modules in .c files to be #include'd; part3: main
programs; just CC these and they should work -- no fancy makefiles
or link commands.

The code here doesn't have much of a user interface; I'm rather hoping
that people who pick it up will hook it in to their favourite operating
system as smoothly as they can.  Perhaps someone with the time to
spare (who am I kidding :-( ) might integrate it with ispell for instance.

I've experimented with various interfaces already; I can accept TeX
input and write output suitable for correcting text in either emacs
or uEmacs.  I'm not going to release these though, until I'm happy
with the internal code.  Incidentally, the code will be totally
public domain - meaning that I've no objection to its being used
in commercial projects.  HOWEVER I would appreciate if you didn't do so
until it is officially released (probably through comp.sources.misc)
because I would prefer to define portable file-formats and magic numbers
first.

OK, enough babble; here's how it all works:

1) acquire a list of words for yourself; I have one but at 690K it is
   too big to post. (actually even these sources come to around 120K
   so I hope they make it through OK.  I don't think I have a shar
   that splits, so I've divided into three shars myself.)
   The word list should be sorted by ascii order.

2) cc dawg.c
   I've deliberately made main programs into one source file, by
   including *.c for utility modules; I know this is bad style but
   it helps make compiling and porting easier (no worries about
   long external names for instance, or non-case-sensitive linkers)

3) dawg yourdict dict
   'yourdict' is your wordlist; but the second parameter should
   be 'dict' at least while testing.  This will generate dict.dwg
   which is a compacted and *FAST* data structure for word access.

   Building a dawg[See next section for meaning of name] from a word
   list is a real memory pig; so I've written code which will let you
   generate the dawg in chunks (traded off against a small loss of
   compression density).  (Interestingly enough it isn't a time-pig;
   using a hash-table for node merging gives almost linear time - thanks
   to Richard Hooker for that one).

      If you don't have enough memory (say 2Mb?) you'll get a run-time
   message suggesting that you recompile with cc -DSMALL_MEMORY dawg.c
   (OK, so one day I'll make it select this mode itself at run-time)

   IBM PC's and Mac's will get this mode by default. [See FOOTNOTE]

    If you want to check that the data structure generated is ok,
    3a) cc pdawg.c
    3b) pdawg dict.dwg
        This should decompile the data and print out the
        original word list

4) cc dwgcheck.c
   Compile a Mickey Mouse spelling checker

5) dwgcheck a few wurdz on the comand line ?
   ... and test it.

6) cc tell.c
   Now try the 'advanced' stuff! - soundex correction... (I hates it :-()

7) tell me sume wrongli speld werds
   and test it...

   If you're getting into this stuff, here's another incidental
   hack you might want to try...
    7a)  cc proot.c
    7b)  proot root
         print out all words of form 'root*'
         One day I'll hook this stuff into regexp; but not today :-)

----------

Enough examples for now.  If that all worked you are doing well!
If it didn't, and you have the time, please find out what went wrong;
If you can fix it, mail me the fix please, else mail me a log of
the symptoms, such as compiler error messages.

By the way, I know that the spelling correction uses a tacky algorithm;
don't worry about it -- I'm working on some red hot stuff which will
put soundex to shame (which isn't hard :) )  [Late note; it now works
(apparently well) but I need a phonetic dictionary before it is useful :-(
Anyone got a phonetic dictionary?  Alvey people out there?]

You might be interested in the data structures used; the main one is
a dawg - a 'Directed Acyclic Word Graph' -- it's essentially a
Directed Acyclic Graph implementing an FSM which describes the set of
all words in your lexicon.  The name DAWG comes from Appel's in his paper
on Scrabble (although none of the code or algorithms do).

The second most important data structure is superimposed on this
one; it is a packing algorithm due to Liang (of TeX hyphenation
fame) which allows you to convert an implicitly linked trie
into an indexed trie *without* taking any more space.  Liang is
a real smart cookie & it is a shame this technique of his is
not better known... (it should be up there among binary tree and
linked lists!)

OK, more examples:

8)  cc pack.c
    compile the packing code

9)  pack  dict.dwg dict.pck
    this takes rather a long time; you'll get messages saying '1000 nodes'
    '2000 nodes' etc roughly 1 every 20 seconds +/- 50%.  There shouldn't
    be more than a couple of screenfuls of these :-)  I'll speed it up one
    day.

    If you want to check that this data structure generated is ok too,
    9a) cc ppack.c
    9b) ppack dict.pck
        Just as you did for (3)

10) cc pckcheck.c
    Just like dwgcheck but using the other type of data.

11) pckcheck sume moar possibly incorrectly speld woards
    boring, huh?

12) Tha Tha Thats all folks!

---------------

Unless I've made some major cockup at the last minute, you should
actually have a reasonable chance of getting these working on your
machine, whatever it is. (Dec 20's and EBCDIC machines *possibly*
excepted - attempts welcome!)

As I said, please mail me your bugfixes, problems, suggestions etc.
By the way, the standard of coding isn't exceptionally high; this
implementation was always intended to be a prototype.  A rewrite
of dawg.c and pack.c is definitely high on the priority list...
The actual application programs aren't too bad, but the interfaces
to the various utility routines are liable to change on an hour-by-hour
basis! (I've been trying to make them more consistent -- you should have
seen the last lot ;-) ).

However, the algorithms are all complete now; anyone who has ever needed
to convert a tree to a dag might be interested in the linear-time solution
in dawg.c - most other solutions I've seen have been NlogN or N^2.

If you're interested, there's a file dictlib.h which is a proposed interface
for the next iteration; comments are invited.

Share & Enjoy,

Graham Toal <gtoal at uk.ac.ed>

<FOOTNOTE>
   I'm experimenting in this code with ways of finding out at compile
time various parameters needed for portability.  There's a file
grope.h which does this.  If things all work first time, great;
if not, you may have to change various #define's.  The best way
to do that is to add code to grope.h which identifies your system,
and only make changes of the form:

#ifdef SYS_MYSYSTEM
#undef something
#define something (a different value)
#endif /* my system */

There are instructions for customising in grope.h itself.  My overall
aim is to avoid special makefiles and special command line options.
(A bit ambitious I know, but nice idea if it works...)
(Steve Pemberton's config.c might be useful to you too if you are
hacking grope.h.  It was reposted on usenet recently.)

</FOOTNOTE>

<PS>

Future hacks:
   1) Anagrams
   2) Crosswords
   3) Scrabble
   4) Anything where a text key can be used to lookup another
      piece of text.  This is implemented with the 'dawg_locate_prefix'
      routine; it is effectively a cheap substitute for btrees etc.
      (and better than a hash table because you can *do* things with it!)

      4a) Reverse phone book      1234567=Fred Bloggs
      4b) house style enforcer    stupid=infelicitous
      4c) uk/us spelling advisor  sidewalk=pavement
      4d) shorthand/macro expander   f.y.i.=for your information
      etc etc... (Most of wwb in fact)

   5) Anything you can suggest! (or *implement* :-) )

</PS>

<MEMO>

Archive-name: dawgutils/part[1-3].shar
Reply-to: Graham Toal <gtoal%uk.ac.ed at nsfnet-relay.ac.uk>

1) Upload fixed version of dyntrie.c - remember bug with signed chars
   being used as an index [comment rule in dawg.c?]

2) known design flaw: can't handle strings with 0 in them

3) known bug: dyntrie.c has problems if you enter a string
   which *starts* with 255.  This is due to sending 'end_of_node'
   bit rather hackily.  Can be fixed.

4) Test version to be posted to alt.sources by running on machine
   with signed chars (eg MSDOS)

5) Remove hacky Malloc debugging macros in utils.c -- there might be
   a problem if compiled on MSDOS but not under MICROSOFT C

6) Check that LIB_STRINGS is *not* defined when UX63 is set.

7) tweak the constants in pack.c to speed it up a lot

8) hack the pack.c heuristics into dyntrie.c to speed it up too.

</MEMO>
SHAR_EOF
cat - << \SHAR_EOF > copyr.h
/*
 * Copyright 1989 Graham Toal & Richard Hooker
 *
 * Permission to use, copy, modify, and distribute this software and its
 * documentation for any purpose with or without fee is hereby granted provided
 * that the above copyright notice appear in all copies and that both that
 * copyright notice and this permission notice appear in supporting
 * documentation.
 *
 * This program is publicly available, but is NOT in the public domain.  The 
 * difference is that copyrights granting rights for unrestricted use and 
 * redistribution have been placed on the software to identify its 
 * authors.  You are allowed and encouraged to take this software and
 * build commercial products, freeware products, shareware products, and
 * any other kind of product you can think up, with the following restriction:
 *
 * If you redistribute the source to this program, or a derivitive of that
 * source, you must include this copyright notice intact. If the system
 * this source is distributed with is under a stricter license (such as
 * a commercial license, the typical freeware "no commercial use" license,
 * or the FSF copyleft) then you must provide the original source under the
 * original terms.
 *
 * Edinburgh Software Products (E.S.P.) makes no representations about the
 * suitability of this software for any purpose.  It is provided "as is"
 * without express or implied warranty.
 *
 * E.S.P. DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
 * ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL
 * E.S.P. BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR
 * ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER
 * IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
 * OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 *
 * Authors:  Graham Toal & Richard Hooker, Edinburgh Software Products,
 *           6 Pypers Wynd, Prestonpans, East Lothian, Scotland.
 */
SHAR_EOF
cat - << \SHAR_EOF > dawg.h
/**
*
*       DAWG.H
*
*       Header file for Directed Acyclic Word Graph access
*
*       The format of a DAWG node (8-bit arbitrary data) is:
*
*        31                24 23  22  21                                     0
*       +--------------------+---+---+--+-------------------------------------+
*       |      Letter        | W | N |??|            Node pointer             |
*       +--------------------+---+---+--+-------------------------------------+
*
*      where N flags the last edge in a node and W flags an edge as the
*      end of a word. 21 bits are used to store the node pointer, so the
*      dawg can contain up to 262143 edges. (and ?? is reserved - all code
*      generating dawgs should set this bit to 0 for now)
*
*      The root node of the dawg is at address 1 (because node 0 is reserved
*      for the node with no edges).
*
*      **** PACKED tries do other things, still to be documented!
*
**/

#define TRUE (0==0)
#define FALSE (0!=0)

#define DAWG_MAGIC 0x01234567
#define PACK_MAGIC 0x34567890

#define TERMINAL_NODE 0
#define ROOT_NODE 1

#define V_END_OF_WORD   23
#define M_END_OF_WORD   (1L << V_END_OF_WORD)
#define V_END_OF_NODE   22                     /* Bit number of N */
#define M_END_OF_NODE   (1L << V_END_OF_NODE)   /* Can be tested for by <0 */

#define V_FREE          22             /* Overloading this bit, as
                                          packed tries need no end-of-node */
#define M_FREE          (1L << V_FREE)

#define V_LETTER        24
#define M_LETTER        0xFF
#define M_NODE_POINTER  0x1FFFFFL     /* Bit mask for node pointer */

/* max_chars and base_chars are a hangover from the days when this
   was trying to save space, and dawgs were only able to contain
   characters in the range 'a'..'z' 'A'..'Z' (squeezed into 6 bits).
   Now that we're '8-bit clean' we no longer need these.  Later code
   in fact doesn't support the old style; but some procedures still
   subtract 'BASE_CHAR' from the character to normalize it.  Since it
   is now 0 it is a no-op... */
#define MAX_CHARS       256
#define BASE_CHAR       0

/* Enough? */
#define MAX_WORD_LEN    256

#define MAX_LINE        256

typedef long NODE;
typedef long INDEX;

SHAR_EOF
cat - << \SHAR_EOF > dictlib.h
/* This header file does not describe any of the code in the
   accompanying files; it is a rough sketch of the NEXT iteration
   of the spelling/word utilities.  Comments are invited.  Missing
   functionality or bits that won't work should be pointed out
   please!   Send mail to gtoal at ed.ac.uk  (via nsfnet-relay.ac.uk
   from US sites)
   many thanks.

      Graham Toal 23/06/90
 */

typedef long INDEX;
typedef long NODE;
#define ROOTNODE 1L

typedef struct DICT {

  /* Although primitive procedures will be provided for handling the
     basic dawg array, by using this interface applications can
     remain independent of the implementation details of the
     dictionary. */

  /* So far there are three forms of dict; 1: a simple dawg with edges
     as 4-byte integers; each node (a set of branches) is stored
     sequentially.  2: An *indexed* form of the above -- much faster
     to lookup in, but slower to walk through.  Although indexing
     would normally be heavy on space, this uses Liang's packed
     overlapping tries, thus using almost exactly the same space as
     method 1.  3: A form of 1, but with all the stops pulled out
     to get as much bit-level compaction as possible; edges can
     be two bytes instead of 4. (by Keith Bostic) */

  /* This struct contains both access procedures and information;
     most fields will be filled in by our code when the dictionary
     is loaded by load_dict().  The fields may be added to in
     later releases, but they will always be upwards compatible;
     none of this data is saved to disk in this format, so changing
     the struct layout will not cause problems as long as the names
     remain the same. */

  /* Note that some procedures take a user-supplied 'apply' parameter;
     This 'apply' procedure has a 'context' parameter which is merely
     a handle to allow the user to pass data in to his code without
     having to use global variables.  If the user's apply procedure
     returns anything other than 0, it will cause an early exit, with
     the users return code being returned from the calling function. */

  /* It is intended that these are used in an object-oriented way;
     although we will supply an external dict_check(word, dict) procedure, for
     instance, all it will do is call dict->check(word, dict->dawg) */

  NODE *dawg;
                                   /* Root of the dictionary tree */
  char *dict_name,                 /* "smiths" - used in command lines etc. */
       *dict_info,                 /* "Johns Smiths Rhyming Dictionary,\n
                                       (c) Smith & Jones, 1892\n
                                       Whitechapel, London.\n" */
       *dict_fname,                /* "/usr/dict/smiths" */

       *lang_charset;              /* "ISO 8859/1 Latin 1" */

  /* These are filled in by us to make life easier for the applications
     programmer.  We'll supply a default table for simple 7-bit ascii
     dawgs; but this mechanism allows us the option of including all
     the salient points about the character set used in a dictionary
     in the dictionary itself.  Because they are char*'s, we'll probably
     end up sharing the same table between multiple dictionaries --
     so don't write to them. (Although you may replace the _pointer_
     with one to a table of your own. */

  /* Later note: I've just found out about LOCALE.H etc.  This stuff
     may therefore change... */

  int *chartype;
                                    /*  1             whitespace           */
                                    /*  2             punctuation          */
                                    /*  4             blank                */
                                    /*  8             lower case letter    */
                                    /*  16            upper case letter    */
                                    /*  32            (decimal) digit      */
                                    /*  64            underscore           */
                                    /*  128           A-F and a-f          */
                                    /*  256           Ignore words starting
                                                      with this char       */
                                    /*  512           Include in definition
                                                      of a word -- mainly
                                                      alpha but also hyphen
                                                      and apostrophe       */

char   *lower,                     /* personal implementation of tolower() */
       *unaccented,                /* map accented chars to non-accenmted */
       *lexorder;                  /* an arbitrary ordering for sorting:
                                      e-acute would come after e and before f
                                      */

  int  (* check)  (NODE *base, INDEX state, struct DICT *d, char *word);
                     /* Check a word - exact match; return true/false */

  int  (* fuzzy)  (NODE *base, INDEX state, struct DICT *d, char *word,
                   char *value);
     /* Check a word - fuzzy case, accents, hyphens etc; return true/false */
     /* value is filled in with the dictionary's idea of what the
        word should look like. */

  /* Fancier correction procedures may be synthesised out of user code and
     'walk' */
  int  (* correct) (
                char *word,
                NODE *base, INDEX state,
                                /* Normally dict's root, but may be subtree */
                struct DICT *d,   /* lowercase/accent information from here */
                int   maxwords,
                int   order,                /* 0: by alpha
                                               1: by highest score first
                                               2: by lexical ordering
                                            (ignores case, accents, hyphens)
                                             */
                int   (* apply) (
                      char *word,     /* Corrected word */
                            int   score,    /* Normalised to range 0..255 */
                            void *context),
                void  *context);
                                   /* Offer spelling corrections for the
                                      given word.  If maxwords > 0, return
                                      the best <maxwords> words in order
                                      of most-likely first; otherwise
                                      many words may be returned, in
                                      alphabetical order, coupled with
                                      a score: 255 means exact match; 0
                                      means totally different */

                                    /* returns 0 if user returned 0 for
                                       all applications */
  int  (* walk) (
               NODE *base, INDEX state,
               int   (* apply) (char *word, void *context),
               void  *context);
                                   /* Apply the procedure given to every
                                      word in the dictionary tree or subtree,
                                      passing in the user-supplied context */

                                    /* returns 0 if user returned 0 for
                                       all applications */

  int (* lookup) (
              NODE *base, INDEX state,
              char  *key,
              /* Out */
              INDEX values);
                                   /* Return the index of the subtree of the
                                      dictionary which has 'key' as a prefix;
                                      for instance, in a reverse telephone
                                      book, you might find entries:
                                         0874=Anytown
                                         0875=Cockenzie
                                         0875=Port Seton
                                         0875=Prestonpans
                                         0876=Toytown
                                       In this case, if searching for "0875=",
                                       a subtree would be returned containing:
                                         Cockenzie
                                         Port Seton
                                         Prestonpans
                                       This subtree would then be walked
                                       using the 'walk' function and a user-
                                       written 'apply' procedure. */

                                    /* returns 0 if successful */

  /* The following two are performance hints ONLY.  They should not
     affect the correct functioning of a program. */

  int  (* uncache) (struct DICT *d);/* The space a dictionary uses may be
                                       reclaimed by calling this.  If the
                                       dictionary is subsequently accessed,
                                       there *may* be a performance penalty --
                                       for instance the dictionary may be
                                       accessed off disk.  However on a machine
                                       with sufficient memory, the system
                                       may choose to leave the data in ram
                                       until, say, malloc runs out of space.
                                         If there is no convenient mechanism
                                       for organising the demand-recacheing,
                                       this may be a no-op. */

                                    /* returns 0 if successful */

  int  (* recache) (struct DICT *d);/* The data of the dictionary is reloaded
                                       into active memory where it will stay */

                                    /* returns 0 if successful */
  /* May be extended in the future */
} DICT;

/*
                            dict_load

   This is provided as a primitive so that dictionaries can be loaded
in the most efficient way the machine supports.  The space for the dict
is supplied *by* this procedure - not *to* it.  This allows a Mach (or Emas)
implementation to mmap (or Connect) the file in memory - the connection
address being chosen by the OS and outside our control.  It also allows
systems like RISC OS on the Acorn Archimedes to use an optimised whole-
file load call to pull the file off disk into real ram in a single disk
operation.

   Since the data parts of the type 1 & 2 dicts are just a set of 4-byte
integers, it is easy to correct a faulty byte-sex by running down this
array *once at start-up* to reverse the order.  This is easy if the
data is loaded into physical ram.  If it is connected as a file by a VM system
however, the VM system must support copy-on-write; if it only has read-only
mapping there must be two files available -- one of each sex.

   A *possible* scheme on VM systems is to byte-sex-reverse a whole
page the first time that page is touched.  This may be a lot of unneccesary
work - I'd recommend keeping a correct-sex file on-line.

   This code will create the DICT struct, and fill in the various fields,
either from the dictionary file itself or from private knowledge if not available.

*/

int dict_load_file(char *fname, DICT **dict_struct);

                                   /* fname is the literal file name */

                                   /* dict_struct is allocated by us,
                                      not by the user. */

                                   /* returns 0 if successfully loaded */

int dict_load_dict(char *dname, DICT **dict_struct);

                                   /* dname is the dictionary name,
                                      e.g. "words", or "web2" etc.
                                      The corresponding file will be searched
                                      for via a system-dependent path, logical
                                      variable, or fixed place. */

                                   /* dict_struct is allocated by us,
                                      not by the user. */

                                   /* returns 0 if successfully loaded */

 int dict_unload(DICT *d);         /*  Use when totally finished with data */

 /* These are user-level veneers for the internal procedures used above.
    Use them in preference to the above.  Use the above only if
    you are a speed-freak! (but disguise them in macros such as:

  #define  dict_check(dawg, dict, word) dict->check(dict->dawg, dict, word)

   These macros will of course go 'Bang!' if dict ptr is unassigned or NULL.
 */

 /* In the worst case, if a machine has a poor quality compiler which
    doesn't support procedure variables well, these routines could be
    the *acual* implementation rather than just a pointer to it. */

 int  check  (NODE *base, INDEX state, DICT *d, char *word);
 int  fuzzy  (NODE *base, INDEX state, DICT *d, char *word, char *value);
 int  correct (char *word,
               NODE *base, INDEX state,
                           /* Normally dictionary root, but may be subtree */
               DICT *d,          /* lowercase/accent information from here */
               int   maxwords,
               int   order,                /* 0: by alpha
                                              1: by highest score first
                                              2: by lexical ordering
                                            */
               int   apply(char *word,     /* Corrected word */
                           int   score,    /* Normalised to range 0..255 */
                           void *context),
               void  *context);
 int  walk   (NODE *base, INDEX state,     /* Yoyo fanatics will be pleased
                                 to learn that they can 'walk the dawg'... :-) */
              int    apply(char *word, void *context),
              void  *context);
 int lookup (NODE *base, INDEX state,
             char  *key,
             /* Out */
             INDEX values);

/* Now here come the fast internal procedures which the above eventually call */
int dawg_check(NODE *base, INDEX state, char *word);
  /* Standard dawg */
int pack_check(NODE *base, INDEX state, char *word);
  /* Overlapped indexed dawg */
int bsd_check(NODE *base, INDEX state, char *word);
  /* Space-optimised dawg (called bsd because Keith Bostic is working
                   on this format for the next bsd4.4 spelling checker) */
/* Must make the internl procs agree with these... save the macros for later... 
#define dict_check(word,dict) \
          (dict != NULL ? (dict->check(dict->dawg, ROOTNODE, word)) : -1)
*/
int dawg_walk(NODE *base, INDEX state,
              int    apply(char *word, void *context),
              void  *context);
int pack_walk(NODE *base, INDEX state,
              int    apply(char *word, void *context),
              void  *context);
int bsd_walk(NODE *base, INDEX state,
              int    apply(char *word, void *context),
              void  *context);
/*
#define dict_walk(dict, apply, context) \
          (dict != NULL ? (dict->walk(dict->dawg, ROOTNODE, apply, context)) : -1)
*/

int dawg_lookup(NODE *base, INDEX state, char  *key, /* Out */ INDEX values);
int pack_lookup(NODE *base, INDEX state, char  *key, /* Out */ INDEX values);
int bsd_lookup(NODE *base, INDEX state, char  *key, /* Out */ INDEX values);
/* etc... */

/* Get list of possible next characters from this point in the tree,
   put it into user-supplied array branch[256].  Filled with
   NULL if no branch, or an INDEX if it points to more of the tree.
   Words which may terminate on a particular letter have terminate[let] != 0

   The user-supplied terminate array is deliberately a long array to allow
   for expansion; it may be used one day to return arbitrary codes such as
  'Noun' etc...

 */

void dawg_branch(NODE *base, INDEX state, NODE *branch, long *terminate);
void pack_branch(NODE *base, INDEX state, NODE *branch, long *terminate);
void bsd_branch(NODE *base, INDEX state, NODE *branch, long *terminate);
/* No internal proc yet... add it in the morning. */

/* In fact there will be even simpler procedures which can be used by
   people who prefer to include the whole source of this package rather
   than just the library interface parts.  In applications where utmost
   speed is a neccessity (eg Scrabble(tm)) you could use these, or write
   your own.  But most of us can use this clean interface with unnoticable
   overhead. */
SHAR_EOF
cat - << \SHAR_EOF > grope.h
#ifndef grope_h
#define grope_h 1

#ifdef TESTING_DEFS
/*###################################################################*/
/*
 * This is an exercise to assist in writing portable programs.
 *
 * To use as a header file, eg "grope.h", just leave this file
 * as it is.  To use it to define new entries, rename it as
 * "grope.c" and compile it as "cc -DTESTING_DEFS grope".
 *
 * To customise this test program for your system, I have set it up
 * so that all features are enabled.  If you find that one feature
 * causes a compile-time error, test a suitable set of '#ifdef's for
 * your machine and #undef the values below which are not appropriate.
 *
 * Please do your best to see that your tests are unique, and that
 * you do not undo any tests already implemented.
 *
 * One last request; PLEASE PLEASE PLEASE send me any updates you
 * may make. If you cannot get through to me on email, send me a
 * paper copy (or even better, a floppy copy :-)) to Grahan Toal,
 * c/o 6 Pypers wynd, Prestonpans, East Lothian, Scotland EH32 9AH.
 *
 *   Graham Toal <gtoal at uk.ac.ed>
 * [PS: I can read DOS and RISC OS floppies only]
 */
#define STDLIBS 1
#define PROTOTYPES 1
#define KNOWS_VOID 1
#define KNOWS_LINE 1
#define KNOWS_REGISTER 1
#endif /* TESTING_DEFS */
/*###################################################################*/
#define SYS_ANY 1
#define COMPILER_ANY 1
  /* Don't yet know what machine it is.  Add a test once identified. */
/*--------------------------*/
#ifdef MSDOS
#define SYS_MSDOS 1
#endif
/*--------------------------*/
#ifdef __STDC__
#define STDLIBS 1
#define PROTOTYPES 1
#define KNOWS_VOID 1
  /* If the compiler defines STDC it should have these. We can undef
     them later if the compiler was telling lies! */
#endif
/*--------------------------*/
#ifdef MPU8086
#define SYS_MSDOS 1
  /* Aztec does not #define MSDOS.
     We assume it is Aztec on MSDOS from the MPU8086.
   */
#ifdef __STDC__
#define COMPILER_AZTEC 1
  /* Aztec is known to also define __STDC__ (illegally).  Therefore if
     __STDC__ is not defined, it is probably not Aztec */
#endif
#endif

#ifdef SYS_MSDOS
/*--------------------------*/
#ifdef __STDC__
/*----------------------*/
#define COMPILER_MICROSOFT 1
  /* I assume that the combination of #define MSDOS and #define __STDC__
     without any other collaboration means MICROSOFT.  (Who, incidentally,
     are being naughty by declaring __STDC__) */
#define KNOWS_LINE 1
#else
/*----------------------*/
#ifdef __ZTC__
/*------------------*/
#define COMPILER_ZORTECH 1
#undef STDLIBS
  /* Another system without locale.h */
#define PROTOTYPES 1
#else
/*------------------*/
/* A non-std msdos compiler */
#undef STDLIBS
#undef PROTOTYPES
/*------------------*/
#endif
/*----------------------*/
#endif
/*--------------------------*/
#endif
#ifdef TURBOC
  /* Although Turbo C correctly does not define __STDC__, it has SOME
     standard libraries (but not all - locale.h?) and accepts prototypes. */
#undef STDLIBS
#define PROTOTYPES 1
#define SYS_MSDOS 1
#define COMPILER_TURBOC 1
  /* TURBOC is defined, but has no value.  This allows it to be tested
     with #if as well as #ifdef. */
#endif
/*--------------------------*/
#ifdef COMPILER_MICROSOFT
#undef STDLIBS
  /* Again, like Turbo C, does not know locale.h */
#define PROTOTYPES 1
#endif
/*--------------------------*/
#ifdef SYS_MSDOS
#define MEMMODELS 1
  /* We assume ALL MSDOS machines use memory models */
#endif
/*--------------------------*/
#ifdef UX63
#undef STDLIBS
#undef PROTOTYPES
#define SYS_UX63 1
#define COMPILER_PCC 1
/*#define LIB_STRINGS 1 - apparently not... */
#endif
/*--------------------------*/
#ifdef sun
#undef STDLIBS
#undef PROTOTYPES
#define SYS_SUN 1
#define COMPILER_PCC 1
#endif
/*--------------------------*/
#ifdef THINK_C
#define SYS_MAC 1
#define COMPILER_THINKC 1
#define KNOWS_VOID 1
#endif
/*--------------------------*/
#ifdef sparc
#undef STDLIBS
#undef PROTOTYPES
#define SYS_SPARC 1
#define COMPILER_PCC 1
#endif
/*--------------------------*/
#ifdef ARM
#define SYS_RISCOS 1
  /* I fear Acorn may define 'ARM' on both unix and risc os versions.
     Should find out whether they define others as well, to differentiate. */
#endif
#ifdef __ARM__
#define SYS_RISCOS 1
  /* I fear Acorn may define 'ARM' on both unix and risc os versions.
     Should find out whether they define others as well, to differentiate. */
#endif
/*--------------------------*/
#ifdef SYS_RISCOS
#define COMPILER_NORCROFT 1
#define KNOWS_REGISTER 1
#define KNOWS_LINE 1
#endif
/*--------------------------*/
#ifdef vms
#define SYS_VMS 1
#endif
/*--------------------------*/

/*--------------------------*/
#ifdef SYS_UX63
#undef SYS_ANY
#endif
#ifdef SYS_ARM
#undef SYS_ANY
#endif
#ifdef SYS_MSDOS
#undef SYS_ANY
#endif
#ifdef SYS_SUN
#undef SYS_ANY
#endif
#ifdef SYS_SPARC
#undef SYS_ANY
#endif
#ifdef SYS_RISCOS
#undef SYS_ANY
#endif
#ifdef SYS_MAC
#undef SYS_ANY
#endif
#ifdef SYS_VMS
#undef SYS_ANY
#endif
/*--------------------------*/
#ifdef COMPILER_PCC
#undef COMPILER_ANY
#endif
#ifdef COMPILER_MICROSOFT
#undef COMPILER_ANY
#endif
#ifdef COMPILER_TURBOC
#undef COMPILER_ANY
#endif
#ifdef COMPILER_ZORTECH
#undef COMPILER_ANY
#endif
#ifdef COMPILER_AZTEC
#undef COMPILER_ANY
#endif
#ifdef COMPILER_NORCROFT
#undef COMPILER_ANY
#endif
#ifdef COMPILER_THINKC
#undef COMPILER_ANY
#endif
/*--------------------------*/
/* ##################################################################### */
#ifdef TESTING_DEFS
#include <stdio.h>
/* ======================================================================= */
#ifdef STDLIBS
              /* If any of these fail, make sure STDLIBS is not
                 defined for your machine. */
#include <stdlib.h>      /* STDLIBS should not be defined. */
#include <time.h>        /* STDLIBS should not be defined. */
#include <locale.h>      /* STDLIBS should not be defined. */
#endif
/* ======================================================================= */
#ifdef KNOWS_VOID
void test() {            /* KNOWS_VOID should not be defined */
  /* Make sure your ifdef's don't #define KNOWS_VOID if this fails */
}
#endif
/* ======================================================================= */
#ifdef KNOWS_REGISTER
int regtest() {
register int i = 0;          /* KNOWS_REGISTER should not be defined */
  /* Make sure your ifdef's don't #define KNOWS_REGISTER if this fails */
  return(i);
}
#endif
/* ======================================================================= */
#ifdef PROTOTYPES
int main(int argc, char **argv)   /* PROTOTYPES should not be defined */
/* If this fails, make sure PROTOTYPES is not defined for your machine. */
#else
int main(argc,argv)
int argc;
char **argv;
#endif
{
/*-------------------------------------------------------------------------*/
  printf("We should know just what the machine is, or 'SYS_ANY':\n");
#ifdef SYS_UX63
  printf("#define SYS_UX63 %d\n", SYS_UX63);
#endif
#ifdef SYS_ARM
  printf("#define SYS_ARM %d\n", SYS_ARM);
#endif
#ifdef SYS_MSDOS
  printf("#define SYS_MSDOS %d\n", SYS_MSDOS);
#endif
#ifdef SYS_SUN
  printf("#define SYS_SUN %d\n", SYS_SUN);
#endif
#ifdef SYS_SPARC
  printf("#define SYS_SPARC %d\n", SYS_SPARC);
#endif
#ifdef SYS_RISCOS
  printf("#define SYS_RISCOS %d\n", SYS_RISCOS);
#endif
#ifdef SYS_MAC
  printf("#define SYS_MAC %d\n", SYS_MAC);
#endif
#ifdef SYS_VMS
  printf("#define SYS_VMS %d\n", SYS_VMS);
#endif
#ifdef SYS_ANY
  printf("#define SYS_ANY %d\n", SYS_ANY);
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the machine has different memory models or not:\n");
#ifdef MEMMODELS
  printf("#define MEMMODELS %d\n", MEMMODELS);
#else
  printf("#undef MEMMODELS\n");
#endif
/*-------------------------------------------------------------------------*/
  printf("One compiler name should be given, or 'COMPILER_ANY':\n");
#ifdef COMPILER_PCC
  printf("#define COMPILER_PCC %d\n", COMPILER_PCC);
#endif
#ifdef COMPILER_MICROSOFT
  printf("#define COMPILER_MICROSOFT %d\n", COMPILER_MICROSOFT);
#endif
#ifdef COMPILER_TURBOC
  printf("#define COMPILER_TURBOC %d\n", COMPILER_TURBOC);
#endif
#ifdef COMPILER_ZORTECH
  printf("#define COMPILER_ZORTECH %d\n", COMPILER_ZORTECH);
#endif
#ifdef COMPILER_AZTEC
  printf("#define COMPILER_AZTEC %d\n", COMPILER_AZTEC);
  /* Can exist on other machines as well as MSDOS */
#endif
#ifdef COMPILER_LATTICE
  /* Currently coming through as 'COMPILER_ANY' - haven't found one to test */
  /* Can exist on other machines as well as MSDOS. Meanwhile pass it in     */
  /* on the command_line?                                                   */
  printf("#define COMPILER_LATTICE %d\n", COMPILER_LATTICE);
#endif
#ifdef COMPILER_GCC
  /* Ditto. Test in appropriate places for each machine. */
  printf("#define COMPILER_GCC %d\n", COMPILER_GCC);
#endif
#ifdef COMPILER_NORCROFT
  printf("#define COMPILER_NORCROFT %d\n", COMPILER_NORCROFT);
#endif
#ifdef COMPILER_THINKC
  printf("#define COMPILER_THINKC %d\n", COMPILER_THINKC);
#endif
#ifdef COMPILER_ANY
  printf("#define COMPILER_ANY %d\n", COMPILER_ANY);
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the compiler accepts ANSI-like libraries or not:\n");
#ifdef STDLIBS
  printf("#define STDLIBS %d\n",STDLIBS);
#else
  printf("#undef STDLIBS\n");
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the machine accepts ANSI prototypes or not:\n");
#ifdef PROTOTYPES
  printf("#define PROTOTYPES %d\n", PROTOTYPES);
#else
  printf("#undef PROTOTYPES\n");
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the machine needs nonstandard #include <strings.h> or not:\n");
#ifdef LIB_STRINGS
  printf("#define LIB_STRINGS %d\n", LIB_STRINGS);
#else
  printf("#undef LIB_STRINGS\n");
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the machine accepts the 'void' keyword or not:\n");
#ifdef KNOWS_VOID
  printf("#define KNOWS_VOID %d\n", KNOWS_VOID);
#else
  printf("#undef KNOWS_VOID\n");
#endif
  printf("Either the machine accepts the 'register' keyword or not:\n");
/*-------------------------------------------------------------------------*/
  printf("Either the compiler accepts the '__LINE__' directive or not:\n");
#ifdef KNOWS_LINE
  printf("#define KNOWS_LINE %d\n", KNOWS_LINE);
#else
  printf("#undef KNOWS_LINE\n");
#endif
/*-------------------------------------------------------------------------*/
  printf("Either the machine accepts the 'register' keyword or not:\n");
#ifdef KNOWS_REGISTER
  printf("#define KNOWS_REGISTER %d\n", KNOWS_REGISTER);
#else
  printf("#undef KNOWS_REGISTER\n");
#endif
  /* Note - this is intended to be used only if your compiler accepts
     'register' *AND* compiles code with it correctly!  Some compilers
     show up bugs when register optimisations are attempted! */
/*-------------------------------------------------------------------------*/
  if (argv[argc] != 0) printf("Warning! argv[argc] != NULL !!! (Non ansii)\n");
}
#endif /* TESTING_DEFS */
#endif /* Already seen? */

SHAR_EOF