3b1 Expansion Box with Rs-232 boards.

Sun Feb 17 17:34:32 AEST 1991

Thinking about projects.

I have been turning these ideas over in my mind.  It seems that all the
technologies to accomplish this are in existence.  The problem seems to
be that each of the technologies is born of a different specialty and
thus what is lacking is the bringing them together technology.

What I would like to accomplish.

I have many bible audio tapes in many languages. I also have some classic
stories in several languages. Including the Chinese classics of Confucious,
Lao Tze, the I-Jing or I Ching, and and old Indian song called the
Bagavad Gita. This latter from India India not American Indian.

I also have the written text.  I want to have the computer show the
words like the old bouncing ball sing-a-long. And at the same time
I want the computer to put out the sound of the words. I want the text
to be in what is approximately called "interlinear transliteration"
word for word and with the translated clue word starting right under
the word in the original tongue, language 1, or master language.

The tree  is green.
El  arbol es verde.

El arbol verde.
Th tree  green.

The computer synthesized speech I want in stereo. One language in each 
ear and the bouncing ball or highlighted phoneme showing what phoneme 
is sounding.  It seems the unix pc could do this if two voice power cards
could be made to operate at the same time in sync.

For the songs you would need from one to four more voice power cards to 
play the music. Thus for Rock of Ages you could have 6 voice power cards.

To render the art form as four voice, Father, Mother, Son and Daughter 
accompanied by organ recital of four note polyphony it would take at least
four cards for the English parts and four cards for the Chinese parts and
you might be able to get away with somehow dividing the four musical tones
for each language between both languages. In other words, a total of 12
voice power cards operating in sync.

Now if the graphics where treated equally in both directions you would 
need to add two more lines to each line to be treated interlinearly.
That would be one line of Chinese Kanji indicating in Kanji the sound of
the English line.  And one line of Kanji showing the Chinese Text.

The English line accompanying this would be what is known as phonetic
transliteration.  In other words, an English singer singing that string
of alpha characters would make sounds that a Chinese speaking person
listening on the other end of the telephone line could interpret in the
Chinese Language sufficient to send an order of egg rolls and be able
to record your visa card number.

So here we need at least 12 AT&T voice power cards running in sync with
the multilingual multimedia graphics on the AT&T Unix PC.

There are programs called approximately "phase vocoders" that can change 
the tempo, time, speed of an audio sample without changing the pitch but
it is said that these require a lot of processing power.  The subject of
phase vocoders has considerable consideration in a recent book by 
F. Richard Moore of the University of California at San Diego. The book
is Element of Computer Music.  I get the idea the FRM is the author of
a topic occasionally referred to as "CMUSIC" in the computer music 
section of usenet.

I was reading
my local papers for the last few days, San Jose, California, and there
were stories about 64meg memory chips around 1994 and soon now 100 megahertz
486 chips.  Also not far away, we are told, is an increase in parallel
processing.  Thus I don't feel it unreasonable to think that what I have
suggested is beyond the home market, church market, school market, in 
four years.  The synchronization problems will take that long to work out
anyway.

Thus I want the option, if one word takes a different length of time
in relation to its counterpart, to be able to speed up or slow down the
tempo of the second language clue word without changing its pitch.

I have bought many of the computerized text to speech hardware and software
over the years.  One of the problems is that they do not speak the English
Language as I want it spoken.

Now there are speech recognition programs comming available on the market.
One of the most notable is Dragon Systems' Dragon Dictate which can 
reportedly recognize 30,000 English words. It is said to have an artificial
intelligence capability to learn the particular speech of an individual.
It is not sold with a speech synthesizer.  Here you see is the specialization
I mentioned above.  They see themselves as being in the speech recognition
business and having nothing to do with the speech synthesis business.  People
in the speech synthesis business generally have nothing to do with speech 
recognition.  Of course it is rare that one or the other knows their own
discipline and also how to incorporate artificial intelligence learning.

I think I can understand that each of these disciplines takes a lifetime
to learn and even one accomplishment is a miracle, out let alone multiple.
But we the people need the combined abilities of all to help save our lives
and the lives of our loved ones. So while we are asking a lot, it is for
a good cause.

To get the two groups to sing four part harmony simultaneously a 
considerable knowledge of music is necessary.  Here again is a lifetime's
worth of study. But if we are going to hear God's word and the word of
Confucius in a language that we can all understand and provide justice for
all, it need be done.

Some local people have a sound card that does sampling and allows visual 
editing by viewing the graphic depiction of the wave form on an ibm
compatible.

If I could take the readings of the classic stories read by what I consider
to be good readers for my purpose on audio tape and clip each word into a
sound file dictionary database and then as the graphics terminal generates
the sing-a-long it would call up the necessary sound files and either render
or edit and generate as necessary the several sound tracks.

There is work going on in "Continuous Speech Recognition".  You can guess
what that is.  It is the leading edge of the art of speech recognition.
I have a task that would be simpler for them.  Where they have little clue
what words are comming next I have the complete text in machine readable 
form.  What I want them to do automatically is to determine where one 
word ends and the next begins. Then to write out to mass storage each word's
sampled sound file into a database and have each record cataloged by
language, chapter, verse and word.

I clipped part of a usenet article that had the words:

	Automatically segmenting speech signals and feature extraction
	using neural nets for phoneme recognition.
	A possible reference would be a Professor T. Kohenen using a Fast
	Fourier Transform to preprocess the voice signal of the Finish
	language.

	    Kohenen T., Associative Memory and Self Organization.

Another opportunity for artificial intelligence and computer speech analysis
is then to compare all the various ways a text word has
been rendered into audio and could possibly define a "most usual" which 
seems to be what the dragon is doing.  

A person then could have various options as to how they wanted to hear the
text read back. I want to hear it produced close to latin or spanish in 
that I want each alpha character to have one and only one audio 
representation.  I also want to see if I can define what I think should
be the standard speed for machine read american english.

I would like to be able to have text to speech generated such that
"come" and "home" rhyme. Some people might like "cum hum" others could
prefer "comb" "home" and I want to hear "co may" "ho may". Any of you
unix-pc'ers figured out how to make this thing speak italian?

At some point
one would like to vary the tempo but not the pitch. But, getting back to
the standard speed and standard audio representation of american english.
People who are trying to produce low cost speech recognition equipment
would have their task greatly simplified at least if the speaker could
conform to this "standard american speech and beat."  The beat could be
prompted by the recognition machine, just add a drum machine.

People hereabouts have considerable practice singing to the beat.

Another option I want is to be able to have the machine generate speech
as a teaching tool.

So it will give an example of how it thinks a word should be rendered and
then the student can try to imitate it.  If the operator does not like the
machine version, the operator can give the machine an example and tell
the machine to imitate the operator more closely.

You could take the machine to church and have several people speak into 
the machine the word of God and the machine would come back and tell you
how each person scored.  Anybody heard of "Pastor for a Day?"  Rightly
dividing the word of truth.  Speakometer.

Just think of the opportunity for a public speaking class?  An automated
machine to tell you who is most able to speak properly and allow the
operator of the machine to tell it what is supposed to be right.

The machine might be unpopular with the educational authorities, read
"financial administration", unless the machine could be properly adjusted
such that the purchaser would always be shown to be the best speaker.

Well, I better get off of this before we open up the subject of politics.

-alvin
Alvin H. White, Gen. Sect.
G.O.D.S.B.R.A.I.N.
Government Online Database Systems
Bureau for Resource Allocations to Information Networks
 alvin at cup.portal.com