Wednesday, August 24, 2011

Lost in Transliteration


Transmogrifier is a device into which a person goes thinking of becoming of something and comes out thinking being exactly that. (Invented by Lord Calvin and Sir Hobbes).

Transliterator is a device, where a sound goes into it in one script and comes out the same in another script.

I never understood the spelling-bee contests. Young kids producing words out of infinitely bewildering phonetics. "trouvaille" (meaning: finding) goes one word, pronounced "truva" as many as 4 letters are silent. These spelling bees are only gonna get tougher, and in three years from now we may have the announcer keeping mum and the poor student will have to find the word, because, er.. all the letters are silent. Kudos to the effort and memory power of the students though.

The whole spelling bee concept is based on the platform of arbitrary and artificial phonetics. If there were a one-to-one correspondence between sound and script, the spelling-bee will become a drone-bee. If we rule that English be written phonetically from tomorrow, spelling assignments would disappear. What you say is what you hear is what your mind interprets. WYSIWYHIWYMI.

This reminds of artificially created problems and equally artifical solutions. A good analogy is some of the Java design patterns like Iterator (where you write elaborate classes) "disappear" in dynamic languages like Groovy. The Java design patterns have been elevated into language features of Groovy, so there is no "problem" to start with!


So how many times in life will you ever use that word again? That thought makes me nostalgic about those toefl and gre vocabularies where I crammed words to better the neighborhood guy whose veneer impressed my folks, while I stood exasperated. How many of those words am I using now? How many of those words are being used by native English speakers? And I'm told that I should use simple language while writing emails to clients. If I use the words appoggiatura or roscian, the client will simply take the business elsewhere, because I am hard to communicate with. The fact is, for effective communication you just need around 1000 words in any language. The rest are for poets and dictionary makers who just want to intimidate us ordinary folks and to float themselves in business :-)

Sanskritam students often wonder what is the best way to type in Sanskritam. There are a few great tools -
Baraha, Google Transliterator etc. There are also a few english transliteration schemes available for the devanagari script. Which one to choose? I feel each of them try and have their own quirks. Lets examine some of those:

The Harvard-Kyoto is mostly used in academic circles. The most jarring usage in this scheme is that of uppercases for anunasikas (G/
ङ्, J/ञ्) and z for fricative . Try however, I could not pronounce z as while reading. Just like the Tamil zh/ doesn't make any sense, but we just adapted it.

IAST is more confusing with multiple latin alphabets for the same devanagari akshara (k, K -> , kh, Kh -> ). Do you ever substitute one letter for another in English? Then why such karakter overdoz for transliterashuns?

ITRANS is even crazier. Doubled or uppercases (aa vs A), exponent symbol (R^i), doubled uppercases followed by lowercases (LLi), tilde before, power after, dot before, dot after - its hard to write in ITRANS unless you are already a mathematical genius. It could have also used integral and sum symbols - one could learn advanced calculus simultaneously while writing saMskR^itam.

The other side of the text is parsing into comprehensible data structures. Writing parse routines for such schemes involve quite a bit of effort. So the question is, can there be a simpler, consistent scheme, with fewer things to remember while typing?

Here is one such minimalistic transliteration scheme and it uses only 4 main rules and 3 corollaries:

Rules

1. lowercase 2. uppercase 3. possible dot after a letter 4. semi-colon for visarga; jihvamuliya; upadhmaaniya

Corollaries

1. uppercase for dIrgha
2. krama -> lowercase; lowercase+dot; uppercase; uppercase+dot
3. 1:1 mapping in the transliteration

a A e E u U r. R. l. E. I O O.

k k. g g. n
c c.j j. n.
t t. d d. N
T T. D D. N.
p p. b b. m
y r l v s s. S h
M :
a. (avasarga)

The principles behind the scheme are:
  1. A devanagari letter is either ONE char or ONE char + dot.
  2. The scheme is almost consistent (lowercase, lowercase+dot, uppercase, uppercase+dot, uppercase for dIrgha svara). This is why we get an orderly (n, n., N, N.)
Parsing or tokenizing such a scheme is also easy. There is only one if condition to determine an akshara - if a dot follows letter, they form one akshara, if not, that letter is the akshara.

In other words, converting a sentence into unicode is just one-line, assuming the unicodeMap[:] is defined already.

def toUnicode(s,i => return this.append(unicodeMap[s.charAt(i+1) == '.' ? s.substring(i,i+2) s.substring(i,i+1)]);

Q & A


Q. Why favor the 1:1 mapping? After all isn't it good to have options and easy to write in different ways?
A. It sure is, but do you really substitute one letter for another letter in English or Sanskritam? Can you say "sha" is close enough to "sa" and use it for "savam"? A 1:1 mapping allows no options and eventually it gets streamlined in a person's thought process.

Q. viDE.he -> isnt this hard to read than vaidehi ?
A. Yes it is, no doubt. But how did you learn to pronounce it as "vaidehi" with "d" as
while the most common pronunciation of d is ? It will be hard read at first, but didnt we all start with guessing what a dot, dash, tilde, apos mean on top and bottom of the letters :-). If one is already used to a scheme the mind resists a change.

Q. So whats really easy about this scheme?
A. If you are well versed with the devanagari krama, it is probably easier to apply the transliteration. Also there is only the extra character dot to remember.

Q. Whats really difficult, then?
A. Reading will be unnatural for those who are used to different scheme (eg nAraayaNa vs N.ArAyaNa). We are so much used to N being murdha nakara! But even the existing schemes were originally unnatural, so the unnaturalness just shifts to a different section.


(Note: This article is not to challenge existing schemes, but to show a minimalistic scheme ie a scheme with minimum fuss is possible from a parsing perspective. In any art form, the prevelance rules over theories). 

1 comment:

Viswanath said...

I shared your opinion with English and how the spelling bees are doing things that are no good for them. I also used to think with with indian languages (telugu and sanskrit to be precise, since i don't know others), this issue won't be there. You pronounce what you read. That is until recently.

I have since learnt few 'rules' of how things are pronounced in vedic sanskrit that have no clear indication from whats written.