Jump to content

Listen to Pattaya FM105

View New Content  

Rule-Based Thai Transliteration


  • Please log in to reply
31 replies to this topic

#26 bytebuster

bytebuster

    Advanced Member

  • Members
  • PipPipPip
  • 52 posts

Posted 2012-01-16 06:06:53

Thank you Christian. It's always great to listen for feedback.
There are many things yet not complete, however. So it's a good idea not to rely on a single service alone.

#27 Richard W

Richard W

    Platinum Member

  • Advanced Members
  • PipPipPipPipPipPip
  • 2,086 posts

Posted 2012-01-22 06:24:55

View Postbytebuster, on 2011-12-28 04:13:43, said:

Added several transcription plugins, improved dictionary of exception words, improved visual appearance, and fixed several bugs.
There are still a few simple bugs left:
พรหมพร is transliterated as pʰrom - pʰɔn, should be pʰrom - pʰɔːn.
เงิน is transliterated as ŋɤʔn, should be ŋɤn.

On the exception front, you still haven't entered the data so that สำเนียง is transliterated not as sɑ̌m - nǐːaŋ  but as sɑ̌m - niːaŋ.  Do you have hooks for exception lists so that ผลิกะ and ปลัด will not be mistransliterated as pʰlì - kàʔ and plàt but correctly transliterated as pʰà - lí - kàʔ and pà - làt?  At present, the correct syllabification is apparently not even considered.

You are still using the word 'proof' where 'explanation' is a far better word.  It is not possible to prove that สระ 'pool' is pronounced sà - rà when it is actually pronounced sà.  (At present there is no indication that the possibility has been rejected.)

#28 bytebuster

bytebuster

    Advanced Member

  • Members
  • PipPipPip
  • 52 posts

Posted 2012-01-22 20:41:05

View PostRichard W, on 2012-01-22 06:24:55, said:

There are still a few simple bugs left:

Thank you very much Richard.

In next version I will change some logic regarding the glottal stops, so they don't appear in improper places.
There are, indeed, a lot of exception words still working incorrectly. I'm thinking of adding them from a large vocabulary, but that takes time.
I've been looking into Lexitron database, but primarily in terms of gathering essential statistics regarding consonant reduplication. My nearest goal is to try implementing some logic that would "guess" possible reduplication.

Yes, the clusters are considered as entire units (unless they aren't followed by blocking rules, e.g. "เ"). So decompositions where they are separate consonants are indeed not considered at all. They are exception words for my approach.
There is good logic behind that. My code involves backtracking algorithm, and each ambiguity leads to creating a tree of possibilities, and branches are further processed independently till the very end of input text (or till a blocking rule is matched). For instance, the full name of กรุงเทพฯ gives some 14 thousand grammatically-possible readings. Not all of them are processed due to lazy computations (I use F#), but still there are too many of them. Cluster optimization is one of many optimizations I had to add to reach fair performance.

As per "proof", I don't claim it is the best word, but I don't think "explanation" is the best term as well. Maybe you agree that tree-style entities are a bit "geeky" by themselves. "Normal" people would rather understand plain text, and that plain text better fits term of "explanation". Sometimes I make it verbose, and that will be called explanation. What do you think?

#29 Richard W

Richard W

    Platinum Member

  • Advanced Members
  • PipPipPipPipPipPip
  • 2,086 posts

Posted 2012-02-20 03:52:11

Another clutch of words that went wrong: เมตร เพชร จักร บุตร are all monosyllabic with short vowels and tones to match.  The tool showed them as disyllabic.  tɕàk - rá for จักร is clearly wrong - as clause final, it would require the spelling จักระ to have that pronunciation.  (As the non-final element of Indic compounds, จักร- is enunciated tɕàk - krà -).  The vowel lengths for เมตร and เพชร definitely need to be marked as exceptions.

#30 Richard W

Richard W

    Platinum Member

  • Advanced Members
  • PipPipPipPipPipPip
  • 2,086 posts

Posted 2012-02-20 03:58:42

View Postbytebuster, on 2012-01-22 20:41:05, said:

Maybe you agree that tree-style entities are a bit "geeky" by themselves. "Normal" people would rather understand plain text, and that plain text better fits term of "explanation". Sometimes I make it verbose, and that will be called explanation. What do you think?
Being able to hide parts of the explanation ('derivation' is a suitable word in this context) that are currently not of interest is useful, but there is a problem with the display technology.  If one navigates away from the screen (at least, backwards), the tree has collapsed when one returns.  This infuriates me.

#31 bytebuster

bytebuster

    Advanced Member

  • Members
  • PipPipPip
  • 52 posts

Posted 2012-02-23 01:33:16

View PostRichard W, on 2012-02-20 03:52:11, said:

As the non-final element of Indic compounds, จักร- is enunciated tɕàk - krà -
The vowel lengths for เมตร and เพชร definitely need to be marked as exceptions.

I greatly appreciate your feedback. I will add those exceptions.
Did you mean falling tone for เมตร, not the vowel length?
As per จักร - yes, the rule insists that ไม้หันอากาศ is always followed by a final.
However, I think it's rather reduplication issue. And the biggest problem with reduplication is that it usually occurs in the middle of polysyllabic word and does not occur at the end (วัฏจักร vs จักรภพ). Boundaries of meaningful words can be detected only by dictionary-based tool, of course.
I cannot think of any effective algorithm, and I would be obliged if someone gave me an idea how to overcome this.

View PostRichard W, on 2012-02-20 03:58:42, said:

If one navigates away from the screen (at least, backwards), the tree has collapsed when one returns.

Yes, it's a common problem of stateless controls. I will try to find workaround, and I'm also thinking on simplifying the tree (it's redundant a lot).

#32 Richard W

Richard W

    Platinum Member

  • Advanced Members
  • PipPipPipPipPipPip
  • 2,086 posts

Posted 2012-02-23 07:06:39

View Postbytebuster, on 2012-02-23 01:33:16, said:

Did you mean falling tone for เมตร, not the vowel length
Both เพชร and เมตร have short vowels and high tone.  If you mark the vowel as short in your exception dictionaries, you will get the high tone automatically if you use the vowel length after exception look-up to determine tone.  For example, would you take mai tho on a dead syllable with low class initial as indicating a long vowel if the vowel length is otherwise unmarked?

View Postbytebuster, on 2012-02-23 01:33:16, said:

Boundaries of meaningful words can be detected only by dictionary-based tool, of course.
It depends on your use case, but there is some benefit from detecting the boundary at the end of the input!



 


Sponsored by:
Quick Navigation   View New Content Site search: