Thank you Christian. It's always great to listen for feedback.
There are many things yet not complete, however. So it's a good idea not to rely on a single service alone.
Rule-Based Thai Transliteration
Started by bytebuster, 2011-12-16 20:25
|
31 replies to this topic
#26Posted 2012-01-16 06:06:53 #27Posted 2012-01-22 06:24:55
Added several transcription plugins, improved dictionary of exception words, improved visual appearance, and fixed several bugs. พรหมพร is transliterated as pʰrom - pʰɔn, should be pʰrom - pʰɔːn. เงิน is transliterated as ŋɤʔn, should be ŋɤn. On the exception front, you still haven't entered the data so that สำเนียง is transliterated not as sɑ̌m - nǐːaŋ but as sɑ̌m - niːaŋ. Do you have hooks for exception lists so that ผลิกะ and ปลัด will not be mistransliterated as pʰlì - kàʔ and plàt but correctly transliterated as pʰà - lí - kàʔ and pà - làt? At present, the correct syllabification is apparently not even considered. You are still using the word 'proof' where 'explanation' is a far better word. It is not possible to prove that สระ 'pool' is pronounced sà - rà when it is actually pronounced sà. (At present there is no indication that the possibility has been rejected.) #28Posted 2012-01-22 20:41:05
There are still a few simple bugs left: Thank you very much Richard. In next version I will change some logic regarding the glottal stops, so they don't appear in improper places. There are, indeed, a lot of exception words still working incorrectly. I'm thinking of adding them from a large vocabulary, but that takes time. I've been looking into Lexitron database, but primarily in terms of gathering essential statistics regarding consonant reduplication. My nearest goal is to try implementing some logic that would "guess" possible reduplication. Yes, the clusters are considered as entire units (unless they aren't followed by blocking rules, e.g. "เ"). So decompositions where they are separate consonants are indeed not considered at all. They are exception words for my approach. There is good logic behind that. My code involves backtracking algorithm, and each ambiguity leads to creating a tree of possibilities, and branches are further processed independently till the very end of input text (or till a blocking rule is matched). For instance, the full name of กรุงเทพฯ gives some 14 thousand grammatically-possible readings. Not all of them are processed due to lazy computations (I use F#), but still there are too many of them. Cluster optimization is one of many optimizations I had to add to reach fair performance. As per "proof", I don't claim it is the best word, but I don't think "explanation" is the best term as well. Maybe you agree that tree-style entities are a bit "geeky" by themselves. "Normal" people would rather understand plain text, and that plain text better fits term of "explanation". Sometimes I make it verbose, and that will be called explanation. What do you think? #29Posted 2012-02-20 03:52:11
Another clutch of words that went wrong: เมตร เพชร จักร บุตร are all monosyllabic with short vowels and tones to match. The tool showed them as disyllabic. tɕàk - rá for จักร is clearly wrong - as clause final, it would require the spelling จักระ to have that pronunciation. (As the non-final element of Indic compounds, จักร- is enunciated tɕàk - krà -). The vowel lengths for เมตร and เพชร definitely need to be marked as exceptions.
#30Posted 2012-02-20 03:58:42
Maybe you agree that tree-style entities are a bit "geeky" by themselves. "Normal" people would rather understand plain text, and that plain text better fits term of "explanation". Sometimes I make it verbose, and that will be called explanation. What do you think? #31Posted 2012-02-23 01:33:16
As the non-final element of Indic compounds, จักร- is enunciated tɕàk - krà - The vowel lengths for เมตร and เพชร definitely need to be marked as exceptions. I greatly appreciate your feedback. I will add those exceptions. Did you mean falling tone for เมตร, not the vowel length? As per จักร - yes, the rule insists that ไม้หันอากาศ is always followed by a final. However, I think it's rather reduplication issue. And the biggest problem with reduplication is that it usually occurs in the middle of polysyllabic word and does not occur at the end (วัฏจักร vs จักรภพ). Boundaries of meaningful words can be detected only by dictionary-based tool, of course. I cannot think of any effective algorithm, and I would be obliged if someone gave me an idea how to overcome this.
If one navigates away from the screen (at least, backwards), the tree has collapsed when one returns. Yes, it's a common problem of stateless controls. I will try to find workaround, and I'm also thinking on simplifying the tree (it's redundant a lot). #32Posted 2012-02-23 07:06:39
Did you mean falling tone for เมตร, not the vowel length
Boundaries of meaningful words can be detected only by dictionary-based tool, of course. |
Sponsored by: |













