88 replies to this topic
Posted 2006-12-14 13:48:23
In english they have a list like this - like the one for the Oxford Advanced learners dictionary, which has a careful selection of common use words. I've been searching for a list like this in Thai for years but never found one. Any pointers would be appreciated.
Posted 2006-12-14 14:49:15
Grover, the best I can do is I have a list of the 1000 most common words according to four sources of language corpora. I've attached a spreadsheet that I converted to HTML.
The best one is the Mary Haas list. Not sure about Haas, but the other three I know are all computed automatically, so the digits 0 to 9, among other things, count as "words" in their list, as well as some other things that aren't common Thai at all, but appear frequently in their corpora because of a large number of technical texts.
Hope this is helpful.
Edited by Rikker, 2006-12-14 14:51:28.
Posted 2006-12-14 15:15:27
Rikker, that is awesome. I've been looking for something exactly like this for years. Cheers Where did you get it from BTW ?
Posted 2006-12-14 23:25:49
thai2english.com to the rescue! copy + paste = phonetic translations
Posted 2006-12-16 15:15:21
How should I be reading this list? What do the column headers Haas, Links, Orchid and Tax represent?
What do the numbers mean and why does each list have different numbers?
For example the first row, why do three lists have การ but one has เป็น?
Haas Links Orchid Tax
366 เป็น 15978 การ 11888 การ 9861 การ
I've put it in a slightly cleaner Excel Format attached here.
Edited by wasabi, 2006-12-16 15:17:54.
Posted 2006-12-16 15:21:03
can you pin it meadish? It's good for beginners.
Posted 2006-12-16 16:07:22
This is interesting, thanks Rikker. I did a similar thing a while back using all the text that people paste into thai2english.com, and for comparison the top 100 results in order were :
ที่ , และ , จะ , การ , มี , ใน , ได้ , ของ , เป็น , ให้ , ไป , ก็ , ไม่ , ว่า , แล้ว , มา , กับ , คุณ , ใจ , คน , เรา , ฉัน , แต่ , นะ , นี้ , ครับ , อยู่ , เธอ , กัน , ผม , โดย , มัน , จาก , ต้อง , ด้วย , เลย , ยัง , หรือ , ทำ , ใช้ , คือ , เขา , มาก , ผู้ , บอก , พี่ , ดู , เมื่อ , วัน , อะไร , เรื่อง , ถ้า , ดี , เพราะ , อยาก , ค่ะ , ไม่ได้ , ปี , อีก , เพื่อ , พระ , รัก , นั้น , ตัว , ถึง , งาน , สามารถ , หน้า , เวลา , ใคร , ไทย , เพลง , แบบ , ซึ่ง , ไว้ , ขอ , ส่ง , ต่อ , ความ , ท่าน , อย่าง , ใหม่ , เล่น , ก่อน , หา , บ้าน , ตาม , ทาง , สำหรับ , หนึ่ง , เอา , เค้า , คะ , ทำให้ , ขึ้น , ไม่มี , อ่าน , บาท , ราย , ชื่อ
ที่ was the most common by miles (about twice the count of และ), whereas all the others were relatively close.
Posted 2006-12-16 18:02:11
Mike thats interisting to se how commomly used..
Posted 2006-12-17 13:07:28
Thanks for doing that. My original is in Excel, I just wanted to make sure everyone could access it.
The four columns are four different text collections/corpora. One from Mary Haas, another from NECTEC's Linguistics and Knowledge Science Laboratory (LINKS), Chula University's Orchid Corpus (appears to be offline right now), and the one labeled Tax I'm not clear on the exact source, but I think it might be the Thai tax code or a corpus of legal documents of some kind, given the high frequency of tax-related terms in their top 1000 words.
The number next to each word is the number of times that word appears in the corpus. The number at the top of each column is just a sum of the total number of occurrences of top 1000 words.
As for why the lists have different words in the top spots, well, that has to do with at least three things: [a] the size of the corpus, [b] the variety (or lack of it) of the subject matter collected in the corpus, [c] the method used to count occurrences.
The line you've quoted is the top word in each of the four corpora. You can see the Haas corpus is a much smaller corpus, with its top word only occurring 366 times. The other three, all much larger, agree that การ is more common. Orchid is largest at 416,000, but I don't know what constitutes a "word" for the purposes of counting in the Orchid corpus. While English "words" don't correspond to the collections of letters between spaces as much as we tend to think they do, it makes it easy for establishing a clear meaning of "word" for the purpose of gathering corpora (and that is easily countable via automatic means). Thai... a bit trickier. I know the corpora on thai.sealang.net are all counted via number of characters, not words.
Also, one telltale sign that Tax is a very narrow corpus subject-matter-wise is the fact that while it is 269,000 words large, it only has 2100 distinct words in it, while even in Haas there are 4000 distinct words out of 27000 total words.
Edited by Rikker, 2006-12-17 13:28:30.
Posted 2006-12-17 14:56:13
Thanks for the detailed reply Rikker,
Where are you coming up with Tax having 2100 and Haas having 4000. I see each list having 1000 words? And can you further define what you mean by corpus. Is this some underlying body of work the statistics are based on? What is this body of work for each.
Posted 2006-12-17 19:00:48
Thanks for the list looks very good but were do I get the translation for the words.
Posted 2006-12-17 19:06:42
Paste them into www.thai2english.com or buy a dictionary. The process of looking words up is good for the memorization process.
The most common words have many different functions. เป็น for example - while it often just means 'be' or 'is', as in
ผมเป็นหมอ I am a doctor,
...it can also be a grammatical function word indicating result or manner:
หั่นเนื้อเป็นชิ้น Cut beef into slices. เป็นเวลาสองวัน ...for two days
...as well as ability:
เล้นกีตาร์ไม่เป็น can not play guitar...
Posted 2006-12-23 22:22:15
Not sure if this would help, but there's a great vocabulary builder from a company called Unforgettable Languages that uses easy memory aids for commonly used words. This is a great addition to your language learning IMHO. It is an easy way to pick up, in this case, about 230 commonly used words. I used it for Thai and Mandarin.
It can be found at: www.unforgettablelanguages.com
Posted 2006-12-24 04:16:52
that is a good link indeed. a good system for vocab building.
eg. imagine a fat GUY (gai) eating a large chicken. and so on.
Edited by Grover, 2006-12-24 04:43:08.
Posted 2009-01-05 15:01:59
I realize this thread is getting old, but this has been really helpful to me. Why waste time learning a whole dictionary right away? Start with the 100 most common words and work up to the 1000 most common then perhaps 2500. That's a good vocabulary!
Using Rikkors lists along with thai2eng, and some others on ThailandQA.com, I have massaged the data trying to find concensus or at least trends. I will try to attach the spreadsheets I am using here, but I really don't spend much time using forums, so I may not succeed.
Two spreadsheets. All the lists were included, sorted, duplicates removed, then trimmed. Frequency table provided showing the degree of correlation between the lists.
As Rikkor kindly pointed out, the tax list seems to contain tax related terms and numbers and special characters were deleted.
All errors are mine alone and suggestions and corrections are gratefully accepted.
Happy new year.
Posted 2009-01-06 09:14:23
Nice. I went at it from a slightly different angle. Also, I'm going for 3000 as (I might be wrong) 1000 doesn't have enough word combos. I didn't grab the whole frequency list as you did, only Mary's. Then I added from AUA, Byki (not all), AWL, Thai-language.com starred, etc. Mike from Thai2English.com also has a frequency list that will come in at some point. Then there's a dictionary with the supposedly top 3000 but I found what I believe are 3 mistakes just in the first couple of pages, so I backed off from seriously checking it against mine.
My eventual aim is to put each with phrases as words on their own don't work with the way I learn. Then when I get to a certain point, I'll have someone in the know look at as there are sure to be a ton of iffy words. But right now, I'm just nibbling away and enjoying the finding of new words as I go.
Posted 2009-01-06 13:12:51
A question for the advanced members, how well do you think this list relates to spoken Thai as opposed to the written Thai that provides the source ?
I would say that ก็ has to be number 1 in terms of spoken Thai surely !
Posted 2009-01-06 14:18:11
I'd forgot about pinning this topic. Great thing that you brought it back again.
It's not just you - it won't work for anyone who wants to speak anything resembling intelligible Thai. The example I posted above about how เป็น is used, is just a brief introduction to the word, and can be extended, not to mention the same thing could [should!] be repeated for most of the most common words.
In other words, these words can have completely different functions depending on the context.
If one doesn't learn grammatical patterns as well as idioms too, the words by themselves, with just one translation in English and no usage examples, won't do you much more good than getting 50 tons of bricks and mortar and an order to reconstruct Wat Benjamabophit, Suvarnabhumi airport, or Baiyoke 2.
For one, I think you'll find much more particles. Especially if you properly distinguish between ครับ อะ ฮะ ค่ะ จ้ะ วะ หว่า etc... you're right about ก็ too - it's a hesitation word.
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users