Archived

This topic is now archived and is closed to further replies.

Richard W

Thai In Openoffice On Ubuntu Lucid Lynx

29 posts in this topic

I can't get Thai spell-checking to work in the word processor of OpenOffice.org 3.2.0 on Ubuntu 10.04 Lucid Lynx. I have installed the Thai dictionary package myspell-th Version 1:3.2.0-3ubuntu3.1, and its installation set it up for hunspell. The problem seems to be that the words provided to the spell checker are defined as columns, so that the spell-checker tries to check the 4-character word for 'island' as three or four words! The default language for complex-text-layout is set to Thai, and I get 'Thai justification' of Thai text (i.e. spaces are clearly inserted between characters rather than just growing interphrase gaps).

Am I missing a trick, or does Thai spell-checking not work for this combination of OS and OpenOffice.org? The interface language is set up for English - I do not yet need it to be set up for Thai.

Separating words by ZWSP, plain space and even having one word per paragraph all fail to help.

Share this post


Link to post
Share on other sites

openoffice.org-l10n-th: office productivity suite -- Thai language package

help or not? please give specifics, devs need a little help

Share this post


Link to post
Share on other sites

openoffice.org-l10n-th: office productivity suite -- Thai language package

help or not? please give specifics, devs need a little help

The labelling on the tin says not. To quote from http://packages.ubuntu.com/lucid/openoffice.org-l10n-th :

This package contains the localization of OpenOffice.org in Thai. It contains the user interface, the templates and the autotext features. (please note that not all this is available for all possible languages). You can switch user interface language using the locales system.

Spelling dictionaries, hyphenation patterns, thesauri and help are not included in this package. There are some available in separate packages (myspell-*, openoffice.org-hyphenation-*, openoffice.org-thesaurus-*, openoffice.org-help-*)

If you just want to be able to spellcheck etc. in other languages, you can install extra dictionaries/hyphenation patterns/thesauri independently of the language packs.

I gave it a try, but it didn't help. What was dispiriting was that almost none of the spell-checking interface was localised to Thai. I have myspell-th installed, but it doesn't provide anything useful.

I hope it is permitted to give an example in this forum.

As an example, I tried the line ไม่ รู้ ว่า จะ ทำ อย่าง ไหน with three variations - spaces between words, 'zero-width space' (ZWSP) between words, and nothing between words. In each case, the spell-checker saw it as a sequence like ม่ รู้ ว่ ย่ , of which only รู้ is a word - it was recognised as spelt correctly.

Now, Thai may just be a tricky problem because no-one expects to be able to force Thais to use ZWSP, although I get the impression that Cambodians have accepted it. Thai needs a dictionary to split text between lines at word boundaries, but that doesn't help with misspelt words. The task isn't impossible - Word manages it. However, at least one version of OpenOffice seems to have addressed the issue by defining Thai words to be consonants plus associated marks, and these seem to be the words supplied to the spell-checking system.

Share this post


Link to post
Share on other sites

I did some more digging, and this time I came up with a partial answer - Spellchecker API isn't appropriate for languages without any space between words like Thai. Interestingly it's being worked on for Khmer as well, so it looks as though they may not be working with ZWSP. I still suspect there are other text-breaking issues fouling up Thai spell-checking. So, OpenOffice does not substitute for Word in Thai!

Share this post


Link to post
Share on other sites

I did some more digging, and this time I came up with a partial answer - Spellchecker API isn't appropriate for languages without any space between words like Thai. Interestingly it's being worked on for Khmer as well, so it looks as though they may not be working with ZWSP. I still suspect there are other text-breaking issues fouling up Thai spell-checking. So, OpenOffice does not substitute for Word in Thai!

You may want to check the successor to Openoffice - Libreoffice. A google search shows it has an add-on for Thai language. Both have deb packages. I'm not sure openofice has a future any more.

Share this post


Link to post
Share on other sites

The only Thai-specific improvement I can see in LibreOffice 3.3 over OpenOffice.org 3.2 is that more of the spelling interface has been translated to Thai - and that might be in OpenOffice.org 3.3. The font sizing seems worse for LibreOffice than OpenOffice - as though someone had assumed one pitch (as opposed to x-size) fitted all scripts. It's actually slightly worrying that I couldn't confirm that LibreOffice has inherited OpenOffice's bug and to-do lists. Apart from that, Thai spell-checking is just as dysfunctional.

Share this post


Link to post
Share on other sites

I thought I had a solution, but I need some help.

My thought for getting some spell-checking was as follows:

1) Separate my Thai words with zero-width space.

2) Extract content.xml and insert line breaks outside paragraphs and headings, e.g.

unzip -p file.odt content.xml | spaceit > content.xml

where spaceit is a program of my own.

3) Edit content.xml with emacs using a Thai spell checker.

4) Put the edited file back

zip file.odt content.xml

. When I edit the file in LibreOffice, the the line breaks inserted in Step 2 are removed.

I've been having trouble with Step 3. I've proceeded as follows:

a) Install myspell-th, which is installed in a hunspell directory.

B) Remove the qualifier from TIS620 in the th_TH.aff directory. (Grrr!)

c) Add a dictionary definition to ispell-local-dictionary-alist using customize-variable. It took me a while to realise that I should enter the characters in Thai script and specify the encoding as utf-8. (I probably need to tweak the definition to make a combined Thai/English dictionary.)

c) Hunspell then wasn't playing nicely with Emacs. I had to fix Hunspell bug Bad UTF-8 char count in pipe mode - ID: 3178449, originally raised as Emacs bug GNU bug report logs - #7781 23.2.91; ispell problem with hunspell and UTF-8 file.

d) I can then step through the spelling errors in Emacs, but I don't get correction suggestions. This looks like a Thai script or UTF-8 problem. I do get the prompts when I use an English dictionary.

Can anyone help me with problem (d)?

A possible solution is to avoid emacs and use the Hunspell spelling correction program directly, but it isn't very friendly when its suggested corrections omit the correct correction.

Share this post


Link to post
Share on other sites
d) I can then step through the spelling errors in Emacs, but I don't get correction suggestions. This looks like a Thai script or UTF-8 problem. I do get the prompts when I use an English dictionary.

Can anyone help me with problem (d)?

A possible solution is to avoid emacs and use the Hunspell spelling correction program directly, but it isn't very friendly when its suggested corrections omit the correct correction.

hunspell -a is messed up for UTF-8 input. Fixing the problem in emacs is complicated. I've tried dropping the -a qualifier, but when I do that, Hunspell uses a subtly different interface, so though it works with Emacs in a simple test case, it fails with content.xml.

Using Hunspell on its own fails because it can't display very long lines. I also get the impression I hit some of its capacity limits.

Share this post


Link to post
Share on other sites

I've now identified the relevant problem with Hunspell, at least for Versions 1.2.8 and 1.3.2. The bugs are all in the pipe_interface() function in the hunspell.cxx for the stand-alone program. Firstly, the method get_tokenpos returns an offset in bytes, but it needs to be converted to an offset in characters. Secondly, when generating suggestions the word checked is converted to the dictionary's encoding (typically TIS-620 for Thai, as in Version 1:3.2.0-3ubuntu3.1 of myspell-th) if filter_mode is NORMAL, but not if it is PIPE. It should be converted in both cases.

There's a slight fault on the Emacs side, in ispell.el (from Version 1.4.0ubuntu2 of package dictionaries-common) - function ispell-show-choices needs to call fit-window-to-buffer just after the call to switch-to-buffer so it will display Thai choices properly.

I've now used the scheme outline above to do some Thai spell-checking. Code is available on request. I'm not sure I'll have formalised the bug reports before the new year. There is some odd behaviour in the generation of suggestions for Thai, so I may have more than just the matters mentioned here to report.

Share this post


Link to post
Share on other sites

I now have a nice collection of bug reports:

On Hunspell:

Bad UTF-8 char count in pipe mode - ID: 3178449

No Encoding of Word for Suggestions in Piped Mode

Multidictionary guesses dictionary for suggestions

Hunspell 1.2.8 Groups Thai TIS-620 Chars in Lower/Upper Case Pairs

On the Thai dictionary:

th_TH Affix File Inadequate for Hunspell

Corrected code (at least, as far as Hunspell for spell-checking via Emacs is concerned) is pointed to by the last of the four Hunspell bug reports.

Share this post


Link to post
Share on other sites

I have long bemoaned lack of an effective Thai spell checker in LibreOffice in Ubuntu. So I was very happy to see that you were addressing this bug. I am not a technical type. I understand though from your last post that you have fixed the problem, and if I downloaded the files you posted and ran the configure file that it would fix the problem for me too. I tried this but no change. Is there a way to fix it? I am using Ubuntu 11.10. Will Ubuntu eventually make an update available to fix this problem?

Share this post


Link to post
Share on other sites

I have long bemoaned lack of an effective Thai spell checker in LibreOffice in Ubuntu. So I was very happy to see that you were addressing this bug. I am not a technical type. I understand though from your last post that you have fixed the problem, and if I downloaded the files you posted and ran the configure file that it would fix the problem for me too. I tried this but no change. Is there a way to fix it? I am using Ubuntu 11.10. Will Ubuntu eventually make an update available to fix this problem?

JiangWade, how far did you get with the process? When you say you 'ran the configure file', I take it you rebuilt Hunspell Version 1.2.8 with my changes, for which you would need to run configure and then make, and then look for the hunspell executable in the src/tools subdirectory. (If you didn't get his far, I will happily walk you through the process. It may seem intimidating, but I don't recall any complexities.)

The next thing you needed to do is to ensure that your setting of PATH picks up the new executable. (I presume you're loath to mess about with stuff in /usr/share.) For example on my machine, for me the environment variable PATH has the value

/home/richard/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

and in /home/richard/bin hunspell is set up as a link to my corrected src/tools/hunspell.

As a test, try issuing the following two lines at a terminal followed by ctrl/C or ctrl/D:

hunspell -a -d th_TH
อไร

I get the outputs

@(#) International Ispell Version 3.2.06 (but really Hunspell 1.2.8) hacked by JRW
& อไร 4 0: อุไร, อะไร, อมร, อรไท

Note that any ad lib changes in the identification code have to go after the closing parenthesis - that threw me for a while.

What is your locale set to? That might be a cause of problems. To find your locale, issue the commands:

env | grep LC
enc | grep LANG

On my machine, they yield:

LANG=en_GB.utf8
GDM_LANG=en_GB.utf8
LANGUAGE=en_GB:en

The important bit is the '.utf8'.

For running emacs, what do you have in your .emacs file relating to ispell? One think I didn't mention in my README.txt (because I wasn't aware of it) was that to set up ispell-related variables for emacs, you first need to start ispell (M-x ispell). The emacs command I used (commands and responses) are:

M-x find-file content.xml # Load file to edit
M-x ispell                # Load ispell functions (needed for
	          # M-x customize-variable ispell-progam-name)
q y	                  # The initial spell process will have to be killed
                  # at some point if it is using the wrong dictionary.
M-x ispell-chage-dictionary thai
M-x load-file ispell.el   # To get window resizing - your set-up may not need this.
M-x ispell                # Away we go - x to break out of spell checking,
		  # q to force a restart.

As to who will fix what, I don't think Ubuntu will fix much directly. Moving fixes downstream is the best I can hope for. This may result in Hunspell 1.2.8 Groups Thai TIS-620 Chars in Lower/Upper Case Pairs being fixed, possibly by simply moving to a later version of Hunspell. What version is configured for Oneiric? Getting th_TH Affix File Inadequate for Hunspell fixed is trickier - I may have to find another way of getting that fixed. Addding your name to those affected may help.

As for fixing Thai spell-checking in LibreOffice directly, I shall check on related progress. I thought Javier Solá's work for Khmer would fix it, but I am suddenly not so sure.

Share this post


Link to post
Share on other sites

For LibreOffice, it looks as though someone is going to have popularise a fully working Thai spell checker for LibreOffice. Professionally, it's doable - SIL implemented graphite font support for a version of OpenOffice, a Burmese company maintained it, and finally it was adopted by OpenOffice. The Khmer spell checker (depending on ZWSP) is already functional. There appears to be a licensing issue with the relevant ICU tool, from what I can glean from various discussions around the internet.

Apparently similar problems arise with Tibetan.

Share this post


Link to post
Share on other sites
This may result in Hunspell 1.2.8 Groups Thai TIS-620 Chars in Lower/Upper Case Pairs being fixed, possibly by simply moving to a later version of Hunspell.

This happened, apparently today (Thursday 9 February 2012), for Lucid (10.04) at least today, presumably as part of the presumed upgrade of LibreOffice supported in Lucid from 3.3.2 to 3.4.5.

Unfortunately for me, LibreOffice Version 3.4.5 drops support for Graphite fonts using Version 1.0 of the Silf table - 'SIL: Are there any such fonts in the wild?' and mangles fonts using 'pseudoglyphs' (bug and fix reported to SIL). On the other hand, a bug that stopped my upgrading my old fonts to use later table versions has now been fixed in the trunk version of the Graphite compiler.

Share this post


Link to post
Share on other sites

  • Recently Browsing   0 members

    No registered users viewing this page.

BANGKOK 28 July 2017 09:38
Sponsors