Usage Notes for Charset Auto-detection modules
Document prepared by: Katsuhiko
Momoi
Direct all technical comments to: Frank
Tang
Last Update: 10/11/99
At M10 the Charset Detection modules mentioned in this document have become available via an Auto-Detect Menu which is seen right below the Character set menu in the Browser window. Thus it is no longer necessary to directly edit the prefs.js file to enable a charset detector. Note that the menu enables one charset detector at a time. (This is the same as in M9). The menu can also disables its use.
There is one new additional charset detector setting for M10, which provides a workaround for EUC-JP mail file attachment display failure. Until the EUC-JP display bug is fixed for Japanese mail attachments, we recommend that those interested in viewing Japanese mail files use the following prefs.js setting:
user_pref("mail.charset.detector", "jaclassic");
This enables the use of Communicator 4.x Japanese detector and it only affects mail attachment viewing. This setting can co-exist with the Browser charset detector setting. If this line does not exist, and if the Browser's charset detector is enabled, then that detector will be used for mail attachments also.
The following usage notes may be still useful for historical reasons and to those who want to manipulate prefs.js settings directly or those who would like to use the DetectCharset.exe directly for bulk testing.
==================== Below this line is the Usage notes written for M9 ===========================
At M9 and later, we now have two ways to use the auto-charset modules:
- Method 1 (available in M9 builds only starting 7/15/99) : Insert the following line into prefs.js (prefs50.js if you're using a pre-M9 build) file. Mozilla will use the auto-detector at all times regardless of the encoding menu choice.
- Method 2: Use the "Detectch" or "Detectcharset" utility which is found in the same directory as your "apprunner" file.
- Purpose of the utility: To test charset detection modules Frank Tang has checked in for M8 and later. These detection modules are not meant for all charsets but rather for useful groups of encodings, e.g. Japanese encodings, Traditional Chinese encodings, etc.
- Where: The utility is called DetectCh.exe or DetectCharset.exe and distributed with the Windows build. On Unix, it won't build by default, but you build it by (1) add "intl/chardet/tests/" to mozilla/allmakefiles.sh file and (2) cd intl/chardet/tests and gmake to build it).
- How:
- Please test against REAL language pages - which do not have undefined code point (for example, undefined Shift_JIS). For example,
- First, download a web page into the same directory as your apprunner.exe file, and
- Type the following line:
- type file_name | DetectCh(arset) japsm1024
- ... where file_name is the name of the file you downloaded for testing , japsm is the detector name, and the 1024 is the size of the block.
- Sometime, the detector won't report any charset, which means it cannot figure out the charset.
- For example, it cannot figure out what charset is for http://home.netscape.com/index.html. However it can figure out it is windows-1252 in http://home.netscape.com/da/index.html (or sv)
- If you don't want to download pages to test, then you can use a Perl Script to fetch URLs to test.
- First, install Perl 5.004 w/ LWP- for example - ActivePerl - ActiveState Tool Corp. - ActivePerl
- Then, type the following line- (in the directory which have DetectCh(arset))
- perl -MLWP::Simple -e "getprint 'http://home.netscape.com/it/index.html'" |DetectCh(arset) japsm 1024
user_pref("intl.charset.detector", "chardet_name");
where chardet_name is one of {japsm, kopsm, cjkpsm, zhpsm, zhtwpsm, zhcnpsm, ruprob, ukprob} -- (see also below for what they detect.)
Available detectors: Red indicates that the charset was not implemented in the module at M8. On M9 builds starting with 7/15/99, all the red-marked charsets below are working. Though they are checked in, zhpsm and cjkpsm are not yet finished. Please don't use them at this time.Here are some sample output using the Perl Script method: Z:\mozilla\dist\WIN32_D.OBJ\bin>perl -MLWP::Simple -e "getprint 'http://home.netscape.com/it/index.html'"|DetectCh japsm 1024
It cannot find the charset.
Z:\mozilla\dist\WIN32_D.OBJ\bin>perl -MLWP::Simple -e "getprint 'http://home.netscape.com/da/index.html'"|DetectCh japsm 1024
It's NOT ISO-2022-JP- byte 4021
It's NOT EUC-JP- byte 4022
It's NOT UTF-8- byte 4022
It's NOT Shift_JIS- byte 4026
It's windows-1252- byte 4026. The only left
windows-1252
- japsm - detect charset among
- ISO-2022-JP
- EUC-JP
- Shift_JIS
- UTF-8
- Windows-1252
- UCS2 (Big Endian and Little Endian)
- kopsm - detect charset among
- EUC-KR
- ISO-2022-KR
- UTF-8
- windows-1252
- UCS2 (Big Endian and Little Endian)
- zhtwpsm - detect charset among
- BIG5
- x-euc-tw
- ISO-2022-CN
- UTF-8
- windows-1252
- UCS2 (Big Endian and Little Endian)
- zhcnpsm - detect charset among
- GB2312
- HZ
- ISO-2022-CN
- UTF-8
- windows-1252
- UCS2 (Big Endian and Little Endian)
- zhpsm (Not fully ready yet) - detect charset among
- BIG5
- x-euc-twGB2312
- HZ
- ISO-2022-CN
- UTF-8
- windows-1252
- UCS2 (Big Endian and Little Endian)
- cjkpsm (Not fully ready yet) - detect charset among
- Shift_JIS
- EUC-JP
- ISO-2022-JP
- EUC-KR
- ISO-2022-KR
- BIG5
- x-euc-twGB2312
- HZ
- ISO-2022-CN
- UTF-8
- windows-1252
- UCS2 (Big Endian and Little Endian)
- Based on "The Cyrillic Software Suite" perl package developed by John Neystadt. Read more about this in this news article posted by Frank Tang: 7/14/99.
- ruprob -- detects among
- KOI8-R
- windows-1251
- ISO-8859-5
- x-mac-cyrillic
- IBM866
- ukprob - detects among
- KOI8-U
- windows-1251
- ISO-8859-5
- x-mac-ukrainian
- IBM866
1. Need to add verifier for ISO-2022-KR, ISO-2022-CN and HZ:
(Note- although we can detect these three charset now (M9: starting
with 7/15/99 build), we have not implemented the Unicode converters for
them yet.) Any one want to help? Write to: Frank
Tang.
2. Need to add ISO-2022-KR verifier to kopsm, cjkpsm
3. Need to add ISO-2022-CN verifier to zhtwpsm, zhpsm and cjkpsm
4. Need to add HZ verifier to zhpsm, zhcnpsm and cjkpsm
5. Need to rewrite ISO-2022-JP Verifier to make it work properly w/
ISO-2022-CN and ISO-2022-KR
Known problems/requests:
1. The UTF-8 verifier has some problem. It might non-UTF-8 pages as
UTF-8 sometimes. Investigating this problem .... [Fix
checked in on 7/14/99. Use 7/15/99 or later M9 builds to see the fix.]
2. zhtwpsm will only work as an x-euc-tw detector if it contains
characters from CNS 11643 plane 2
3. zhpsm and cjkpsm won't detect correctly due to similar code ranges
among EUC-KR, GB2312, x-euc-tw, Big5, EUC-JP, and Shift_JIS. We may eventually
remove these two detectors. But for now, we've included them in the builds
so that we can learn how good/bad they are. It
is a known issue that detector 'cjkpsm' and 'zhpsm' won't report in some
cases. And 'zhtwpsm' won't report "x-euc-tw' correctly. We need a 2nd path
probability analysis for 'cjkpsm', 'zhpsm' and 'zhtwpsm'. Don't report
bug for them please. They are incomplete for now.
4. Is there some Vietnamese software which can detect charset between
VIQR, VNI, VPS, and VISCII ? Write to: Frank
Tang.
Bug reporting procedures: File your bug at this web page (Bugzilla). File it against the Browser. Assign the bug to ftang@netscape.com, and choose the Component to be "International".
1. Please test against japsm, kopsm, zhcnpsm and zhtwpsm, but not zhpsm
and cjkpsm. These two detectors are not finished yet and need additional
work. [If you use M9 build later than 7/14/99, then
you can also use ruprob and ukprob. ]
2. When you report a bug, please include the DOS/prompt window output.
You can attach it by
a. click the upper-left DOS icon in the dos window, and
click control-k control-e
b. high light the portion of output text by using mouse
to click and click again
c. click the upper-left DOS icon in the dos window, and
click control-k control-y
d. Paste into your Bugzilla report.
e. That will help us a lot since the dos output tell me
which byte it go into error states for a particular charset.