Charset Detection Algorithms

You are currently viewing a snapshot of www.mozilla.org taken on April 21, 2008. Most of this content is highly out of date (some pages haven't been updated since the project began in 1998) and exists for historical purposes only. If there are any pages on this archive site that you think should be added back to www.mozilla.org, please file a bug.

Roadmap
Projects
Coding
Testing
Tools
- Bugzilla
- Tinderbox
- Bonsai
- LXR
FAQs

Usage Notes for Charset Auto-detection modules

Document prepared by: Katsuhiko Momoi
Direct all technical comments to: Frank Tang
Last Update: 10/11/99

At M10 the Charset Detection modules mentioned in this document have become available via an Auto-Detect Menu which is seen right below the Character set menu in the Browser window. Thus it is no longer necessary to directly edit the prefs.js file to enable a charset detector. Note that the menu enables one charset detector at a time. (This is the same as in M9). The menu can also disables its use.

There is one new additional charset detector setting for M10, which provides a workaround for EUC-JP mail file attachment display failure. Until the EUC-JP display bug is fixed for Japanese mail attachments, we recommend that those interested in viewing Japanese mail files use the following prefs.js setting:

user_pref("mail.charset.detector", "jaclassic");

This enables the use of Communicator 4.x Japanese detector and it only affects mail attachment viewing. This setting can co-exist with the Browser charset detector setting. If this line does not exist, and if the Browser's charset detector is enabled, then that detector will be used for mail attachments also.

The following usage notes may be still useful for historical reasons and to those who want to manipulate prefs.js settings directly or those who would like to use the DetectCharset.exe directly for bulk testing.

==================== Below this line is the Usage notes written for M9 ===========================

At M9 and later, we now have two ways to use the auto-charset modules:

Method 1 (available in M9 builds only starting 7/15/99) : Insert the following line into prefs.js (prefs50.js if you're using a pre-M9 build) file. Mozilla will use the auto-detector at all times regardless of the encoding menu choice.

user_pref("intl.charset.detector", "chardet_name");

where chardet_name is one of {japsm, kopsm, cjkpsm, zhpsm, zhtwpsm, zhcnpsm, ruprob, ukprob} -- (see also below for what they detect.)

Method 2: Use the "Detectch" or "Detectcharset" utility which is found in the same directory as your "apprunner" file.

Purpose of the utility: To test charset detection modules Frank Tang has checked in for M8 and later. These detection modules are not meant for all charsets but rather for useful groups of encodings, e.g. Japanese encodings, Traditional Chinese encodings, etc.
Where: The utility is called DetectCh.exe or DetectCharset.exe and distributed with the Windows build. On Unix, it won't build by default, but you build it by (1) add "intl/chardet/tests/" to mozilla/allmakefiles.sh file and (2) cd intl/chardet/tests and gmake to build it).
How:

Please test against REAL language pages - which do not have undefined code point (for example, undefined Shift_JIS). For example,

First, download a web page into the same directory as your apprunner.exe file, and
Type the following line:

type file_name | DetectCh(arset) japsm1024

... where file_name is the name of the file you downloaded for testing , japsm is the detector name, and the 1024 is the size of the block.

Sometime, the detector won't report any charset, which means it cannot figure out the charset.

For example, it cannot figure out what charset is for http://home.netscape.com/index.html. However it can figure out it is windows-1252 in http://home.netscape.com/da/index.html (or sv)

If you don't want to download pages to test, then you can use a Perl Script to fetch URLs to test.

First, install Perl 5.004 w/ LWP- for example - ActivePerl - ActiveState Tool Corp. - ActivePerl
Then, type the following line- (in the directory which have DetectCh(arset))

perl -MLWP::Simple -e "getprint 'http://home.netscape.com/it/index.html'" |DetectCh(arset) japsm 1024

Sample output -- illustration of Method 2:

Here are some sample output using the Perl Script method:

Z:\mozilla\dist\WIN32_D.OBJ\bin>perl -MLWP::Simple -e "getprint 'http://home.netscape.com/it/index.html'"|DetectCh japsm 1024

It cannot find the charset.

Z:\mozilla\dist\WIN32_D.OBJ\bin>perl -MLWP::Simple -e "getprint 'http://home.netscape.com/da/index.html'"|DetectCh japsm 1024

It's NOT ISO-2022-JP- byte 4021
It's NOT EUC-JP- byte 4022
It's NOT UTF-8- byte 4022
It's NOT Shift_JIS- byte 4026
It's windows-1252- byte 4026. The only left
windows-1252

Available detectors: Red indicates that the charset was not implemented in the module at M8. On M9 builds starting with 7/15/99, all the red-marked charsets below are working. Though they are checked in, zhpsm and cjkpsm are not yet finished. Please don't use them at this time.

japsm - detect charset among

ISO-2022-JP
EUC-JP
Shift_JIS
UTF-8
Windows-1252
UCS2 (Big Endian and Little Endian)

kopsm - detect charset among

EUC-KR
ISO-2022-KR
UTF-8
windows-1252
UCS2 (Big Endian and Little Endian)

zhtwpsm - detect charset among

BIG5
x-euc-tw
ISO-2022-CN
UTF-8
windows-1252
UCS2 (Big Endian and Little Endian)

zhcnpsm - detect charset among

GB2312
HZ
ISO-2022-CN
UTF-8
windows-1252
UCS2 (Big Endian and Little Endian)

zhpsm (Not fully ready yet) - detect charset among

BIG5
x-euc-twGB2312
HZ
ISO-2022-CN
UTF-8
windows-1252
UCS2 (Big Endian and Little Endian)

cjkpsm (Not fully ready yet) - detect charset among

Shift_JIS
EUC-JP
ISO-2022-JP
EUC-KR
ISO-2022-KR
BIG5
x-euc-twGB2312
HZ
ISO-2022-CN
UTF-8
windows-1252
UCS2 (Big Endian and Little Endian)

[The following modules are available only for M9 or later builds starting on 7/15/99]

Based on "The Cyrillic Software Suite" perl package developed by John Neystadt. Read more about this in this news article posted by Frank Tang: 7/14/99.

ruprob -- detects among

KOI8-R
windows-1251
ISO-8859-5
x-mac-cyrillic
IBM866

ukprob - detects among

KOI8-U
windows-1251
ISO-8859-5
x-mac-ukrainian
IBM866

Unfinished Work at M8 but were completed for M9: [All of the tasks below were completed and checked in on 7/14/99 after M8. Use 7/15/99 or later M9 builds to see the fixes.]

1. Need to add verifier for ISO-2022-KR, ISO-2022-CN and HZ:
(Note- although we can detect these three charset now (M9: starting with 7/15/99 build), we have not implemented the Unicode converters for them yet.) Any one want to help? Write to: Frank Tang.
2. Need to add ISO-2022-KR verifier to kopsm, cjkpsm
3. Need to add ISO-2022-CN verifier to zhtwpsm, zhpsm and cjkpsm
4. Need to add HZ verifier to zhpsm, zhcnpsm and cjkpsm
5. Need to rewrite ISO-2022-JP Verifier to make it work properly w/ ISO-2022-CN and ISO-2022-KR

Known problems/requests:

1. The UTF-8 verifier has some problem. It might non-UTF-8 pages as UTF-8 sometimes. Investigating this problem .... [Fix checked in on 7/14/99. Use 7/15/99 or later M9 builds to see the fix.]
2. zhtwpsm will only work as an x-euc-tw detector if it contains characters from CNS 11643 plane 2
3. zhpsm and cjkpsm won't detect correctly due to similar code ranges among EUC-KR, GB2312, x-euc-tw, Big5, EUC-JP, and Shift_JIS. We may eventually remove these two detectors. But for now, we've included them in the builds so that we can learn how good/bad they are. It is a known issue that detector 'cjkpsm' and 'zhpsm' won't report in some cases. And 'zhtwpsm' won't report "x-euc-tw' correctly. We need a 2nd path probability analysis for 'cjkpsm', 'zhpsm' and 'zhtwpsm'. Don't report bug for them please. They are incomplete for now.
4. Is there some Vietnamese software which can detect charset between VIQR, VNI, VPS, and VISCII ? Write to: Frank Tang.

Bug reporting procedures: File your bug at this web page (Bugzilla). File it against the Browser. Assign the bug to ftang@netscape.com, and choose the Component to be "International".

1. Please test against japsm, kopsm, zhcnpsm and zhtwpsm, but not zhpsm and cjkpsm. These two detectors are not finished yet and need additional work. [If you use M9 build later than 7/14/99, then you can also use ruprob and ukprob. ]
2. When you report a bug, please include the DOS/prompt window output. You can attach it by
   a. click the upper-left DOS icon in the dos window, and click control-k control-e
   b. high light the portion of output text by using mouse to click and click again
   c. click the upper-left DOS icon in the dos window, and click control-k control-y
   d. Paste into your Bugzilla report.
   e. That will help us a lot since the dos output tell me which byte it go into error states for a particular charset.

Mozilla

Usage Notes for Charset Auto-detection modules