How to build standalone universal charset detector from Mozilla source

You are currently viewing a snapshot of www.mozilla.org taken on April 21, 2008. Most of this content is highly out of date (some pages haven't been updated since the project began in 1998) and exists for historical purposes only. If there are any pages on this archive site that you think should be added back to www.mozilla.org, please file a bug.

Roadmap
Projects
Coding
Testing
Tools
- Bugzilla
- Tinderbox
- Bonsai
- LXR
FAQs

How to build standalone universal charset detector from Mozilla source

version: 0.3

Author: Shanjian Li

Date: November 06, 2002

Target Audience

This documentation was prepared for people who want use mozilla universal charset detector in their own project. To be able to perform all the work, you should know
a. how to use cvs or ftp to get code from mozillal.org
b. C/C++
c. know how to write make file in their environment

Introduction

The universal charset detector was implemented without any dependency on other mozilla modules. It was later wrapped using XPCOM interface and referenced by other mozilla modules. The wrapper is totally separated from implementation class. So it should be pretty easy to use universal detector in your own project.

In many situations, the caller is not able to or practical to provide all data at one time. In fact it is not really necessary to do so because reasonable conclusion can be reached without a lot of data. A stream based implementation is more suitable in many applications. Since more and more people showing interest in this module, we are considering providing this module as a standalone component. This simple documentation is provided before that work is done.

Step 1, Get the code

All the source files of universal charset detector can be found in mozilla open source project. You can download the source code of mozilla by following the instructions found in "http://www.mozilla.org/source.html". You can use either ftp or cvs to get the source code. If you are familar with cvs, you don't have to download the whole tree. "mozilla\extensions\universalchardet" directory contains everything you need. If you want to build and try it, you need to get the whole source tree and build mozilla. The test program and existing wrapper has dependency on XPCOM modules.

All the source files for universal charset detector implementation can be found in "mozilla\extensions\universalchardet\src" directory. Most people won't need "nsUniversalCharDetModule.cpp", "nsUniversalCharDetDll.h". (Those 2 files is used to make universal charset detector a dll, but it is following the XPCOM way. So most likely you need to do it differently.)

Step 2, write a wrapper class

The major work will be implement your own wrapper class. You need to commented out the wrapper class in "nsUniversalDetector.h", "nsUniversalDetector.cpp". and implement your own in a different file. Class "nsUniversalDetector" is top level internal implementation class. You need to create a new class using this class as base class, and overload "void Report(const char* aCharset)" to provide your own charset notification mechanism. A more straightforward approach will be to check the detecting result after a feeded data is processed.

We have 2 wrapper class in nsUniversalDetector.cpp. One is stream based, the other is buffer based. Buffer based means all data is available at the time of calling detecting function. You use either wrapper classes as reference to implement your own. Reference to API section..

What about C

For C implementation, a context can be used to pass around the inside implementation class. Here is one design we used in other project for your reference. (This is stream based. Buffer based should be straightforward.)

ErrCode initCharsetDetector(LibraryContext libraryContext,
                             charsetDetectorContext* outHandle);
ErrCode releaseCharsetDetector(LibraryContext libraryContext,
                                charsetDetectorContext inHandle);
ErrCode resetCharsetDetector( LibraryContext libraryContext,
                              charsetDetectorContext inHandle);
ErrCode streamDetectCharset( LibraryContext libraryContext,
                             charsetDetectorContext inHandle,
                             const char* inbytes, UINT32 inLen,
                             DetectStatus* detectStatusOut,
                             Encoding* charsetEncodingOut);
ErrCode finishStreamDetectCharset(LibraryContext libraryContext,
                                   charsetDetectorContext inHandle,
                                  DetectStatus* detectStatusOut,
                                  Encoding* charsetEncodingOut);

Step 3, Prepare your own make file

There are no special attention needed to compile those files. Because every OS is different, and every project has its own practice, you need to write your own make file or modify existing one to include the files.

Step 4, Fix error and run

After the wrapper class is done, you will hit some compile time errors like undefined macro and type name. You are recommended to redefine those things in new header file and try to avoid touch other implementation files. In our experience, only nsUniversalDetector.h and nsUniversalDetector.h needs to be modified. It is always a good idea to prepare a small test program and try the detector with some sample data.

If you do have better idea or implementation, please let us know and we can improve the code together so that every one else will benefit from your improvement.

APIs

Class "nsUniversalDetector" will be your most interested class for implementing wrapper. There are 4 method you need to use, they are :
public:
   virtual void HandleData(const char* aBuf, PRUint32 aLen);
   virtual void DataEnd(void);
protected:
   virtual void Report(const char* aCharset) = 0;
   virtual void Reset();
HandleData send data to detector for detection, is arguments is obvious. DataEnd notify detector that no further data is available. Wrapper class should implement Report in order to provide a charset notification mechanism. Reset clear all internal state of charset detector. Report and Reset is defined as protected methold to prevent them from being called directly by application. They should be only called with implementation and wrapper.

Open Issues

Bug 178790, separate xpcom wrapper and charset detector, after this bug get in, the work might be much easier.

To do list

To provide a default wrapper with plain API calls.
Collecting text data and build language models for popular western language.
For some single-byte encoded language, their 2-char sequence might not work good enough. If that is proved to be true, we need to figure out how to implement a high performance n-char sequence analysis. (or n-gram in some existing research.)

Feed back

Questions and further information, please contact Frank Tang (ftang@netscape.com) or Shanjian Li (shanjian@netscape.com).

Other Reference

In "\mozilla\extensions\universalchardet\doc" directory, you could find 2 documentation about this implementation. CharsetInterface.htm is a simple code documentation. UniversalCharsetDetection.doc is my paper for unicode conference. It described the general idea of auto detection.