How to build standalone universal charset detector from Mozilla source
Target AudienceThis documentation was prepared for people who want use mozilla universal charset detector in their own project. To be able to perform all the work, you should know
a. how to use cvs or ftp to get code from mozillal.org
c. know how to write make file in their environment
In many situations, the caller is not able to or practical to provide all data at one time. In fact it is not really necessary to do so because reasonable conclusion can be reached without a lot of data. A stream based implementation is more suitable in many applications. Since more and more people showing interest in this module, we are considering providing this module as a standalone component. This simple documentation is provided before that work is done.
Step 1, Get the codeAll the source files of universal charset detector can be found in mozilla open source project. You can download the source code of mozilla by following the instructions found in "http://www.mozilla.org/source.html". You can use either ftp or cvs to get the source code. If you are familar with cvs, you don't have to download the whole tree. "mozilla\extensions\universalchardet" directory contains everything you need. If you want to build and try it, you need to get the whole source tree and build mozilla. The test program and existing wrapper has dependency on XPCOM modules.
All the source files for universal charset detector implementation can be found in "mozilla\extensions\universalchardet\src" directory. Most people won't need "nsUniversalCharDetModule.cpp", "nsUniversalCharDetDll.h". (Those 2 files is used to make universal charset detector a dll, but it is following the XPCOM way. So most likely you need to do it differently.)
Step 2, write a wrapper classThe major work will be implement your own wrapper class. You need to commented out the wrapper class in "nsUniversalDetector.h", "nsUniversalDetector.cpp". and implement your own in a different file. Class "nsUniversalDetector" is top level internal implementation class. You need to create a new class using this class as base class, and overload "void Report(const char* aCharset)" to provide your own charset notification mechanism. A more straightforward approach will be to check the detecting result after a feeded data is processed.
We have 2 wrapper class in nsUniversalDetector.cpp. One is stream based, the other is buffer based. Buffer based means all data is available at the time of calling detecting function. You use either wrapper classes as reference to implement your own. Reference to API section..
What about C
For C implementation, a context can be used to pass around the inside
implementation class. Here is one design we used in other project for
your reference. (This is stream based. Buffer based should be
ErrCode initCharsetDetector(LibraryContext libraryContext,
ErrCode releaseCharsetDetector(LibraryContext libraryContext,
ErrCode resetCharsetDetector( LibraryContext libraryContext,
ErrCode streamDetectCharset( LibraryContext libraryContext,
const char* inbytes, UINT32 inLen,
ErrCode finishStreamDetectCharset(LibraryContext libraryContext,
Step 3, Prepare your own make fileThere are no special attention needed to compile those files. Because every OS is different, and every project has its own practice, you need to write your own make file or modify existing one to include the files.
Step 4, Fix error and runAfter the wrapper class is done, you will hit some compile time errors like undefined macro and type name. You are recommended to redefine those things in new header file and try to avoid touch other implementation files. In our experience, only nsUniversalDetector.h and nsUniversalDetector.h needs to be modified. It is always a good idea to prepare a small test program and try the detector with some sample data.
If you do have better idea or implementation, please let us know and we can improve the code together so that every one else will benefit from your improvement.
APIsClass "nsUniversalDetector" will be your most interested class for implementing wrapper. There are 4 method you need to use, they are :
virtual void HandleData(const char* aBuf, PRUint32 aLen);
virtual void DataEnd(void);
virtual void Report(const char* aCharset) = 0;
virtual void Reset();
HandleData send data to detector for detection, is arguments is obvious. DataEnd notify detector that no further data is available. Wrapper class should implement Report in order to provide a charset notification mechanism. Reset clear all internal state of charset detector. Report and Reset is defined as protected methold to prevent them from being called directly by application. They should be only called with implementation and wrapper.
- Bug 178790,
separate xpcom wrapper and charset detector, after this bug get in, the
work might be much easier.
To do list
- To provide a default wrapper with plain API calls.
- Collecting text data and build language models for popular western language.
- For some single-byte encoded language, their 2-char sequence
might not work good enough. If that is proved to be true, we need to
figure out how to implement a high performance n-char sequence
analysis. (or n-gram in some existing research.)