Character Set Converters

You are currently viewing a snapshot of www.mozilla.org taken on April 21, 2008. Most of this content is highly out of date (some pages haven't been updated since the project began in 1998) and exists for historical purposes only. If there are any pages on this archive site that you think should be added back to www.mozilla.org, please file a bug.

Roadmap
Projects
Coding
Testing
Tools
- Bugzilla
- Tinderbox
- Bonsai
- LXR
FAQs

Character Set Converters
by Catalin Rotaru <cata@netscape.com>
Last Modified: 14/Dec/1998

Introduction

From an user point of view, a human-readable string is an array of characters. But, in order to store this text in a computer, an encoding (character set) must be used. Internally, NGLayout uses Unicode. However, a different character set may be used by the page author, and a different one may be used by the font author. So our system must be able to first convert data from the input character set into the internal encoding (Unicode), and then into the output character set in order to do the rendering. This is what the Character Set Converters are for: convert data between various encodings. One thing to keep in mind is that a character set is not a converter. A character set is a name, a label for an encoding. A type, if you want. A converter is a piece of code able to convert data between two different encodings.

Design & Architecture

The Character Set Converter module contains 2 main components

The ConverterManager - implementing nsICharsetConverterManager

This guy is responsible with managing all those converters.
It will: solve charset aliases into cannonical names, maintain a mapping between converters and the charsets they convert from and into, return a list of all the encodings for which we have a converter, and so on.

The Converter(s) - implementing nsICharsetConverter and its factory implementing nsICharsetConverterInfo

The converter converts between two character sets
The Charset Converter Info is a little description of the converter - which charsets is it converting between.

Extensibility

Our main goal for the new model is to have full drop-in extensibility for the converters and their corresponding charsets. That means that if an user adds a plugin Converter(FooCharset => Unicode), that charset will have full rights, for example it will apear in the [View.Character Set] menu and the converter will be used to decode incoming data encoded in the FooCharset.
The reason for this goal is that usually encodings are grouped in a per-language basis. Instead of gathering all the known converters and ship a converter library containing all known charstets (this can get quite big in time!), we'd rather offer a basic distribution containing the most used converters and per-language support throught SmartUpdate or Plugins. This also give users the possibility to add converters for the Foo legacy enconding, which is not known or used enough to be included in a Netscape distribution.

XXX Further documentation to be added here as the extensibility mechanisms are solved at XPCOM level.

High-level API

This API is expected to be used by most of the users. It should give very easy access to the most common converters functions. It should be at the stream level: for example something like new UnicodeInputStream(String * aCharset), or new String(byte * aBuffer, String * aCharset). You get the idea: type safety and all, simplicity - the Converter Manager is well hidden under the hood, you can very well ignore it if you don't need the extra functionality. Hell, you don't even know you are using a Converter!

XXX Further documentation to be added here as the high level API is designed.

Low-level API

This API is the most powerful and the most general one. It gives you direct access to the converters. The downside is that you must be extra careful here with the data types, and you have to manage more complexity.

First you get a character set converter from the converter manager using the following API:

/**
* Interface for a Manager of Charset Converters.
*
* @created 17/Nov/1998
* @author Catalin Rotaru [CATA]
*/
class nsICharsetConverterManager : public nsISupports
{
public:

/**
   * Finds a Converter between the source and the destination character
   * sets.
   *
   * @param aSrc    [IN] the known name/alias of the source character set
   * @param aDest   [IN] the known name/alias of the destination character set
   * @param aResult [OUT] the character set converter
   * @return        NS_CONVERTER_NOT_FOUND if no converter was found for
   *                these charsets
   */
NS_IMETHOD GetConverter(const nsString * aSrc, const nsString * aDest,
      nsICharsetConverter ** aResult) = 0;

/**
   * Returns a list of character sets for which we have converters (from the
   * given charset into them).
   *
   * @param aCharset    [IN] the name/alias of the source character set
   * @param aResult     [OUT] a NULL-terminated array of pointers to Strings
   */
NS_IMETHOD GetCharsetsConvertedFrom(const nsString * aCharset,
      nsString ** aResult) = 0;

/**
   * Returns a list of character sets for which we have converters (from them
   * into the given charset).
   *
   * @param aCharset    [IN] the name/alias of the destination character set
   * @param aResult     [OUT] a NULL-terminated array of pointers to Strings
   */
NS_IMETHOD GetCharsetsConvertedInto(const nsString * aCharset,
      nsString ** aResult) = 0;

/**
   * Resolves the cannonical name of a character set. If the given name is
   * unknown to the resolver, a new identical string will be returned! This
   * way, new & unknown charsets are not rejected and they are treated as
   * no-aliases charsets.
   *
   * @param aCharset    [IN] the known name/alias of the character set
   * @param aResult     [OUT] the cannonical name of the character set
   */
NS_IMETHOD GetCharsetName(const nsString * aCharset,
      nsString ** aResult) = 0;
};

Then you use the Converter with the following API:

/**
* Interface for a Charset Converter.
*
* XXX Compare this interface with the one from the C++ standard
*
* @created 23/Nov/1998
* @author Catalin Rotaru [CATA]
*/
class nsICharsetConverter : public nsISupports
{
public:

/**
   * Converts the data from one character set to another.
   *
   * @param aDest       [IN/OUT] the destination data buffer
   * @param aDestOffset [IN] the offset in the destination data buffer
   * @param aDestLength [IN/OUT] the length of destination data buffer; after
   *                    converstion will contain the number of bytes written
   * @param aSrc        [IN] the source data buffer
   * @param aSrcOffset [IN] the offset in the source data buffer
   * @param aSrcLength [IN/OUT] the length of source data buffer; after
   *                    converstion will contain the number of bytes read
   * @param finish      [IN] if this is the last buffer in this conversion;
   *                    the converter has the possibility to write some extra
   *                    data, flush its final state (but only if success!)
   * @return            error code
   */
NS_IMETHOD Convert(char * aDest, PRInt32 aDestOffset, PRInt32 * aDestLength,
      const char * aSrc, PRInt32 aSrcOffset, PRInt32 * aSrcLength,
      PRBool finish) = 0;

/**
* Resets the charset converter so it may be reused on a different buffer.
*/
NS_IMETHOD Reset() = 0;
};

The converter discovery mechanism uses the following description API, which is implemented by the Converter factory:

/**
* Interface for getting the Charset Converter information.
*
* @created 08/Dec/1998
* @author Catalin Rotaru [CATA]
*/
class nsICharsetConverterInfo : public nsISupports
{
public:

/**
   * Returns the character set this converter is converting from.
   *
   * @param aCharset    [OUT] a name/alias for the source charset
   */
NS_IMETHOD GetCharsetSrc(nsString ** aCharset) = 0;

/**
   * Returns the character set this converter is converting into.
   *
   * @param aCharset    [OUT] a name/alias for the destination charset
   */
NS_IMETHOD GetCharsetDest(nsString ** aCharset) = 0;
};

How to write and add a new Character Set Converter

XXX Further documentation to be added here as the API is freezed. Until then, if you want to write a new converter, you can get almost all the data you need from the source code! For the rest, please contact me, I'd be more that happy to help and assist you.

Issues

1) Right now a charset is a string, a label. Should this be an interface (ICharset)?
2) Right now the alias resolution service is done by the CharsetConverterManager. Should this be in a different, independent service (CharsetManager)?