You are currently viewing a snapshot of www.mozilla.org taken on April 21, 2008. Most of this content is highly out of date (some pages haven't been updated since the project began in 1998) and exists for historical purposes only. If there are any pages on this archive site that you think should be added back to www.mozilla.org, please file a bug.



You are here: Editor project page > Serializer notes

Serializer notes

Purpose:

Take a DOM tree and turn it into a serial stream of characters in a given format.

History:

Originally, the Document knew how to write itself to XIF (eXtensible Interchange Format?), an XML dialect which contained HTML plus meta-information. (rickg)
To turn this into html or text, the parser parsed XIF and sent output to "Sink Streams" (nsHTMLContentSinkStream and nsHTMLToTXTSinkStream, in htmlparser) which turned XIF parser nodes into html or plaintext.

For performance reasons, XIF was eliminated, and the sink streams were translated by jst into the current html and plaintext serializers and moved from parser to their current home in content. The XIF serializer was added later.

Types:

  • nsHTMLSerializer: show HTML source corresponding to a DOM tree.
  • nsPlaintextSerializer: translate an HTML DOM into plaintext.
  • nsXMLContentSerializer: show XML source (I assume?)

Location:

content/base/src/ns*Serializer

Use:

Not intended to be called directly by classes and embeddors; we offer simpler interfaces, such as nsIDocumentEncoder and the new interface nsIWebBrowserPersist which take a mime type and don't have to worry about the details of how to call serializers.

Flags: currently defined in nsIDocumentEncoder but likely to move soon.

    // Output only the selection (as opposed to the whole document).
    OutputSelectionOnly = 1,

    // Plaintext output: Convert html to plaintext that looks like the html.
    // Implies wrap (except inside <pre>), since html wraps.
    // HTML output: always do prettyprinting, ignoring existing formatting.
    // (Probably not well tested for HTML output.)
    OutputFormatted     = 2,

    // OutputRaw is used by copying text from widgets
    OutputRaw           = 4,

    // No html head tags
    OutputBodyOnly      = 8,

    // Wrap even if we're not doing formatted output (e.g. for text fields)
    OutputPreformatted  = 16,

    // Output as though the content is preformatted
    // (e.g. maybe it's wrapped in a MOZ_PRE or MOZ_PRE_WRAP style tag)
    OutputWrap          = 32,

    // Output for format flowed (RFC 2646). This is used when converting
    // to text for mail sending. This differs just slightly
    // but in an important way from normal formatted, and that is that
    // lines are space stuffed. This can't (correctly) be done later.
    OutputFormatFlowed  = 64,

    // Convert links, image src, and script src to absolute URLs when possible
    OutputAbsoluteLinks = 128,

    // Encode entities when outputting to a string.
    // E.g. If set, we'll output & if clear, we'll output 0xa0.
    OutputEncodeEntities = 256,

    // LineBreak processing: we can do either platform line breaks,
    // CR, LF, or CRLF.  If neither of these flags is set, then we
    // will use platform line breaks.
    OutputCRLineBreak = 512,
    OutputLFLineBreak = 1024,

    // Output the content of noscript elements (only for serializing
    // to plaintext).
    OutputNoScriptContent = 2048,

    // Output the content of noframes elements (only for serializing
    // to plaintext).
    OutputNoFramesContent = 4096

HTML Serializer:

Purpose:

  • Show HTML source corresponding to a DOM.
  • Not used for Browser's View Source; that just refetches from necko.
  • Is used by Composer (for saving, publishing, and View Source), Mail (for sending or View Source) and HTML copy in all windows.

Issues/Bugs:

  • Preserving formatting: serializer has to guess at formatting based on information from the parser. This often doesn't preserve the formatting of the original document -- a problem for composer.
  • Line breaks: serializer must insert breaks (otherwise mail and news transmission agents might choke, aside from general source readability issues) but can't look ahead, so sometimes breaks between tags, introducing whitespace which becomes significant.

Plaintext Serializer:

Purpose:

  • Translate HTML (eventually XML?) into plaintext.
  • Translate a dom from the html mail compose window into a plaintext mail message.
  • Translate a <pre> formatted html tree (from plaintext mail compose, textarea, etc.) into plaintext.
  • Translate a selection (which we store in html) into plaintext to paste into another app or a text control in our app.

Two modes:

  • Formatted: used primarily for mail compose. "Make output look as much like the html as possible." Includes line breaks, usually wraps; eventually will have table support.
  • Unformatted: used primarily for copy/paste. "Make output be simple, so that the containing app can wrap or otherwise format it."

Issues/Bugs:

  • Line break problems.
  • Wrapping issues. (Complex because of format=flowed.)
  • Doesn't handle tables.

External Contributors:

  • Daniel Brattell and Ben Bucksch contributed a lot of code to the plaintext serializer due to its importance to mail. Some of their contributions: format=flowed wrapping, bold/italic/underline/smiley substitution, html-lookalike formatting (lists, indentation).

Code Overview:

Two modes of use:

There are two methods of calling a serializer: as a serializer on a dom tree, or as a parser sink. So a serializer needs to be able to act on either parser nodes or dom nodes.

DOM serializer mode is straightforward: someone calls methods like AppendElementStart() and AppendElementEnd(), passing in nsIDOMElements, and the serializer potentially has access to the tree.

In parser sink mode, the parser serially calls methods like OpenContainer() and CloseContainer(), passing in a parser node each time. There's no lookahead, hence each node must be serialized independantly of what might be coming.

Hence the serializer must make all its decisions about a node without assuming it can get that node's neighbors.

Output is always to an nsAWritableString, as set in Initialize(nsAWritableString* aOutString, PRUint32 aFlags, PRUint32 aWrapCol). We used to support output to a stream, but that's no longer supported.

Important methods to know:

  • Initialize(nsAWritableString* aOutString, PRUint32 aFlags, PRUint32 aWrapCol).
  • DoAddLeaf(PRInt32 aTag, const nsAReadableString& aText): use this to add leaf nodes, e.g. text. DoAddLeaf must decide whether the leaf needs to be wrapped.
  • DoOpenContainer(PRInt32 aTag): Start a tag.
  • DoCloseContainer(PRInt32 aTag): End a tag.
  • AddToLine(const PRUnichar* aStringToAdd, PRInt32 aLength): Called by the three preceeding methods to add text to the current line.
  • EnsureVerticalSpace(PRInt32 noOfRows)/EndLine(PRBool softlinebreak): generally, call EnsureVerticalSpace, don't call EndLine directly, to guard against too many extra blank lines.