New Layout: Parsing Engine
Author: Rick GessnerLast update: 1May98
Overview |
The parsing engine in NGLayout has a modular
design that actually permits the system to parse almost any kind of data. (Of
course the engine is optimized for HTML).
Conceptually speaking, a parsing "engine"
is used to transform a source document from one form into another. In the case
of HTML, the parser transforms the hierarchy of HTML tags (the source form)
into a form that the underlying layout and display engine requires (the target
form).
Major Components |
Scanner Component
The first major component in the parsing
engine is the Scanner. The Scanner provides an incremental "push-based"
API that offers methods for accessing characters in the input stream (usually
a URL), finding particular sequences, collating input data and skipping over
unwanted data. Our experience has shown than a fairly simple scanner can be
used effectively to parse everything from HTML and XML to C++.
Parser Component
The second major element in the
system is the parser component itself. The parser component controls and
coordinates the activities the other components in the system. This approach
relies upon the fact that regardless of the form of the source document,
the transformation process remains the same (as we'll explain later). While
other components of the system are meant to be dynamically substituted
according to the source document type, it is rarely necessary to alter
the parser component.
The parser also drives tokenization. Tokenization refers to the process of coalating atomic units (characters) in the input stream into higher level structures called tokens. So for example, the HTML tokenizer converts a raw input stream of characters into HTML tags. For maximum flexibility, the tokenizer makes no assumptions about the underlying grammer. Instead, the details of the actual grammer being parsed is up to the DTD object that understands the constructs that comprise the grammar. The importance of this design decision is that it allows the engine to dynamically vary the language it is tokenizing without changing the tokenizer itself.
DTD Component
The final component in the parser engine
is the DTD, which describes the rules for well-formed and/or valid documents
in the target grammar. In HTML, the DTD declares and defines the tag set, the
associated set of attributes and the hierarchical (nesting) rules of the HTML
tags. Once again, by separating the DTD component from the other components
in the parser engine it becomes possible to use the same system to parse a much
wide range of document types. Simply put, this means that the same parser can
provide input to the browser, biased (via the DTD) to behave like Navigator,
IE, or any other HTML browser. The same can be said for XML.
Sink Component
Once the tokenization process is complete,
the parse-engine needs to emit its content (tokens). Since the parser doesn't
know anything about the document model, the containing application must provide
a "content-sink". The sink is a simple API that accepts a container, leaf and
text nodes, and constructs the underlying document model accordingly. The DTD
interacts with the sink to cause the proper content-model to be constructed
based on the input set of tokens.
While these objects may seem confusing at first, this simple diagram illustrates the runtime relationships between these objects:
<insert
parser image here>
Implementation |
Parsing a document is a straightforward operation. The containing application initiates the parse by creating a nIURL object, a nsTokenizer object and nsHTMLParse object. The parser is assigned a sink and a DTD (remember: the DTD understands the grammar of the document being parsed, while the sink interfaces allows the DTD to properly build a content model).
Phase 2 -- Opening an Input Stream
The parse process begins when the URL is
opened, and content is provided in for the form of a network input stream. The
stream is given to the scanner, which controls all access. The parse-engine
then instructs the tokenizer to initiate the tokenization phase. Tokenization
is an incremental process, and can interrupt when the scanner is blocked awaiting
network data.
Phase 3 -- Tokenization
The tokenizer controls and coordinates the
tokenization of the input stream into a collection of CTokens. (Different grammars
will have their own subclasses of CToken, as we've done to create CHTMLToken,
as well as their own iDTD). As the tokenizer runs, it repeatedly calls the method
GetToken(). This continues until EOF occurs on the input stream, or an
unrecoverable error occurs.
Phase 4 -- Token Iteration/Document
Construction
After the tokenization phase completes, the
parses enters the token iteration phase which validates the document and causes
a content model to be constructed. Token iteration proceeds until an unrecoverable
error occurs, or the parser has visited each token. The tokens are collected
into related groups of information according to the rules provided by the nsDTD
class. The DTD controls the order in which tokens can appear in relation to
each other. At well defined times during this process, the parser notifies the
content sink about the parse context, instructing the sink to construct the
document according to the state of parser.
Phase 5 -- Object Destruction
Once tokenization and iteration have concluded,
the objects in the parse system are destroyed to conserve memory.
Also Of Interest... |
In addition to parsing of documents and dynamic DTD support, the parse engine also offers support for data i/o observers. The intention of these interfaces is to allow secure objects to hook into the i/o system at runtime, transforming the underlying stream before the parser see it. This can be useful in cases where preprocessing needs to occur, or where transforms from foreign document types into HTML should occur.
Dependencies |
- nsString
- nsCore.h (and prtypes.h)
- The XP_COM system
- Netlib (for urls and input stream)
Roadmap |
- Support for well-formed and/or valid XML documents.
- Support for document "processors" such as XSL and others.
- Backward compatibility -- HTML DTD improvements.
- Performance tuning.
Known Bugs |