New Layout: Parsing EngineAuthor: Rick Gessner
Last update: 1May98
The parsing engine in NGLayout has a modular
design that actually permits the system to parse almost any kind of data. (Of
course the engine is optimized for HTML).
Conceptually speaking, a parsing "engine" is used to transform a source document from one form into another. In the case of HTML, the parser transforms the hierarchy of HTML tags (the source form) into a form that the underlying layout and display engine requires (the target form).
The first major component in the parsing engine is the Scanner. The Scanner provides an incremental "push-based" API that offers methods for accessing characters in the input stream (usually a URL), finding particular sequences, collating input data and skipping over unwanted data. Our experience has shown than a fairly simple scanner can be used effectively to parse everything from HTML and XML to C++.
The second major element in the system is the parser component itself. The parser component controls and coordinates the activities the other components in the system. This approach relies upon the fact that regardless of the form of the source document, the transformation process remains the same (as we'll explain later). While other components of the system are meant to be dynamically substituted according to the source document type, it is rarely necessary to alter the parser component.
The parser also drives tokenization. Tokenization refers to the process of coalating atomic units (characters) in the input stream into higher level structures called tokens. So for example, the HTML tokenizer converts a raw input stream of characters into HTML tags. For maximum flexibility, the tokenizer makes no assumptions about the underlying grammer. Instead, the details of the actual grammer being parsed is up to the DTD object that understands the constructs that comprise the grammar. The importance of this design decision is that it allows the engine to dynamically vary the language it is tokenizing without changing the tokenizer itself.
The final component in the parser engine is the DTD, which describes the rules for well-formed and/or valid documents in the target grammar. In HTML, the DTD declares and defines the tag set, the associated set of attributes and the hierarchical (nesting) rules of the HTML tags. Once again, by separating the DTD component from the other components in the parser engine it becomes possible to use the same system to parse a much wide range of document types. Simply put, this means that the same parser can provide input to the browser, biased (via the DTD) to behave like Navigator, IE, or any other HTML browser. The same can be said for XML.
Once the tokenization process is complete, the parse-engine needs to emit its content (tokens). Since the parser doesn't know anything about the document model, the containing application must provide a "content-sink". The sink is a simple API that accepts a container, leaf and text nodes, and constructs the underlying document model accordingly. The DTD interacts with the sink to cause the proper content-model to be constructed based on the input set of tokens.
While these objects may seem confusing at first, this simple diagram illustrates the runtime relationships between these objects:
parser image here>
Parsing a document is a straightforward operation. The containing application initiates the parse by creating a nIURL object, a nsTokenizer object and nsHTMLParse object. The parser is assigned a sink and a DTD (remember: the DTD understands the grammar of the document being parsed, while the sink interfaces allows the DTD to properly build a content model).
Phase 2 -- Opening an Input Stream
The parse process begins when the URL is opened, and content is provided in for the form of a network input stream. The stream is given to the scanner, which controls all access. The parse-engine then instructs the tokenizer to initiate the tokenization phase. Tokenization is an incremental process, and can interrupt when the scanner is blocked awaiting network data.
Phase 3 -- Tokenization
The tokenizer controls and coordinates the tokenization of the input stream into a collection of CTokens. (Different grammars will have their own subclasses of CToken, as we've done to create CHTMLToken, as well as their own iDTD). As the tokenizer runs, it repeatedly calls the method GetToken(). This continues until EOF occurs on the input stream, or an unrecoverable error occurs.
Phase 4 -- Token Iteration/Document
After the tokenization phase completes, the parses enters the token iteration phase which validates the document and causes a content model to be constructed. Token iteration proceeds until an unrecoverable error occurs, or the parser has visited each token. The tokens are collected into related groups of information according to the rules provided by the nsDTD class. The DTD controls the order in which tokens can appear in relation to each other. At well defined times during this process, the parser notifies the content sink about the parse context, instructing the sink to construct the document according to the state of parser.
Phase 5 -- Object Destruction
Once tokenization and iteration have concluded, the objects in the parse system are destroyed to conserve memory.
|Also Of Interest...|
In addition to parsing of documents and dynamic DTD support, the parse engine also offers support for data i/o observers. The intention of these interfaces is to allow secure objects to hook into the i/o system at runtime, transforming the underlying stream before the parser see it. This can be useful in cases where preprocessing needs to occur, or where transforms from foreign document types into HTML should occur.
- nsCore.h (and prtypes.h)
- The XP_COM system
- Netlib (for urls and input stream)
- Support for well-formed and/or valid XML documents.
- Support for document "processors" such as XSL and others.
- Backward compatibility -- HTML DTD improvements.
- Performance tuning.