You are currently viewing a snapshot of www.mozilla.org taken on April 21, 2008. Most of this content is highly out of date (some pages haven't been updated since the project began in 1998) and exists for historical purposes only. If there are any pages on this archive site that you think should be added back to www.mozilla.org, please file a bug.



You are here: Editor project page > Care and Feeding of Find in Page

Care and Feeding of Find in Page

Basics

The find algorithm is located in nsFind.cpp, and the core of it is in nsFind::Find.  I'm going to assume that these clusters are entirely located within one text node, so you shouldn't need to work about routines like NextNode or SkipNode -- those should just work.

Find() searches forward or backward through the tree, looking at nodes (almost always a text node, though there's an RFE to be able to search in other types of nodes) and comparing character by character with our pattern string. In reverse, we compare characters starting from the end of the pattern string, not the beginning (that might be important for this bug).

When we start a match (i.e. match one or more characters), we save the point where the match started then continue through the string. If we reach the end of the pattern string and we're still matching, we have succeeded; if we hit a non-match before the end of the pattern string, then we have to go back to the next character in the tree after the place where this match started, and reset our position in the pattern string to the beginning (or end, if backwards).

Important variables within nsFind::Find()

mIterNode is the current node we're examining. It is updated by NextNode, which uses a content iterator to move forward or backward in the tree. It needs to be able to discard nodes which are children of skipped nodes; if we had an iterator which visited nodes both before and after iterating through their children (so we would see the parent before the children in both forward and backward directions), but the current content iterator doesn't support that (in fact, its Post order iteration is unreliable and not supported). So the current code climbs the tree to find the first block parent, and compares that to the last block parent it saw to determine when block boundaries are crossed. I originally tried to use a nsIDOMTreeWalker , which might have made this easier, but didn't get it to work; this might be an improvement at some point.

mIterOffset is the current offset into the node: the place at which the current match started, or, if we're not in a partial match, the character we're comparing right now.

pindex is the current offset into the pattern string, and findex is the current offset into the current node (so these two point to the characters we're currently comparing).

You may have noticed that findex and mIterOffset seem to do the same thing. They basically do, but findex points to a character, whereas mIterOffset is a range offset that points between characters: see the note around line 680.

Since a match may (and often does) span multiple nodes in the tree, matchAnchorNode and matchAnchorOffset store the node and offset (range offset) where the current partial match began, while mIterNode and mIterOffset advance.

Debugging tips

You can #define DEBUG_find 1 and get lots of spew on standard output about what the algorithm is doing; I initially put these in because gdb doesn't always work reliably, but it's also useful in cases where you have to search a lot of text before the first match, so stopping/continuing past a breakpoint would take a long time.

Fragments of text nodes in the dom can be either single or double byte: see the use of the pointers t1b and t2b.

Given this info, the comments in the code may be enough to take you through the algorithm. Most of it is fairly straightforward; some of the tricky logic has to do with matching whitespace (any space in the pattern string matches one or more spaces in the document, so for example a double space in the pattern string must match two or more characters of whitespace -- which includes space, dom newline (\n), and non-breaking space ( ). There's also some tricky logic regarding when to advance findex and pindex.

There is a set of regression tests for find, located here. Please run through the entire set of tests after modifying find, and if possible, add a new test which will detect regressions in the bug you are fixing. Although these find tests are not an official part of the QA acceptance tests, running them regularly will make it much easier to detect regressions.

Hints on likely places that might need work

If you're reading this, it's probably because you have a bug that needs fixing. Here are some areas known to need work, and where to look:

  • Support for languages which use more than one doublewide character per group. The usage of t1b and t2b might be all that need be modified for this. Perhaps we need a tMultiByte option everywhere we now deal with t1b and t2b. You'll need to be able to iterate forward and backward in the string from the position of tMultiByte. If that's not straightforward, then you might have to define iterators which do what we currently do now in the single and double byte case, but do more work to implement forward and backward iteration in the multibyte case. Perhaps you could make a small class with an interator operator on it, which encapsulates the one- vs-two- vs-multiple byte string. If you're feeling ambitious, you could make a derivative of our string classes which handles text fragments of varying length; this would be useful in other areas as well, but writing it in a way that doesn't slow down find performance (no string copies!) might be challenging.
  • Off-by-one problems: be careful of when findex and pindex are incremented, and to where they are reset. It's easy to get off by one here when you're adding new code, and it's certainly possible that there are off-by-one bugs still lurking in the present code. If you have to change the way findex and pindex are incremented or reset, be sure to comment what you're doing (it's way too easy to get lost in this code without comments) and run through the regression tests completely.
  • Wildcard searching (bug 32641) and whole word only (bug 14871): unfortunately, we won't be able to leverage anything already written for our nsAString classes, because of the one- vs two- byte problem, the fact that we can't copy for performance reasons, and the need to match both forward and backward. I think the best way to inplement wildcard searching is to do something like what we already do for whitespace: use a flag to keep track of whether we're in a match, and increment findex but not pindex as long as the wildcard continues to match. This sounds not too hard, but I expect it will take some debugging to get it right. Perhaps wildcards and spaces can be combined into a more general concept.
  • Extending find to look through nodes other than text nodes (bug 158757). This will require changes (probably fairly straightforward) in two places: one in NextNode(), with perhaps also a little work in SkipNode; and a change in the code segment where we get the text out of the text node, which currently assumes a text node.
Author: Akkana Peck.