You are here: Editor project page > Care and Feeding of Find in Page
Care and Feeding of Find in Page
Basics
The find algorithm is located in nsFind.cpp, and the core of it is in nsFind::Find. I'm going to assume that these clusters are entirely located within one text node, so you shouldn't need to work about routines like NextNode or SkipNode -- those should just work.
Find() searches forward or backward through the tree, looking at nodes (almost always a text node, though there's an RFE to be able to search in other types of nodes) and comparing character by character with our pattern string. In reverse, we compare characters starting from the end of the pattern string, not the beginning (that might be important for this bug).
When we start a match (i.e. match one or more characters), we save the point where the match started then continue through the string. If we reach the end of the pattern string and we're still matching, we have succeeded; if we hit a non-match before the end of the pattern string, then we have to go back to the next character in the tree after the place where this match started, and reset our position in the pattern string to the beginning (or end, if backwards).
Important variables within nsFind::Find()
mIterNode
is the current node we're examining. It is updated
by NextNode,
which uses a content iterator to move forward or backward in the tree. It
needs to be able to discard nodes which are children of skipped nodes; if
we had an iterator which visited nodes both before and after iterating
through their children (so we would see the parent before the children
in both forward and backward directions), but the current content
iterator doesn't support that (in fact, its Post order iteration is
unreliable and not supported). So the current code climbs the tree to
find the first block parent, and compares that to the last block parent
it saw to determine when block boundaries are crossed. I originally
tried to use a nsIDOMTreeWalker
, which might have made this easier, but didn't get it to work;
this might be an improvement at some point.
mIterOffset
is the current offset into the node: the
place at which the current match started, or, if we're not in a
partial match, the character we're comparing right now.
pindex
is the current offset into the pattern string,
and findex
is the current offset into the current node
(so these two point to the characters we're currently comparing).
You may have noticed that findex
and
mIterOffset
seem to do the same thing. They basically do,
but findex points to a character, whereas mIterOffset is a range
offset that points between characters: see the note around line 680.
Since a match may (and often does) span multiple nodes in the tree, matchAnchorNode
and matchAnchorOffset
store the node and offset (range
offset) where the current partial match began, while mIterNode
and mIterOffset
advance.
Debugging tips
You can #define DEBUG_find 1 and get lots of spew on standard output about what the algorithm is doing; I initially put these in because gdb doesn't always work reliably, but it's also useful in cases where you have to search a lot of text before the first match, so stopping/continuing past a breakpoint would take a long time.
Fragments of text nodes in the dom can be either single or double byte: see the use of the pointers t1b and t2b.
Given this info, the comments in the code may be enough to take you through the algorithm. Most of it is fairly straightforward; some of the tricky logic has to do with matching whitespace (any space in the pattern string matches one or more spaces in the document, so for example a double space in the pattern string must match two or more characters of whitespace -- which includes space, dom newline (\n), and non-breaking space ( ). There's also some tricky logic regarding when to advance findex and pindex.
There is a set of regression tests for find, located here. Please run through the entire set of tests after modifying find, and if possible, add a new test which will detect regressions in the bug you are fixing. Although these find tests are not an official part of the QA acceptance tests, running them regularly will make it much easier to detect regressions.
Hints on likely places that might need work
If you're reading this, it's probably because you have a bug that needs fixing. Here are some areas known to need work, and where to look:
- Support for languages which use more than one doublewide character per group. The usage of t1b and t2b might be all that need be modified for this. Perhaps we need a tMultiByte option everywhere we now deal with t1b and t2b. You'll need to be able to iterate forward and backward in the string from the position of tMultiByte. If that's not straightforward, then you might have to define iterators which do what we currently do now in the single and double byte case, but do more work to implement forward and backward iteration in the multibyte case. Perhaps you could make a small class with an interator operator on it, which encapsulates the one- vs-two- vs-multiple byte string. If you're feeling ambitious, you could make a derivative of our string classes which handles text fragments of varying length; this would be useful in other areas as well, but writing it in a way that doesn't slow down find performance (no string copies!) might be challenging.
- Off-by-one problems: be careful of when
findex
andpindex
are incremented, and to where they are reset. It's easy to get off by one here when you're adding new code, and it's certainly possible that there are off-by-one bugs still lurking in the present code. If you have to change the wayfindex
andpindex
are incremented or reset, be sure to comment what you're doing (it's way too easy to get lost in this code without comments) and run through the regression tests completely. - Wildcard searching (bug 32641) and whole word only (bug 14871): unfortunately, we won't be able to leverage anything already written for our nsAString classes, because of the one- vs two- byte problem, the fact that we can't copy for performance reasons, and the need to match both forward and backward. I think the best way to inplement wildcard searching is to do something like what we already do for whitespace: use a flag to keep track of whether we're in a match, and increment findex but not pindex as long as the wildcard continues to match. This sounds not too hard, but I expect it will take some debugging to get it right. Perhaps wildcards and spaces can be combined into a more general concept.
- Extending find to look through nodes other than text nodes (bug 158757). This will require changes (probably fairly straightforward) in two places: one in NextNode(), with perhaps also a little work in SkipNode; and a change in the code segment where we get the text out of the text node, which currently assumes a text node.