grendel database requirements
by Terry Weissman
Note: this does not include address book needs! Lester Schueler is working on a separate document to cover our needs there, which is very different.
what we need to store.
- We already have a place to store the messages, and that's largely outside of our control. Messages live on IMAP or NNTP servers. Or they live in Berkeley mail folders. We could conceivably replace the Berkeley mail folders with something else, but there are sound interoperability reasons why we want to keep using them. Anyway, we can't replace the IMAP or NNTP servers. So, we're not looking for a database to store the messages themselves.
- What we need is a way to index the messages. That is, given a set of criteria, we need to find the set of messages that satisfy the criteria. We would expect the returned messages to be in the form of pointers into the actual message store; we can then chase the pointers to get the messages themselves. The database would contain enough information to build a summary line for that message.
- We need to be able to store hundreds of thousands of messages. (If you think that number is too high, then you aren't thinking about news.)
- Folder this message is stored in
- List of "To" recipients
- List of "Cc" recipients
- List of "Bcc" recipients
- Flags (read/unread, flagged, replied, forwarded, etc.)
- And I'm sure I'm missing some
Usually, when I hear people talking about using databases in Xena, they seem to be talking about a convenient place to store a few dozen or a few hundred objects. But Grendel is different in three ways:
Most traditional databases that I'm familiar with let you store records of data. Each record consists of several fields. A few of the fields are special "key" fields, that you can do fast searches on.
We can use this kind of traditional database, but it turns out every field we store will need to be a key field, because we want to do sorting and searching based on any of these fields.
And we have a lot of fields to store:
Actually, the set of headers to store should probably be user-customizable; some users (like jwz) will want to store every possible header.
Much or all of what we'll store in the database is just cached information from the messages themselves. So, theoretically, if the database blows up, we can recreate it. Practically, though, this would suck a lot, as many users would have enough data that it will take hours or days to recreate the data.
- Use 3.0-style summary files for the common case.
We won't even use a spiffy database for the usual case of ``show me all the messages in this folder.'' Instead, we'll maintain a file for each folder that contains all the info we need about that folder; whenever the user opens up a folder, we inhale into memory the entire contents of the relevant file. This is a proven technique that covers the vast majority of common cases really well. It has some scaling problems, though, and it definitely doesn't allow for the nifty cross-folder views and searches that we really want to do. So, we still want a database to handle those new features.
We would love to be proven wrong, and to just throw away the summary file code and use the database for everything. But we are not yet convinced that this will ever work, and so we're not prepared to count on it.
- Update the database in a background thread.
(Any database gurus will probably laugh as I clumsily describe what has got to be a standard database technique. If it's not a standard technique, I deserve a patent.) Whenever changes are made that need to be stored in the database, don't immediately commit them to the database, because this is slow and will block the user from doing anything else that requires the database until it completes. Instead, note them down in a log file, and have a background thread incrementally commit the changes. Any database queries would check for entries in the log file, and merge in results from there. This technique results in database changes to apparently happen instantaneously; the cost is that any immediate queries will run slightly slower as they merge in the uncommitted changes from the log file.
It's gotta fly. Both reading and writing have to be pretty much instantaneous for the common cases. And the common cases are pretty broad.
We don't believe it is possible to write a database that has all the indexing we need, and all the reliability we need, and still get all the performance we need. So, we've figured out two dodges to help:
The first dodge helps us avoid the need for fast queries for the database, but we'll still want it as fast as possible for the new features that aren't handled well by summary files.
The second dodge helps us avoid the need for fast updates to the database, but a slow database will still definitely suck in a lot of ways, especially when the user does things like move thousands of messages from one folder to another, or receives a ton of new mail, or imports an entire new folder, or needs to rescan a whole folder (see below).
We have to be able to quickly throw out and rebuild large chunks of data, because at any time we may detect that everything we once knew about a folder is suddenly invalid. If another application has changed an IMAP folder or a Berkeley mail folder, we can detect the fact that a change happened, but we can't know what changed. We have no choice but to throw out everything in the database that relates to the folder, and recreate it. Just the "throwing out" part can be a real expensive operation on many databases.
Another nasty consequence of this is that it means the database is probably not a good place to write down any extra information about the messages. It's tempting to put annotations and extra status information solely in the database, without writing it in anywhere in the real message itself. But because folders can be changed out from under us without warning, it is also tempting to consider the entire database as just a cache, where anything can be thrown out and recreated at will. These goals tend to be conflicting.
fitting in with RDF.
One bit of good news is that RDF's view of data is a model that works well for our needs. A really fast database that directly implements the RDF model can be directly used by our stuff. The main lack there would be the ability to sort the resulting query in any way; as near as I can tell, RDF does not support sorting of results. But I think we can live without the database sorting its results.
truly ambitious stuff.
We haven't thought a lot about it, but we'd love a database that would support full body text indexing on messages. Yow.