Mozilla Tree Verification Process
Author: Chris YehLast updated: October 11th, 1999
This document describes the verification process of the mozilla source tree which happens every weekday starting at 8 AM, US Pacific Time. It explains why we use this process, what happens during the process and who is involved.
The format of this document is in a question/answer format. It can be read as either a whole document or as a FAQ.
-
How does the verification process work?
- Wow. So all of the developers can't check in code during this time period?
- Isn't that terribly expensive in terms of development time?
- Why not let people check in during the verification?
- Why not shift the time of the verification earlier so that the time spent pulling and building a tree burns time when no one is up at night?
- Is there anything you can do to make the verification process take less time?
- So do you really hold up all the developers if a single platform is dead? What about the people working on other platforms?
- Where did the verification process come from?
- Why is the verification process used?
- What problems does the verification process attempt to solve?
How does the verification process work?
The verification process starts by "closing" the tree at 8 AM, Pacific Standard time. At this time, no developers are to commit source code changes to the revision control system.
At 8:05 AM, a set of automated verification machines begin pulling a source code tree to build it.
If there is a problem building the product on any of the verification machines, the machines page the release team informing of them of a problem. The release team member in charge of the verification will examine the build logs and determine the compile error that caused the build to stop. If the release engineer can determine the problem and implement a fix, a fix is committed to the source tree and the build is restarted.
After the builds are complete on all verification machines, a set of runtime tests (called smoke tests) are performed on the binaries. The smoke tests are a subset of the full QA test suite. Tests are executed to look for feature or performance regressions in key areas.
If the smoke tests fail on a particular platform (or in some cases, the binary crashes on startup) then the release engineer reports test failures to all of the engineers that checked in since the last verification build (known as "The Hook" as in being on the hook). The release engineer then looks at the last at the source changes and attempts to find the source check-in and engineer that may have caused the regression. If the release engineer can't find the cause quickly, the release engineer is then empowered to find a software engineer to assist him in tracking down the regression.
Once the cause of the regression has been found, the release engineer and the software engineer find an owner for the bug and evaluate how long it will take to fix the problem. The estimated time to fix the bug is communicated to all developers.
If a fix for the regression is ready, then the fix is checked into the tree and the verification builds are rebuilt. When the builds are complete, the builds are then re-tested to verify the fix and ensure that another regression hasn't been introduced.
If the fix is good, the builds are delivered to QA and the source tree is "opened" for code changes. Developers are free to commit source code changes to the revision control system.
If a fix for the regression is not found, the release engineer and a group of developers re-evaluate the bug and decide if it is worth keeping the tree "closed" to additional changes. In some cases there has been sufficient progress in the bug investigation and the tree will re-open. In other cases it is determined that the regression have a workaround or are not critical features and the tree will re-open. In some rare cases the bug will be deemed so bad, so severe that the tree will remain closed (for sometimes days at a time) until a fix is found. Once a fix is found, the code is rebuilt and redelivered to QA.
Wow. So all of the developers can't check in code during this time period?
Yes. Occasionally the release engineer in charge of the verification will grant exceptions to this rule. In most cases, we let people commit changes into the source tree that are not a part of the primary make process.
Isn't that terribly expensive in terms of development time?
Yes. The verification process effectively prevents a large number of engineers from being able to commit changes to the source code. There are approximately 130 active Mozilla contributors (as of October 1999), so every minute the tree is closed equals 130 minutes of development time that is lost.
Why not let people check in during the verification?
Letting people check in during the verification process destroys the ability to create a known baseline of performance and stability. More importantly, it makes the verification process take longer.
The whole point of the verification is to verify buildability and create a stable platform for feature development. So let's assume that we allow changes to the source code during the verification:
8:00 tree closes
8:05 tree is built
10:00 builds are being tested.
10:10 engineers check in changes to source tree
10:30 builds are verified to be okay, tree opens.
What happens if the changes that were checked in at 10:10 caused the builds to crash at startup? One hundred engineers will pull a tree, thinking that they will get a stable platform to develop on when instead they will waste two hours building and have nothing to work with.
In addition to failing to establish a baseline, you also make the investigation process harder if the regression fails. For example:
8:00 tree closes
8:05 tree is built
10:00 builds are being tested.
10:05 build fails tests, investigation starts into bug.
10:10 other engineers check in changes to source tree
10:30 fix in hand for failed test.
10:35 rebuild verifications
11:00 re-test builds, fail again.
Now did we fail because the original fix was bad? Or because someone checked in new bugs?
Complicating this is the cross platform nature of the code. Any fixes need to be tested on all platforms, because of compiler and runtime differences. So even fixes to address a bug on a particular platform could cause a regression on another platform.
Why not shift the time of the verification earlier so that the time spent pulling and building a tree burns time when no one is up at night?
The source code takes two hours to pull and build on the verification machines. The thinking is that this time can be used during the early morning hours when no one will be committing changes to the tree anyway. (The theory being that engineers are asleep during this time.)
First, Mozilla is a global project. There are people working on the code at all times during the day. Secondly, geeks are notorious for staying up for ridiculous amounts of time to finish something. Third, it assumes that the verification goes off without any problems.
Let's assume that engineers start work at 10:30 am Pacific Time and that we've moved the verification time to start at 6 am Pacific Time.
6:00 tree closes
6:05 tree is built
8:00 builds are being tested.
If there aren't any problems:
8:30 open the tree.
If we encounter a problem, then it looks like this:
8:30 tests fail. wait until:
10:30 start investigation with engineers.
So you lose 120 minutes because no one is around to start the investigation and bug fixing. It's a false savings. You're betting that there won't be any problem. If there isn't a problem the tree is closed for 2.5 hours. If there is a problem, the tree is closed for at least 4.5 hours.
Is there anything you can do to make the verification process take less time?
The only constant in the process is the amount of time required to pull and build the tree on all verification platforms. This is where we throw money at the problem in the form of ultra-fast machines with lots of RAM, multiple CPU's and disk arrays.
So do you really hold up all the developers if a single platform is dead? What about the people working on other platforms?
In Mozilla, the reference platforms (Linux, Macintosh, Win32) are all equal. If one of them goes, we hold the tree.
During the development Netscape Navigator and Netscape Communicator it was argued many times that based upon shipping deadlines and marketshare, we should care less about a particular set of platforms and fix regressions on the "second-class" platforms later. We tried this once. The reason why we don't have Netscape Communicator on Win16 was the result of putting off the recovery of that platform until later. After a couple of weeks recovery became impossible.
If you have a regression on a platform, allowing other platforms to continue checking into a common codebase ends up stacking the deck against the one dead platform. The problems will stack up behind the original one as the codebase moves forward and it never catches up.
Where did the verification process come from?
The verification process is the result of four years of development on Netscape Navigator and Netscape Communicator. Netscape Communicator was a large cross platform application built on shared library code on over 30 different computing platforms. It also had a large number of developers modifying into the code base at the same time.
We use this process in Mozilla because the software has the same qualities. It is a large application that has shared code that must build and run on multiple platforms. As an open source project, it also has a large body of engineers contributing code.
The verification process is a constantly evolving one as new challenges and problems arise. It's undergone at least four major changes as the software and the number of engineers on the project increased in size. It should be expected that it will change in the future to address scalability issues in the future.
The verification process is just a part of what could be called "The Mozilla Development Process". A more detailed document outlining the history and development of that process can be found here.
Why is the verification process used?
During the development of Netscape Navigator and Netscape Communicator we learned that developing a cross platform application with shared code was a tremendous undertaking. In order to be able to meet our deadlines and ship the software, this process was developed to ensure daily deliveries to QA and also provide developers a stable development platform on which to develop new features.
What problems does the verification process attempt to solve?
The verification process was developed in order to deliver builds to QA on a daily basis and ensure that developers had a stable code base with which to develop new features and bug fixes across multiple platforms.
The verification process tries to address two problems: build/compile problems and runtime stability.
Build and Compile Problems
This may sound easy, but in reality keeping the source tree building and compiling is a hard task. There are a number of issues that will stop your ability to compile the code across all platforms:
Compiler differences
Each compiler implements the C/C++standard in a different way. This isn't merely code generation (which is going to be different across platforms) but also how the compiler handles errors and warnings. Some compilers are rather strict, while other compilers are rather forgiving. We try to set the error and warning levels to the same as much as possible, but even the settings are implemented differently from compiler to compiler.
As a result, code that will happily compile under one compiler will cause another compiler to spew two pages of error warnings or even crash. Code that executes fine under one compiler causes a crash on another compiler because of differences in the implementation of 'int'.
Build system differences.
This could alternatively titled "Build Lore differences." The reference platforms that Mozilla uses (currently Macintosh PPC, Linux glibc, and Win32) have different build systems because the build tools and platform dictate it. As a result, it can be difficult to remember what needs to be done on each platform. Documentation alleviates some of this but it doesn't replace hard technical knowledge of each platform's idiosyncrasies.
Broken dependencies
The majority of developers pull a tree, make changes to the code and then commit them to the revision control system. Then they update their source code and do a depend build in their current tree. In a perfect system all dependencies would be known and accounted for and everything would get rebuilt when it needed to.
However, the world is far from perfect. There have been occasions where we've discovered holes in the build system where you can't build from a tree pulled from scratch.
Debug vs. Optimized
Developers build debug. QA wants to test optimized. Occasionally changes made to the build system will work on debug and not on optimized builds.
Runtime stability.
The program has be stable enough to serve as a platform to code new features. You can't verify new features if the program is crashing at startup or has major runtime flaws. Being able to pull a tree and build it is useless unless you can execute the program long enough create and debug new code.