Performance: profiling jujuContact: Chris Waterson (firstname.lastname@example.org)
This page contains tricks and tips for the black art of Profiling Mozilla. For information about the tools that are available for profiling, please see this page.
Rational Software's Quantify product can be a truly wonderful tool to work with. If you can get it to work.
Collecting data is actually the "riskiest" part. Mozilla is, unfortunately, a big 500 pound gorilla of a product to profile, and it really pushes Quantify to its limits. Here is the magic recipe that I've used to get traces out of the tool on Win32.
Have lots of swap space. Configure your machine so that you've got a ton of swap space. (I've heard reports of needing as much as 3,000MB -- that's three gigabytes, baby!)
MOZ_DEBUG_SYMBOLS=1 set and
--enable-optimize --disable-debug --enable-quantify.
Rule Number One in profiling is to profile what you'll
ship. Profiling debug bits means you'll fight with
sprintf() calls that generate debugging strings that go
nowhere, sloppy (but quickly) generated code from the compiler, and
#ifdef DEBUG cruft that people have left in to
verify internal structures. In other words, you'll risk spending the
afternoon chasing down a "big win", only to discover somebody's sanity
checking code is really, really slow.
Build in a separate tree. Don't combine debug and optimized bits in the same tree. Doing so risks the anger of the profiling gods.
Run the app once before starting to profile. After a clean build, you should run the app once to let XPCOM autoregistration and other "first time" tasks take place; e.g., selecting the default profile. (Unless, of course, your goal is to profile these "first time" tasks.)
Use "function level timing" for a module that Quantify can't instrument. By default, Quantify will try to instrument each DLL using line-level instrumentation. This is the most accurate; however, sometimes (when appropriate chickens have not been sacrificed), doing line-level instrumentation will cause the target app to crash. If this happens, try changing the instrumentation for the module that was being instrumented when the crash occurred to function-level. To do this:
- Click the Setting... button in the Run Program dialog. This will bring up the Program Settings dialog.
- Click the Modules... button in the Program Settings dialog. This will bring up the Module Instrumentation dialog.
- Find the module that was being instrumented when the crash occurred, and change its Measurement from Default to Function.
(Tip o' the hat to kipp for that trick.)
Remove all the files from your Quantify cache directory if the above doesn't seem to work. (This is what the tech support people at Rational say, anyway.) I think maybe this has worked for me. Or maybe it was because I did it under the light of a full moon...
"Pause" recording of data until you get to the task you want to profile. You do this by clicking the little "VCR pause" button in the toolbar, and then clicking the "Rewind" button to clear out any partially-collected results. Get the app to a state where you are just about to perform the task you want to profile, then "un-pause" the data recorder, perform the task, re-pause the recorder, and shut the app down cleanly. Besides avoiding a ton of unnecessary cruft, you will greatly increase the chances that Quantify will actually be able to parse the collected data set.
Once you've collected the data, you'll quickly find yourself weeding through more information than you probably care to know. Here are some tips for isolating the stuff that's important.
Plan to take notes. As you start to walk through this data, you'll stumble on dozens of things that surprise you. Rather than trying to address each individually, take notes so you can come back to stuff.
Focus on the thread that did the work when analyzing the data. Quantify will collect timing information from all the threads that ran during your task, and will include "dead time" spent waiting in Necko threads even if you didn't happen to load anything. Wading through that just makes your life harder. Start by finding the "green thread" (i.e., the thread that did all the work), and then use the Call Graph window and the right-mouse to Focus on Subtree. I've found it amazingly useful to recursively "focus down the tree" on hot operations.
Use the "Function List" to find hot function. The function list can be sorted by column, so you can find functions by name, function time, and call count. The "function+descendant time" and "call count" columns are most useful for finding the Big Kahunae of low hanging fruit. Start there.
Use the "Function Detail" window to walk up and down call chains. The function list will give you a crude overview of where time is being spent. But you know the code better than Quantify. Use the function list (sorted alphabetically) to find key entry points (e.g., where layout is starting, where a reflow is triggered), and then switch to the function detail view to navigate up and down the tree to see who is calling it and who it calls. You may be surprised. Use the Call Graph window to re-focus on subroutines if you become particularly interested in one place or another.
Use call counts to identify opportunities for algorithmic change. A high call count may indicate an opportunity for algorithmic change; e.g., replacing a list with a hashtable.
Use the "Annotated Source" to examine leaf-level functions that are taking a lot of time, or have high call counts. You can access this menu by right-clicking in the Function Detail menu, choose Annotated Source in the Switch To submenu.
Compare profiles against each other after optimizing. Make sure
the performance optimization you thought you made actually
worked. Save the original profile as a baseline, and then refer back
to it as you tune the code. Not only will this keep you honest, it'll
give you bragging rights ("I made
foo() go 200% faster!").
Visual Studio 2005 Notes
If you're attempting to use Quantify with Visual Studio 2005, you'll need to be using version 7.0 of the Rational suite. Data collection with VS2005 built binaries seems to require much more profiling space than previous required. Discussion with Rational support gave these steps to allow data collection, in addition to the normal Quantify instructions above:
- Enable 3GB mode in Windows. Edit
/3GBto the end of line in the
[operating systems]section. Reboot to make this take effect.
- Before you start a run, you need to add some quantify options:
- In the Run Program dialog, click on Settings...
- Click on the Files tab
- Add the following to Additional options:
If you've already tried profiling before adding those options, make sure Quantify reinstruments the EXE/DLLs by clearing the cache - Run Program, click on Settings..., select the PowerTune tab, click on Configure..., click on Clear Cache.
Mac Profiling Tips
Resolving aliases is slow!
Launching the app after restarting the Mac takes about 37 seconds with a debug build with aliases and about 19 seconds with the real files.
It takes 4.5s to load the 43 DLLs instead of 3.06s, which means that using aliases makes it 50% slower for relatively big files. It must be worse for plenty of small files (like xpt, xul, gif...).
Result: if you are working on performance, you should use a copy of the debug build with real files instead of aliases otherwise you may not be able to appreciate your improvements to their fullest.
File cache makes a big difference
With VM on and a cache of 8Mb, it takes 37 seconds to launch a debug build with aliases. With a cache of 128Kb (the minimum allowed by MacOS), it takes 1 minute and 5 seconds. The time needed to load just the DLLs was 12 seconds instead of 4.5.
Virtual Memory makes almost no difference
Turning the Virtual Memory on and off, makes no difference at all for launching the app for the first time: 37 seconds in both cases. However, VM-on improves quite a bit the time needed to relaunch the app just after a Quit: about 9 seconds instead of 12. Most of this difference comes from the time needed to reload the DLLs: 0.8 seconds in one case, 2.9 seconds in the other.
Optimized builds are a little bit faster
We suspected it, however the difference is not very important. I compared a debug build without aliases and a daily build from Mozilla.org. After restarting the Mac, the debug build takes about 19 seconds to launch, the optimized build about 16 seconds. We have about the same difference when relaunching the applications a second time: 12 seconds versus 9. It can be explained by a smaller size of DLLs in optimized builds rather than by a better code (but it is just my guess).