Demystifying Footprint
Suresh Duddi <dp@netscape.com>Created: 30 Jan 2002; Last Modified: 2 March 2002
Glossary
It is highly recommended that these terms are looked up and understood.Virtual Memory
Working Set - The working set of a program or system is that memory (2) or set of addresses which it will use in the near future.
max-working-set - The union of all addresses accessed during a given time period {tbegin, tend}
           
 Resident Set - In a 
       virtual memory(1) system, a process
resident    set is that part of a process   
  address space which is currently in     
main memory.  If this does not include all of the process     
working set, the system may      thrash
  . Usually this is the     
Working Set, plus other pages that the app used earlier that have  not
 been swapped to disk. Operating Systems decide when to swap unused pages
  of an app to disk. Mostly this decision is demand based - when other applications 
   need more physical memory and there is nothing free. Windows will swap 
out   all unused pages when an application is iconified.
        
        Thrash
       
        Footprint - can mean different things in different contexts. 
 For    example, on a machine without virtual memory, `footprint' would probably 
    mean `the maximum amount of physical memory that the application requires.
        
        Most modern operating systems have a Virtual Memory (VM) system that
  gives   each process its own Virtual Address Space. Only a part of the
virtual   memory   required by a process is paged into physical memory -
the part that  is required   now. This is called the Resident Set.
          
          Process address space can be broken down into:
          
                   
| Stack grows | v  | 
              
| ^ | grows Heap  | 
              
| static data | 
              
Code (lib)  | 
              
| Code (exe) | 
              
Usually,
Virtual Memory > Resident Set > Working Set
When physical memory available to the app becomes less than the apps Working Set, the app will thrash causing poor performance.
Understanding the heap
"heap" is mostly data allocated using malloc() and free(). When applications request for memory, the allocator (implementation of malloc and friends) get VmData in bulk from the OS and manage it.| Application Requests memory via calls to malloc()  | 
      
| Allocator 
       - libc Implements malloc() Requests VmData from the OS using sbrk() or mmap() or VirtualAlloc() in bulk and manages the memory returned.  | 
      
| Operating 
System  - kernel Manages physical memory and the mapping of individual processes' virutal memory into physical memory  | 
      
User Statement of the problem
- "When I run Netscape 6 for days on win98, I get alerts warning
   me of low virual memory"
This could be because swap file is small on win98. SWAP SIZE + Physical memory size caps total amount of Vm used by all applications. So as processes use more Vm, we could hit this limit. It is really hard to hit this limit on WinNT, 2000, XP or linux.
 - "On my 32MB windows/linux machine, Netscape 6 is slow and 
sluggish"
This means that Netscape 6's working set for the user scenario is high enough that it cannot be all held in physical memory and the sytem thrashes as it cycles through to satisfy Netscape 6's working set.
 - "When Iconifying and deiconifying Netscape 6, it takes a
long   time  to become active"
When iconified on windows Nt/2000/XP, OS actively start swapping out unused pages. When deiconified, all pages that are needed are swapped back in as always. If the working set for deiconifying and displaying Netscape 6 is large, a large amount of memory needs to be swapped back in and potentially in making room for that, that much memory needs to be swapped out of applications that are using it to disk. This could account for the delay in getting Netscape 6 to become active again post deiconification. Also this will account for other apps being slower when Netscape 6 is running as it causes them to trash more.
 - "Netscape 6 is too fat, consumes too much memory"
This is a perception issue and has no real meaning to it.
 
Ideal Metrics
From the above statement of user problems and given the ideal tools that can measure anything, these would be the numbers to measure and improve:- Max-working-set : performance threshold
For a given scenario, union of the working set required by Netscape 6 at every instant during the scenario. This number would say that on a machine with backing store (swap), Netscape 6 needs this much physical memory available after the OS and other apps have been loaded to give the user a non-sluggish performance. This is a function of the process and not available physical memory (ie) for a given application and a senario, this is a constant number irrespective of what other apps are running or how much physical memory is available - Peak-vm-usage : pagesize threshold
For a given scenario, max vm requirement of Netscape6, assuming no allocator buffering of virtual memory. Usually allocators dont return virtual memory got from operating system ( via sbrk() ) unless the unused VM is greater than a threshold and is at the end of the processes' VM space.
 
Measurable Metrics
Max-working-set We currently dont have the means to measure this. We are working on it.Peak-vm-usage: This or a function of this can be measured reliably but is Operating System and allocator dependent.
User Scenarios for measurement:
- Run pageload test : http://cowtools.mcom.com
 - - Startup w/default profile, home.netscape.com
- Read 10 messages of 200 from netscape.net email using Imap
- Compose & send one email
 
Windows
First let us see what each of the numbers reported by the various tools mean:- TaskManager : [Operating System] Task manage reports Resident Set Not much useful.
 - Taskinfo 2000 : [Operating System] Reports total Vm Usage on 2000 and VmCode/VmData breakdown on win98. VmCode includes static data. Gives VmCode for each dll and the executable.
 - SpaceTrace : [Application] Reports peak heap requested by application. VmData is a direct function of heap requested by app - the allocator stands between the app and the system and request Vm in bulk. It also decides when to give back unused memory back to the OS.
 - HeapInfo : [Allocator] Reports Vm requested
 by allocator from operating system on behalf of the application. Also reports
 a breakdown of usage of this memory. win32 only
 
Linux
  
  On linux, say 21078 was the process id for netscapechetu> grep Vm /proc/22049/status
VmSize: 41844 kB
VmRSS: 26260 kB
VmData: 20316 kB
VmLib: 17752 kB
| VmSize | 
                Virtual memory usage of entire process = VmLib + VmExe + VmData + VmStk  | 
              
| VmRSS | 
                Resident Set currently in physical memory
 including  Code, Data, Stack | 
              
| VmData | 
                Virtual memory usage of Heap | 
              
| VmStk | 
                Virtual memory usage of Stack. Doest change
  much. | 
              
| VmExe | 
                Virtual memory usage by executable and statically
     linked libraries 'man top' says this is broken ? | 
              
| VmLib | 
                Virtual memory usage by dlls loaded | 
        
Goal
- Reduce Peak-vm-usage : Peak-VmData and Peak-VmCode
 - Reduce Peak-working-set
 
Plan
| Approach | 
            Impact | 
            Benefit | 
          
|---|---|---|
| Release data no longer needed | 
       data | 
       Reduces peak vm usage. Frees more so allocator need 
not get more from OS. Reduces max-vm-usage | 
     
| Reduce <64 byte allocations | 
       data | 
       Reduces overhead. Post startup: 15% of USED memory is consumed by overhead - data from HeapInfo Post startup: 78% (about 71,000) allocations are for < 64 bytes - data from SpaceTrace  | 
     
| Reducing memory churn | 
            data | 
            This is performance not footprint. Since 
fragmentation isnt high, this wont help footprint much. | 
          
| Reduce code | 
            code | 
            Caution: Focus on code needed for scenario
  rather than any code This is going to be really hard to achieve for Mach-V Makes more sense for embedding.  | 
          
| Delay load dlls | 
            code | 
            Reduces max-working-set and max-vm-usage Caution: Delaying is useful only if the dll load can be postponsed past the scenario  | 
          
Notes on allocator
Windows 2000
- Uses a best-fit allocator
 - Fragmentation is less than 4% of free space - not a problem
 - Rarely releases free space back to operating system : HeapCompact() doesn't do much
 - Doest use mmap() much
 - Allocated blocks are aligned on 8 byte boundaries
 - Malloc overhead is usually 8 to 16 bytes per block
 
Linux
- Uses ptmalloc - a derivative of Doug Lea's boundary tag allocator
 - Allocated blocks are aligned on 8 byte boundaries
 - Malloc overhead is 4 bytes per allocation.
 - Minium allocated size is 16 bytes
 
Recommended Reading
- Doug Lea's boundary tag allocator (Linux malloc) : http://gee.cs.oswego.edu/dl/html/malloc.html
 - Dynamic Storage Allocation: A Survey and Critical Review : ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps
 - http://developer.apple.com/techpubs/macosx/Essentials/Performance/VirtualMemory/Virtual_Mem_on_Mac_OS_X.html
 - 
      http://developer.apple.com/techpubs/macosx/Essentials/Performance/VirtualMemory/Allocating__eing_Memory.html