Better Understanding PHP’s Garbage Collection

It’s interesting how just a few years can make a difference in the names that are given to things. If this were to come up today, it would probably be called PHP Recycling Options, because rather than picking things up and throwing them into a landfill where they’ll never be seen again, we are really talking about grabbing things whose use has passed and setting them up to be useful again. But, recycling wasn’t le petit Cherie of society back when the idea was developed and so this task was given the vulgar name of ‘Garbage Collection’. What can we do but follow what history and common usage have given us?

Program Generated Garbage

Programs use resources; sometimes small ones, sometimes much bigger. An example would be a data field. A program may define a data field, say a sequence number, that is used in the program. And once defined, this data field will take up space in memory, probably only a few bytes, but space nonetheless. Since every machine or programming environment has a finite (albeit large) amount of space available, the remaining space that it has left will be reduced by the amount of space that this field takes up.

When the program ends, naturally, the program, and any space that it has tied up, will disappear and the total space available will expand back to it’s maximum size. But what happens if the program never ends?

I’ve written a few of these such programs in my time. Works of beauty they were, and I was always pleased when everyone else in the shop noticed that I had created one. There’s nothing that points out your capabilities quite as much as bringing a big ol’ piece of IBM iron to a stand-still all by yourself, while from the surrounding cubicles one person after another says loudly, “hey, is there something wrong with the system?” The trick is to chime in second or third so to deflect attention from yourself.

But some programs are even meant to run forever, like daemons and other such things. And as they run, the amount of debris they generate can potentially keep growing. If the locked up resources are substantial, then it can have a real negative impact on your system.

As a result, every language must have a way of clearing out orphaned resources, making them available to other users and ensuring that the total available system space remains constant. Fortunately, PHP has a three tiered approach to garbage removal.

First Level – End of Scope

First, like most languages, whenever a scope of action ends, everything within that scope of action is destroyed, and any allocated resources are released. The scope of action can cover a function, a script, a session, etc. and when that scope ends, so does everything it is holding on to. Of course, you can always free up a resource any time you want by using the unset() function.

This is one reason why functions and methods are so very important, because they establish a scope of action, when particular memory usage begins and when it should end, and limits how long things can be around. They should be used whenever possible instead of global entities.

Second Level – Reference Counting

Second, like most scripting languages, PHP keeps track of how many entities are using a given variable using a technique called reference counting.

When a variable is created in a PHP script, PHP creates a little ‘container’ called a zval that consists of the value assigned to that variable plus two other pieces of information: is_ref and refcount. The zval containers are kept in a table where there is one table per scope of action (script, function, method, whatever).

is_ref is a simple true/false value that indicates if the variable is part of a reference set, thus helping PHP to tell if this is a simple variable or a reference.

The refcount is more interesting in that it holds a numeric value indicating how many different variables are using this value. That is, if you define variable $dave = 6, the refcount will be set to 1. If I then say $programmer = $dave, the refcount will be incremented to 2. PHP knows enough not to create a second zval for the value 6; it just updates the counter on the already existing value container. When the program ends, or when we leave the scope of the function, or when unset() is used, then this refcount will be decremented. When the refcount hits zero, the zval is destroyed and any memory that it was holding is now free.

Of course, this is a simple example for a simple variable. When you are talking about arrays or objects then it’s much more complicated for with multiple zrefs being created for the multiple values for an element in an array, but the basic processing is the same.

A problem occurs, however, if we use an array within another array, something that happens with some frequency in more complicated PHP scripts. In this case, the refcount for an array value is set to 1 when the original array value is set, then incremented to 2 when the array is associated with another array. If the scope of use of the second array then ends, then the refcount is decremented by 1. We are now in a situation where the value itself is no longer associated with anything, but the container (zval) that represents it still has a refcount greater than zero.

The end result is that the storage represented by the original array will not be freed up and that amount of memory is now unavailable for use by anything. Normally, we think of this amount of lost storage as being small, but often it isn’t. Arrays can be very big things today and it is especially problematic if the script in which this occurs is a daemon or other nearly continuously running function. In this case, the resultant ‘memory leak’ can have devastating consequences on performance and even the ability of a server to operate.

Third Level – Formal Garbage Collection

Obviously, reference count oriented clears have their limitations but fortunately, PHP 5.3 offered another option to help with this situation.

The specific situation that we want our garbage cycle to address is the case where the zval has been decremented, but it is still a non-zero value. Basically the cycle sees which values can be decremented further and then free up the ones that go to zero.

What really happens is that PHP keeps track of the all root containers (zvals). This is done whether garbage collection is turned on not (because it is faster for it to just do it rather than asking if garbage collection is on, yada, yada, yada). This root buffer holds up to 10,000 roots (fixed size, but this can be changed). When it fills up, then the garbage collection mechanism will kick off and it will begin analyzing this buffer.

The first thing the GC routine does is rip through the root buffer and decrement all of the zval counts by 1. As it does this, it marks each one with a little like check mark so that it only decrements a root once.

Then, it goes through again and marks (this time with a little squiggly line) all of the zvals whose reduced counts are zero. The ones that are not zero are incremented so that they resume their original values.

Finally, it will roll through there one more time, clearing out the non-zero zvals from the buffer, and freeing up the storage for the ones with a zero refcount.

Garbage collection is always turned on in PHP, but you can turn it off in the php.ini file with the directive zend.enable_gc. Or, you can do it within your script by calling the gc_enable() and gc_disable() functions.

As noted above, the garbage collection, if enabled, runs when the root is full, but you can override this and run the collection when you feel like it with the gc_collect_cycles() function. And, you can modify the size of the root buffer with the gc_root_buffer_max_entries value in the zend/zend_gc.c value in the PHP source code.

All in all, this allows you to control whether GC runs and when and were it does, which is a good thing because it is a bit resource intensive and so might not be the sort of thing you run just for the heck of it.

When Should You Use It

Because there is a performance hit attached to garbage collection, it is worth taking a minute to figure out when it should be used.

First, keep in mind that unless you overtly run it (with the gc_collect_cycles() function), the formal garbage collection will not happen until the root table (10,000 entries) is full, and since this table is at the scope level, that isn’t going to happen for small functions.

Should you use it on small scripts? That’s up to you. It’s hard to say that running something like garbage collection is a bad thing, but if you have small, quick running scripts that start and then end and are gone then there might not be much of a payback. But if your server is running a lot of small scripts that stay persistent, then it will probably be worth the effort. The only real way to know is to benchmark your application and see. And certainly, if you have long running scripts or especially scripts that do not end, then garbage collection is essential if you want to prevent the kind of memory leaking that we talked about above.

Perhaps most importantly, we should always try to follow good programming guidelines so that we minimize or eliminate global variables and tie our variables instead to scope, so that even if we have a long running script, we free up that memory when the function, rather than the script, ends. Also be aware of when you are using arrays within arrays, or objects referencing objects, since such situations can cause memory leaking and is the real target of the formal garbage collection process.

Image via Fotolia

Win an Annual Membership to Learnable,

SitePoint's Learning Platform

  • http://WebsiteURL sarmen

    thank you for this article! Garbage collection isnt something that the majority of php developers even think or care about but its best to know because i’ve seen applications that lag a great example being magento lol. If i wanted to see the is_ref or refcount values how would i go about it? is there a function or method i use to see it? just for testing.

    thanks

  • http://www.BitWindow.com Ashesh

    Thanks David for this article with a simple explanation of GC.

  • http://www.ontariodarts.com Steven Scott

    I think your sample for part 2 is wrong. As $dave = 6 is an internal integer type, when $programmer = $dave is called, $dave’s refcount should not be 2, but rather $programmer should have a refcount of 1 as well, since you could clear $dave without effecting $programmer.

    This outputs 6 as would be expected.
    I think you wanted to use a more complex type like an object, for a reference (&) to link $programmer to $dave.

    • http://www.resumebaking.com Aleksey Asiutin

      Steven,
      Dave described everything correctly. Php doesn’t make a copy of scalar variable untill you modify it. So for example
      gives us one real object in the memory with two variables reference it. The object’s refcount is 2 in this case. But if we change one variable: , then php creates two separate objects in the memory with refcounts equal to 1. Hope it’s clear now. If you have any questions, I can help you with proper article (just need to find it)

      Cheers,
      Aleksey.

      • http://ehsana.info Lasana Murray

        So what happens to references of non scalars like like an object passed by reference to another object and stored internally. Is the internal storage really just a copy?

  • http://web.performancerasta.com/ Nico

    For performance reasons, is best to only enable it when needed, ie for long running josbs, batch, php-cli daemons, etc.
    Of course there might be cases where you need it for long loops, but the code can probably be optimized (more OOP) before resorting go GC. Also some cases where you see increased memory is due to PHP mememory leaks, the best is to issolate the leaking function and go arond it (and report the but to php.net).

  • http://www.sassquad.com Timely Article

    Thanks for this very timely article. My work colleagues and I were discussing how you can explicitly perform garbage collection, as we have some scripts which deal with very large arrays and batch runs of large data files for reporting purposes. PHP is full of surprises, which is a blessing and a curse. I’ll definitely pass this article around. Thanks so much!

  • Dave Shirey

    First, a quick but sincere thank you to Sarmen, Ashesh, and Timely Article for your kind words. Authors are a fragile group and you never know when a few nice words will save them from a life slumped over a bar in a rundown ginmill slurping what is left of their rye on the rocks.

    Also, for Sarmen, you can see the ref_count if you have Xdebug installed by using XDEBUG_DEBUG_ZVAL(). See http://xdebug.org for more into. And remember that if ref_count = 1 then is_ref is always false. Hope this helps.

    To Steven and Alexsey – you scared me Steven because I thought, crap, I am going to have to think about this again, but Alexsey came up with a solution and I am sure (praying) that he is right. If this explanation doesn’t satisfy you Steven let me know. I thought I was right but I have been down that road before.

    And finally, Nico. Good call, my man. Yes, don’t turn this on if you don’t need to, although for most applications I am guessing the overhead will not be significant (noticeable). Unfortunately, it seems that every time you do need it is also a situation where you want the stinkin thing to be as fast as possible. Shirey’s Second Law – You’re screwed no matter what you do.

  • http://ehsana.info Lasana Murray

    This one of the better arguments against global state I have seen in a while. The common chatter about unit testing doesn’t really appeal much to small time hackers like me but the chance of memory leaks are scary :O .

  • Santosh Moktan

    Thanks for great article, complex topic with simple explanation. Waiting more to see ….