Python garbage collection and the gc module

Originally posted on infoworld.

How does Python deal with memory management? Learn the ins and outs of Python’s garbage collection system and how to avoid its pitfalls.

Python grants its users many conveniences, and one of the largest is (nearly) hassle-free memory management. You don’t need to manually allocate, track, and dispose of memory for objects and data structures in Python. The runtime does all of that for you, so you can focus on solving your actual problems instead of wrangling machine-level details.

Still, it’s good for even modestly experienced Python users to understand how Python’s garbage collection and memory management work. Understanding these mechanisms will help you avoid performance issues that can arise with more complex projects. You can also use Python’s built-in tooling to monitor your program’s memory management behavior.

How Python manages memory

Every Python object has a reference count, also known as a refcount. The refcount is a tally of the total number of other objects that hold a reference to a given object. When you add or remove references to an object, the number goes up or down. When an object’s refcount goes to zero, that object is deallocated and its memory is freed up.

What is a reference? Anything that allows an object to be accessed by way of a name, or by way of an accessor in another object.

Here’s a simple example:

x = "Hello there"

When we give Python this command, two things happen under the hood:

  1. The string "Hello there" is created and stored in memory as a Python object.
  2. The name x is created in the local namespace and pointed at that object, which increases its reference count by 1, to 1.

If we were to say y = x, then the reference count would be raised once again, to 2.

Whenever x and y go out of scope or are deleted from their namespaces, the reference count for the string goes down by 1 for each of those names. Once x and y are both out of scope or deleted, the refcount for the string goes to 0 and is removed.

Now, let’s say we create a list with a string in it, like this:

x = ["Hello there", 2, False]

The string remains in memory until either the list itself is removed or the element with the string in it is removed from the list. Either of these actions will cause the only thing holding a reference to the string to vanish.

Now consider this example:

x = "Hello there"
y = [x]

If we remove the first element from y, or delete the list y entirely, the string is still in memory. This is because the name x holds a reference to it.

Reference cycles in Python

In most cases, reference counts work fine. But sometimes you have a case where two objects each hold a reference to each otherThis is known as a reference cycle. In this case, the reference counts for the objects will never reach zero, and they’ll never be removed from memory.

Here’s a contrived example:

x = SomeClass()
y = SomeOtherClass()
x.item = y
y.item = x

Since x and y hold references to each other, they will never be removed from the system—even if nothing else has a reference to either of them.

It’s actually fairly common for Python’s own runtime to generate reference cycles for objects. One example would be an exception with a traceback object that contains references to the exception itself.

In very early versions of Python, this was a problem. Objects with reference cycles could accumulate over time, which was a big issue for long-running applications. But Python has since introduced the cycle detection and garbage collection system, which manages reference cycles.

The Python garbage collector (gc)

Python’s garbage collector detects objects with reference cycles. It does this by tracking objects that are “containers”—things like lists, dictionaries, custom class instances—and determining what objects in them can’t be reached anywhere else.

Once those objects are singled out, the garbage collector removes them by ensuring their reference counts can be safely brought down to zero. (For more about how this works, see the Python developer’s guide.)

The vast majority of Python objects don’t have reference cycles, so the garbage collector doesn’t need to run 24/7. Instead, the garbage collector uses a few heuristics to run less often and to run as efficiently as possible each time.

When the Python interpreter starts, it tracks how many objects have been allocated but not deallocated. The vast majority of Python objects have a very short lifespan, so they pop in and out of existence quickly. But over time, more long-lived objects hang around. Once more than a certain number of such objects stacks up, the garbage collector runs. (The default number of allowed long-lived objects is 700 as of Python 3.10.)

Every time the garbage collector runs, it takes all the objects that survive the collection and puts them together in a group called a generation. These “generation 1” objects get scanned less often for reference cycles. Any generation 1 objects that survive the garbage collector eventually are migrated into a second generation, where they’re scanned even more rarely.

How to use the gc module

Generally, the garbage collector doesn’t need tuning to run well. Python’s development team chose defaults that reflect the most common real-world scenarios. But if you do need to tweak the way garbage collection works, you can use Python’s gc module. The gc module provides programmatic interfaces to the garbage collector’s behaviors, and it provides visibility into what objects are being tracked.

One useful thing gc lets you do is toggle off the garbage collector when you’re sure you won’t need it. For instance, if you have a short-running script that piles up a lot of objects, you don’t need the garbage collector. Everything will just be cleared out when the script ends. To that end, you can disable the garbage collector with the command gc.disable(). Later, you can re-enable it with gc.enable().

You can also run a collection cycle manually with gc.collect(). A common application for this would be to manage a performance-intensive section of your program that generates many temporary objects. You could disable garbage collection during that part of the program, then manually run a collection at the end and re-enable collection.

Another useful garbage collection optimization is gc.freeze(). When this command is issued, everything currently tracked by the garbage collector is “frozen,” or listed as exempt from future collection scans. This way, future scans can skip over those objects. If you have a program that imports libraries and sets up a good deal of internal state before starting, you can issue gc.freeze() after all the work is done. This keeps the garbage collector from having to trawl over things that aren’t likely to be removed anyway. (If you want to have garbage collection performed again on frozen objects, use gc.unfreeze().)

Debugging garbage collection with gc

You can also use gc to debug garbage collection behaviors. If you have an inordinate number of objects stacking up in memory and not being garbage collected, you can use gc‘s inspection tools to figure out what might be holding references to those objects.

If you want to know what objects hold a reference to a given object, you can use gc.get_referrers(obj) to list them. You can also use gc.get_referents(obj) to find any objects referred to by a given object.

If you’re not sure if a given object is a candidate for garbage collection, gc.is_tracked(obj) tells you whether or not that object is tracked by the garbage collector. As noted earlier, keep in mind that the garbage collector doesn’t track “atomic” objects (such as integers) or elements that contain only atomic objects.

If you want to see for yourself what objects are being collected, you can set the garbage collector’s debugging flags with gc.set_debug(gc.DEBUG_LEAK|gc.DEBUG_STATS). This writes information about garbage collection to stderr. It preserves all objects collected as garbage in the read-only list, gc.garbage.

Avoid pitfalls in Python memory management

As noted, objects can pile up in memory and not be collected if you still have references to them somewhere. This isn’t a failure of Python’s garbage collection as such; the garbage collector can’t tell if you accidentally kept a reference to something or not.

Let’s end with a few pointers for preventing objects from never being collected.

Pay attention to object scope

If you assign Object 1 to be a property of Object 2 (such as a class), Object 2 will need to go out of scope before Object 1 will:

obj1 = MyClass()
obj2.prop = obj1

What’s more, if this happens in a way that’s a side-effect of some other operation, like passing Object 2 as an argument to a constructor for Object 1, you might not realize Object 1 is holding a reference:

obj1 = MyClass(obj2)

Another example: If you push an object into a module-level list and forget about the list, the object will remain until removed from the list, or until the list itself no longer has any references. But if that list is a module-level object, it’ll likely hang around until the program terminates.

In short, be conscious of ways your object might be held by another object that doesn’t always look obvious.

Use weakref to avoid reference cycles

Python’s weakref module lets you create weak references to other objects. Weak references don’t increase an object’s reference count, so an object that has only weak references is a candidate for garbage collection.

One common use for weakref would be an object cache. You don’t want the referenced object to be preserved just because it has a cache entry, so you use a weakref for the cache entry.

Manually break reference cycles

Finally, if you’re aware that a given object holds a reference to another object, you can always break the reference to that object manually. For instance, if you have instance_of_class.ref = other_object, you can set instance_of_class.ref = None when you’re preparing to remove instance_of_class.

Originally posted on infoworld