We wrote some new code in the form of celery tasks that we expected to run for up to five minutes, and use a few hundred megabytes of memory. Rinse and repeat for a thousand different data sets. We ran through a few data sets successfully, but once we started running though ALL of them, we noticed that the memory of the celery process was continuing to grow.
In celery, each task runs in one of a fixed number of processes that persist between tasks. We assumed we had a memory leak on our hands; somehow we were leaving references around to our data structures that were remaining in memory and not being garbage collected between tasks. But how do you go about investigating exactly what is happening?
Note: Stop everything, and make sure that you're not in
DEBUG mode, assuming you're using Django. In that mode, every database query you make will be stored in memory, which looks a lot like a memory leak.
The command line utilities top or the more pleasing htop should be your first stop for any CPU or memory load investigation. In our case, we had observed that the machine would run out of memory and start paging while running our tasks. So we kicked them off again, and watched the processes in htop. Indeed, the processes grew from their initial size of 100MB, slowly, all the way up to 1GB before we killed them. We could see from the logs that any individual tasks were being completed successfully along the way.
We were able to reproduce the behavior in our development environment, though we only had enough data for the process to balloon to a few hundred megabytes. Once we had the behavior reproducible in a script that could be run on it's own outside of celery (using
CELERY_ALWAYS_EAGER), we could using the GNU
time command to track peak memory usage, ie
/usr/bin/time -v myscript.py.
Note: we're specifying the full path to time so that we get the GNU time command, and not the one built into bash.
Note: there is a bug in some versions of the utility that mis-reports memory usage by multiplying it by a factor of four. Double-check using top.
You can actually get the amount of memory your process is using from inside your Python process, using the resource module.
This can be useful for adding logging statements to your code to measure memory usage over time, or at critical junctures of a long-running process. This can help you isolate the critical section of your code that's causing the memory issue.
Once you have identified a spot in your code just after the memory issue has occurred, you can query for the objects currently in memory right from Python, as well. You will probably need to do a
pip install objgraph first.
Maybe you'll get lucky and see a custom class that you've defined at the top of the list. But if not, what exactly is in those generic type buckets? Enter guppy, which is like
show_most_common_types on steroids. Again, you will likely need to install this via
pip install guppy. The great thing about guppy/heapy is that you can take a snapshot of the heap before your critical section and after, and diff them, just getting the objects that were added to the heap in between.
You probably want a pdb session here, so you can interactively investigate the heap diff. The best heapy tutorial I have found is How to use guppy/heapy for tracking down memory usage.
Note: memory dumps have been fabricated to protect the innocent.
An interesting thing happened when we were using heapy. We noticed that heapy was only reporting 128MB of objects in memory, where as the resource module and top agreed that there was almost 1GB being used.
To get an idea of what was comprising the remaining 800+ MBs, we turned to gdb, specifically to a python helper called gdb-heap.
In our case, what we saw was mostly indecipherable. But there seemed to be a ton of tiny little objects around, like integers.
Long running Python jobs that consume a lot of memory while running may not return that memory to the operating system until the process actually terminates, even if everything is garbage collected properly. That was news to me, but it's true. What this means is that processes that do need to use a lot of memory will exhibit a "high water" behavior, where they remain forever at the level of memory usage that they required at their peak.
Note: this behavior may be Linux specific; there are anecdotal reports that Python on Windows does not have this problem.
This problem arises from the fact that the Python VM does its own internal memory management. It's commonly know as memory fragmentation. Unfortunately, there doesn't seem to be any fool-proof method of avoiding it.
Celery tends to bring out this behavior for a lot of users.
AFAIK this is just how Python works. I would guess that the operating system will reuse the memory anyway, since it can just swap it out if it's not used. If you have allocated a chunk of memory, there's a big chance that you will need it again, and it's better to delegate memory management to the operating system. ... There is no solution - that I know of - to make Python release the memory ... Ask Solem, author of celery
For celery in particular, you can roll the celery worker processes regularly. This is exactly what the
CELERYD_MAX_TASKS_PER_CHILD setting does. However, you may end up having to roll the workers so often that you incur an undesirable performance overhead.
For non-celery systems, you can use the
multiprocessing module to run any function in a separate process. There is a simple looking gist called processify that does just that.
Note: This may have the undesirable effect of using more shared resources, like database connections.
You could also run your Python jobs using Jython, which uses the Java JVM and does not exhibit this behavior. Likewise, you could upgrade to Python 3.3,
Ultimately, the best solution is to simply use less memory. In our case, we ended up breaking the work into smaller chunks (individual days). For some tasks, this may not be possible, or may require complicated task coordination.