Tracking down a Rails app memory leak
Recently on thenewsroom we've noticed a memory leak; the process size would just grow and grow up over 1 GB per Mongrel. Restarting them regularly kept things more or less under control, but that's silly. Much better to find the actual problem.
I tried out various memory profilers, like mongrel -B and Scott Laird's MemoryProfiler. But the results just didn't jump out at me; I spent quite a while poking around Objectspace dump files and String diffs and such without really making much progress.
Then I tried Evan Weaver's BleakHouse. This was nicer; the Gruff graphs are cool and it showed that the leak was in the way we were using Ruby and Rails. So we didn't have a class Hash in a controller or anything like that. But I still couldn't pin it down; I was just seeing charts with upwardly sloping numbers of String objects.
Finally I backed down to just poking around the app while watching the output of:
watch -n 1 "ps -o rss,vsz -p 12345"
Nothing interesting surfaced at first. OK, what if I look at the differences between my dev and our production database? Ah ha! A particular table in my DB has 200 rows, the production one has 10K rows, and a particular page hit by a particular user causes 1500 of them to be fetched. And that corresponds with a huge leap in memory size.
Here was the problem. Our architecture involves various Rails apps passing data around - and we were using Marshal'd OpenStruct objects as the transfer mechanism. This is a bad idea. Here's a demo of why. First, let's write 10K Marshal'd OpenStruct objects to a file:
require 'ostruct'
structs = []
10000.times do |x|
structs << OpenStruct.new(
:first_name => "Fred#{x}",
:last_name => "Fredson#{x}",
:hatsize => 12+x,
:country => "FooLand#{x}")
end
File.open("data.dat", "w") { |f| f.syswrite(Marshal.dump(structs)) }
This creates a data.dat file of about 500K. Now let's read them back in, but check the process size at various points. Here's a read.rb:
require 'ostruct'
puts "Hit enter to read data"
gets
puts "Reading data..."
data = Marshal.load(File.read("data.dat"))
puts "Hit enter to set to nil"
gets
data = nil
GC.start
puts "It's nil, back from GC.start"
sleep 500
When we run read.rb, the process is about 3 MB:
$ ps -o rss,vsz -p 30079
RSS VSZ
1516 3048
Now, after we read in the data:
$ ps -o rss,vsz -p 30079
RSS VSZ
80600 82104
Great scott, 82 MB! But surely that GC.start will clean them up, right?
$ ps -o rss,vsz -p 30079
RSS VSZ
80620 82104
And, no. Booooo. But replacing that OpenStruct with a Hash causes the memory size to only go from 3 MB to 7 MB, which is a bit of an improvement over rising from 3 MB to 80 MB.
What would have been nice in this situation would be to have some sort of heap analyzer that showed the size of memory occupied by each object graph subtree. It'd be worth looking at Evan's patch to Ruby 1.8.6 and his profiler to see if we could somehow gather and dump enough information to create something like that. What a cool project that would be for someone much smarter than I :-)
The mem_inspect gets you most of the way there. Bleakhouse uses it to do its object counts. Its fairly time consuming to walk the heap and figure out how big everything is, though.
Posted by: Eric Hodel | September 20, 2007 at 03:10 PM
"But replacing that OpenStruct with a Hash causes the memory size to only go from 3 MB to 7 MB, which is a bit of an improvement over rising from 3 MB to 80 MB."
Does the memory go back down with a Hash, or not? What about a
class Fred
attr_accessor :first_name, :last_name, :hat_size, :country
end
which being a normal class might be better?
(I'm aware that your real OpenStruct may not suit a regular class, but I'm wondering if this tells us something useful about GC in Ruby, or just about OpenStruct.)
While trying to challenge assumptions, as one needs to at times like this: what happens if you use YAML instead of Marshal? I'd hope it makes no difference...
Posted by: Hugh Sasse | September 21, 2007 at 05:31 AM
Check out LiteSpeed too, it will enforce hard and soft memory limits for your rails apps, and auto spawns/kills handlers as needed
Posted by: Jerrett | September 21, 2007 at 01:06 PM
Also check out Dike, which might be able to give you information down to the lines of code that cause the leak.
http://www.codeforpeople.com/lib/ruby/dike/
Posted by: Grant Hutchins | September 29, 2007 at 03:04 PM
So...my question is why doesn't GC.start get it?
Posted by: roger pack | November 03, 2007 at 11:30 PM
Roger - exactly! I have no idea... something for ruby-core, I suppose...
Posted by: tomcopeland | November 21, 2007 at 11:30 AM
You could try making the heap chunk size small (and not grow exponentially) and see if ruby is able to reclaim that unused space
Posted by: roger pack | January 17, 2008 at 06:54 PM
Roger - yup, I haven't been watching ruby-core lately, not sure what the status of this one is...
Posted by: tomcopeland | January 17, 2008 at 09:31 PM