Wednesday, October 9, 2013

JRuby 1.7.x Blocking IO Timeouts

If you haven't read Release It! : Design and Deploy Production-Ready Software I strongly recommend that you do. This book has really changed the way that I think about software design and architecture, in particular I think more about the way software systems fail and developing strategies for dealing with system outages and recovery. The reality is systems need to be designed for failure, i.e. they have to be aware that things will go wrong and that we have to be prepared for the worst.  

One of my favorite tools in Ruby to mitigate "slow" dependencies is to use the Timeout module, in the past Ruby's Green Thread Model rendered it useless. Fortunately the SystemTimer gem was there to help us through the dark ages until the 1.9 version was released and we where able to go back to using the actual timeout module. 
  
A few years ago our team switched to JRuby and we were excited about taking advantage of concurrency on the JVM. You can imaging my surprise when I wrote the Timeout tests for our Gateway classes blocked indefinitely never raising an error! After a bit of searching I read about a defect in JRuby that prevents the timer thread from interrupting the I/O thread. 

Once I gained my composure, I started looking for alternatives to the solution. Fortunately I found this little snippet of code. It seemed to work great until I ran into an issue when the thread executing in the timeout block raised native java.lang.Throwable types would silently fail and not propagate the error the the main thread.  

Once I "fixed" that problem, I started to execute long running load tests against the system. Eventually we ran into a new issue:

 Found one Java-level deadlock:
=============================
"http-/192.168.220.196:8080-8192":
waiting to lock monitor 0x00007fca797bffa8 (object 0x00000005a6894000, a org.jruby.internal.runtime.ThreadService),
which is held by "http-/192.168.220.196:8080-598"
"http-/192.168.220.196:8080-598":
waiting to lock monitor 0x00007fca390d0600 (object 0x00000006dd8466c0, a org.jruby.RubyThread),
which is held by "RubyThread-3685559: <somewhere in my code>"
"RubyThread-3685559: <somewhere in my code>":
waiting to lock monitor 0x00007fca797bffa8 (object 0x00000005a6894000, a org.jruby.internal.runtime.ThreadService),
which is held by "http-/192.168.220.196:8080-598"
Timeout! I think it's time to fix the blocking I/O thread bug.

No comments: