Stock Centos6 Ruby and a MySQL database far away

I wanted to share a recent issue with stock Ruby on Centos6.

We run a Rails application that connects to a MySQL datastore. We're in a datacenter transformation and we deployed the application to our new datacenter though the MySQL datastore is at the other datacenter. As you can see there is about a 35ms distance between the application and the MySQL instance.



[root@host]# ping 172.x.y.19
PING 172.x.y.19 (172.x.y.19) 56(84) bytes of data.
64 bytes from 172.x.y.19: icmp_seq=1 ttl=253 time=32.6 ms
64 bytes from 172.x.y.19: icmp_seq=2 ttl=253 time=38.1 ms
64 bytes from 172.x.y.19: icmp_seq=3 ttl=253 time=36.0 ms
^C
--- 172.x.y.19 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2751ms
rtt min/avg/max/mdev = 32.656/35.600/38.101/2.250 ms


We started the Rails application up though it just was not working. Looking at netstat we observed the following:


[root@host]# netstat -natp | grep SYN
tcp        0      1 172.x.y.63:33739          172.x.y.19:3306           SYN_SENT    -                     
tcp        0      1 172.x.y.63:33745          172.x.y.19:3306           SYN_SENT    -                   
tcp        0      1 172.x.y.63:33748          172.x.y.19:3306           SYN_SENT    -                   
tcp        0      1 172.x.y.63:33741          172.x.y.19:3306           SYN_SENT    -                   
tcp        0      1 172.x.y.63:33744          172.x.y.19:3306           SYN_SENT    -                   
tcp        0      1 172.x.y.63:33753          172.x.y.19:3306           SYN_SENT    -                   
tcp        0      1 172.x.y.63:33740          172.x.y.19:3306           SYN_SENT    -                   
tcp        0      1 172.x.y.63:33751          172.x.y.19:3306           SYN_SENT    -                   
tcp        0      1 172.x.y.63:33754          172.x.y.19:3306           SYN_SENT    -                  
tcp        0      1 172.x.y.63:33746          172.x.y.19:3306           SYN_SENT    -                   
tcp        0      1 172.x.y.63:33747          172.x.y.19:3306           SYN_SENT    -                   
tcp        0      1 172.x.y.63:33742          172.x.y.19:3306           SYN_SENT    -


It appeared that the instance was tearing down the TCP connection and reestablishing it.

Stracing called out



strace -f -p 13517 -s 1024 -tttt
[pid 13517] 1334041023.537514 connect(18, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("172.x.y.19")}, 16
[pid 13518] 1334041023.547227 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 13518] 1334041023.547253 tgkill(13517, 13517, SIGVTALRM) = 0
[pid 13517] 1334041023.547290 <... connect resumed> ) = ? ERESTARTSYS (To be restarted)
[pid 13518] 1334041023.547312 futex(0x3368ce74f0, FUTEX_WAKE_PRIVATE, 1
[pid 13517] 1334041023.547327 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
[pid 13518] 1334041023.547396 <... futex resumed> ) = 0
[pid 13517] 1334041023.547413 rt_sigreturn(0x1a
[pid 13518] 1334041023.547437 futex(0x3368ce74c4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 123689355, {1334041023, 557284256}, ffffffff
[pid 13517] 1334041023.547460 <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 13517] 1334041023.547484 shutdown(18, 2 /* send and receive */) = 0
[pid 13517] 1334041023.547532 close(18) = 0
[pid 13517] 1334041023.547578 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 18
[pid 13517] 1334041023.547626 fcntl(18, F_SETFL, O_RDONLY) = 0
[pid 13517] 1334041023.547672 fcntl(18, F_GETFL) = 0x2 (flags O_RDWR)
[pid 13517] 1334041023.547708 connect(18, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("172.x.y.19")}, 16
[pid 13518] 1334041023.557389 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 13518] 1334041023.557415 tgkill(13517, 13517, SIGVTALRM) = 0
[pid 13517] 1334041023.557465 <... connect resumed> ) = ? ERESTARTSYS (To be restarted)
[pid 13518] 1334041023.557492 futex(0x3368ce74f0, FUTEX_WAKE_PRIVATE, 1
[pid 13517] 1334041023.557509 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
[pid 13518] 1334041023.557591 <... futex resumed> ) = 0
[pid 13517] 1334041023.557609 rt_sigreturn(0x1a
[pid 13518] 1334041023.557635 futex(0x3368ce74c4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 123689357, {1334041023, 567453732}, ffffffff
[pid 13517] 1334041023.557669 <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 13517] 1334041023.557695 shutdown(18, 2 /* send and receive */) = 0
[pid 13517] 1334041023.557732 close(18) = 0x.y

So Ruby was connecting to MySQL though after about 10ms was receiving a SIGVTALRM and then immediately after it would re-attempt to establish the connection. tcpdumping 3306 didnt call out anything - it appeared to be doing the initial authentication to the database server and then terminating.

 The debugging called out threading issues that appear to still be in the stock Ruby that ships with Centos6  Now I'm not really over the internals of ruby though I did then build REE and re-attempt starting the application which worked. I'm guessing a bug matching this one  was fixed in REE though I have yet to find the specific bug report.

So in summary there still appears to be a threading bug within stock Ruby that ships with centos6 that will  terminate a MySQL connection attempt if the datastore is more than 10ms away. Please comment if you have come across this yourself or have a different experience here.


As I side note, I found @tmm1 's slide deck on debugging ruby helpful.