I wanted to share a recent issue with stock Ruby on Centos6.
We run a Rails application that connects to a MySQL datastore. We're in a datacenter transformation and we deployed the application to our new datacenter though the MySQL datastore is at the other datacenter. As you can see there is about a 35ms distance between the application and the MySQL instance.
[root@host]# ping 172.x.y.19
PING 172.x.y.19 (172.x.y.19) 56(84) bytes of data.
64 bytes from 172.x.y.19: icmp_seq=1 ttl=253 time=32.6 ms
64 bytes from 172.x.y.19: icmp_seq=2 ttl=253 time=38.1 ms
64 bytes from 172.x.y.19: icmp_seq=3 ttl=253 time=36.0 ms
^C
--- 172.x.y.19 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2751ms
rtt min/avg/max/mdev = 32.656/35.600/38.101/2.250 ms
We started the Rails application up though it just was not working. Looking at netstat we observed the following:
[root@host]# netstat -natp | grep SYN
tcp 0 1 172.x.y.63:33739 172.x.y.19:3306 SYN_SENT -
tcp 0 1 172.x.y.63:33745 172.x.y.19:3306 SYN_SENT -
tcp 0 1 172.x.y.63:33748 172.x.y.19:3306 SYN_SENT -
tcp 0 1 172.x.y.63:33741 172.x.y.19:3306 SYN_SENT -
tcp 0 1 172.x.y.63:33744 172.x.y.19:3306 SYN_SENT -
tcp 0 1 172.x.y.63:33753 172.x.y.19:3306 SYN_SENT -
tcp 0 1 172.x.y.63:33740 172.x.y.19:3306 SYN_SENT -
tcp 0 1 172.x.y.63:33751 172.x.y.19:3306 SYN_SENT -
tcp 0 1 172.x.y.63:33754 172.x.y.19:3306 SYN_SENT -
tcp 0 1 172.x.y.63:33746 172.x.y.19:3306 SYN_SENT -
tcp 0 1 172.x.y.63:33747 172.x.y.19:3306 SYN_SENT -
tcp 0 1 172.x.y.63:33742 172.x.y.19:3306 SYN_SENT -
It appeared that the instance was tearing down the TCP connection and reestablishing it.
Stracing called out
strace -f -p 13517 -s 1024 -tttt
[pid 13517] 1334041023.537514 connect(18, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("172.x.y.19")}, 16
[pid 13518] 1334041023.547227 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 13518] 1334041023.547253 tgkill(13517, 13517, SIGVTALRM) = 0
[pid 13517] 1334041023.547290 <... connect resumed> ) = ? ERESTARTSYS (To be restarted)
[pid 13518] 1334041023.547312 futex(0x3368ce74f0, FUTEX_WAKE_PRIVATE, 1
[pid 13517] 1334041023.547327 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
[pid 13518] 1334041023.547396 <... futex resumed> ) = 0
[pid 13517] 1334041023.547413 rt_sigreturn(0x1a
[pid 13518] 1334041023.547437 futex(0x3368ce74c4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 123689355, {1334041023, 557284256}, ffffffff
[pid 13517] 1334041023.547460 <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 13517] 1334041023.547484 shutdown(18, 2 /* send and receive */) = 0
[pid 13517] 1334041023.547532 close(18) = 0
[pid 13517] 1334041023.547578 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 18
[pid 13517] 1334041023.547626 fcntl(18, F_SETFL, O_RDONLY) = 0
[pid 13517] 1334041023.547672 fcntl(18, F_GETFL) = 0x2 (flags O_RDWR)
[pid 13517] 1334041023.547708 connect(18, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("172.x.y.19")}, 16
[pid 13518] 1334041023.557389 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 13518] 1334041023.557415 tgkill(13517, 13517, SIGVTALRM) = 0
[pid 13517] 1334041023.557465 <... connect resumed> ) = ? ERESTARTSYS (To be restarted)
[pid 13518] 1334041023.557492 futex(0x3368ce74f0, FUTEX_WAKE_PRIVATE, 1
[pid 13517] 1334041023.557509 --- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
[pid 13518] 1334041023.557591 <... futex resumed> ) = 0
[pid 13517] 1334041023.557609 rt_sigreturn(0x1a
[pid 13518] 1334041023.557635 futex(0x3368ce74c4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 123689357, {1334041023, 567453732}, ffffffff
[pid 13517] 1334041023.557669 <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 13517] 1334041023.557695 shutdown(18, 2 /* send and receive */) = 0
[pid 13517] 1334041023.557732 close(18) = 0x.y
So Ruby was connecting to MySQL though after about 10ms was receiving a SIGVTALRM and then immediately after it would re-attempt to establish the connection. tcpdumping 3306 didnt call out anything - it appeared to be doing the initial authentication to the database server and then terminating.
The debugging called out threading issues that appear to still be in the stock Ruby that ships with Centos6 Now I'm not really over the internals of ruby though I did then build REE and re-attempt starting the application which worked. I'm guessing a bug matching this one was fixed in REE though I have yet to find the specific bug report.
So in summary there still appears to be a threading bug within stock Ruby that ships with centos6 that will terminate a MySQL connection attempt if the datastore is more than 10ms away. Please comment if you have come across this yourself or have a different experience here.
As I side note, I found @tmm1 's slide deck on debugging ruby helpful.