I wanted to share a recent issue with stock Ruby on
Centos6.
We run a Rails application that connects to a MySQL datastore.
We're in a datacenter transformation and we deployed the
application to our new datacenter though the MySQL datastore is
at the other datacenter. As you can see there is about a 35ms
distance between the application and the MySQL instance.
[root@host]# ping 172.x.y.19
PING 172.x.y.19 (172.x.y.19) 56(84) bytes of data.
64 bytes from 172.x.y.19: icmp_seq=1 ttl=253 time=32.6 ms
64 bytes from 172.x.y.19: icmp_seq=2 ttl=253 time=38.1 ms
64 bytes from 172.x.y.19: icmp_seq=3 ttl=253 time=36.0 ms
^C
--- 172.x.y.19 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time
2751ms
rtt min/avg/max/mdev = 32.656/35.600/38.101/2.250 ms
We started the Rails application up though it just was not
working. Looking at netstat we observed the
following:
[root@host]# netstat -natp | grep SYN
tcp 0 1
172.x.y.63:33739
172.x.y.19:3306 SYN_SENT
-
tcp 0 1
172.x.y.63:33745
172.x.y.19:3306 SYN_SENT
-
tcp 0 1
172.x.y.63:33748
172.x.y.19:3306 SYN_SENT
-
tcp 0 1
172.x.y.63:33741
172.x.y.19:3306 SYN_SENT
-
tcp 0 1
172.x.y.63:33744
172.x.y.19:3306 SYN_SENT
-
tcp 0 1
172.x.y.63:33753
172.x.y.19:3306 SYN_SENT
-
tcp 0 1
172.x.y.63:33740
172.x.y.19:3306 SYN_SENT
-
tcp 0 1
172.x.y.63:33751
172.x.y.19:3306 SYN_SENT
-
tcp 0 1
172.x.y.63:33754
172.x.y.19:3306 SYN_SENT
-
tcp 0 1
172.x.y.63:33746
172.x.y.19:3306 SYN_SENT
-
tcp 0 1
172.x.y.63:33747
172.x.y.19:3306 SYN_SENT
-
tcp 0 1
172.x.y.63:33742
172.x.y.19:3306 SYN_SENT
-
It appeared that the instance was tearing down the TCP connection
and reestablishing it.
Stracing called out
strace -f -p 13517 -s 1024 -tttt
[pid 13517] 1334041023.537514 connect(18, {sa_family=AF_INET,
sin_port=htons(3306), sin_addr=inet_addr("172.x.y.19")}, 16
[pid 13518] 1334041023.547227 <... futex resumed> ) = -1
ETIMEDOUT (Connection timed out)
[pid 13518] 1334041023.547253 tgkill(13517, 13517, SIGVTALRM) =
0
[pid 13517] 1334041023.547290 <... connect resumed> ) = ?
ERESTARTSYS (To be restarted)
[pid 13518] 1334041023.547312 futex(0x3368ce74f0,
FUTEX_WAKE_PRIVATE, 1
[pid 13517] 1334041023.547327 --- SIGVTALRM (Virtual timer
expired) @ 0 (0) ---
[pid 13518] 1334041023.547396 <... futex resumed> ) =
0
[pid 13517] 1334041023.547413 rt_sigreturn(0x1a
[pid 13518] 1334041023.547437 futex(0x3368ce74c4,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 123689355,
{1334041023, 557284256}, ffffffff
[pid 13517] 1334041023.547460 <... rt_sigreturn resumed> )
= -1 EINTR (Interrupted system call)
[pid 13517] 1334041023.547484 shutdown(18, 2 /* send and receive
*/) = 0
[pid 13517] 1334041023.547532 close(18) = 0
[pid 13517] 1334041023.547578 socket(PF_INET, SOCK_STREAM,
IPPROTO_IP) = 18
[pid 13517] 1334041023.547626 fcntl(18, F_SETFL, O_RDONLY) =
0
[pid 13517] 1334041023.547672 fcntl(18, F_GETFL) = 0x2 (flags
O_RDWR)
[pid 13517] 1334041023.547708 connect(18, {sa_family=AF_INET,
sin_port=htons(3306), sin_addr=inet_addr("172.x.y.19")}, 16
[pid 13518] 1334041023.557389 <... futex resumed> ) = -1
ETIMEDOUT (Connection timed out)
[pid 13518] 1334041023.557415 tgkill(13517, 13517, SIGVTALRM) =
0
[pid 13517] 1334041023.557465 <... connect resumed> ) = ?
ERESTARTSYS (To be restarted)
[pid 13518] 1334041023.557492 futex(0x3368ce74f0,
FUTEX_WAKE_PRIVATE, 1
[pid 13517] 1334041023.557509 --- SIGVTALRM (Virtual timer
expired) @ 0 (0) ---
[pid 13518] 1334041023.557591 <... futex resumed> ) =
0
[pid 13517] 1334041023.557609 rt_sigreturn(0x1a
[pid 13518] 1334041023.557635 futex(0x3368ce74c4,
FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 123689357,
{1334041023, 567453732}, ffffffff
[pid 13517] 1334041023.557669 <... rt_sigreturn resumed> )
= -1 EINTR (Interrupted system call)
[pid 13517] 1334041023.557695 shutdown(18, 2 /* send and receive
*/) = 0
[pid 13517] 1334041023.557732 close(18) = 0x.y
So Ruby was connecting to MySQL though after about 10ms
was receiving a SIGVTALRM and then immediately
after it would re-attempt to establish the connection.
tcpdumping 3306 didnt call out anything - it appeared to be doing
the initial authentication to the database server and
then terminating.
The debugging called out threading issues that
appear to still be in the stock Ruby that ships with Centos6
Now I'm not really over the internals of ruby though I did
then build REE and re-attempt starting the application which
worked. I'm guessing a bug matching this one was fixed in REE though I
have yet to find the specific bug report.
So in summary there still appears to be a threading bug within
stock Ruby that ships with centos6 that will terminate a
MySQL connection attempt if the datastore is more than 10ms away.
Please comment if you have come across this yourself or have a
different experience here.
As I side note, I found @tmm1 's slide deck on debugging ruby helpful.
Apr
10
2012