Forum - Packet loss on demuc02 causing hangups :: SixXS

Packet loss on demuc02 causing hangups

Shadow Hawkins on Wednesday, 03 December 2014 13:15:52

I have a network behind a heartbeat tunnel (MTU 1472, same problem with 1452) to PoP demuc02. When I connect to native IPv6 hosts on the Internet, I regularly experience 34 second hangups, noticable especially during interactive SSH sessions. These hangups only happen over the SixXS tunnel, not when using IPv4. During these hangups, the following packets arrive at the network's router:

456 13:35:27.803350000 local-computer -> remote-computer SSH 124 [TCP ACKed unseen segment] Encrypted request packet len=52
457 13:35:27.823352000 local-computer -> remote-computer SSH 124 [TCP ACKed unseen segment] Encrypted request packet len=52
458 13:35:27.843353000 local-computer -> remote-computer SSH 124 [TCP ACKed unseen segment] Encrypted request packet len=52
459 13:35:27.915359000 local-computer -> remote-computer SSH 228 [TCP ACKed unseen segment] Encrypted request packet len=156
460 13:35:28.059372000 local-computer -> remote-computer TCP 84 [TCP Dup ACK 459#1] [TCP ACKed unseen segment] 41531 > ssh [ACK] Seq=12117 Ack=11805 Win=396 Len=0 TSval=23085426 TSecr=991272835 SLE=11701 SRE=11753
461 13:35:28.151380000 local-computer -> remote-computer SSH 384 [TCP ACKed unseen segment] [TCP Retransmission] Encrypted request packet len=312
462 13:35:28.563415000 local-computer -> remote-computer TCP 84 [TCP Dup ACK 461#1] [TCP ACKed unseen segment] 41531 > ssh [ACK] Seq=12117 Ack=11805 Win=396 Len=0 TSval=23085552 TSecr=991272835 SLE=11701 SRE=11753
463 13:35:28.623420000 local-computer -> remote-computer SSH 384 [TCP ACKed unseen segment] [TCP Retransmission] Encrypted request packet len=312
464 13:35:29.567500000 local-computer -> remote-computer SSH 384 [TCP ACKed unseen segment] [TCP Retransmission] Encrypted request packet len=312
465 13:35:29.571501000 local-computer -> remote-computer TCP 84 [TCP Dup ACK 464#1] [TCP ACKed unseen segment] 41531 > ssh [ACK] Seq=12117 Ack=11805 Win=396 Len=0 TSval=23085804 TSecr=991272835 SLE=11701 SRE=11753
466 13:35:31.459662000 local-computer -> remote-computer SSH 384 [TCP ACKed unseen segment] [TCP Retransmission] Encrypted request packet len=312
467 13:35:31.491664000 remote-computer -> local-computer SSH 124 [TCP Previous segment not captured] Encrypted response packet len=52
468 13:35:31.491664000 local-computer -> remote-computer SSH 2872 [TCP ACKed unseen segment] Encrypted request packet len=2800

while the remote host just retransmits (captured remotely):

457 13:35:28.046184000 remote-computer -> local-computer SSH 140 [TCP Retransmission] Encrypted response packet len=52
458 13:35:28.550139000 remote-computer -> local-computer SSH 140 [TCP Retransmission] Encrypted response packet len=52
459 13:35:29.558186000 remote-computer -> local-computer SSH 140 [TCP Retransmission] Encrypted response packet len=52

This really baffles me. How could this be debugged?

Packet loss on demuc02 causing hangups

Jeroen Massar SixXS Staff

on Wednesday, 03 December 2014 13:37:12

I have a network behind a heartbeat tunnel (MTU 1472, same problem with 1452) to PoP demuc02.

As you mention MTU, what is the real MTU between your tunnel endpoint and the PoP and what tunneling protocol is being used?

During these hangups, the following packets arrive at the network's router:

Partially futile to look at IPv6 when the drop could be happening at the IPv4 layer.

These hangups only happen over the SixXS tunnel, not when using IPv4.

IPv4 paths can be completely different from IPv6 paths (physical and code) and thus different nodes can be involved but also different settings/code decisions.

How could this be debugged?

Running an 'mtr', possibly in both directions can at least show a possible hop where things go missing. Do note that if packets drop on the IPv6 layer that they just might have gone missing on the IPv4 layer as one is tunneling. Hence one also has to check IPv4 at the same time. And that trick only works if it is consistently dropping all packet types and not just dropping certain packets. If that is not the case one really has to start looking at every single hop to see where packets are going missing. Typically though mtr is a good indicator.

Packet loss on demuc02 causing hangups

Shadow Hawkins on Wednesday, 03 December 2014 14:13:38

Jeroen Massar wrote:

As you mention MTU, what is the real MTU between your tunnel endpoint and the PoP and what tunneling protocol is being used?

tracepath reveals a PMTU of 1492, as expected. The tunnel protocol in use is proto-41/heartbeat.

> How could this be debugged? Running an 'mtr', possibly in both directions can at least show a possible hop where things go missing.

Isn't the problem with mtr that it uses unrealistic traffic and some of the big routers drop its packets to prioritise real data? This explains the packet loss statistics down below: Anyway, here are the four mtr traces:

# v4 outgoing
 Host                                   Loss%   Snt   Last   Avg  Best  Wrst  StDev
 1. 10.178.17.1                          0.0%    96    0.0   0.0   0.0   0.0   0.0
 2. 82.135.16.28                         0.0%    96   16.0  14.3  12.0  16.0   2.0
 3. 82.135.16.209                        0.0%    96   20.0  22.9  20.0  56.0   5.7
 4. 80.81.192.209                       46.7%    96   20.0  28.1  20.0  64.0  12.7
 5. 91.121.128.82                        0.0%    96   28.0  24.8  24.0  28.0   1.6
 6. 188.165.9.98                         0.0%    96   32.0  31.5  28.0  36.0   1.5
 7. 213.186.32.153                       2.6%    96   28.0  40.6  28.0 168.0  28.2
 8. 91.121.215.129                      11.8%    96   32.0  35.6  28.0 184.0  25.2
 9. 87.98.215.238                        0.0%    95   32.0  30.9  28.0  32.0   1.8

# v4 incoming, to DSL modem only
 Host                                   Loss%   Snt   Last   Avg  Best  Wrst  StDev
 1. 87.98.212.254                        5.2%    96    0.6  12.8   0.3 367.5  47.2
 2. 91.121.215.200                       0.0%    95    0.5  10.4   0.3 166.4  31.0
 3. 213.186.32.146                       0.0%    95    0.9   1.0   0.8   2.0   0.2
 4. 188.165.9.101                        0.0%    95    7.6   7.7   7.2  25.0   1.8
 5. 91.121.128.124                      70.2%    95    9.6  11.6   9.5  45.7   7.1
 6. 80.81.193.7                          0.0%    95   13.9  15.7   9.8  80.9  14.6
 7. 82.135.76.179                        0.0%    95   31.6  32.8  29.0  79.0   6.5

# v6 outgoing
 Host                                   Loss%   Snt   Last   Avg  Best  Wrst  StDev
 1. 2001:a60:f000:19d::1                 0.0%    97   16.0  15.7  12.0  20.0   1.8
 2. 2001:a60:0:30::1:1                   0.0%    97   12.0  14.9  12.0  20.0   1.9
 3. 2001:a60:0:30::1                    58.9%    96   16.0  16.1  12.0  20.0   1.1
 4. 2001:a60::69:0:2:2                   0.0%    96   24.0  25.0  20.0  68.0   8.4
 5. ???
 6. 2001:41d0::8c2                      56.8%    96   40.0  44.4  20.0 280.0  56.6
 7. 2001:41d0::b04                       9.4%    96   24.0  26.0  24.0  28.0   2.0
 8. 2001:41d0::8a2                      12.5%    96   32.0  31.9  28.0  36.0   1.3
 9. 2001:41d0::5d1                       8.4%    96  164.0  45.8  28.0 216.0  37.7
10. 2001:41d0::133                       4.2%    96   40.0  42.2  28.0 244.0  32.7
11. 2001:41d0:14:b301::1                 0.0%    96   32.0  31.6  28.0  36.0   1.7

# v6 incoming
 Host                                   Loss%   Snt   Last   Avg  Best  Wrst  StDev
 1. 2001:41d0:14:b3ff:ff:ff:ff:fd       28.7%    94    0.6   9.6   0.5 198.6  36.6
 2. 2001:41d0::134                       0.0%    94    0.7  17.5   0.5 185.2  44.2
 3. 2001:41d0::5c1                      20.2%    94    1.5   1.2   1.1   1.6   0.1
 4. 2001:41d0::8a3                       1.1%    93    7.5   7.5   7.3   8.4   0.2
 5. 2001:41d0::b05                      58.1%    93   42.7  24.6   9.5 205.4  34.1
 6. 2001:7f8::223f:0:1                   0.0%    93   12.6  10.9   9.8  40.5   3.8
 7. 2001:a60:0:30::1                    33.3%    93   18.3  18.3  18.1  19.5   0.2
 8. 2001:a60:0:30::1:1                   0.0%    93   16.8  16.8  16.8  17.2   0.1
 9. 2001:a60:f000:19d::1                 0.0%    93   17.0  17.0  16.8  22.2   0.6
10. 2001:a60:f10a::1                     0.0%    93   30.1  32.1  29.8  34.3   1.3

Do note that if packets drop on the IPv6 layer that they just might have gone missing on the IPv4 layer as one is tunneling. Hence one also has to check IPv4 at the same time.

Yeah, except that IPv4 works fine and there have never been problems. The only sensible explanation I have thus far is that the DSL modem doesn't properly forward proto-41 packets, but I am having a real hard time imagining how it could screw that up give that it also works in between the hangups.

Packet loss on demuc02 causing hangups

Jeroen Massar SixXS Staff

on Wednesday, 03 December 2014 14:26:52

Ah you could have mentioned that you are attempting to use OVH. They are known to have flaky network connectivity resulting from their "DDoS protection". Please test from a completely different network. Note that that IPv4 path is not the one being chosen for the path towards the PoP over which your tunnel runs, hence, while it demonstrates that that path has problems it is likely irrelevant to the path the PoP is at. Including source + destination addresses is always a useful thing btw. 80.81.192.209 = OVH @ DE-CIX. Likely their peering port is full and it is indeed not unlikely they are just randomly dropping ICMP (very useful for the errors it reports...)

Isn't the problem with mtr that it uses unrealistic traffic and some of the big routers drop its packets to prioritise real data?

mtr uses standard ICMP echo requests. Nothing special about that. (-u option can make it do UDP btw). Depending on the configuration of a node it might indeed drop ICMP messages which is why using mtr is just an indication; especially as it does not match the actual traffic you are sending. Hence why I noted about config/code decisions. Typically though one should not see much drop because of that unless something is sending a lot of ICMP and thus triggering the problem.

Anyway, here are the four mtr traces:

Showing async IPv4 paths which can hide all kind of artefacts; but it clearly shows that there are drops already happening in IPv4.

# v6 incoming

Big latency spikes (see stdev column) and random packet loss at various hosts. Not at all hosts which is funny though;Typically if a earlier host is generating consistent loss the rest of the trace should have it too. I would definitely look at the IPv4 layer before bothering with IPv6.

Packet loss on demuc02 causing hangups

Shadow Hawkins on Thursday, 04 December 2014 15:41:06

Jeroen Massar wrote:

I would definitely look at the IPv4 layer before bothering with IPv6.

Yeah, especially since I just tried a HE.net tunnel and saw the same behaviour. IPv4 works fine though, really. Could you imagine a bug on a router firmware that could exhibit this behaviour? I.e. reliable IPv4 proto 8 and 17, but proto 41 experiences random losses I am failing to come up with a bug scenario

Packet loss on demuc02 causing hangups

Jeroen Massar SixXS Staff

on Thursday, 04 December 2014 16:21:27

Martin F. Krafft wrote:

Jeroen Massar wrote:

I would definitely look at the IPv4 layer before bothering with IPv6.

Only 'bug' could be a configuration issue or different queues for different protocols, but that does not make sense. Is the box maybe also doing NAT or some other kind of connection tracking? Also note that this could be happening in any node in the path.

Packet loss on demuc02 causing hangups

Shadow Hawkins on Thursday, 04 December 2014 22:08:11

Jeroen Massar wrote:

Is the box maybe also doing NAT or some other kind of connection tracking?

Yeah, it is! And it's not exactly hard to NAT/conntrack a proto-41 conntection. This is why I am asking. It works most of the time, there is just repetitive but inconsistent packet loss.

Packet loss on demuc02 causing hangups

Jeroen Massar SixXS Staff

on Friday, 05 December 2014 08:36:15

Martin F. Krafft wrote:

Jeroen Massar wrote:

Is the box maybe also doing NAT or some other kind of connection tracking?

Yeah, it is! And it's not exactly hard to NAT/conntrack a proto-41 conntection. This is why I am asking. It works most of the time, there is just repetitive but inconsistent packet loss.

While it might not be hard, it is not always done correctly. First of all the NAT state can time out. Next to that another host might send a proto-41 packet and update the state in it's favor. There are likely other scenarios where this all breaks down. Hence, if you are performing connection tracking (including NAT) and doing proto-41 then you might hit issues with it. There is a reason there is a FAQ about this. Note that this applies to the proto-41 packets being forwarded over that box and even when terminating it on the box. Also note that it could be one of many reasons why those packets go missing. Hard to say without looking at every hop in the path.

Please note Posting is only allowed when you are logged in.