Posts Tagged ‘bug’

Multiple Network Interface Gotcha in Linux

Sunday, November 9th, 2008

Ok here’s a recipe to try:

  • Get a development board or PC that has two or more network interfaces,
  • Assign each of them with a unique IP address,
  • Connect the interfaces to a common network,
  • Finally, ping one of the IP addresses from another machine.

Now, depending on which IP address you chose to ping, you may find that your pings will suddenly fail to respond and timeout when you disconnect the other interface (i.e. the interface you are not pinging). Bizarre, isn’t it?

However, as my colleague and I recently discovered whilst debugging a new Ethernet driver, this gotcha is actually correct behaviour for Linux – and in fact it is correct behaviour as defined by the relevant RFC’s. I thought I’d use this post to discover what is going on and why this is OK.

In order to reproduce this behaviour I set up a virtual machine and assigned it with a number of NAT’d Ethernet devices (4 in fact). I also set up Wireshark (what used to be Ethereal) so that I could monitor any traffic. Here is cut down version of the output from ifconfig.

$ ifconfig
eth0      Link encap:Ethernet  HWaddr 00:0c:29:12:0b:bd
          inet addr:192.168.27.132  Bcast:192.168.27.255  Mask:255.255.255.0

eth1      Link encap:Ethernet  HWaddr 00:0c:29:12:0b:c7
          inet addr:192.168.27.133  Bcast:192.168.27.255  Mask:255.255.255.0

eth2      Link encap:Ethernet  HWaddr 00:0c:29:12:0b:d1
          inet addr:192.168.27.135  Bcast:192.168.27.255  Mask:255.255.255.0

eth3      Link encap:Ethernet  HWaddr 00:0c:29:12:0b:db
          inet addr:192.168.27.134  Bcast:192.168.27.255  Mask:255.255.255.

I cleared the ARP cache on my Windows machine by using “arp -d” and then pinged 192.168.27.132. The packet exchange captured by Wireshark proved to be quite interesting. Let’s take a look.

No.  Source                Destination        Protocol   Info
1  Vmware_c0:00:08       Broadcast             ARP      Who has 192.168.27.132?  Tell 192.168.27.1
2  Vmware_12:0b:db       Vmware_c0:00:08       ARP      192.168.27.132 is at 00:0c:29:12:0b:db
3  192.168.27.1          192.168.27.132        ICMP     Echo (ping) request
4  Vmware_12:0b:d1       Vmware_c0:00:08       ARP      192.168.27.132 is at 00:0c:29:12:0b:d1
5  Vmware_12:0b:c7       Vmware_c0:00:08       ARP      192.168.27.132 is at 00:0c:29:12:0b:c7
6  Vmware_12:0b:bd       Vmware_c0:00:08       ARP      192.168.27.132 is at 00:0c:29:12:0b:bd
7  192.168.27.132        192.168.27.1          ICMP     Echo (ping) reply
8  192.168.27.1          192.168.27.132        ICMP     Echo (ping) request
9  192.168.27.132        192.168.27.1          ICMP     Echo (ping) reply
10 192.168.27.1          192.168.27.132        ICMP     Echo (ping) request
11 192.168.27.132        192.168.27.1          ICMP     Echo (ping) reply

After I invoke the ping command, my machine issues an ARP broadcast, asking for the MAC address currently associated with 192.168.27.132. However all of the network interfaces of my virtual machine respond – resulting in 4 ARP replies. When this happens Windows (and other OS’s) will ignore all but the first response, with the assumption that the first reply must have come from the quickest route.

In this example the quickest ARP reply came from the MAC address associated with eth3. Therefore whenever we communicate with 192.168.27.132, as we have done via Ping, the traffic will be sent to eth3. As a result, if we now down interface eth3 with “ifconfig eth3 down”, our pings will fail. This behaviour can be confusing as why should eth3 going down affect traffic that is directed to 192.168.27.132 which we believed to be associated with eth1?

Despite the impression ifconfig gives, Linux associates IP addresses with the host as opposed to individual interfaces of the host. With that in mind, the behaviour we’ve seen doesn’t seem so bizarre. When a network interface receives an ARP request for an IP address which it owns, then in effect a valid network route been made between the requestor and the requested. This route could potentially be the only route and as it is likely that the two will communicate with each other, it makes sense to reply to the ARP request. And this is what happens – the network interface that received the ARP request will now act as a proxy for the requested IP address.

This behaviour is actually quite convenient. In our example, even though our pings began to fail once we disconnected a route to the host – as soon as the Windows ARP cache times out (after 10 minutes) another ARP request will be broadcast. Like before, any interface that can provide a route to the host will respond, and so connectivity will be restored. If Linux wasn’t designed in this way and each interface truly owned an IP address, then if that link went down connectivity would never be restored to that address – even though there are other physical connections to the machine that has that IP address!

The other point of interest here, at least with this contrived networking configuration, is that reliability is favoured over performance. The reason is that where multiple interfaces exist on a machine, it’s quite likely that a priority ordering will exist between them. And so if, as in this case, eth3 replies the quickest then it is likely that it will always be the quickest. As a result, it is also likely to respond to all the ARP requests first and so all traffic for the 4 IP addresses will arrive on a single interface. We can demonstrate this. After pinging all of the IP addresses assigned to my virtual machine we can examine the ARP cache of Windows.

arp -a

Interface: 192.168.27.1 --- 0x2
  Internet Address      Physical Address      Type
  192.168.27.132        00-0c-29-12-0b-bd     dynamic
  192.168.27.133        00-0c-29-12-0b-bd     dynamic
  192.168.27.134        00-0c-29-12-0b-bd     dynamic
  192.168.27.135        00-0c-29-12-0b-bd     dynamic

As you can see all the IP addresses correspond to the same interface, i.e. eth3. Thus all the traffic will go over a single 10/100Mbit link instead of 4 links.

Fortunately, where this behaviour isn’t ideal, the proc interface provides a means to modify it. Of particular interest are the arp_filter and rp_filter sysctl knobs which can be found in the proc interface. I’ve not really managed to make complete sense of these yet and may well write another post on these in the future. Though for the behaviour described above it was necessary for me to invoke “echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter”, I found without this I would only ever get two ARP replies instead of 4 – I’m not entirely sure why this is… suggestions anyone?

And finally, for those that wish to read more, I recommend an article on LWM which provides some more background information.

GCC Weak Symbols

Saturday, October 18th, 2008

GNU’s GCC has a useful (and perhaps not very well known) feature known as ‘weak symbols’. I first discovered this a while back when building a Linux kernel – however unbeknown to me the Linux kernel makes great use of weak symbols yet the compiler I used did not correctly support them. Rather than a failed build the kernel built fine and even run – I was instead presented with a number of interesting bugs, but more on this later.

In a nutshell weak symbols permit you to define a symbol that doesn’t need to be resolved at link time, i.e. it allows you to tell the compiler that this function may not have a body and that is OK. Furthermore, if later the compiler comes across another symbol with the same name that doesn’t have the weak attribute the original symbol will be overwritten with the stronger symbol (Without getting a multiple defination linker error). And finally you can also use the symbol to determine, at run-time, if such a body exists.

To give you an example of its use let’s refer back to my original bug…

v2.6.27/arch/sh/kernel/cpu/clock.c
292: void __init __attribute__ ((weak))
293: arch_init_clk_ops(struct clk_ops **ops, int type)
294: {
295: }

This function is part of the architecture specific (SH) code for setting up the various clocks of the device. The function defined above is used to return a structure of clock operations (struct clk_ops) which is later used to register the clock within the kernel. As you can see the function is declared with a weak symbol via the “weak” attribute. Therefore, when built correctly, the function can be overridden.

The design of this part of the kernel is such that generic clock operations are defined in clock.c and can be later overridden via weak symbols by implementations for specific CPU subtypes – for example this function is overriden in the clock-sh7712.c file…

v2.6.27/arch/sh/kernel/cpu/clock-sh7712.c
66: void __init arch_init_clk_ops(struct clk_ops **ops, int idx)
67: {
...

The function hasn’t been defined as a weak symbol and so will override the weak symbol. In this case the function will provide the caller with the clock operations specific to the SH7712. In this manner the existing generic clock support code has been designed such that it can be easily extended to support future SH subtypes. Likewise weak symbols are used elsewhere in the kernel (since 2.4.0) for similar effect.

Whilst my version of GCC claimed to support weak symbols there was a known GCC bug that prevented this from working correctly. I found that the code would only work correctly if the weak arch_init_clk_ops function had code in it’s body – what was happening was that the compiler was optimising out the function all together (with the -O2 optimisation GCC flag) and resulted in the non-weak symbol not being called (There is a quick hack to fix this which is to use the -fno-unit-at-a-time flag, however this is expected to be removed from GCC in the future.)

It’s always worth looking at the “/Documentation/Changes” file included with the kernel, it contains a list of the tools required and the minimal version of each tool. Just because the kernel builds doesn’t mean that it has built in the way intended by the Linux contributors!

References:

GCC Function Attributes (gnu.org)
GCC Help Mailing List Archive – Discussing weak symbols and optimisation (gnu.org)
Further Discussion of this bug in KGDB – Here (osdir.com) and Here (lkml.org)