7

I have a third party component which tries to send too many UDP messages to too many separate addresses in a certain situation. This is a burst which happens when the software is started and the situation is temporary. I'm actually not sure is it the plain amount of the messages or the fact that each of them go to a separate IP address.

Anyway, changing the underlying protocol or the problematic component is not an option, so I'm looking for a workaround. The StackTrace looks like this:

java.io.IOException: No buffer space available
    at java.net.PlainDatagramSocketImpl.send(Native Method)
    at java.net.DatagramSocket.send(DatagramSocket.java:612)

This issue occurs (at least) with Java versions 1.6.0_13 and 1.6.0_10 and Linux versions Ubuntu 9.04 and RHEL 4.6.

Are there any Java system properties or Linux configuration tweaks which might help?

5 Answers 5

10

I've finally determined what the issue is. The Java IOException is misleading since it is "No buffer space available" but the root issue is that the local ARP table has been filled. On Linux, the default ARP table lookup is 1024 (files /proc/sys/net/ipv4/neigh/default/gc_thresh1, /proc/sys/net/ipv4/neigh/default/gc_thresh2, /proc/sys/net/ipv4/neigh/default/gc_thresh3).

What was happening in my case (and I assume your case), is that your Java code is sending out UDP packets from an IP address that is in the same subnet as your destination addresses. When this is the case, the Linux machine will perform an ARP lookup to translate the IP address into the hardware MAC address. Since you are blasting out packets to many different IPs the local ARP table fills up quickly, hits 1024, and that is when the Java exception is thrown.

The solution is simple, either increase the limit by editing the files I mentioned earlier, or move your server into a different subnet than your destination addresses, which then causes the Linux box to no longer perform neighbor ARP lookups (instead will be handled by a router on the network).

Sign up to request clarification or add additional context in comments.

2 Comments

Not the first place one would look. How did you crack it?
@ThorbjørnRavnAndersen: I wouldn't know how to look for this now, but some 8 years ago /proc/slabinfo had a separate entry for "neigh" (e.g. ARP) entries. When you got ENOBUFS you just looked at slabinfo to see which buffers those were. Nowadays it's probably merged into some of the kmalloc-size entries.
3

When sending lots of messages, especially over gigabit ethernet in Linux, the stock parameters for your kernel are usually not optimal. You can increase the Linux kernel buffer size for networking through:

echo 1048576 > /proc/sys/net/core/wmem_max
echo 1048576 > /proc/sys/net/core/wmem_default
echo 1048576 > /proc/sys/net/core/rmem_max
echo 1048576 > /proc/sys/net/core/rmem_default

As root.

Or use sysctl

sysctl -w net.core.rmem_max=8388608 

There are tons of network options

See Linux Network Tuning by IBM and More tuning information

7 Comments

Thanks. In addition to those parameters, I tried to also tweak net.ipv4.udp_mem and net.ipv4.udp_wmem_min. First I doubled, the values, then I doubled them again, and at last I changed them to be 10 times as big as the defaults. Nothing has helped so far though.
@auramo, Which JVM are you using? The sun build or the OpenJDK/JVM stuff from your distro? I would recommend using one for your distro, the open one if possible as it will be less 'safe' and more accurate interfacing with the kernel/libc.
I'm using the Sun builds of 1.6.0_13 and 1.6.0_10. I could easily try with the OpenJDK versions, but changing from the Sun implementation the OpenJDK for the end product would be a major hassle at this point of the project.
If you switch to OpenJDK, which is based on the Sun source anyway and the problem is solved, then you can ask around Sun forums for differences that would cause this ... and help reconfiguring the Sun JRE release to work ;) it might not be the Linux kernel, but some component in-between (c library that the sun blob may link to statically as an example)
I have the same suspicion: the Sun's JDK has some C code which makes library calls which override the sysctl values I tried to change. I bumped into several articles while googling which said that you can override udp_mem, wmem_max etc. with some C API call in your client code.
|
1

Might be a bit complicated but as I know, Java uses the SPI1 pattern for the network sub-library. This allows you to change the implementation used for various network operations. If you use OpenJDK then you could gain some hints how and what to wrap with your implementation. Then, in your implementation you slow down the I/O with some sleeps for example.

Or, just for fun, you could override the default DatagramSocket with your modified implementation. Have the same package name for it and - as I know - it will take precedence over the default JRE class. At least this method worked for me on some buggy 3rd party library.

Edit:

1Service Provider Interface is a method to separate client and service code within an API. This separation allows different client and different provider implementations. Can be recognized from the name ending in Impl usually, just like in your stack trace java.net.PlainDatagramSocketImpl is the provider implementation where the DatagramSocket is the client side API.

You commented that you don't want to slow down the communication the entire way. There are several hacks to avoid it, for example measure the time in your code and slow the communication within the first 1-2 minutes starting at your first incoming method call. Then you can skip the sleep.

Another option would be to identify the misbehaving class in the library, JAD it and fix it. Then replace the original class file in the library.

1 Comment

Can you tell me what an SPI pattern is? I want to overcome a 1-2 minute, boot-time burst. For that I definitely do not want to slow down my UDP I/O which needs to be speedy throughout the time the application is running (it's a server application).
0

I'm also currently seeing this problem as well with both Debian & RHEL. At this point I believe I've isolated it down to the NIC and/or the NIC driver. What hardware configuration do you have this also exhibits this problem? This seems to only occur on new Dell PowerEdge servers that we recently acquired that have Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet NICs.

I too can confirm that it is the rapid generation of outbound UDP packets to many different IP addresses in a short window. I've attempted to write a simple Java application that can reproduce it (since ours is occurring with snmp4j).

EDIT

Look at my answer here: Java IOException: No buffer space available while sending UDP packets on Linux

1 Comment

My problem occurred on many hw configurations, on a HP workstation as well as a rack server. Eventually we ended up hacking the underlying component (Java-component from another team inside our company) which did the excessive network messaging triggering the issue. Now that component does a lot less UDP request/responses and the problem is solved for us.
0

I have got this error when i tried to run coherence cluster in two local JVM using the WIFI connection to database.. If i run it using the ethernet - it runs well.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.