Hey Lapo.
I have developped a custom extension for SmartFox 1.6.6. Everything seems to be running very smoothly, but for some reason, after an extended period of time, Smartfox seems to queue up messages and send them in batch. One example is that if I do 2 actions at a very different time (say talk, wait a few seconds, then drop an item), the clients will receive the response after a while, but all at the same time. At first, I thought it was because the server was under stress, but after taking out the server from circulation, waiting for the client count to drop, and running YourKit Java profiler on it, nothing comes up, but the behaviour is still present. I've added a custom command to write in the log. When I do talk, wait a second, send custom command to log something, wait a few more seconds, then drop an item, I receive both talk and drop item at the same time, but the custom command is clearly received and executed almost at the same time I sent it.
That info leads me to belive that Smartfox is somehow accumulating message to send to the client, and then sending them after a while in batch, but the reception/execution are done almost right away.
That behaviour does not happen at first though, we need to get a good lot of people on the server, but once that behaviour appears, even with a single client connected, it takes a while.
Is there anything I missed in the configuration that could be causing such a behaviour?
Here's a snippet of our configuration :
<ClientMessagQueue>
<QueueSize>50</QueueSize>
<MaxAllowedDroppedPackets>1</MaxAllowedDroppedPackets>
</ClientMessagQueue>
<OutQueueThreads>1</OutQueueThreads>
<ExtHandlerThreads>1</ExtHandlerThreads>
<MaxWriterQueue>100</MaxWriterQueue>
Accumulation of message to send
I've just learnt something new recently, and it was called Nagle's algorithm. See http://forums.smartfoxserver.com/viewtopic.php?t=8990. Maybe that's the reason for the behavior you're seeing?
What if you set <DebugOutGoingMessages> to true via your config.xml file, does both messages get sent at the same time from the server or spaced out accordingly?
What if you set <DebugOutGoingMessages> to true via your config.xml file, does both messages get sent at the same time from the server or spaced out accordingly?
Smartfox's forum is my daily newspaper.
The server doesn't do anything like this unless it is forced by a couple of possible things:
1. lag or bad network connection with the client, in this case messages are held in the queue until the network is available again
2. threading problmes. Your extension code is taking long times to execute a request holding the Extension thread. Everything slows down because operations are not executed in parallel.
If you can reproduce the issue in a local network then I would suggest to check #2.
In your config ExtHandlerThreads is set to 1... change it to 4 and see if this gets any better.
Also check the Server queues with the AdminTool. Network problems will make the Outgoing queue increase while threading problems will make the ExtensionHandler Queue increase.
Hope it helps
1. lag or bad network connection with the client, in this case messages are held in the queue until the network is available again
2. threading problmes. Your extension code is taking long times to execute a request holding the Extension thread. Everything slows down because operations are not executed in parallel.
If you can reproduce the issue in a local network then I would suggest to check #2.
In your config ExtHandlerThreads is set to 1... change it to 4 and see if this gets any better.
Also check the Server queues with the AdminTool. Network problems will make the Outgoing queue increase while threading problems will make the ExtensionHandler Queue increase.
Hope it helps
I do have a couple of service calls that are taking longer to run, but those are executed in a completely different thread than the Extension. As soon as I receive any events from AbstractExtension, I create a runnable task and dispatch it in an ExecutorService.
Network bottleneck was my first guess, but there's plenty of bandwidth available to the server. We first noticed the issue on our live environment, so to eliminate the client's bandwidth variable, we devised a stress test and put it on the same network as a test server. After running it over the last week end, we observed the same result, but this time, we were sure there was only one client (myself). We also had the profiler running (YourKit Java Profiling), and it came up with nothing useful.
I intend to let the stress test run this weekend with user.getChannel().socket().setTcpNoDelay(true);. Hopefully, the problem won't be happening again.
Edit : The stress test is testing 2 zones : one for custom login, and one for the actual game. Both are hosted on the same Smartfox instance of the server. A single room is used by the stress test bot to run around and chat.
Network bottleneck was my first guess, but there's plenty of bandwidth available to the server. We first noticed the issue on our live environment, so to eliminate the client's bandwidth variable, we devised a stress test and put it on the same network as a test server. After running it over the last week end, we observed the same result, but this time, we were sure there was only one client (myself). We also had the profiler running (YourKit Java Profiling), and it came up with nothing useful.
I intend to let the stress test run this weekend with user.getChannel().socket().setTcpNoDelay(true);. Hopefully, the problem won't be happening again.
Edit : The stress test is testing 2 zones : one for custom login, and one for the actual game. Both are hosted on the same Smartfox instance of the server. A single room is used by the stress test bot to run around and chat.
I have left the stress test running 2 more days to make sure it wasn't some coincidence, but it seems that doing socketChannel.socket().setTcpNoDelay(true) seems to do the trick in our local environment. I'll make it part of our next build on the live environment, and post back the results to let you know if it worked.