Page 1 of 1

Unexplainable Performance Drop

Posted: 15 Apr 2011, 19:10
by Hygiliak
Here is a full description of our situation:

Server version: 1.6.9
Server hardware: 8 CPU cores, 8GB RAM
Operating system: Linux

Relevant configuration:

<OutQueueThreads>4</OutQueueThreads>
<ExtHandlerThreads>1</ExtHandlerThreads>
<MaxWriterQueue>50</MaxWriterQueue>
<ClientMessagQueue>
<QueueSize>100</QueueSize>
<MaxAllowedDroppedPackets>1</MaxAllowedDroppedPackets>
</ClientMessagQueue>
<MaxIncomingQueue>8000</MaxIncomingQueue>
<DeadChannelsPolicy>strict</DeadChannelsPolicy> <MaxMsgLen>4096</MaxMsgLen>

Problem:

The game works very well, but after about 400 concurrent users there are moments when players experience extensive lag (5-15 seconds), despite good ping values (50-100ms). This behaviour seems highly erratic, with no apparent direct cause.
We use RAW strings for 95% of our traffic, so the bulk of our traffix rarely exceeds 700kbits/sec. The packet count is about 400 packets per second (200 inbound, 200 outbound).
This is a turn based game, so not much traffic is to be expected anyway.

CPU usage rarely goes above 10% on the core that handles mysql, while the rest are mostly idle, with usages below 3%.

I'm positive that there has to be a design flaw in my code, but I can't seem to find it. Any info on this?

Posted: 15 Apr 2011, 20:24
by BigFIsh
This is a kind of problem that will be difficult for us to pinpoint, as it is seemly random. You alone will have a better chance at solving this problem than us.

But let's try and tackle this problem with questions:
1. How often do the lag occur? Is there any pattern? Does the lag occur at a certain time frame or at a certain game phase?
2. Did every player experienced the lag? If so, it may be server related (bandwidth possibly)
3. Have you monitored your server bandwidth? It could be possible that someone (possibly a hacker) is sending a flood attack to your server (not necessarily through the socket port) within a short period of time.
3. If only some players experienced the lag, it may be dependent on those players, not the server.
4. How often do you fetch the ping request for each player? How is ping value calculated and averaged? If you average the ping over a long period of time, it may not pick up on the not so common lag.
5. Lastly, while not likely, did your log file show anything suspicious?

Posted: 16 Apr 2011, 07:03
by Hygiliak
Thank you for the quick reply.

I have run a test by pinging the server non-stop while playing the game, waining in the lobby etc.

It seems that the performance drop spikes (200ms ping and timeouts) happen when users enter/exit the game in rapid succession.
When this happens we send a small user count package in raw string format to every user. During this text there were 150 connected users.

Could the number of packets to be sent be a problem (their size is tiny so bandwidth is not an issue here).
I will cancel the instant user count updates and replace it with an update that triggers much more infrequently.
Should I expect lag when sending a packet to all users, even only once at 5 minutes?

Edit: what is strange is that before reducing packet sizes we have had 300 concurrent users without any lag, despite frequent user count updates and large packets (we had 5Mbits/sec traffic before). So if this lag is indeed causaed by user count updates, it's probably just a side effect and not the prime cause.

Also, no signs of flooding attempts.
We have an average 50 traces per second in the log file. Could that account for any lag?

Edit: as it turns out, we did not send user count update to everyone, but just to those who were in the lobby (only a few of them), so the problem must be elsewhere. The only time we send a packet to everyone is when a change occurs in the data of the top 4 players, but this happenes infrequently.