Page 1 of 1

Buddy List Error - H2 Database - Live Server has Crashed 2x

Posted: 30 Apr 2009, 00:21
by jjduran
This is the second time this issue has occured, it's occurred once on SFS 1.6.6, and one time on the previous edition SFS 1.6.5.

Our live server has completely crashed for the second time. By crash I mean that the server becomes completely unavailable. I can not ping it, it no longer serves pages, the forums become inaccessible, and SFS entirely shuts down. The server requires a manual power cycle before it will work again.

Here is the error we're seeing in the logs (actually, there are several hundred of these duplicate errors in the logs):

Code: Select all

2009/04/29 19:29:22.717 - [ SEVERE ] [id: 10] (DbManager.executeQuery): DbManager could not retrive a connection. org.h2.jdbc.JdbcSQLException: IO Exception: java.io.FileNotFoundException: /home/game/SFS_PRO_1.6.6/Server/datastore/__sfscore__.lock.db (Permission denied); /home/game/SFS_PRO_1.6.6/Server/datastore/__sfscore__.lock.db [90031-110]
	at org.h2.message.Message.getSQLException(Message.java:107)
	at org.h2.message.Message.convertIOException(Message.java:293)
	at org.h2.store.fs.FileSystemDisk.openFileOutputStream(FileSystemDisk.java:348)
	at org.h2.store.FileLock.save(FileLock.java:206)
	at org.h2.store.FileLock.lockFile(FileLock.java:318)
	at org.h2.store.FileLock.lock(FileLock.java:130)
	at org.h2.engine.Database.open(Database.java:528)
	at org.h2.engine.Database.openDatabase(Database.java:223)
	at org.h2.engine.Database.<init>(Database.java:218)
	at org.h2.engine.Engine.openSession(Engine.java:57)
	at org.h2.engine.Engine.openSession(Engine.java:139)
	at org.h2.engine.Engine.getSession(Engine.java:119)
	at org.h2.server.TcpServerThread.run(TcpServerThread.java:120)
	at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.FileNotFoundException: /home/game/SFS_PRO_1.6.6/Server/datastore/__sfscore__.lock.db (Permission denied)
	at java.io.FileOutputStream.open(Native Method)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:102)
	at org.h2.store.fs.FileSystemDisk.openFileOutputStream(FileSystemDisk.java:340)
	... 11 more


2009/04/29 19:29:22.718 - [ WARNING ] [id: 10] (AdvancedBuddyListPersister.checkIfTableExists): Could not connect to H2 Engine, buddy list persister may not work!
When we restart the server and SFS, the errors continue to fire and all of our players' buddy lists are empty. To make things worse, if a player adds a buddy to their list, the server won't save the data when the player logs in again later.

To temporarily correct the problem, I have to stop the server, delete all of the __sfscore__ files and allow SFS to recreate them on restart. Once the __sfscore__ files are recreated, the server runs smoothly. But, after a few days, something repeatedly triggers the same error, and the server crashes hard shortly afterward.

The problem appears to be with permissions regarding the __sfscore__ storage files that pertain to the buddy list and the H2 Database engine. Why are the files suddenly becoming unavailable?

Here is our current buddy list configuration:

Code: Select all

<BuddyList active="true">
  <mode>advanced</mode>
  <size>30</size>
  <maxBuddyVars>10</maxBuddyVars>
  <addBuddyPermission>true</addBuddyPermission>
  <mutualAddBuddy>true</mutualAddBuddy>
  <mutualRemoveBuddy>true</mutualRemoveBuddy>
  <offLineBuddyVariables>true</offLineBuddyVariables>
  <permissionTimeOut>30</permissionTimeOut>
  <ownerListFullErrorMsg>Your buddy list is already full!</ownerListFullErrorMsg>
   <targetListFullErrorMsg>The selected character's buddy list is already full!</targetListFullErrorMsg>
</BuddyList>
(We're not using a custom buddy list persisterClass as you can see in our config).


Update 1:
I wonder if the problems mentioned in these posts are related?

This one mentions the buddy list clearing...
http://forums.smartfoxserver.com/viewto ... base+buddy

This one mentions the buddy list "Going Boom" regarding string variables of length 0 ...
http://forums.smartfoxserver.com/viewto ... base+buddy



Update 2:
I've reviewed this post, but it is NOT the issue. We do NOT have the h2db zone defined in our config file:
http://forums.smartfoxserver.com/viewtopic.php?t=2880

Also, I did try to access the H2 database interface on our remote dedicated server to troubleshoot futher, but the chapter seems to mention that you can't access the interface from a remote machine:
http://www.smartfoxserver.com/docs/inde ... ddedDb.htm


Any ideas? This is a critical error as it has brought down our live server twice now.

Thank you for your support.

Posted: 30 Apr 2009, 11:32
by Lapo
DbManager could not retrive a connection. org.h2.jdbc.JdbcSQLException: IO Exception: java.io.FileNotFoundException: /home/game/SFS_PRO_1.6.6/Server/datastore/__sfscore__.lock.db (Permission denied); /home/game/SFS_PRO_1.6.6/Server/datastore/__sfscore__.lock.db
It looks like you are having a file permission problem.
It wouldn't be strange if the error is always fired, while it's pretty strange that it happens after sometime.
Why are the files suddenly becoming unavailable?
No idea, you should check your permissions, and I would also suggest to run an integrity check on your disk.

Also is there anything else on your server that might alter the file permissions?
To temporarily correct the problem, I have to stop the server, delete all of the __sfscore__ files and allow SFS to recreate them on restart. Once the __sfscore__ files are recreated, the server runs smoothly
Take a look at the permissions when you restart the server and go back to those files as soon as those errors appear again. I'd expect that permissions are changed or files have disappeared... :?

Posted: 30 Apr 2009, 16:55
by jjduran
I have verified that the __sfscore__ files do still exist when the problem occurs.

I agree that this is likely a permissions issue, but I don't think it could be a disk integrity issue because we also run a VPS server for our testing environment. Sure enough, the error has also appeared on the completely separate VPS machine.

Although the same error appears in the logs of VPS machine, the entire VPS server did not crash like it did in the live environment. I'm guessing because the errors fire far less frequently since there are fewer users online in our testing environment.

I will pay attention to the permissions of the __sfscore__ files, but why would that happen on both servers? If I remember correctly, the file ownership of the __sfscore__ was 'root' after the files were no longer accessible.

After I deleted the files and SFS automatically recreated them, the file ownership returned to the linux user account name. Unfortunately, there is no way to be sure of that ownership change until the server starts throwing the errors again. But if it does change, I can't think of a reason why it would occur on both servers? It seems more like an issue where SFS locks the file for read/write, but never unlocks it.

Maybe some intense testing of the buddy lists when they are full or near full would uncover the issue? I'm also wondering about the comment posted by the user 'Mistermind' regarding buddy variables set to string variables of length 0.
http://forums.smartfoxserver.com/viewto ... base+buddy

I couldn't locate any places in our code where we do this, but if there is an error in our logic where a variable accidentally gets set to "", maybe that could trigger the buddy list to "GO BOOM" as Mistermind mentioned?

Further ideas?

Thanks!

Posted: 03 May 2009, 02:46
by jjduran
Problem Update:

The server has crashed again and just like before, the logs contain the same "Permission Denied" error.

Fortunately there is some new information this time. After checking the ownership of the __sfsfile__ files, it turns out that the __sfscore__ files do in fact become owned by 'root' after the server crashes fires. Prior to the error the __sfscore__ files were owned by the user account affiliated with where our SFS install resides

Here is the good news: I was able to stop the buddy list errors by using the chown command to change the ownership of the __sfscore__ files back to the correct user name affiliated with our SFS install.

The only question that remains is why the buddy list __sfscore__ file permissions change? Maybe the server is crashing for completely separate reason, and the buddy list problem is just a consequence of a larger issue?

We'll be posting an update to the game tonight to try a few new things, but if you have any ideas about the permissions issue please let me know.

Thank!

Posted: 03 May 2009, 06:36
by Lapo
The only question that remains is why the buddy list __sfscore__ file permissions change?
At the moment I have no ideas.
It is very unlikely that SFS (via the H2 db engine) can change those permissions. Unless you are starting SmartFoxServer as the root user I would say that this is impossible. In fact there's no way that a non-root user can elevate the permissions of a file or directory to the root user.

On the other hand if you are running SFS as root you shouldn't have problems with the database access because you would have the highest permissions.
Maybe the server is crashing for completely separate reason, and the buddy list problem is just a consequence of a larger issue?
Everything is possible, but it doesn't explain how those files permissions are modified.
Also we should define what "crash" means for you?
Does the JVM crash?
Does the server become unresponsive?
What's the state of the java process? CPU? RAM?
Can you login as Admin? If so could you send us a screenshot of the server status?

Posted: 04 May 2009, 03:38
by jjduran
Unless you are starting SmartFoxServer as the root user I would say that this is impossible. In fact there's no way that a non-root user can elevate the permissions of a file or directory to the root user.
We have in fact been starting SFS as the root user. Is this a bad practice? I just assumed that being root would allow SFS to access everything it needed.
Does the JVM crash?
I'm not sure because the server becomes unresponsive, but the logs do not include anything that would suggest the JVM has crashed. I'm by no means a java expert, but after some research, I found a few forums suggesting to look at a file named 'hs_err_pid' that supposedly contains java error messages. Unfortunately, I was unable to locate that file on our server. Is there a better way to determine if the JVM has crashed, or a way to enable the JVM error logging?
Does the server become unresponsive?

By crash I mean that the server becomes completely unavailable. I can not ping it, it no longer serves pages, the forums become inaccessible, the database is not accessible, and SFS entirely shuts down. The server requires a manual power cycle before it will work again.
What's the state of the java process? CPU? RAM?
I haven't yet been able to actively monitor this information right before a crash, but if it would help, I can send you our process list using the "ps -aux" command. Additionally, it looks like I can use "top", "vmstat", and the "free" commands to monitor the most intensive processes and determine available memory. Would it help if I emailed you that information?
Can you login as Admin? If so could you send us a screenshot of the server status?
When the server crashes we are unable to access the admin tool. I can however confirm that the problem it is not a thread issue. We always maintain a low number of active threads, generally under 30 at all times. I did however notice that the "Extension Handler Queue Size" spiked only moments before the crash occurred... that number generally remains at zero or a very small number on our server when things are working properly.

Posted: 04 May 2009, 05:26
by Lapo
We have in fact been starting SFS as the root user. Is this a bad practice? I just assumed that being root would allow SFS to access everything it needed.
In general it's better not to start it as root, unless you need to run the server on a port <= 1024.

Code: Select all

Is there a better way to determine if the JVM has crashed, or a way to enable the JVM error logging?
The JVM does not crash otherwise you would know it, because the java process would die.
By crash I mean that the server becomes completely unavailable. I can not ping it, it no longer serves pages, the forums become inaccessible, the database is not accessible, and SFS entirely shuts down.
Wait a moment... so it's not just SFS not responding but also all the other services? :shock:

Posted: 04 May 2009, 05:53
by jjduran
Correct, the entire server hangs as I mentioned in the very first post of this thread:
Our live server has completely crashed for the second time. By crash I mean that the server becomes completely unavailable. I can not ping it, it no longer serves pages, the forums become inaccessible, and SFS entirely shuts down. The server requires a manual power cycle before it will work again.
To get the server back online, I've actually had to call our web host and request that they power cycle the machine because it doesn't respond via SSH, a web browser, our FTP client, or anything.


Also, in a PM I just sent you, I mentioned that I did in fact find an error in the logs indicating that the JVM did crash. I apologize that I overlooked this the first time I reviewed the logs :oops:

Code: Select all

ERROR  | wrapper  | 2009/05/02 22:49:15 | JVM appears hung: Timed out waiting for signal from JVM.
ERROR  | wrapper  | 2009/05/02 22:49:19 | JVM did not exit on request, terminated
STATUS | wrapper  | 2009/05/02 22:49:22 | JVM received a signal SIGKILL (9).
STATUS | wrapper  | 2009/05/02 22:49:24 | Launching a JVM...
ERROR  | wrapper  | 2009/05/02 22:49:53 | Startup failed: Timed out waiting for a signal from the JVM.
ERROR  | wrapper  | 2009/05/02 22:49:54 | JVM did not exit on request, terminated
STATUS | wrapper  | 2009/05/02 22:49:55 | JVM received a signal SIGKILL (9).
STATUS | wrapper  | 2009/05/02 22:49:59 | Launching a JVM...
ERROR  | wrapper  | 2009/05/02 22:50:28 | Startup failed: Timed out waiting for a signal from the JVM.
ERROR  | wrapper  | 2009/05/02 22:50:28 | JVM did not exit on request, terminated
STATUS | wrapper  | 2009/05/02 22:50:31 | JVM received a signal SIGKILL (9).
STATUS | wrapper  | 2009/05/02 22:50:33 | Launching a JVM...
STATUS | wrapper  | 2009/05/02 22:50:53 | TERM trapped.  Shutting down.
ERROR  | wrapper  | 2009/05/02 22:51:28 | Shutdown failed: Timed out waiting for signal from JVM.
ERROR  | wrapper  | 2009/05/02 22:51:32 | JVM did not exit on request, terminated 

As always, thanks for your support.

Posted: 04 May 2009, 07:10
by Lapo
Correct, the entire server hangs as I mentioned in the very first post of this thread:
Sorry, with "server" I thought you were meaning SmartFoxServer.

I think it is very likely you have some hosting related problems, which is probably related to your server machine configuration.
The fact that you can't ping the machine and everything halts is a sign of a broader problem.
I would definitely suggest to submit the problem to your hosting company, reporting all the details you have.
ERROR | wrapper | 2009/05/02 22:49:19 | JVM did not exit on request, terminated
STATUS | wrapper | 2009/05/02 22:49:22 | JVM received a signal SIGKILL (9).
Yes, this one is not exactly an error.
It means that another process (which I think is running with high privileges) has sent a SIGKILL to the SFS process, in other words this is an order to shut down immediately.

Finally, as regards memory settings, the 66MB of Heap Memory is perfectly normal. That size is the default for the JVM.
For more infos check our documentation:
http://www.smartfoxserver.com/docs/docP ... vanced.htm
(bottom of the document)

Posted: 13 May 2009, 02:38
by jjduran
Just to provide some closure to this topic, we've discovered several problems after purchasing a few books and doing some serious reading ...

Many of the symptoms I had identified above were only consequences of the real problem > poor optimization across several server-side systems.

The buddy list errors we reported were only a result of our server sending SIGKILL messages to several active processes in order to recover memory. The H2 database engine then crashed, leaving the buddy list data tables improperly closed.

The java errors we reported were again a consequence of our server running low on resources, and sending a SIGKILL message to the java runtime.

The problems were mainly generated because of some poorly optimized SQL statements and the default Apache and MySQL configurations. After monitoring the "top" tool, we discovered that our CPU usage was spiking dramatically and our ram was nearly consumed.

Inevitably, our system would crash when the load reached a certain level of traffic ... or if enough database transactions were processed simultaneously.

If anyone else suddenly encounters server crashes but you haven't made any major changes to your system, I would investigate the performance and optimization of your machine before wasting too much time trying to troubleshoot each individual error in the SmartFox logs.

A good place to start would be using the "top" and "vmstat" tools to monitor resources. Additional, review the status and logs of your database. If you're working with MySQL, you'll want to tweak your "my.cnf" config file to optimize your database transactions and enable error logging.

Additionally, if your server is running Apache, you'll want to customize your "http.conf" file to optimize the number of processes Apache uses.

After about a week of adjustments, we're seeing some amazing performance improvements. I will update if we experience any further problems.

Time to make some real progress :D

Posted: 13 May 2009, 04:33
by Lapo
Great to heat that you have found the culprit. :)

Good luck