Crash Offsets

Cannon

I have just realised that I might I have the same problem on a test server. Every 20 minutes or so the server freezes and is restarted. Sometimes a OS reboot seems to fix the problem for a random period of time - sometimes days.

When the server freezes, the flhook console is locked up but the FLServer uptime counter continues to count up.

The server is running in a VM and I thought it might have been related to the VM software. The OS is XP 32bit SP3, a clean install.

When did you install SP3 - a year ago?

The mod definitely works okay on my XP machine. I’m sure it is not the mod. It could be flhook stuff but I am/was running production plugins.

I threw windbg at it and I think it was freezing when it was trying to execute level0? ring code but I really have no idea what I’m talking about.

If anybody wants to debug it for me, remote access is available…

Also, (I’ll start another thread on this) I’m wondering if it might be server list related. A couple times I’ve noticed that FLServer has lost connection to the list server and reacquired it. On at least one or more occasions, the crash happened at that time.

I wondered the same thing but I removed all list server stuff and it didn’t seem to help.

robocop

Well…, welcome to the club. I was starting to think everyone was looking at me sideways on this issue…

This has been going on for over a year now. The current install is only a week old. WinXPsp3 with no other updates.

Perhaps I’ll try running WinXPsp2 to see if the problem goes away but I’m pretty sure the problem existed well after upgrading to service pack 3.

Watch your event viewer. Application section should show any FLServer crashes. In there you should see which module is affecting flserver.

AlphaWolf

I’m not sure if this will help but picking up on adoxa’s comment XP SP3 introduced a memory leak problem as described here: http://support.microsoft.com/kb/959658.

Perhaps this is what is happening?

-Alpha

? Offline

When the server freezes, the flhook console is locked up but the FLServer uptime counter continues to count up.

The server is running in a VM and I thought it might have been related to the VM software.

1. It was in VirtualBox (not Oracle version yet)

2. On vmware ESXi i never had this problem

3. Simple Vmware Server 2.0 - i have not checked it, ESXi is better

Huor

HeIIoween wrote:
Whats wrong with engbase.dll? - I do not remember, remind please

Um yeah there is a crash of FLServer in this module. At least thats what is logged. Therefor i was searching for any entry that could give me some information on that offset - but i didnt found any in a dll viewer. What Adoxa explained pretty well explains it

Nevertheless i got around this problem as i have attached windbg to the running exe and so could find the entry point of engbase.dll. So far so good. But now i dont know if the offset from the crash log is the offset from the module entry point or the offset of the exe base start…

Finally if the engbase functions are virtually accessed it maybe that the Dacom routines are causing the crash?! Once i have found out the address for the breakpoint I could try to figure throughout the callstack what might cause the crash - thatswhy i was asking about that offset stuff ;D

If still anyone might have an idea I really would appreciate it and am grateful in advance.

? Offline

Try perdr, and look in readme.txt of it

adoxa

The crash report is for the module, so it’ll be relative to engbase.dll. If you look at the details, it should give you the actual address. Depending on how the crash occurs, WinDbg should actually break there (at least, OllyDbg usually does).

Huor

yes it would if it would crash while attached

robocop

Well, the hotfix didn’t work and neither did a reinstall with WinXPsp2 only.
Still crashing in dalib.dll at 00004353

I’m willing to kill for a solution…

Also, during the freeze/crashes, procmon shows buffer overflows in csrss.exe and lsass.exe as well as a bunch of dxdiag stuff flooding the monitor.

That help at all?

Cannon

I might have a fix for this problem: “FLServer locks up every 10-20 minutes, mostly commonly after about 18 minutes. It tends to lock up more often if flhook event socket stuff is happening. I have both DSPM and DAM connected to flhook doing unicode event socket traffic. DirectPlay traffic still works but no game events are passed.”

The lock up is caused by the thread CRemotePhysicsSimulation+0x2660 ceasing to do stuff. I think this is the main FLServer loop. FLhook event processing is called from it and so if this thread stops, the flhook socket stuff will stop too.

Reviewing one of the stack traces I suspect that problem might be related to a QueryPerformanceTimer call. On futher investigation it might be possible for this to return an invalid time potentially causing FLServer to wait a very long time before doing processing. I haven’t actually proven that this is happening but it seems possible particularly on multi-core processors and/or running virtual machines.

Adding /usepmtimer to your boot.ini as described in this article might workaround this problem. See See http://support.microsoft.com/kb/895980 for instructions.

I’ve applied this change and so far, I’m on 1 hour 30 minutes of no crashes. I’ll edit this post if it really does fix it or not.

Cannon

Still crashing in dalib.dll at 00004353

According to the code, this crash apparently happens when the server starts hosting. It is in the function call CDPServer::GetHostAddresses()

I might recall having this crash on GC ages ago. I believe it was related to a corrupt character file, causing a stack corruption which some how ended up in this function. I’m sorry, I’m not more certain that this.

robocop

I’m not sure your first suggestion about the boot.ini applies to us as we don’t use FLHook.

The second interests me but, I know each player character has been ‘cleaned’ by FL PlayerCleaner on more than one occasion, and FLAC is not identifying any attempts to log in with corrupt player chars.

So, short of doing a player wipe, how do you find the corrupt player char(s) if that’s indeed the problem?

FriendlyFire

Robo, don’t think FLAC is anything but FLHook with modifications. It most likely has the same bugs that Hook does.

AlphaWolf

As clarified previously, FLAC didn’t form from FLHook. They both use a similar approach because that’s the approach that’s needed but the sources are significantly different enough for them to be completely separate of each other.

The crash offset that is being raised seems not to be being triggered by FLAC and likely wouldn’t be by FLHook either as the errors appear to be triggered by FL itself. Whether further back along the line there’s an issue with the hooking functionality that’s causing the circumstance I can’t tell so far and there’s little to suggest that that is the case.

Unfortunately I’ve virtually exhausted options from my own investigations so far (prior to the post) but will continue to dig and will post if there’s anything new I figure out.

-Alpha

robocop

That MS boot.ini workaround refers specifically to servers using AMD procs. We have an Intel proc. Would running that workaround cause any issues or help at all?

going through player files as we speak. Anyone know of any tools that can identify a corrupted player file? I’m using FL PLayer Cleaner to clean everything up, but I’ve been using it for months and still have probs. Willing to try another tool.

FriendlyFire

DSAM does a good job, but I don’t know how dependent it is on FLHook to be able to do that.

? Offline

Not only for Amd

Robocop try /ONECPU

Enable in control panel - where power setting is - all power settings t “on”

Disable in BIOS power management support for CPU

This named Time Drift Bug

In Linux PM-Timer Bug

Cannon

The boot.ini /usepmtimer is a fix for a FLServer issue rather than FLHook or FLAC. I should have mentioned, we have an Intel CPU.

Also the test server has been running for 12 hours now. Yay!

robocop

HeIIoween wrote:
Not only for Amd

Robocop try /ONECPU

Enable in control panel - where power setting is - all power settings t “on”

Disable in BIOS power management support for CPU

This named Time Drift Bug

In Linux PM-Timer Bug

I’ve implemented the boot.ini workaround, we’ll see what that does for us.

@Helloween, for power options I have set ‘Always On’. Is that what you mean?

What is /ONECPU?

Next reboot I’ll disable BIOS power management support for CPU.

R

robocop

Well, daily misery report again…
implemented that /usepmtimer suggestion in the boot.ini file, no joy. 18 minutes after reboot got the triple .\HookFunction.cpp(887): *** ERROR: Exception in Hook_IServerImpl_TradeResponse (unhandled exception) message and everyone online got booted.

So, the fact that it happens within 18 minutes of a reboot indicates to me that it’s not likely a memory leak, especially when the server often runs fine for hours. It’s after 1630 server time so, this is going to go on now every 15-20 minutes or so as long as two or more players are online. At least until midnight anyway…

AlphaWolf seems to think that traderesponse message is related to NPCs scanning something/someone but I don’t know. Nothing in that regard has been changed and this behavior happens whether the mod is activated or not…