Crash Offsets

robocop

Well, the hotfix didn’t work and neither did a reinstall with WinXPsp2 only.
Still crashing in dalib.dll at 00004353

I’m willing to kill for a solution…

Also, during the freeze/crashes, procmon shows buffer overflows in csrss.exe and lsass.exe as well as a bunch of dxdiag stuff flooding the monitor.

That help at all?

Cannon

I might have a fix for this problem: “FLServer locks up every 10-20 minutes, mostly commonly after about 18 minutes. It tends to lock up more often if flhook event socket stuff is happening. I have both DSPM and DAM connected to flhook doing unicode event socket traffic. DirectPlay traffic still works but no game events are passed.”

The lock up is caused by the thread CRemotePhysicsSimulation+0x2660 ceasing to do stuff. I think this is the main FLServer loop. FLhook event processing is called from it and so if this thread stops, the flhook socket stuff will stop too.

Reviewing one of the stack traces I suspect that problem might be related to a QueryPerformanceTimer call. On futher investigation it might be possible for this to return an invalid time potentially causing FLServer to wait a very long time before doing processing. I haven’t actually proven that this is happening but it seems possible particularly on multi-core processors and/or running virtual machines.

Adding /usepmtimer to your boot.ini as described in this article might workaround this problem. See See http://support.microsoft.com/kb/895980 for instructions.

I’ve applied this change and so far, I’m on 1 hour 30 minutes of no crashes. I’ll edit this post if it really does fix it or not.

Cannon

Still crashing in dalib.dll at 00004353

According to the code, this crash apparently happens when the server starts hosting. It is in the function call CDPServer::GetHostAddresses()

I might recall having this crash on GC ages ago. I believe it was related to a corrupt character file, causing a stack corruption which some how ended up in this function. I’m sorry, I’m not more certain that this.

robocop

I’m not sure your first suggestion about the boot.ini applies to us as we don’t use FLHook.

The second interests me but, I know each player character has been ‘cleaned’ by FL PlayerCleaner on more than one occasion, and FLAC is not identifying any attempts to log in with corrupt player chars.

So, short of doing a player wipe, how do you find the corrupt player char(s) if that’s indeed the problem?

FriendlyFire

Robo, don’t think FLAC is anything but FLHook with modifications. It most likely has the same bugs that Hook does.

AlphaWolf

As clarified previously, FLAC didn’t form from FLHook. They both use a similar approach because that’s the approach that’s needed but the sources are significantly different enough for them to be completely separate of each other.

The crash offset that is being raised seems not to be being triggered by FLAC and likely wouldn’t be by FLHook either as the errors appear to be triggered by FL itself. Whether further back along the line there’s an issue with the hooking functionality that’s causing the circumstance I can’t tell so far and there’s little to suggest that that is the case.

Unfortunately I’ve virtually exhausted options from my own investigations so far (prior to the post) but will continue to dig and will post if there’s anything new I figure out.

-Alpha

robocop

That MS boot.ini workaround refers specifically to servers using AMD procs. We have an Intel proc. Would running that workaround cause any issues or help at all?

going through player files as we speak. Anyone know of any tools that can identify a corrupted player file? I’m using FL PLayer Cleaner to clean everything up, but I’ve been using it for months and still have probs. Willing to try another tool.

FriendlyFire

DSAM does a good job, but I don’t know how dependent it is on FLHook to be able to do that.

? Offline

Not only for Amd

Robocop try /ONECPU

Enable in control panel - where power setting is - all power settings t “on”

Disable in BIOS power management support for CPU

This named Time Drift Bug

In Linux PM-Timer Bug

Cannon

The boot.ini /usepmtimer is a fix for a FLServer issue rather than FLHook or FLAC. I should have mentioned, we have an Intel CPU.

Also the test server has been running for 12 hours now. Yay!

robocop

HeIIoween wrote:
Not only for Amd

Robocop try /ONECPU

Enable in control panel - where power setting is - all power settings t “on”

Disable in BIOS power management support for CPU

This named Time Drift Bug

In Linux PM-Timer Bug

I’ve implemented the boot.ini workaround, we’ll see what that does for us.

@Helloween, for power options I have set ‘Always On’. Is that what you mean?

What is /ONECPU?

Next reboot I’ll disable BIOS power management support for CPU.

R

robocop

Well, daily misery report again…
implemented that /usepmtimer suggestion in the boot.ini file, no joy. 18 minutes after reboot got the triple .\HookFunction.cpp(887): *** ERROR: Exception in Hook_IServerImpl_TradeResponse (unhandled exception) message and everyone online got booted.

So, the fact that it happens within 18 minutes of a reboot indicates to me that it’s not likely a memory leak, especially when the server often runs fine for hours. It’s after 1630 server time so, this is going to go on now every 15-20 minutes or so as long as two or more players are online. At least until midnight anyway…

AlphaWolf seems to think that traderesponse message is related to NPCs scanning something/someone but I don’t know. Nothing in that regard has been changed and this behavior happens whether the mod is activated or not…

? Offline

/ONECPU - use one CPU on multiprocessor system
http://en.wikipedia.org/wiki/NTLDR

? Offline

I still have one problem - flserver frozens with window whitening and flhook window is fine but ofcourse without any response to commands.

All the same - no logs, no events, nothing.

Because all my server are on physical machine i dunno how it will be after restart with /usepmtimer key in boot.ini

Also will try to disable apic/acpi functions in bios - google says must help.

robocop

usepmtimer and /onecpu did not help my problem.

Still crashing in dalib.dll.

What causes hash collisions?

adoxa

Hash functions are very good, but not infallible, so sometimes two different strings will produce the same result. In the vanilla game, there were five collisions (search jflp.txt for clash; createid -s then whatis -t will find them).

robocop

Well, here’s my issue in a nutshell and the specific post which refers to my interest in the causes for a ‘hash collision’.

http://the-starport.net/modules/newbb/viewtopic.php?topic_id=2191&forum=11&post_id=24563#forumpost24563

? Offline

content.dll - 0x490a5 : formation errors, check faction_prop.ini formation values

Huor

I am stilled behind this crash at engbase.dll (+0x0124bd). I attached an debugger and the debugger always halt at this stack back trace:

 #   Memory  ChildEBP RetAddr  Args to Child  
00           00129544 066123db 00d1cec8 08e5c568 0012d5e4 EngBase+0x124bd
01        20 00129564 0661ae06 06612567 08e5c568 00c46318 EngBase+0x23db
02         4 00129568 06612567 08e5c568 00c46318 08e07aac EngBase+0xae06
03        a8 00129610 4fdf4e22 00002800 00000000 0b662220 EngBase+0x2567
04        14 00129624 4fd9c47d 0b662220 00000000 00000008 d3d9!CD3DDDIDX8::LockVB+0x32 (FPO: [2,0,0])
05        14 00129638 06d12996 0b662220 00006300 00000120 d3d9!CDriverVertexBuffer::Lock+0x4d (FPO: [5,0,4])
06      40a8 0012d6e0 7c9201db 77bfc3c9 00330000 00000000 RP8+0x12996
07       234 0012d914 77bfc3c9 00330000 00000000 77bfc3ce ntdll!RtlAllocateHeap+0xeac (FPO: [Non-Fpo])
08        40 0012d954 77bfc3e7 00000014 0012d970 77bf9cd4 msvcrt!_heap_alloc+0xe0 (FPO: [Non-Fpo])
09         c 0012d960 77bf9cd4 00000014 00000001 06be0038 msvcrt!_nh_malloc+0x13 (FPO: [2,0,0])
0a        10 0012d970 06b73f73 00000014 00000000 09061c28 msvcrt!operator new+0xf (FPO: [1,0,0])
0b        14 0012d984 06b71935 0012d9a8 090603c0 0012d9ac ReadFile+0x3f73
0c           00000000 00000000 00000000 00000000 00000000 ReadFile+0x1935

The ReadFile is as far as i could find out something with hudframe021004123005 << dunno maybe something from there…

Has anyone an idea what is done here? Has it something to do with graphical problems due to d3d9!CD3DDDIDX8??

–> that seems to be addressed when the server connection is lost

Correction: The stackback trace is:

 #   Memory  ChildEBP RetAddr  Args to Child              
WARNING: Stack unwind information not available. Following frames may be wrong.
00           00127434 0628a428 00d1cec8 0932d098 00000000 EngBase+0x124bd
01        18 0012744c 06293cce 0932d098 00000000 0932cff0 Common!PhySys::FindSphereCollisions+0x438
02        2c 00127478 0629ed87 00127464 77c05c94 001274f0 Common!CNonPhysAttachment::Disconnect+0x26e
03        1c 00127494 77bfc2e3 0932cff0 0629ed87 00000000 Common!CExternalEquip::IsConnected+0x7
04        70 00127504 0629b344 00000001 093292bc 093291d8 msvcrt!free+0xc8 (FPO: [Non-Fpo])
05        20 00127524 062a922e 093293a8 093291d8 09329548 Common!CEquipManager::Clear+0x94
06        34 00127558 062b0c21 00000000 093291d8 093233d4 Common!CEqObj::~CEqObj+0x9e
07        2c 00127584 06288222 00000000 00000000 062af65b Common!CShip::~CShip+0x271
08         c 00127590 062af65b 00000001 09323344 00539599 Common!BaseWatcher::~BaseWatcher+0x52
09           00000000 00000000 00000000 00000000 00000000 Common!CObject::Release+0x1b

Still i have no clue what happens at the engbase.dll is there a way to find out what routines are working there? The dissambly looks like:

crash offset is this line:

066224bd 8b4010 mov eax,dword ptr [eax+10h] ds:0023:09054e20=00000000

066224a9 90              nop
066224aa 90              nop
066224ab 90              nop
066224ac 90              nop
066224ad 90              nop
066224ae 90              nop
066224af 90              nop
066224b0 8b442408        mov     eax,dword ptr [esp+8]
066224b4 83f8ff          cmp     eax,0FFFFFFFFh
066224b7 740b            je      EngBase+0x124c4 (066224c4)
066224b9 85c0            test    eax,eax
066224bb 7407            je      EngBase+0x124c4 (066224c4)
066224bd 8b4010          mov     eax,dword ptr [eax+10h] ds:0023:09054e20=00000000
066224c0 85c0            test    eax,eax
066224c2 7503            jne     EngBase+0x124c7 (066224c7)
066224c4 83c8ff          or      eax,0FFFFFFFFh
066224c7 c20800          ret     8
066224ca 90              nop
066224cb 90              nop
066224cc 90              nop
066224cd 90              nop
066224ce 90              nop
066224cf 90              nop

Any hints would be appreciated. I am bad in understanding this assembler stuff ;(

Chips

Are these servers running mods at all? Some of these, if they are mods, sound like simple testing should have picked them up and been recognised…

I know it’s a non contributory post, just this whole thread seemed an interesting concept and made me immediately suspect it’d replace testing and experience (not trying to offend folks )