Crash at 0x635d32b (Common.dll), dump attached
-
Hey all,
So I’ve been trying to find the source of a particular crash that seems to happen occasionally when switching systems. Now, it’s worth mentioning of course that we use a modified system for jumping which is based off Cannon’s hyperdrive code.
Anyway, the crash is at 0x635d32b, which is in common.dll, but I’ve been unable to find any mention of this one yet. I attempted to trace into the instruction and got rather conflicting results: it seems to be called both by ios_base and by a physics function?
Since this is a server crash, I’d really like to know what the hell is going on with it, so I’ve attached the dump file that FLHook produces right before crashing. I hope someone’s already encountered this or may be able to glean more insight from it. IDA doesn’t even want to run the dump over here, which great, and Visual Studio typically produces garbage when it doesn’ thave a PDB to work off of.
-
Seems to be related to Common.dll 0x0635C376 crash, as it’s pointing to the same data. You could try nopping out the six bytes at 0FD31E (or 635D31E in the debugger; alternatively, [c]and[/c] with -1), which would leave [c]edx[/c] as -1, not 0xFFFF, so comparing with [c]eax[/c] as -1 will work (which, given the registers in the dump, seems to be the problem).
-
Hmm… I’ll try to do that and report if I run into anything. At least it hasn’t crashed when I tried running around and jumping, so that’s a start.
-
So, to report back, I’ve had one crash at 0x635d32b since applying the patch on top of a bunch more at Common.dll+0xf24a0 (which I’ve mentioned as having trouble with in the past).
I’m pretty sure the two issues are related, but at this point I’m completely unable to say what would cause these to rise up. I’ve already sanitized positions and orientations obtained from players (which were a large source of these errors), but it seems like bad values are still being generated somewhere.
To perhaps provide some context, here are six dump files of those two crashes.
-
This may catch the crash at F24A0, causing it at F23CD instead, so another dump may help narrow it down. You could also try replacing the final [c]88 00[/c] with [c]C2 08 00[/c] to see if that works around it. Don’t know about the other issues, though.
File: Common.dll 0E4432: 8F [ 9A ] 0E44B0: 11 [ 1C ] 0E453E: 83 [ 8E ] 0F23C5: 8B 44 E4 04 85 C0 75 03 88 00 [ 90 90 90 90 90 90 90 90 90 90 ]
-
I’ll definitely try that! How do you load those dump files by the way? Visual Studio works well with them, but since all those crashes are in files without PDBs it’s pretty much useless, and IDA doesn’t seem to be able to do anything useful with them, let alone give me some kind of call stack or crash info.
-
I’ve tried applying all previous fixes with the C2 08 00 workaround and got a bunch more crash logs in different locations. Interestingly, related logs are always 10 seconds apart.
They’re all the same offset except the one at 19.22.25 which has three cascading points of failure, starting at 0x0635d32b. All seem related with jumping across systems.
Also, I can confirm that they all trace down to PhySys::Update rather than another of the many functions that end up calling the crashing code.
-
I use [c]cdb[/c] (from the debugging tools, the command line version of [c]WinDbg[/c]) to read the dump ([c](for %j in (*.dmp) do cdb -ses -z %j -c .ecxr;k;q) |tde[/c]).
I wanted dumps with the new crash address so I could trace back what was writing the zero in the first place (assuming a zero was being inserted into the hash table, not just replacing an existing entry; given that it’s crashing at a different spot, I guess it is inserted). I could hazard a guess at a [c]PhySys::CreatePhantom[/c] constructor.
-
I’m tracking down the bug a little further (it’s not quite reproducible, but it’s sufficiently common to force it) and one thing I’ve noticed is that a lot of the time, there’s at least one player ship with NaNs for position and orientation at the time of the crash. Not all the time, though, which is odd. What’s even more odd, however, is that as I’ve mentioned in the past, we’re blocking NaNs coming from the client, so I don’t see where they could even be coming from.
I’ll try to hook the CreatePhantom calls and see if I get something.
-
So I have conclusions (partial, anyway):
-
CreatePhantom didn’t trigger any exception, so I don’t think the error originates from that.
-
The root cause of the issue remains NaN positions and/or orientations. The way it happened this time around was a lot more tricky though, hence why it took me so long to find it. The rotation quaternion the client sent back after jumping was all NaNs because the rotation matrix was slightly denormalized, causing the conversion to fail. However, the quaternion was “sanitized” somewhere along the chain and became all zeros, so it didn’t get caught by the SPObjUpdate patch I added (it checks for NaNs only), but a zero quaternion is still an invalid rotation matrix, so when converting it back it became full of NaNs again and would crash the server on some occasions.
I’ve tracked down and fixed the quaternion conversion (I can’t really help with rotation matrices becoming slightly denormalized, that’s just normal) on the client, but I’ve also added another check to SPObjUpdate in FLHook which requires the quaternion to be mostly normalized, so it’ll catch erroneous values like this in the future.
-