Bug 1191976 - [e10s] Some child process crashes are not being reported
Oh shit, this is pretty serious.

Luckily, I can mostly reproduce.

"
STR for this particular crash:

1) Set layers.progressive-paint to true and restart Nightly
2) Move Nightly over to a Retina display
3) Log in to Workday.com
4) Click into any of the menus that cause the "zoom" animation
5) Keep repeating until a content process crash.

The crash itself is not important - layers.progressive-paint is not a thing we're ever likely to turn on. We should find out why the crash report isn't being generated here in this case though."

Thing to remember - for debug builds, the crash reporter is disabled, so make sure to run with MOZ_CRASHREPORTER=1. I've been burned by that before.

See? I'm learning! :D


Okay, going to attack this on two fronts. First, I need to attach a JS debugger breakpoint on TabCrashReporter.observe for the ipc:content-shutdown observer notification.

Because it seems like that's either not being called, or the shutdown is not abnormal or something.

Also setting a breakpoint in ContentParent::ActorDestroy, since this is what's being called when the process is crashing.

Welp, shit, I can't reproduce this with a debug build. But I can with an opt build. The crash from the STR must be a racey one or something. :/

So it's down to printf debugging. Lovely.

Well, what I've discovered is that the dumpID is blank. Why?

So who is responsible for setting the dumpID?

nsAutoString dumpID (crashReporter -> ChildDumpID());
props -> SetPropertyAsAString(NS_LITERAL_STRING( "dumpID" ), dumpID);

The above is in ContentParent::ActorDestroy. So I guess I need to examine the CrashReporterParent.

Huh… it seems we never call into GenerateCrashReportForMinidump… and that's what's necessary to generate the dumpID...

So far, I've traced this to TakeMinidumpForChild in nsExceptionHandler.cpp. There's apparently no child process data when we try to "GetEntry":

MutexAutoLock lock ( * dumpMapLock);

ChildProcessData
* pd = pidToMinidump -> GetEntry(childPid);
if ( ! pd)
return false ;

NS_IF_ADDREF( * dump = pd -> minidump);


So we return false here, which causes us to fail to generate a crash report.

Question: Why is pidToMinidump->GetEntry empty?

Well, better question:

Question: When is pidToMinidump->GetEntry usually set?

Two places that pidToMinidump->PutEntry is called. Let's instrument those and use TabCrasher.

So when I use TabCrasher, we PutEntry in OnChildProcessDumpRequested. Who calls that? Apparently that's something that CrashGeneratorServer calls, which I think is part of breakpad...

Ah, ok, so CrashGeneratorServer runs on a separate thread...

* thread #45: tid = 0x5bb87f, 0x00000001065eb80b XUL`CrashReporter::OnChildProcessDumpRequested(aContext=0x0000000000000000, aClientInfo=0x000000013ad03a90, aFilePath=0x000000013ad03a80) + 27 at nsExceptionHandler.cpp:2635, stop reason = breakpoint 2.1
* frame #0: 0x00000001065eb80b XUL`CrashReporter::OnChildProcessDumpRequested(aContext=0x0000000000000000, aClientInfo=0x000000013ad03a90, aFilePath=0x000000013ad03a80) + 27 at nsExceptionHandler.cpp:2635
frame #1: 0x000000010664255c XUL`google_breakpad::CrashGenerationServer::WaitForOneMessage(this=0x0000000139795620) + 764 at crash_generation_server.cc:136
frame #2: 0x000000010664223d XUL`google_breakpad::CrashGenerationServer::WaitForMessages(server=0x0000000139795620) + 29 at crash_generation_server.cc:95
frame #3: 0x00007fff8305d772 libsystem_c.dylib`_pthread_start + 327

Okay, uh… so is that being run in the bad case?

Bah. I'm out of my depth. Needinfo'ing ted.

Ted sez:

5:59 PM < @ ted > mconley: by way of https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/mac/handler/exception_handler.cc#370

I'm noticing these before a crash:

###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0xFFFB,name=???) Payload error: message could not be deserialized

IPDL protocol error: Error deserializing 'data' (Shmem) member of 'SurfaceDescriptorShmem'
IPDL protocol error: Error deserializing 'SurfaceDescriptor'

###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x800006,name=PLayerTransaction::Msg_PTextureConstructor) Value error: message was deserialized, but contained an illegal value


###!!! [Parent][MessageChannel] Error: (msgtype=0x200084,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv

Is it possible that the IPC library is doing a killhard of the child on an error message that we're not prepared to handle?

Maybe. I'd have to talk to billm about that maybe.

What I do know is that ExceptionHandler::WriteMinidumpWithException isn't being called… that's what's called in the "show crash report" case.

Okay, this is getting pretty insane. First of all, somehow, I'm able to set breakpoints inside XCode and walkthrough an OPT BUILD. WHAT THE FUCK. That's some H.P. Lovecraft style revelations here. I feel like I'm staring at non-Euclidean geometry. The horror! The majesty! TL;DR: Everything I know is wrong.

Uh. I can't even deal with that right now. Let's pretend that the data I'm getting out of this is useful. I'll curl up into a fetal ball and rock myself to sleep later.

Soooooooooo - I'm starting to suspect that the ContentParent::KillHard is sending a SIGTERM to the content process. I can see that SIGTERM being sent at the end of ContentParent::KillHard via ProcessWatcher::EnsureProcessTerminated.

SO, hypothesis: This is a KillHard abort that does not result in a crash report.

Test: Add a magical??? breakpoint to ContentParent::KillHard, reproduce the crash and see if we hit the breakpoint.

Result: No . The parent never enters ContentParent::KillHard. Uh… so EnsureProcessTerminated is being called by something else?

Noooo - I think it's happening through ContentParent::ActorDestroy - after we attempt to get a crash report, we do DelayedDeleteSubprocess which kills the process.

So ActorDestroy… the IPC pipe is being shut down. Why? Did the child return false for something?

###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0xFFFB,name=???) Payload error: message could not be deserialized

IPDL protocol error: Error deserializing 'data' (Shmem) member of 'SurfaceDescriptorShmem'
IPDL protocol error: Error deserializing 'SurfaceDescriptor'

###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x800006,name=PLayerTransaction::Msg_PTextureConstructor) Value error: message was deserialized, but contained an illegal value

So… uh, it looks like something is sending SIGTERM to the content process. And then the content process doesn't handle that and just shuts down without creating a minidump.

Wai?

Hypothesis: Child is returning false for some IPC message. This causes IPC layer to flip out and kill the child with SIGTERM.

Test: Make RecvSwappedWithOtherRemoteLoader return false, and then tear out a tab to crash.

Result: No , we still get a crash report. Plus this error message:

IPDL protocol error: Handler for SwappedWithOtherRemoteLoader returned error code

###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x20008C,name=PBrowser::Msg_SwappedWithOtherRemoteLoader) Processing error: message was deserialized, but the handler returned false (indicating failure)

[Child 23849] ###!!! ABORT: Content child abort due to IPC error: file /Users/mikeconley/Projects/mozilla-central/dom/ipc/ContentChild.cpp, line 2005
[Child 23849] ###!!! ABORT: Content child abort due to IPC error: file /Users/mikeconley/Projects/mozilla-central/dom/ipc/ContentChild.cpp, line 2005

So I think we handle that case OK.


Hypothesis: Parent asks child to allocate something. Child returns nullptr. Parent flips out and kills it with SIGTERM.

Test: Make TabChild::AllocPWebBrowserPersistDocumentChild return nullptr, and then try to save a page.

Result: No , we still get a crash report, and the following error message:

###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x200004,name=PBrowser::Msg_PWebBrowserPersistDocumentConstructor) Value error: message was deserialized, but contained an illegal value

[Child 24092] ###!!! ABORT: Content child abort due to IPC error: file /Users/mikeconley/Projects/mozilla-central/dom/ipc/ContentChild.cpp, line 2005
[Child 24092] ###!!! ABORT: Content child abort due to IPC error: file /Users/mikeconley/Projects/mozilla-central/dom/ipc/ContentChild.cpp, line 2005

###!!! [Parent][MessageChannel] Error: (msgtype=0x200084,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv


Huh. So let's go back to the error messages that we were seeing:

###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0xFFFB,name=???) Payload error: message could not be deserialized

IPDL protocol error: Error deserializing 'data' (Shmem) member of 'SurfaceDescriptorShmem'
IPDL protocol error: Error deserializing 'SurfaceDescriptor'

###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x800006,name=PLayerTransaction::Msg_PTextureConstructor) Value error: message was deserialized, but contained an illegal value

Let's try to come up with a scenario here that could explain this.

Where is the PTextureConstructor sent?

Hm, a few places:

ImageBridgeChild::CreateTexture and ShadowLayerForwarder::CreateTexture

Reminds me of….. bug 1134252 .

That had an error message of "'(msgtype=0x200002,name=PBrowser::Msg_PRenderFrameConstructor) Value error: message was deserialized, but contained an illegal value'"

And that was because the PARENT was returning nullptr for the constructor.

* thread #25: tid = 0x630940, 0x00000001016bb474 XUL`base::KillProcess(process_id=24855, exit_code=<unavailable>, wait=false) + 20 at process_util_posix.cc:70 , name = 'Compositor', stop reason = breakpoint 1.2
* frame #0: 0x00000001016bb474 XUL`base::KillProcess(process_id=24855, exit_code=<unavailable>, wait=false) + 20 at process_util_posix.cc:70
frame #1: 0x00000001016d9424 XUL`mozilla::ipc::FatalError(aProtocolName=<unavailable>, aMsg=<unavailable>, aOtherPid=<unavailable>, aIsParent=<unavailable>) + 308 at ProtocolUtils.cpp:324
frame #2: 0x00000001017905e8 XUL`mozilla::layers::PLayerTransactionParent::Read(this=0x0000000000000000, v__=<unavailable>, msg__=<unavailable>, iter__=0x00000001129868b8) + 216 at PLayerTransactionParent.cpp:2423
frame #3: 0x000000010178c15a XUL`mozilla::layers::PLayerTransactionParent::Read(this=0x00000001240d0160, v__=0x0000000112986878, msg__=0x0000000112986b20, iter__=0x00000001129868b8) + 170 at PLayerTransactionParent.cpp:4576
frame #4: 0x000000010178b05b XUL`mozilla::layers::PLayerTransactionParent::OnMessageReceived(this=0x00000001240d0160, msg__=0x0000000112986b20) + 1227 at PLayerTransactionParent.cpp:425
frame #5: 0x00000001018f7758 XUL`mozilla::layers::PCompositorParent::OnMessageReceived(this=<unavailable>, msg__=0x0000000112986b20) + 232 at PCompositorParent.cpp:509
frame #6: 0x00000001016d6896 XUL`mozilla::ipc::MessageChannel::DispatchAsyncMessage(this=0x000000011c20ec68, aMsg=0x0000000112986b20) + 118 at MessageChannel.cpp:1382
frame #7: 0x00000001016d6011 XUL`mozilla::ipc::MessageChannel::DispatchMessage(this=0x000000011c20ec68, aMsg=<unavailable>) + 241 at MessageChannel.cpp:1302
frame #8: 0x00000001016d2d77 XUL`mozilla::ipc::MessageChannel::OnMaybeDequeueOne(this=0x000000011c20ec68) + 167 at MessageChannel.cpp:1273
frame #9: 0x00000001016b4524 XUL`MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) [inlined] MessageLoop::RunTask(task=0x000000011c21f890, this=0x0000000112986d20) + 132 at message_loop.cc:364
frame #10: 0x00000001016b4513 XUL`MessageLoop::DeferOrRunPendingTask(this=0x0000000112986d20, pending_task=<unavailable>) + 115 at message_loop.cc:372
frame #11: 0x00000001016b483a XUL`MessageLoop::DoWork(this=0x0000000112986d20) + 170 at message_loop.cc:459
frame #12: 0x00000001016b5046 XUL`base::MessagePumpDefault::Run(this=0x0000000112ccb520, delegate=0x0000000112986d20) + 758 at message_pump_default.cc:34
frame #13: 0x00000001016b40ad XUL`MessageLoop::Run() [inlined] MessageLoop::RunInternal(this=<unavailable>) + 77 at message_loop.cc:234
frame #14: 0x00000001016b409e XUL`MessageLoop::Run() [inlined] MessageLoop::RunHandler(this=<unavailable>) at message_loop.cc:227
frame #15: 0x00000001016b409e XUL`MessageLoop::Run(this=<unavailable>) + 62 at message_loop.cc:201
frame #16: 0x00000001016bc43b XUL`base::Thread::ThreadMain(this=0x0000000112ccb430) + 187 at thread.cc:170
frame #17: 0x00000001016bc49a XUL`ThreadFunc(closure=<unavailable>) + 10 at platform_thread_posix.cc:39
frame #18: 0x00007fff8305d772 libsystem_c.dylib`_pthread_start + 327

From jld:

"From a quick look at the generated code, it looks as if the only cases where mozilla::ipc::FatalError is called if there's an out-of-range IPDL union discriminant or if a ParamTraits<T>::Read (i.e., deserialization) routine fails.  Most of the reasons that can happen mean the data stream is corrupt, but one possibility is a FallibleTArray when the receiver's allocation fails.

It looks like the other cases (Recv* returning false, failed Alloc* on the receive side) just return errors from the OnMessageReceived method, which is probably where we're tapping in to do the paired minidump?"

From billm:

"I suspect the logic here has to do with whether it's reasonable to recover from the error or not. If we can't deserialize some data, then we're unlikely to be able to understand future messages from the child either, so we should kill it. Things like message handlers returning false are potentially recoverable.

I'd suggest we add some sort of ability for individual protocols to override the FatalError behavior. However, the default behavior if you don't implement this method would be to MOZ_CRASH() (which would definitely send us a crash report). Then we would override the FatalError behavior for CrossProcessCompositorParent to do something reasonable. Maybe it could post a message to the main thread asking it to take a minidump and kill the child. We'll need to be careful to close the CompositorParent channel right away though so that we don't get any more bad messages coming in."

Question:

billm suggests using MOZ_CRASH within FatalError in order to definitely send a crash report. Since this code is presumably running in the parent process, is he suggesting we crash the parent here?

Yes, yes he is. And that way, we can get an idea of how often this happens.

2:31 PM < mconley > jimm: hey - can I pick your brain a bit about bug 1191976? Perhaps jld wants to listen in too - he seems to know a lot about this stuff...
2:33 PM < jimm > mconley: sure, just catching up on that bug
2:33 PM < mconley > thanks
2:33 PM < jimm > "So, to be clear, we want a minidump from the process that saw FatalError."
2:33 PM < jimm > that's the browser right?
2:33 PM < jimm > on the compositor thread
2:33 PM < mconley > jimm: in the case of this bug, yes - but (I believe) that can be any process.
2:34 PM < mconley > like, FatalError is hit if serialization fails on either side
2:34 PM < mconley > so, I'm pretty sure FatalError could be hit on parent, content and plugin side.
2:34 PM < jld > That's a good point.
2:34 PM billm joined (billm@moz-a8k9ll.mh2d.1ibb.0101.2620.IP)
2:34 PM & billm was promoted to admin (+a) by ChanServ
2:34 PM & billm was opped (+o) by ChanServ
2:34 PM < jimm > hmm, sounds like a simple MOZ_CRASH in place of that kill the other side code is what you want
2:35 PM < mconley > jimm: which is what billm also suggested
2:35 PM < jld > Thought if a child process is the receiver, then there's no downside to MOZ_CRASH(), because it was going to do that anyway?
2:35 PM < mconley > jimm: but in the case of the parent process, this brings down the whole browse.r
2:35 PM < jld > s/ght/gh/
2:35 PM < mconley > s/.r/r.
2:35 PM < jimm > yep
2:35 PM < jimm > 8mozilla%3A%3Adom%3A%3AIPCDataTransferIt...&version=Firefox%3A42.0a2
2:36 PM < mconley > jimm: we're certain we're willing to make that trade-off?
2:36 PM < jimm > do we have any sense for how often this happens in the wild?
2:36 PM < mconley > because FatalError on the parent side should be recoverable, I'd think - in that, we kill the process that sent us the message, and spew out a minidump for our own process
2:36 PM < mconley > jimm: I don't think we do.
2:37 PM < mconley > like, MOZ_CRASH'ing here would certainly give us an idea o fit
2:37 PM < mconley > s/o fit/of it
2:37 PM < jld > ...gah. The important part of the stack is off the end of the "..." in that signature.
2:37 PM < mconley > like, that's one way of finding out. :)
2:38 PM < jld > jimm: No, not the URL being truncated on IRC (I pasted it back together), the "mozilla::dom::PContentChild::Read(mozilla::dom::IPCDataTransferIt..."
2:38 PM < mconley > ah - so it looks like if we hit FatalError in the child, that we're doing the right thing
2:38 PM < mconley > s/that/then
2:38 PM < jimm > ah sorry
2:38 PM < mconley > geez, I'm fat fingering like crazy
2:39 PM < jimm > mconley: in the case of the compositor crash, I think crashing the browser would be ok assuming you push it to try for a full test run and don't see anything ugly happen.
2:39 PM < jimm > why do we think this is a problem on the compositor side vs. the main thread side?
2:40 PM < jimm > sounds like the data we received was messed up
2:40 PM < jimm > oh, wait, this message came from the child?
2:40 PM < mconley > Yes - we receive a messed up message from the child that we can't deserialize
2:40 PM < jimm > same question though..
2:40 PM < jimm > hmm ok
2:40 PM < jld > ...or, wait. IPCDataTransferItem is useful information.
2:41 PM < mconley > So we're doing the right thing in killing the content process, I think. The new plan is to MOZ_CRASH instead.
2:41 PM < mconley > I guess I'm just hesitant to MOZ_CRASH
2:41 PM < jld > But it would be nice if there were more characters there, or fewer were taken up the stuff leafward of the actor's FatalError.
2:42 PM < mconley > because I feel like there should be a good way, in the parent process FatalError case, to kill the content process still, and take a minidump from ourselves
2:42 PM < mconley > and somehow convince about:tabcrashed to submit _that_ minidump
2:42 PM < jimm > mconley: how does crashing the browser help us diagnose?
2:42 PM < jld > I don't much like MOZ_CRASH()ing either. I guess if the child process is really compromised then there are easier ways to DoS the parent process.
2:43 PM < jld > mconley: And that was what I suggested in the bug: however we get a live-dump-of-self for paired minidumps, do that.
2:43 PM < mconley > jimm: it's pretty brute force. It's a quick and dirty way of ensuring we get a minidump from the parent.
2:43 PM < mconley > jld: yeah, I think I'd like to lean that way if I can
2:43 PM < jld > But I haven't tried going into the code to see how much of the important parts of that are thread-safe.
2:44 PM < mconley > jimm says that the CrashReporterParent is reasonably threadsafe
2:44 PM < jimm > noo
2:44 PM zz_miketaylr → miketaylr
2:44 PM < mconley > oh, sorry, misquoted
2:44 PM < jimm > not CrashReporterParent
2:45 PM < mconley > "the code in CrashReporterParent is a light wrapper around the CrashReporter apis which do most of the work, and I believe these routines are mostly multithread safe" <-- so I was wrong, not CrashReporterParent, but the crash reporter backend.
2:45 PM < jimm > yeah
2:46 PM < jimm > mconley: do we know what ipdl message this happen for?
2:47 PM < mconley > jimm: for this particular bug, it's Msg_PTextureConstructor
2:47 PM < mconley > jimm: but the cause of this FatalError isn't particularly interesting - it's our reaction to the crash that's interesting
2:47 PM < mconley > *sigh*, s/crash/FatalError
2:48 PM < & billm > I just feel like we should get a sense for how common this is before we put a lot of effort into it
2:48 PM < & billm > making the crash reporting threadsafe isn't trivial
2:48 PM < jimm > yeah our handling there is messy.
2:48 PM < mconley > yeah, reading this stuff, I'm getting a sense of it. :/
2:48 PM < mconley > okay
2:48 PM < & billm > if it turns out we crash there a lot, we can revisit the decision
2:48 PM < jimm > billm: pretty sure CrashReporter apis can be called on any thread.
2:48 PM < & billm > if we almost never crash there, we can leave it as is
2:49 PM < jimm > ted: * ?
2:49 PM < jimm > ted: ^ ?
2:49 PM < mconley > billm: okay, I think I'm pursuaded
2:49 PM < & billm > jimm: yes, but the stuff in between isn't
2:49 PM < mconley > persuaded
2:49 PM < jimm > I'm not suggesting calling ContentParent's KillHard there
2:49 PM < jimm > just copying a bit of the logic over
2:50 PM < jld > The core crash reporting stuff *should* be fairly thread-safe, because any thread can crash.
2:51 PM < mconley > welp
2:51 PM < jimm > I don't see how a browser stack helps us here though.. so I don't understand why we want to crash the browser.
2:51 PM < mconley > let's put the spaghetti in the machine
2:51 PM < jimm > unless we just want numbers
2:51 PM bholley quit ( bholley@moz-qpv0dk.ca.comcast.net ) Quit: My MacBook Pro has gone to sleep. ZZZzzz…
2:51 PM < & billm > jimm: we just want the numbers
2:51 PM < jimm > ok :)
2:51 PM < jld > jimm: Because it shows us what we failed to deserialize.
2:51 PM < mconley > jimm: in the case of a FatalError in the parent process, it's because the minidump in the parent will (hopefully) tell us which protocol the bad message was sent over
2:51 PM < jimm > MOZ_CRASH will give you that for sure!
2:51 PM < mconley > which is something, I guess
2:52 PM mconley whips up a MOZ_CRASH patch
2:52 PM < & billm > the child dump won't really tell us anything either I don't think
2:52 PM < & billm > it's just nicer for the user
2:52 PM < mconley > right
2:52 PM < jimm > mconley: thought we already new that
2:52 PM < mconley > which we should strive for
2:52 PM < & billm > but if we almost never hit this, it's not worth the trouble
2:52 PM < mconley > jimm: me too, but you said "I don't see how a browser stack helps us here though"
2:52 PM < mconley > maybe I misunderstood
2:52 PM < jld > MOZ_RELEASE_ASSERT()ing that we don't Write() bytes that won't be Read()able would also work, but there might be a performance cost, and it also won't help stuff like FallibleTArray OOM.
2:53 PM < mconley > alright, well, thanks for jamming with me on this one, folks. :)
2:54 PM < jld > But right now we're not getting *any* reporting for child->parent Read() failure unless the user takes the initiative to file a bug, if I understand correctly?
2:54 PM < mconley > yep, and even then, the information would be scant
2:54 PM < mconley > no minidumps
2:55 PM < jld > Yeah, this one had a usable STR, mercifully.
2:55 PM < jimm > hmm, thought we had stacks already. mike mentioned the message in question was Msg_PTextureConstructor
2:55 PM < mconley > jimm: I only found that out because of the logging we do in the FatalError
2:56 PM < mconley > and because it turns out that I can attach XCode to an opt build and get useful crash information. :)
2:56 PM < jimm > I see so we need crash reports to get a better idea of when this happens. gotcha.
2:56 PM mconley nods
2:56 PM < jimm > crash away!
2:56 PM < mconley > when, and how frequently
2:56 PM < mconley > let's crash this mutha
2:57 PM < jimm > mconley: what's the bug number? I can keep an eye on things after you go on pto.
2:57 PM < jld > Maybe also useful: MOZ_DIAGNOSTIC_ASSERT, which asserts on all nightly/aurora builds (and debug builds of any train).
2:57 PM < mconley > jimm: bug 1191976
2:57 PM < jimm > thxc
2:58 PM < jld > I don't know that it's the right answer here, but it's a tool we have if we don't want to crash beta/release builds.
2:58 PM < mconley > jld: hmmm - yes, that sounds appropriate
2:58 PM < mconley > at least, the behaviour

Ok, using MOZ_DIAGNOSTIC_ASSERT.

Bah, let's just use MOZ_CRASH. Eff it.

Also, adding annotations as per billm's request.

Try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4b930d32cacd