Bug 1191976 - [e10s] Some child process crashes are not being reported
Oh shit, this is pretty serious.
Luckily, I can mostly reproduce.
"
STR for this particular crash:
1) Set layers.progressive-paint to true and restart Nightly
2) Move Nightly over to a Retina display
3) Log in to Workday.com
4) Click into any of the menus that cause the "zoom" animation
5) Keep repeating until a content process crash.
1) Set layers.progressive-paint to true and restart Nightly
2) Move Nightly over to a Retina display
3) Log in to Workday.com
4) Click into any of the menus that cause the "zoom" animation
5) Keep repeating until a content process crash.
The crash itself is not important - layers.progressive-paint is not a thing we're ever likely to turn on. We should find out why the crash report isn't being generated here in this case though."
Thing to remember - for debug builds, the crash reporter is disabled,
so make sure to run with MOZ_CRASHREPORTER=1. I've been burned by that
before.
See? I'm learning! :D
Okay, going to attack this on two fronts. First, I need to attach a JS
debugger breakpoint on TabCrashReporter.observe for the
ipc:content-shutdown observer notification.
Because it seems like that's either not being called, or the shutdown is not abnormal or something.
Also setting a breakpoint in ContentParent::ActorDestroy, since this is what's being called when the process is crashing.
Welp, shit, I can't reproduce this with a debug build. But I can with
an opt build. The crash from the STR must be a racey one or something.
:/
So it's down to printf debugging. Lovely.
Well, what I've discovered is that the dumpID is blank. Why?
So who is responsible for setting the dumpID?
nsAutoString
dumpID
(crashReporter
->
ChildDumpID());
props
->
SetPropertyAsAString(NS_LITERAL_STRING(
"dumpID"
), dumpID);
The above is in ContentParent::ActorDestroy. So I guess I need to examine the CrashReporterParent.
Huh… it seems we never call into GenerateCrashReportForMinidump… and that's what's necessary to generate the dumpID...
So far, I've traced this to TakeMinidumpForChild in
nsExceptionHandler.cpp. There's apparently no child process data when we
try to "GetEntry":
MutexAutoLock
lock
(
*
dumpMapLock);
ChildProcessData * pd = pidToMinidump -> GetEntry(childPid);
if ( ! pd)
return false ;
ChildProcessData * pd = pidToMinidump -> GetEntry(childPid);
if ( ! pd)
return false ;
NS_IF_ADDREF(
*
dump
=
pd
->
minidump);
So we return false here, which causes us to fail to generate a crash report.
Question: Why is pidToMinidump->GetEntry empty?
Well, better question:
Question: When is pidToMinidump->GetEntry usually set?
Two places that pidToMinidump->PutEntry is called. Let's instrument those and use TabCrasher.
So when I use TabCrasher, we PutEntry
in OnChildProcessDumpRequested. Who calls that? Apparently that's
something that CrashGeneratorServer calls, which I think is part of
breakpad...
Ah, ok, so CrashGeneratorServer runs on a separate thread...
* thread #45: tid = 0x5bb87f, 0x00000001065eb80b XUL`CrashReporter::OnChildProcessDumpRequested(aContext=0x0000000000000000, aClientInfo=0x000000013ad03a90, aFilePath=0x000000013ad03a80) + 27 at nsExceptionHandler.cpp:2635, stop reason = breakpoint 2.1
* frame #0: 0x00000001065eb80b XUL`CrashReporter::OnChildProcessDumpRequested(aContext=0x0000000000000000, aClientInfo=0x000000013ad03a90, aFilePath=0x000000013ad03a80) + 27 at nsExceptionHandler.cpp:2635
frame #1: 0x000000010664255c XUL`google_breakpad::CrashGenerationServer::WaitForOneMessage(this=0x0000000139795620) + 764 at crash_generation_server.cc:136
frame #2: 0x000000010664223d XUL`google_breakpad::CrashGenerationServer::WaitForMessages(server=0x0000000139795620) + 29 at crash_generation_server.cc:95
frame #3: 0x00007fff8305d772 libsystem_c.dylib`_pthread_start + 327
Okay, uh… so is that being run in the bad case?
Bah. I'm out of my depth. Needinfo'ing ted.
Ted sez:
5:59 PM
<
@
ted
> mconley: by way of
https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/google-breakpad/src/client/mac/handler/exception_handler.cc#370
I'm noticing these before a crash:
###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0xFFFB,name=???) Payload error: message could not be deserialized
IPDL protocol error: Error deserializing 'data' (Shmem) member of 'SurfaceDescriptorShmem'
IPDL protocol error: Error deserializing 'SurfaceDescriptor'
###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x800006,name=PLayerTransaction::Msg_PTextureConstructor) Value error: message was deserialized, but contained an illegal value
###!!! [Parent][MessageChannel] Error: (msgtype=0x200084,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv
Is it possible that the IPC library is doing a killhard of the child on an error message that we're not prepared to handle?
Maybe. I'd have to talk to billm about that maybe.
What I do know is
that ExceptionHandler::WriteMinidumpWithException isn't being
called… that's what's called in the "show crash report" case.
Okay, this is getting pretty insane. First of all, somehow, I'm able to
set breakpoints inside XCode and walkthrough an OPT BUILD. WHAT THE
FUCK. That's some H.P. Lovecraft style revelations here. I feel like I'm
staring at non-Euclidean geometry. The horror! The majesty! TL;DR:
Everything I know is wrong.
Uh. I can't even deal with that right now. Let's pretend that the data
I'm getting out of this is useful. I'll curl up into a fetal ball and
rock myself to sleep later.
Soooooooooo - I'm starting to suspect that the ContentParent::KillHard
is sending a SIGTERM to the content process. I can see that SIGTERM
being sent at the end of ContentParent::KillHard
via ProcessWatcher::EnsureProcessTerminated.
SO, hypothesis: This is a KillHard abort that does not result in a crash report.
Test: Add a magical??? breakpoint to ContentParent::KillHard, reproduce the crash and see if we hit the breakpoint.
Result:
No
. The parent never enters ContentParent::KillHard. Uh… so EnsureProcessTerminated is being called by something else?
Noooo - I think it's happening through ContentParent::ActorDestroy -
after we attempt to get a crash report, we
do DelayedDeleteSubprocess which kills the process.
So ActorDestroy… the IPC pipe is being shut down. Why? Did the child return false for something?
###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0xFFFB,name=???) Payload error: message could not be deserialized
IPDL protocol error: Error deserializing 'data' (Shmem) member of 'SurfaceDescriptorShmem'
IPDL protocol error: Error deserializing 'SurfaceDescriptor'
###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x800006,name=PLayerTransaction::Msg_PTextureConstructor) Value error: message was deserialized, but contained an illegal value
So… uh, it looks like something is sending SIGTERM to the content
process. And then the content process doesn't handle that and just shuts
down without creating a minidump.
Wai?
Hypothesis: Child is returning false for some IPC message. This causes IPC layer to flip out and kill the child with SIGTERM.
Test: Make RecvSwappedWithOtherRemoteLoader return false, and then tear out a tab to crash.
Result:
No
, we still get a crash report. Plus this error message:
IPDL protocol error: Handler for SwappedWithOtherRemoteLoader returned error code
###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x20008C,name=PBrowser::Msg_SwappedWithOtherRemoteLoader) Processing error: message was deserialized, but the handler returned false (indicating failure)
[Child 23849] ###!!! ABORT: Content child abort due to IPC error: file /Users/mikeconley/Projects/mozilla-central/dom/ipc/ContentChild.cpp, line 2005
[Child 23849] ###!!! ABORT: Content child abort due to IPC error: file /Users/mikeconley/Projects/mozilla-central/dom/ipc/ContentChild.cpp, line 2005
So I think we handle that case OK.
Hypothesis: Parent asks child to allocate something. Child returns nullptr. Parent flips out and kills it with SIGTERM.
Test: Make TabChild::AllocPWebBrowserPersistDocumentChild return nullptr, and then try to save a page.
Result:
No
, we still get a crash report, and the following error message:
###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x200004,name=PBrowser::Msg_PWebBrowserPersistDocumentConstructor) Value error: message was deserialized, but contained an illegal value
[Child 24092] ###!!! ABORT: Content child abort due to IPC error: file /Users/mikeconley/Projects/mozilla-central/dom/ipc/ContentChild.cpp, line 2005
[Child 24092] ###!!! ABORT: Content child abort due to IPC error: file /Users/mikeconley/Projects/mozilla-central/dom/ipc/ContentChild.cpp, line 2005
###!!! [Parent][MessageChannel] Error: (msgtype=0x200084,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv
Huh. So let's go back to the error messages that we were seeing:
###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0xFFFB,name=???) Payload error: message could not be deserialized
IPDL protocol error: Error deserializing 'data' (Shmem) member of 'SurfaceDescriptorShmem'
IPDL protocol error: Error deserializing 'SurfaceDescriptor'
###!!! [Child][DispatchAsyncMessage] Error: (msgtype=0x800006,name=PLayerTransaction::Msg_PTextureConstructor) Value error: message was deserialized, but contained an illegal value
Let's try to come up with a scenario here that could explain this.
Where is the PTextureConstructor sent?
Hm, a few places:
ImageBridgeChild::CreateTexture and ShadowLayerForwarder::CreateTexture
Reminds me of….. bug
1134252
.
That had an error message of
"'(msgtype=0x200002,name=PBrowser::Msg_PRenderFrameConstructor) Value error: message was deserialized, but contained an illegal value'"
And that was because the PARENT was returning nullptr for the constructor.
* thread #25: tid = 0x630940, 0x00000001016bb474 XUL`base::KillProcess(process_id=24855, exit_code=<unavailable>, wait=false) + 20 at
process_util_posix.cc:70
, name = 'Compositor', stop reason = breakpoint 1.2
* frame #0: 0x00000001016bb474 XUL`base::KillProcess(process_id=24855, exit_code=<unavailable>, wait=false) + 20 at process_util_posix.cc:70
frame #1: 0x00000001016d9424 XUL`mozilla::ipc::FatalError(aProtocolName=<unavailable>, aMsg=<unavailable>, aOtherPid=<unavailable>, aIsParent=<unavailable>) + 308 at ProtocolUtils.cpp:324
frame #2: 0x00000001017905e8 XUL`mozilla::layers::PLayerTransactionParent::Read(this=0x0000000000000000, v__=<unavailable>, msg__=<unavailable>, iter__=0x00000001129868b8) + 216 at PLayerTransactionParent.cpp:2423
frame #3: 0x000000010178c15a XUL`mozilla::layers::PLayerTransactionParent::Read(this=0x00000001240d0160, v__=0x0000000112986878, msg__=0x0000000112986b20, iter__=0x00000001129868b8) + 170 at PLayerTransactionParent.cpp:4576
frame #4: 0x000000010178b05b XUL`mozilla::layers::PLayerTransactionParent::OnMessageReceived(this=0x00000001240d0160, msg__=0x0000000112986b20) + 1227 at PLayerTransactionParent.cpp:425
frame #5: 0x00000001018f7758 XUL`mozilla::layers::PCompositorParent::OnMessageReceived(this=<unavailable>, msg__=0x0000000112986b20) + 232 at PCompositorParent.cpp:509
frame #6: 0x00000001016d6896 XUL`mozilla::ipc::MessageChannel::DispatchAsyncMessage(this=0x000000011c20ec68, aMsg=0x0000000112986b20) + 118 at MessageChannel.cpp:1382
frame #7: 0x00000001016d6011 XUL`mozilla::ipc::MessageChannel::DispatchMessage(this=0x000000011c20ec68, aMsg=<unavailable>) + 241 at MessageChannel.cpp:1302
frame #8: 0x00000001016d2d77 XUL`mozilla::ipc::MessageChannel::OnMaybeDequeueOne(this=0x000000011c20ec68) + 167 at MessageChannel.cpp:1273
frame #9: 0x00000001016b4524 XUL`MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) [inlined] MessageLoop::RunTask(task=0x000000011c21f890, this=0x0000000112986d20) + 132 at message_loop.cc:364
frame #10: 0x00000001016b4513 XUL`MessageLoop::DeferOrRunPendingTask(this=0x0000000112986d20, pending_task=<unavailable>) + 115 at message_loop.cc:372
frame #11: 0x00000001016b483a XUL`MessageLoop::DoWork(this=0x0000000112986d20) + 170 at message_loop.cc:459
frame #12: 0x00000001016b5046 XUL`base::MessagePumpDefault::Run(this=0x0000000112ccb520, delegate=0x0000000112986d20) + 758 at message_pump_default.cc:34
frame #13: 0x00000001016b40ad XUL`MessageLoop::Run() [inlined] MessageLoop::RunInternal(this=<unavailable>) + 77 at message_loop.cc:234
frame #14: 0x00000001016b409e XUL`MessageLoop::Run() [inlined] MessageLoop::RunHandler(this=<unavailable>) at message_loop.cc:227
frame #15: 0x00000001016b409e XUL`MessageLoop::Run(this=<unavailable>) + 62 at message_loop.cc:201
frame #16: 0x00000001016bc43b XUL`base::Thread::ThreadMain(this=0x0000000112ccb430) + 187 at thread.cc:170
frame #17: 0x00000001016bc49a XUL`ThreadFunc(closure=<unavailable>) + 10 at platform_thread_posix.cc:39
frame #18: 0x00007fff8305d772 libsystem_c.dylib`_pthread_start + 327
* frame #0: 0x00000001016bb474 XUL`base::KillProcess(process_id=24855, exit_code=<unavailable>, wait=false) + 20 at process_util_posix.cc:70
frame #1: 0x00000001016d9424 XUL`mozilla::ipc::FatalError(aProtocolName=<unavailable>, aMsg=<unavailable>, aOtherPid=<unavailable>, aIsParent=<unavailable>) + 308 at ProtocolUtils.cpp:324
frame #2: 0x00000001017905e8 XUL`mozilla::layers::PLayerTransactionParent::Read(this=0x0000000000000000, v__=<unavailable>, msg__=<unavailable>, iter__=0x00000001129868b8) + 216 at PLayerTransactionParent.cpp:2423
frame #3: 0x000000010178c15a XUL`mozilla::layers::PLayerTransactionParent::Read(this=0x00000001240d0160, v__=0x0000000112986878, msg__=0x0000000112986b20, iter__=0x00000001129868b8) + 170 at PLayerTransactionParent.cpp:4576
frame #4: 0x000000010178b05b XUL`mozilla::layers::PLayerTransactionParent::OnMessageReceived(this=0x00000001240d0160, msg__=0x0000000112986b20) + 1227 at PLayerTransactionParent.cpp:425
frame #5: 0x00000001018f7758 XUL`mozilla::layers::PCompositorParent::OnMessageReceived(this=<unavailable>, msg__=0x0000000112986b20) + 232 at PCompositorParent.cpp:509
frame #6: 0x00000001016d6896 XUL`mozilla::ipc::MessageChannel::DispatchAsyncMessage(this=0x000000011c20ec68, aMsg=0x0000000112986b20) + 118 at MessageChannel.cpp:1382
frame #7: 0x00000001016d6011 XUL`mozilla::ipc::MessageChannel::DispatchMessage(this=0x000000011c20ec68, aMsg=<unavailable>) + 241 at MessageChannel.cpp:1302
frame #8: 0x00000001016d2d77 XUL`mozilla::ipc::MessageChannel::OnMaybeDequeueOne(this=0x000000011c20ec68) + 167 at MessageChannel.cpp:1273
frame #9: 0x00000001016b4524 XUL`MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) [inlined] MessageLoop::RunTask(task=0x000000011c21f890, this=0x0000000112986d20) + 132 at message_loop.cc:364
frame #10: 0x00000001016b4513 XUL`MessageLoop::DeferOrRunPendingTask(this=0x0000000112986d20, pending_task=<unavailable>) + 115 at message_loop.cc:372
frame #11: 0x00000001016b483a XUL`MessageLoop::DoWork(this=0x0000000112986d20) + 170 at message_loop.cc:459
frame #12: 0x00000001016b5046 XUL`base::MessagePumpDefault::Run(this=0x0000000112ccb520, delegate=0x0000000112986d20) + 758 at message_pump_default.cc:34
frame #13: 0x00000001016b40ad XUL`MessageLoop::Run() [inlined] MessageLoop::RunInternal(this=<unavailable>) + 77 at message_loop.cc:234
frame #14: 0x00000001016b409e XUL`MessageLoop::Run() [inlined] MessageLoop::RunHandler(this=<unavailable>) at message_loop.cc:227
frame #15: 0x00000001016b409e XUL`MessageLoop::Run(this=<unavailable>) + 62 at message_loop.cc:201
frame #16: 0x00000001016bc43b XUL`base::Thread::ThreadMain(this=0x0000000112ccb430) + 187 at thread.cc:170
frame #17: 0x00000001016bc49a XUL`ThreadFunc(closure=<unavailable>) + 10 at platform_thread_posix.cc:39
frame #18: 0x00007fff8305d772 libsystem_c.dylib`_pthread_start + 327
From jld:
"From a quick look at the generated code, it looks as if the only cases where mozilla::ipc::FatalError is called if there's an out-of-range IPDL union discriminant or if a ParamTraits<T>::Read (i.e., deserialization) routine fails.
Most of the reasons that can happen mean the data stream is corrupt, but one possibility is a FallibleTArray when the receiver's allocation fails.
It looks like the other cases (Recv* returning false, failed Alloc* on the receive side) just return errors from the OnMessageReceived method, which is probably where we're tapping in to do the paired minidump?"
From billm:
"I suspect the logic here has to do with whether it's reasonable to recover from the error or not. If we can't deserialize some data, then we're unlikely to be able to understand future messages from the child either, so we should kill it. Things like message handlers returning false are potentially recoverable.
I'd suggest we add some sort of ability for individual protocols to override the FatalError behavior. However, the default behavior if you don't implement this method would be to MOZ_CRASH() (which would definitely send us a crash report). Then we would override the FatalError behavior for CrossProcessCompositorParent to do something reasonable. Maybe it could post a message to the main thread asking it to take a minidump and kill the child. We'll need to be careful to close the CompositorParent channel right away though so that we don't get any more bad messages coming in."
Question:
billm suggests using MOZ_CRASH within FatalError in order to definitely
send a crash report. Since this code is presumably running in the
parent process, is he suggesting we crash the parent here?
Yes, yes he is. And that way, we can get an idea of how often this happens.
2:31 PM
<
mconley
> jimm: hey - can I pick your brain a bit about bug 1191976?
Perhaps jld wants to listen in too - he seems to know a lot about this
stuff...
2:33 PM
<
jimm
> mconley: sure, just catching up on that bug
2:33 PM
<
mconley
> thanks
2:33 PM
<
jimm
> "So, to be clear, we want a minidump from the process that saw FatalError."
2:33 PM
<
jimm
> that's the browser right?
2:33 PM
<
jimm
> on the compositor thread
2:33 PM
<
mconley
> jimm: in the case of this bug, yes - but (I believe) that can be any process.
2:34 PM
<
mconley
> like, FatalError is hit if serialization fails on either side
2:34 PM
<
mconley
> so, I'm pretty sure FatalError could be hit on parent, content and plugin side.
2:34 PM
<
jld
> That's a good point.
2:34 PM
→
billm
joined (billm@moz-a8k9ll.mh2d.1ibb.0101.2620.IP)
2:34 PM
<
jimm
> hmm, sounds like a simple MOZ_CRASH in place of that kill the other side code is what you want
2:35 PM
<
mconley
> jimm: which is what billm also suggested
2:35 PM
<
jld
> Thought if a child process is the receiver, then there's no
downside to MOZ_CRASH(), because it was going to do that anyway?
2:35 PM
<
mconley
> jimm: but in the case of the parent process, this brings down the whole browse.r
2:35 PM
<
jld
> s/ght/gh/
2:35 PM
<
mconley
> s/.r/r.
2:35 PM
<
jimm
> yep
2:35 PM
<
jimm
> mconley:
https://crash-stats.mozilla.com/report/list?product=Firefox&range_value=7&range_unit=days&date=2015-08-18&signature=mozalloc_abort%28char+const*+const%29+|+NS_DebugBreak+|+mozilla%3A%3Aipc%3A%3AFatalError%28char+const*%2C+char+const*%2C+unsigned+long%2C+bool%29+|+mozilla%3A%3Adom%3A%3APContentChild%3A%3AFatalError%28char+const*+const%29+|+mozilla%3A%3Adom%3A%3APContentChild%3A%3ARead%2
2:35 PM
<
jimm
> 8mozilla%3A%3Adom%3A%3AIPCDataTransferIt...&version=Firefox%3A42.0a2
2:36 PM
<
mconley
> jimm: we're certain we're willing to make that trade-off?
2:36 PM
<
jimm
> do we have any sense for how often this happens in the wild?
2:36 PM
<
mconley
> because FatalError on the parent side should be recoverable,
I'd think - in that, we kill the process that sent us the message, and
spew out a minidump for our own process
2:36 PM
<
mconley
> jimm: I don't think we do.
2:37 PM
<
mconley
> like, MOZ_CRASH'ing here would certainly give us an idea o fit
2:37 PM
<
mconley
> s/o fit/of it
2:37 PM
<
jld
> ...gah. The important part of the stack is off the end of the "..." in that signature.
2:37 PM
<
mconley
> like, that's one way of finding out. :)
2:37 PM
<
jimm
> jld:
https://crash-stats.mozilla.com/report/index/e3ee109d-376e-42f3-8d61-503332150817
2:38 PM
<
jld
> jimm: No, not the URL being truncated on IRC (I pasted it
back together), the
"mozilla::dom::PContentChild::Read(mozilla::dom::IPCDataTransferIt..."
2:38 PM
<
mconley
> ah - so it looks like if we hit FatalError in the child, that we're doing the right thing
2:38 PM
<
mconley
> s/that/then
2:38 PM
<
jimm
> ah sorry
2:38 PM
<
mconley
> geez, I'm fat fingering like crazy
2:39 PM
<
jimm
> mconley: in the case of the compositor crash, I think
crashing the browser would be ok assuming you push it to try for a full
test run and don't see anything ugly happen.
2:39 PM
<
jimm
> why do we think this is a problem on the compositor side vs. the main thread side?
2:40 PM
<
jimm
> sounds like the data we received was messed up
2:40 PM
<
jimm
> oh, wait, this message came from the child?
2:40 PM
<
mconley
> Yes - we receive a messed up message from the child that we can't deserialize
2:40 PM
<
jimm
> same question though..
2:40 PM
<
jimm
> hmm ok
2:40 PM
<
jld
> ...or, wait. IPCDataTransferItem is useful information.
2:41 PM
<
mconley
> So we're doing the right thing in killing the content process, I think. The new plan is to MOZ_CRASH instead.
2:41 PM
<
mconley
> I guess I'm just hesitant to MOZ_CRASH
2:41 PM
<
jld
> But it would be nice if there were more characters there, or
fewer were taken up the stuff leafward of the actor's FatalError.
2:42 PM
<
mconley
> because I feel like there should be a good way, in the
parent process FatalError case, to kill the content process still, and
take a minidump from ourselves
2:42 PM
<
mconley
> and somehow convince about:tabcrashed to submit _that_ minidump
2:42 PM
<
jimm
> mconley: how does crashing the browser help us diagnose?
2:42 PM
<
jld
> I don't much like MOZ_CRASH()ing either. I guess if the
child process is really compromised then there are easier ways to DoS
the parent process.
2:43 PM
<
jld
> mconley: And that was what I suggested in the bug: however we get a live-dump-of-self for paired minidumps, do that.
2:43 PM
<
mconley
> jimm: it's pretty brute force. It's a quick and dirty way of ensuring we get a minidump from the parent.
2:43 PM
<
mconley
> jld: yeah, I think I'd like to lean that way if I can
2:43 PM
<
jld
> But I haven't tried going into the code to see how much of the important parts of that are thread-safe.
2:44 PM
<
mconley
> jimm says that the CrashReporterParent is reasonably threadsafe
2:44 PM
<
jimm
> noo
2:44 PM
zz_miketaylr →
miketaylr
2:44 PM
<
mconley
> oh, sorry, misquoted
2:44 PM
<
jimm
> not CrashReporterParent
2:45 PM
<
mconley
> "the code in CrashReporterParent is a light wrapper around
the CrashReporter apis which do most of the work, and I believe these
routines are mostly multithread safe" <-- so I was wrong, not
CrashReporterParent, but the crash reporter backend.
2:45 PM
<
jimm
> yeah
2:46 PM
<
jimm
> mconley: do we know what ipdl message this happen for?
2:47 PM
<
mconley
> jimm: for this particular bug, it's Msg_PTextureConstructor
2:47 PM
<
mconley
> jimm: but the cause of this FatalError isn't particularly
interesting - it's our reaction to the crash that's interesting
2:47 PM
<
mconley
> *sigh*, s/crash/FatalError
2:48 PM
<
&
billm
> I just feel like we should get a sense for how common this is before we put a lot of effort into it
2:48 PM
<
jimm
> yeah our handling there is messy.
2:48 PM
<
mconley
> yeah, reading this stuff, I'm getting a sense of it. :/
2:48 PM
<
mconley
> okay
2:48 PM
<
jimm
> billm: pretty sure CrashReporter apis can be called on any thread.
2:49 PM
<
jimm
> ted: * ?
2:49 PM
<
jimm
> ted: ^ ?
2:49 PM
<
mconley
> billm: okay, I think I'm pursuaded
2:49 PM
<
mconley
> persuaded
2:49 PM
<
jimm
> I'm not suggesting calling ContentParent's KillHard there
2:49 PM
<
jimm
> just copying a bit of the logic over
2:50 PM
<
jld
> The core crash reporting stuff *should* be fairly thread-safe, because any thread can crash.
2:51 PM
<
mconley
> welp
2:51 PM
<
jimm
> I don't see how a browser stack helps us here though.. so I don't understand why we want to crash the browser.
2:51 PM
<
mconley
> let's put the spaghetti in the machine
2:51 PM
<
jimm
> unless we just want numbers
2:51 PM
⇐
bholley
quit (
bholley@moz-qpv0dk.ca.comcast.net
) Quit: My MacBook Pro has gone to sleep. ZZZzzz…
2:51 PM
<
jimm
> ok :)
2:51 PM
<
jld
> jimm: Because it shows us what we failed to deserialize.
2:51 PM
<
mconley
> jimm: in the case of a FatalError in the parent process,
it's because the minidump in the parent will (hopefully) tell us which
protocol the bad message was sent over
2:51 PM
<
jimm
> MOZ_CRASH will give you that for sure!
2:51 PM
<
mconley
> which is something, I guess
2:52 PM
—
mconley
whips up a MOZ_CRASH patch
2:52 PM
<
mconley
> right
2:52 PM
<
jimm
> mconley: thought we already new that
2:52 PM
<
mconley
> which we should strive for
2:52 PM
<
mconley
> jimm: me too, but you said "I don't see how a browser stack helps us here though"
2:52 PM
<
mconley
> maybe I misunderstood
2:52 PM
<
jld
> MOZ_RELEASE_ASSERT()ing that we don't Write() bytes that
won't be Read()able would also work, but there might be a performance
cost, and it also won't help stuff like FallibleTArray OOM.
2:53 PM
<
mconley
> alright, well, thanks for jamming with me on this one, folks. :)
2:54 PM
<
jld
> But right now we're not getting *any* reporting for
child->parent Read() failure unless the user takes the initiative to
file a bug, if I understand correctly?
2:54 PM
<
mconley
> yep, and even then, the information would be scant
2:54 PM
<
mconley
> no minidumps
2:55 PM
<
jld
> Yeah, this one had a usable STR, mercifully.
2:55 PM
<
jimm
> hmm, thought we had stacks already. mike mentioned the message in question was Msg_PTextureConstructor
2:55 PM
<
mconley
> jimm: I only found that out because of the logging we do in the FatalError
2:56 PM
<
mconley
> and because it turns out that I can attach XCode to an opt build and get useful crash information. :)
2:56 PM
<
jimm
> I see so we need crash reports to get a better idea of when this happens. gotcha.
2:56 PM
—
mconley
nods
2:56 PM
<
jimm
> crash away!
2:56 PM
<
mconley
> when, and how frequently
2:56 PM
<
mconley
> let's crash this mutha
2:57 PM
<
jimm
> mconley: what's the bug number? I can keep an eye on things after you go on pto.
2:57 PM
<
jld
> Maybe also useful: MOZ_DIAGNOSTIC_ASSERT, which asserts on
all nightly/aurora builds (and debug builds of any train).
2:57 PM
<
mconley
> jimm: bug 1191976
2:57 PM
<
jimm
> thxc
2:58 PM
<
jld
> I don't know that it's the right answer here, but it's a tool we have if we don't want to crash beta/release builds.
2:58 PM
<
mconley
> jld: hmmm - yes, that sounds appropriate
2:58 PM
<
mconley
> at least, the behaviour
Ok, using MOZ_DIAGNOSTIC_ASSERT.
Bah, let's just use MOZ_CRASH. Eff it.
Also, adding annotations as per billm's request.