Bug 1068349 - IPC message handlers that return false cause child process to crash without a crash report
Picking this up from
bug 1099274
. Summary from billm:
"There are a number of places where ContentParent.cpp calls KillHard().
One of them is in ContentParent::ProcessingError. In this case, the
parent has decided that a message sent by the child was malformed or
something and we should kill the child. However, we don't submit a crash
report in that case. For sync messages, a crash report would tell us
why the child sent the message. For async messages, it would crash at a
random spot, but at least we would know about the crash."
and ted:
"We basically want to call CreatePairedMinidumps in this situation:
http://hg.mozilla.org/mozilla-central/annotate/4c449dcda453/dom/ipc/CrashReporterParent.h#l112
Not sure if we need other plumbing around that to make it get submitted
or if the "child died, look for minidump" logic handles that."
Maybe it's really just that simple - maybe we just need to call CreatePairedMinidump in the KillAll method.
Let's try that and see what happens.
Hm - there are two functions that sound interesting:
CreatePairedMinidump, and GeneratePairedMinidump. What's the difference
between these two?
It looks like GeneratePairedMinidump calls CreatePairedMinidump. What else does it do?
Seems to set mChildDumpID to the generated dump ID as well.
Here it is being used for the plugin process:
http://hg.mozilla.org/mozilla-central/file/8c02f3280d0c/dom/plugins/ipc/PluginModuleParent.cpp#l592
- I think this deals with plugin hangs.
I can probably use that as an example.
So how does it work here in PluginModuleParent? … well, first it gets itself a handle on the associated crash reporter...
Then it annotates it, calls it a plugin hang.
Calls crashReporter->GeneratePairedMinidump and passes this, where this is the PluginChromeModuleParent.
So, can I do something similar in ContentParent when it calls KillHard?
Isn't KillHard similar to PluginChromeModuleParent's
TerminateChildProcess? They both end up calling KillProcess, so while
I'm inclined to believe that some of the details might be different, the
idea is very much the same.
So let's see if I can do something similar for ContentParent then.
But first, tea.
Ok, tea acquired. Let's rock and roll.
So what does GeneratePairedMinidump expect as an argument? A "Toplevel*".
What is a Toplevel*?
I'm having trouble finding its definition...
I'm seeing some Toplevel*'s called "actor", so I wonder if this is
something that the IPC libraries give us? Maybe we can just pass in
ContentParent and call it a day?
Let's go on that assumption for now.
Took a little break there for some meetings and some food, some code review, and to land another thing. Back at this now.
Ah, failing right now because I (again) forgot to
set MOZ_CRASHREPORTER env variable. Repeat after me - the crash
reporter is disabled by default in debug builds.
Hey, it worked! Woo! That's awesome. I can get a crash report generated here.
So now… I have a few choices:
http://hg.mozilla.org/mozilla-central/annotate/4c449dcda453/dom/ipc/CrashReporterParent.h#l112
- I can just post what I've got and get feedback
- I can take a closer look at what PluginChromeModuleParent annotates the crash report with and try to adapt it for our new crash reports here, and then ask for feedback.
Gonna go to 2.
NS_LITERAL_CSTRING("1"));
GenerateCrashReport is a combination function that does multiple things. First it takes the minidump (finds it on the filesystem), gets the ID of that minidump, and then calls GenerateChildData. If I already have the ID of the minidump, then all I need is GenerateChildData.
- Set a member variable in ContentParent for the child minidump if we've already created it
- Alter ContentParent::ActorDestroy so that if a minidump pre-exists, instead of calling GenerateCrashReport, we call GenerateChildData to annotate the pre-existing minidump properly.
And I guess that'd be enough? Maybe? bsmedberg says to give it a shot. So let's do it.
"mdsw_status_string": "ERROR_NO_THREAD_LIST",
"json_dump": {
"status": "ERROR_NO_THREAD_LIST",
"sensitive": {
}
},
"mdsw_return_code": 0,
"success": false
},
0000 0000 4c62 8b54 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000
Posting for review...
Landed, but bounced. :(
ted sez:
"It looks like the browser-chrome tests under
browser/components/customizableui/test launch a lot of child processes,
so maybe one of them is hitting this KillHard path you've added, and it
now causes a failure because the minidumps get written?
I do see some spew like this in the log:
11:13:47 INFO - [Parent 2384] WARNING:
pipe error (49): Connection reset by peer: file
/builds/slave/fx-team-lx-0000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc,
line 457
11:13:47 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
11:13:47 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
11:13:47 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
11:13:47 INFO - ###!!! [Parent][OnMaybeDequeueOne] Error: Channel error: cannot send/recv
The B2G desktop mochitests seem to be hitting something similar. It's
possible this is something that has always happened and your patch turns
it into a test failure."
So I've reduced the tests in customizableui that cause the stack to appear to 2 subtests:
[browser_878452_drag_to_panel.js]
[browser_880382_drag_wide_widgets_in_panel.js]
The crash seems to depend on both of these tests to run. Interesting. I assume we're hitting kill hard, right?
I would assume so, since we're generating paired minidumps. Let's find out why we're hitting KillHard.
Building a debug build...
Huh. Can't reproduce in a debug build.
I can, however, introduce some logging. Here's what I'm seeing:
First of all - recall that when entering customize mode, we go to
about:customizing, which is not in our whitelisted URIs for about:
pages. We usually transition to about:customizing from about:blank,
which
is
whitelisted.
What that means is that we're going from a single remote tab to a
single non-remote tab, and we do this several times in these tests, back
and forth.
Since the content process is torn down when the last remote tab closes,
this means that many separate content processes are being spawned, and
then being asked to shutdown during the entirety of these tests.
Actually, for these two tests, here's how it's looking when run with --e10s:
- Browser starts with about:blank as remote tab - content process boots
- browser_878452_drag_to_panel.js starts, which moves about:blank to about:customizing - message is sent to content process to die via ContentParent::NotifyTabDestroying - a KillHard timeout is started.
- Child receives message and shuts down successfully.
- We exit customizing mode, which puts us back at about:blank, which spawns a new content process
- We immediately move back into customizing mode for browser_880382_drag_wide_widgets_in_panel.js. Message is sent to content process to die via ContentParent::NotifyTabDestroying. A KillHard timeout is started.
- Child never receives this message!
- KillHard is called, and we create the minidumps, which results in test failure.
Questions to be answered:
Which KillHard is being obeyed in 7? I assume it's the one fired in 5, but are we sure it's not the one fired in 2?
Yes - I am sure. ContentParent::~ContentParent is cancelling the KillHard timeout before it can fire.
So the question is: why isn't the child receiving the message to shut down?
That's the big question here - I think if I can crack that, I've got the answer to this puzzle.
My hypothesis is that teardown is occurring before setup has ever
really finished - like, the IPC communications channel is not set up
yet, so we can't even tell the child to shut down.
So yes, it looks like TabDestroyed has never been called after TabDestroying. WHY???
Ah… so I talked to bent. It looks like a timer issue. Like, if I up the
timeout for KillHard to 100 seconds, everything is hunky-dory.
Here's the explanation of what's going on:
In order for the content process to respond to messages, it needs to
fire up XPCOM and start spinning its event loop. Only then can it react
to messages. We have a 5 second default timeout on opt builds so that
when we send a destroy to the content process, if we don't hear back
over 5 seconds, we should just kill it hard (which causes our minidump).
What bent suggests is that for mochitests, we either up the timeout, or
even better, set dom.ipc.tabs.shutdownTimeoutSecs to 0 so that we
don't actually do the KillHard bit when running mochitests. Let's try
that.
Yes - this appears to work. So, let's do a try push with the builds that failed:
Linux opt (e10s mochitests), B2G Desktop Linux opt mochitests, B2G Desktop Linux x64 opt mochitests
Ok, try push is out:
smaug wants me to just try bumping up the delay a bit - trying 10 seconds. Also, I need to extend the delay for b2g as well.
And here we are just disabling it again:
GRAAWWARR so much rage