Bug 1110887 - With e10s, plugin crash submit UI is broken

While I'm waiting for a fresh build to compile, let's inspect what might be doing on here.

The place I should probably be looking at is PluginContent.jsm's pluginInstanceCrashed method. This method is what handles the PluginCrashed event.

Hrm - so killing the plugin-container from the Activity Monitor is no good. That doesn't send a crash report.

Using kill -ABRT [procid] on OS X should do the trick, according to ted.

So it looks as if the pluginDumpID is the empty string. So we have no crash report for the plugin - or at least, we're not passing up the crash ID.

So I don't think this was caused by bug 899347 , but probably by the plugin process refactoring that billm and jimm did for e10s.

Let's see if I can dig down to the place where we fire the PluginCrashed event and see if I can find out why we don't have a pluginDumpID...

Where do we connect the crash with a pluginDumpID? I feel like I should know that already… but I can't find a reference to pluginDumpID anywhere in my notes. Boooo...

Ok, dxr it is.

Hrm. Killing with kill -ABRT [procid] doesn't apparently give us a plugin dump ID. :(

OH FOR FUCKS SAKE. I've gotta set MOZ_CRASHREPORTER=1 for debug builds! BURNED BY THAT AGAIN. KJSBDFKJSHDFL:KJSDL:RUEOP*)#%OIULJDS:KFSDF

Ok, I'm able to get the dump created in the single process case, which is what I expected.

PluginModuleChromeParent::ProcessFirstMinidump seems to be the place that I want to set a breakpoint in the content process. In the single process case, this is where we get the mPluginDumpID.

Plugin crashes. Both parent and child see ActorDestroy. Parent gets the crash dump information.

Ok, so that's the problem - only the parent has the crash dump information.

  
   16:57
  
   (billm)
  
   mconley: that sounds about right. I'm just wondering if we can get 
away with the content process never needing the dump ID and just 
accessing chrome to submit it.

  
   16:58
  
   (mconley)
  
  billm: I think the only reason that the content process needs that 
information is so that it can tell the parent process which report to 
submit

  
   16:58
  
   (billm)
  
   mconley: ok, that's what I was wondering

  
   16:58
  
   (billm)
  
   mconley: does it ask it to submit using the content script somehow?

  
   16:58
  
   (mconley)
  
  yes

  
   16:59
  
   (mconley)
  
  billm:
  
   https://dxr.mozilla.org/mozilla-central/source/browser/modules/PluginContent.jsm#531

  
   16:59
  
   (billm)
  
   mconley: ok, cool. that should work fine with sandboxing. I guess it should be fine to pass down the dump ID.

  
   17:00
  
   (mconley)
  
  billm: alright - I'll see if I can hook that up

  
   17:00
  
   (mconley)
  
  jimm / billm: thanks

  
   17:01
  
   (billm)
  
   mconley: well, it might not be easy :-). still thinking.

  
   17:01
  
   (jimm)
  
   mconley: hey, sorry, getting side tracked here in a meeting

  
   17:01
  
   (mconley)
  
  no worries

  
   17:07
  
   (mconley)
  
  billm: I wonder if we should move this stuff out of PluginModuleParent and into PluginModuleChromeParent's ActorDestroy:
  
   https://dxr.mozilla.org/mozilla-central/source/dom/plugins/ipc/PluginModuleParent.cpp#1330

  
   17:08
  
   (billm)
  
   mconley: well, we somehow need to notify the child too so that it can put up its UI

  
   17:08
  
   (mconley)
  
  and then in PluginModuleChromeParent::NotifyPluginCrashed, if there's a
 PluginModuleContentParent, somehow send a message to it here?:

  
   17:08
  
   (mconley)
  
   https://dxr.mozilla.org/mozilla-central/source/dom/plugins/ipc/PluginModuleParent.cpp#1384

  
   17:08
  
   (mconley)
  
  right

  
   17:08
  
   (billm)
  
   mconley: it's sending the message that's tricky

  
   17:08
  
   (mconley)
  
  naive question - is the PluginModuleChromeParent aware of the PluginModuleContentParent?

  
   17:08
  
   (mconley)
  
  judging by your answer, I suspect not

  
   17:08
  
   (billm)
  
   mconley: no :-)

  
   17:08
  
   (mconley)
  
  ah hah

  
   17:09
  
   (mconley)
  
  shoot

  
   17:09
  
   (billm)
  
   mconley: the PluginModuleContentParent lives in the content process

  
   17:09
  
   (mconley)
  
  right

  
   17:09
  
   (billm)
  
   mconley: it links to a PluginModuleChild in the plugin process

  
   17:09
  
   (mconley)
  
  yep

  
   17:09
  
   (mconley)
  
  so it's a triangle without a connection between two points.

  
   17:09
  
   (billm)
  
  but there's no link between a PluginModuleChromeParent and a PluginModuleContentParent

  
   17:09
  
   (billm)
  
  I think we can do some sort of ID scheme on the PluginModuleChromeParent to make it work

  
   17:10
  
   (billm)
  
  and when the PluginModuleContentParent sees a crash, it would send a 
sync message to ContentParent asking for the crash dump ID for its ID

  
   17:11
  
   (mconley)
  
  hmmmm

  
   17:11
  
   (billm)
  
  basically, I think we have to have 
PluginModuleContentParent::ActorDestroy send a message to the 
ContentParent on a crash to get the dump ID

  
   17:11
  
   (mconley)
  
  and that ID would be... the unique ID for the plugin I guess?

  
   17:12
  
   (billm)
  
  well, it's a bit tricky. plugins already have an ID, but the ID persists across crashes, which we sort of don't want here

  
   17:12
  
   (billm)
  
  the situation I'm worried about is if a plugin crashes twice in quick succession.

  
   17:12
  
   (billm)
  
  the plugin will have the same mID in both cases, so we need a new ID that gets regenerated each time it crashes

  
   17:12
  
   (billm)
  
  otherwise we'll submit the same crash dump for both

  
   17:13
  
   (mconley)
  
  I see

  
   17:13
  
   (billm)
  
  although admittedly that's not such a huge deal...

  
   17:13
  
   (mconley)
  
  have the PluginModuleChild message back a guid to both parents?

  
   17:14
  
   (billm)
  
  we can make the ID in the parent process, since it runs code to generate each PluginModuleContentParent

  
   17:14
  
   (billm)
  
  let me find it

  
   17:15
  
   (billm)
  
   mconley: so when the child wants a new PluginModuleContentParent, it asks for it here:
  
   http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#983
  
   (via ContentParent)

  
   17:15
  
   (billm)
  
   mconley: although with async plugin initialization it's a bit different

  
   17:15
  
   (billm)
  
  that goes here:
  
   http://mxr.mozilla.org/mozilla-central/source/dom/plugins/ipc/PluginModuleParent.cpp#94

  
   17:16
  
   (billm)
  
   mconley: somehow the ID would be generated in there

  
   17:16
  
   (mconley)
  
  this is pretty intense stuff

  
   17:17
  
   (mconley)
  
  alright

  
   17:17
  
   (mconley)
  
  let me see if I can sum up the plan

  
   17:19
  
   (mconley)
  
  somewhere in
  
   http://mxr.mozilla.org/mozilla-central/source/dom/plugins/ipc/PluginModuleParent.cpp#94
  
  , I generate a unique ID for plugin instantiation that is shared 
between PluginModuleContentParent and PluginModuleChromeParent. When the
 plugin crashes, have the PluginModuleContentParent react to 
ActorDestroy by sending a sync message the ContentParent asking for 
browser and plugin dump IDs

  
   17:19
  
   (mconley)
  
  billm: somehow the ContentParent is going to need to know which PluginModuleChromeParent to talk to to get those

  
   17:20
  
   (mconley)
  
  but let's pretend that's not too hard

  
   17:20
  
   (mconley)
  
  ContentParent sends that information back down, the 
PluginModuleContentParent passes those dump IDs along to the mPlugin 
instance it's holding on to

  
   17:20
  
   (mconley)
  
  and everything should just kinda work

  
   17:21
  
   (billm)
  
   mconley: I think we could just have a static hashtable somewhere that
 maps an ID to a dump ID. the PluginModuleChromeParent could stash the 
dump ID there when the crash happens.

  
   17:21
  
   (billm)
  
   mconley: then the message handler in ContentParent to return the dump ID could look there

  
   17:22
  
   (mconley)
  
  billm: is ContentParent the right location for that hashtable?

  
   17:22
  
   (mconley)
  
  er

  
   17:22
  
   (mconley)
  
  static hashtable you said

  
   17:22
  
   (billm)
  
   mconley: probably a static member of PluginModuleChromeParent

  
   17:22
  
   (mconley)
  
  I see, ok

  
   17:22
  
   (billm)
  
   mconley: I'm a little worried about races

  
   17:22
  
   (mconley)
  
  billm: the XPCOM guidelines for hashtables tells me that I should use a
 hashtable for small quantities of data... does that no longer hold?

  
   17:22
  
   (mconley)
  
  er

  
   17:22
  
   (mconley)
  
  should _not_ use

  
   17:23
  
   (billm)
  
   mconley: hopefully PMChromeParent::ActorDestroy always happens before
 we get the sync message from the child asking for the dump ID

  
   17:23
  
   (billm)
  
   mconley: what does it say to use instead?

  
   17:23
  
   (billm)
  
   mconley: I guess you could use an array

  
   17:24
  
   (mconley)
  
  billm:
  
   https://developer.mozilla.org/en-US/docs/Detailed_XPCOM_hashtable_guide

  
   17:24
  
   (mconley)
  
  dunno if this document is still relevant

  
   17:24
  
   (mconley)
  
   https://developer.mozilla.org/en-US/docs/Detailed_XPCOM_hashtable_guide#When_Should_I_Use_a_Hashtable.3F
  
  , specifically

  
   17:24
  
   (billm)
  
   mconley: I dunno. it's true that it won't be efficient, but there's only one per process, so it's not a big deal

  
   17:24
  
   (billm)
  
   mconley: you can use an array too

  
   17:25
  
   (mconley)
  
  billm: ok, I think I'm starting to wrap my head around this

  
   17:25
  
   (mconley)
  
  let me scribble a patch out and we'll fun it up the flagpole and see if anybody salutes

  
   17:25
  
   (mconley)
  
  er,

  
   17:25
  
   (mconley)
  
  run. Run it up the flagpole.

  
   17:25
  
   (billm)
  
  :-)

  
   17:25
  
   (mconley)
  
  billm: thanks!

Plan of record:

Somewhere inside where we set up the bridge with PluginModuleChromeParent and PluginModuleContentParent, create a unique identifier (I'll call it ID) for the plugin run. When the plugin crashes, have the PluginModuleContentParent::ActorDestroy send a synchronous message to the ContentParent asking for the dump information associated with the ID.

Here's where we might get a bit race-y.

When the PluginModuleChromeParent::ActorDestroy gets called, have it stash the browser and plugin dump ID information in a mapping somewhere.

_Hopefully_ we can guarantee that PluginModuleChromeParent::ActorDestroy gets called first. If that's the case, then the ContentParent can get the browser and plugin dump IDs from the mapping, and send those down to the child. The child can then pass those along to the plugin to dispatch the PluginCrashed event, and the in-content plugin crash UI should just work (it'll just send the dump IDs back up to the parent, and the parent will do the submitting work).

So what is the potentiality of races here? Are billm's worries valid?

Let's test. I'll add breakpoints to both PluginModuleChromeParent::ActorDestroy and PluginModuleContentParent::ActorDestroy and see if I consistently hit the first breakpoint first.

No - that's not a good test. Because there's the message that the ContentParent needs to respond to for getting the dump IDs back.

So what I really need to do is have two breakpoints, both in the parent process.

One at ActorDestroy, and one at the handler for a message from the ActorDestroy in the child process.

~~Build confidence that we won't race between ActorDestroy message from plugin and get dump IDs message from content.~~ Yeah right - see bsmedberg's comment below.

~~When we set up the bridge in the parent process , have a plugin run ID be assigned to it. Somehow pass this to content.~~

~~We need this ID because the PluginModuleChromeParent doesn't have direct access to PluginModuleContentParent. There's no mapping between the two of them that I know of.~~

~~How… how do I get that ID down to PluginModuleContentParent?~~

~~Create some kind of mapping in the parent process from plugin run IDs to browser dump IDs and plugin dump IDs~~

~~Hook up RecvGetPluginCrashDumpsIDs to grab and return the items from the mapping (and probably remove the items from the mapping)~~

~~See if we can activate some plugin crash tests for e10s?~~

~~Questions:~~

~~In the mapping… what happens if the plugin and browser dump IDs are never claimed? Do they just stick around forever? Or is there some kind of expiry?~~

HOLD THE PHONE.

From bsmedberg:

"I'm concerned about this plan. I feel like the content process should not have access to the crash ID, and the presentation changes necessary for crash submission in the Firefox UI should be driven by the chrome process.

Discussing how frequent a race might occur seems odd; if we're going to rely on a certain event ordering, we should be able to prove that it will be delivered this way. The order that processes receive pipe error notifications from the OS seems inherently racy. I don't see how you can reason about this ordering given the setup here to build a reliable system."

Yikes.

I guess it hadn't occurred to me that broken pipe errors from the OS were causing the IPC library to call the ActorDestroy methods - though it does make the most sense. And we really can't rely on the ordering there from the OS, can we?

It sounds like what bsmedberg wants me to investigate is whether or not I can alter the in-content crash reporting UI for plugins such that it doesn't need to rely on the browser / plugin dump IDs in order to function properly.

Let's investigate that.

We'll need some way of mapping the message from a remote browser to a set of plugin / browser dump IDs being held in the parent… does the parent actually get a chance to store that stuff at the chrome JS level? Like, I know the PluginModuleChromeParent hears about it via ActorDestroy, and it parses the minidumps and stuff, but does

Idea - create some unique plugin run ID. When plugin crashes, parent process needs to hear about it, collect dump IDs, and then all windows get told about the crash, passing in the plugin run ID. Every instance of that plugin with a matching plugin run ID gets the crashed plugin UI shown. Once a crash report is submitted, all windows hear about it again, and the crash UI is updated so that multiple reports are not submitted.

Before I continue - I want to know - are there other Gecko apps that make use of the dump IDs in PluginCrashed? Maybe I can just chop them out.

Doesn't actually look like it does. Very interesting!

bsmedberg likes the plan. I think we've got a plan!

Ok, let's throw down a plan:

Create some kind of run identifier for every time a new process for a plugin is opened, and somehow make it possible to ask a plugin for that information via some QI magic. Maybe send this down through PluginModuleChromeParent::NP_Initialize? Make sure to do this for both the synchronous and asynchronous case.

PluginModuleParent::GetSettings is where PluginSettings are populated. Here's where we could pack a static integer that gets bumped.

Maybe instead of creating a new run ID, we just use the proc ID?

So some cases to think about:

OOP plugins disabled. This is actually straight-up not supported in e10s, and even in non-e10s, we don't have to worry about this because if the plugin crashes, it brings down the whole browser. So we can discard this case - this one is hopeless.

OOP plugins enabled, e10s disabled. In this case, we have this arrangement:

In the e10s case, we have this arrangement:

Handyman's idea is to have the process ID be the run ID. Now, I can do that, except that it exposes something pretty systems-level to JS. I want to see if I can avoid that, so let me put that on the backburner for a sec.

So PluginModuleChromeParent… when it calls CallNP_Initialize, PluginModuleChild is going to hear about it...

Well, here's the thing. The Plugin(Module|Instance)Child don't really need to know the run ID. It's really only the parents that need to know. Maybe the PluginInstanceParent needs to bea ble to communicate with the PluginModuleContentParent… can it do that?

Yeah, it can totally do that.

So I think the trick is somehow to have both the PluginModuleChromeParent and PluginModuleContentParent on the same page as to what the run ID is.

And I mean, handyman's right - that could be the process ID.

But I am kinda reluctant (maybe needlessly so) to expose something so low-level to the chrome level.

Note that the plugin ID is _almost_ what we want, except that it persists from run to run. :/ The plugin ID is associated with a tag, that exists outside of the lifetime of a plugin run. In other words, if we were to crash, and re-instantiate a plugin of the same type, it'd share IDs.

So I need something more than that - something that probably exists outside of nsPluginTag's, but rather in PluginInstanceParent's - or, more likely, PluginModule(Chrome|Content)Parent's.

So I've got a lead here. It looks like there is already some kind of association made between plugin IDs and process IDs - it seems to go like this:

PluginModuleChromeParent will call sendAssociatePluginId if there exists an mContentParent (so only in the e10s case, I guess). That goes through ContentParent.

ContentChild receives the call in RecvAssociatePluginId, and calls a static method - plugins::PluginModuleContentParent::AssociatePluginId, passing in the plugin ID and the process ID. That calls into PluginModuleMapping::AssociateWithProcessId...

A PluginModuleMapping is an element inside a circularly linked list. Huh.

What is the lifetime of a PluginModuleMapping? According to the class definition:

/**

* Objects of this class remain linked until either an error occurs in the

* plugin initialization sequence, or until

* PluginModuleContentParent::OnLoadPluginResult has completed executing.

class PluginModuleMapping : public PRCList

So maybe it gets destroyed once we're up and running. Is that true?

I can find out by putting a breakpoint in the destructor, and then trying to load a plugin.

Yeah, it goes away after the load has finished. So that mapping is not useful for me right now.

But that's OK! Like, if I just pass the run ID along with this AssociateWithProcessId, then the PluginModuleContentParent instance can hold onto it? Maybe?

And then I can extract it and hold onto it in PluginModuleContentParent::Initialize after the Resolve?

ehsan's idea:

In the child when you receive a notification that the plugin has crashed, put up the crash UI immediately. The crash UI may or may not contain crash report UI. By the time you receive the crash report information that the plugin has crashed, you put up the crash report UI. Otherwise you just wait on that bit, and the message handler for the thing that tells the child that some plugin has crashed needs to have a list of plugins of that type that has crashed. If you receive the notification from the parent, and we don't have the UI, we hold on to the data and wait until we show the PluginCrashed notification and then chuck it. Otherwise, just show the plugin crash report UI and never hold onto that.

Let me try to reformulate this.

Two cases:

ActorDestroy is called in PluginModuleChromeParent first. PluginModuleContentParent has not heard about it yet. We know the type of the plugin that has crashed, and we process minidumps if we can. We then fire an observer notification, sending out the name of the plugin that has crashed, and the dump ID information.

Browser front-end sees the notification, and immediately tells all remote browsers "plugin with type X has crashed". Sends a message down to the content processes.

I suppose it's possible that the content process, at this point, does not know that the plugin with type X has died. PluginContent.jsm hears those messages, and checks an internal WeakSet to see if there are any <object>'s with that pluginID that have reported PluginCrashed. If not, hold on to the information.

Eventually, we should see PluginCrashed events come up from the DOM for each <object>

Q: In this scenario, how long should the PluginCrashed instances hold on to the crash info before getting rid of it? :/ It wouldn't know unless there was some process wide observer notification saying, "I fired all PluginCrashed events", so we could drop the data.

OK, talked to billm. He feels like run ID is just safer. It's bulletproof. Yes, pid might work 99.9999999% of the time, but run ID will work 100% of the time. So let's do it.

Slight alteration though - instead of extracting the PluginModuleChromeParent and setting run ID in it in PluginModuleContentParent::Initialize, just instantiate the PluginModuleContentParent right away when we create the mapping, and set the run ID on it there. Maybe pass it into the constructor.

Huh… so I might have gotten the ordering wrong. It looks as if the bridging occurs _after_ the connection has been set up in the parent process to the plugin process - it's not simultaneous. Good to know.

Ah, so the AssociatePlugin stuff never gets called for the synchronous plugin loading case. :/ I have to find a different solution there...

This AssociatePlugin stuff might be all useless. Hrm...

Actually, no. I think I need it for the async case. Garrr….

Both async and sync plugin instantiation calls LoadPlugin through PContent.ipdl. Maybe it'd be better if I set run ID there instead of all of this async monkeying about.

Ah, heh, and RecvLoadPlugin calls into mozilla::plugins::SetupBridge, which is where billm initially suggested I set the ID in the first place. FULL CIRCLE, BABY.

Yeah, I think this is going to work. I really don't need to wait for the plugin process to gets its initialization done, which is what the AssociatePluginFoo stuff was all about.

Note - mozilla::plugins::SetupBridge is re-entrant.

Ok, the dummy thing I wrote seems to be acceptable by billm / aklotz. Let's move on.

Expose the runID via nsIObjectLoadingContent

If I can get this, I think everything else is mostly gravy.

Now… how the hell do I get from an nsObjectLoadingContent to the PluginModuleContentParent?

Like, I can get from an nsObjectLoadingContent to an nsPluginInstanceOwner, and I can get from a PluginInstanceParent to a PluginModuleContentParent.

But how do I get from an nsPluginInstanceOwner to the PluginInstanceParent? Or is there another path?

Current path I can think of seems to be through this NPP struct that npapi.h seems to define. We static_cast an PluginInstanceParent to a PluginDataResolver* and set it to the instance's NPP pdata… or something. This is so fucked down here.

Hm… apparently, nsNPAPIPluginInstance can call into PluginModuleParent as its "library". Ok...

Yes - PluginModuleParent subclasses PluginLibrary.

I could go from nsObjectLoadingContent.cpp, and reach into mInstanceOwner to get the nsNPAPIPluginInstance...

And then that could be extended to ask its PluginLibrary for a run ID. In the single process case, we can just make that return NS_ERROR_NOT_IMPLEMENTED or something.

jimm likes this plan! I think I can make this work.

This NS_ERROR_NOT_IMPLEMENTED thing - I want to return that in the event that the plugin is loaded in process. Can I easily detect that in nsObjectLoadingContent? Yes - I have to get at the Library and check library->IsOOP.

Sweet! I can totally do this! Let's start by adding the member to nsIObjectLoadingContent.idl.

Done! Now I can forward down to PluginLibrary… lots of hops here. *sigh*.

I've changed the GetRunId method in PluginModuleParent to return an nsresult instead, that was it can be an implementation of the method for PluginLibrary, and the PluginLibrary for the in-process case can just return NS_ERROR_NOT_IMPLEMENTED.

HOLY SHIT that was hard. I had to like… add stuff to this webidl file otherwise it wouldn't show up when I QI'd a Flash object… bah. And then I had to implement _yet another_ GetRunId in nsObjectLoadingContent that calls the XPCOM GetRunId. Kind of a monkey-show here.

And I get NS_ERROR_FAILURE, because at the time I'm interested in the run ID, the nsPluginInstanceOwner has gone away. DAMN IT.

GOD DAMN IT.

So I guess maybe nsObjectLoadingContent should get the run ID on instantiation - like maybe in InstantiatePluginInstance. We get the run ID, store it in a local member on the nsObjectLoadingContent… and a bool for whether or not we were able to successfully get the run ID. Jeebus.

Ok, got this to work finally.

Send an observer notification when a plugin crashes

This already happens! The notification is called plugin-crashed, and passes a property bag with the pluginDumpID and the browserDumpID. I should pass the run ID here too.

Then I need somebody to hold on to it...

Have each window listen for that observer notification to send a message down to every browser window. Pass down the run ID and whether or not we can submit a crash report.

And this probably needs to be in toolkit. :/

Wait. No. It'd only need to be in toolkit if I was removing PluginCrashed. I'm only _considering_ removing PluginCrashed. If I don't, then anybody who consumes PluginCrashed will still work.

So I wonder if PluginCrashReporter code could go into the TabCrashReporter.jsm file, and then rename that thing to CrashReporters.jsm.

Then we have a PluginCrashReporter that gets initted if MOZ_CRASHREPORTER.

On init, it registers a global observer for the plugin-crashed notification. When it sees the plugin crashed notification, it maps run IDs to plugin/browser dump IDs.

We have gPluginHandler add an observer for the plugin crashed event as well, and when we see it, we send a message down to content with the following information:

1) bool for whether or not to show the crash report interface

2) run ID

On submit, pass up information with run ID. This should call into the PluginCrashReporter, which will submit the report.

Only need to do this in the parent process.

I feel… I feel like I need to get a better sense of how the plugin crash UI works and what it's capabilities are before I continue. Let's do a quick study of that.

pluginInstanceCrashed in PluginContent.jsm is where most of the magic happens, it seems...

I see - so the event gets "submittedCrashReport" set on it if we automatically sent the crash report - that way we don't show the crashed UI again.

Ah… a plugin might not be represented by "window-global" plugins. I might have to think about that. Adding TODO.

The plugin crashed UI seems to have several states:

submitted
noSubmit
noReport
please

What about window-global plugins? Do we expose run ID on those? How?

Ah, ok, so these are things like the Adobe PDF reader plugin that take up the entirety of the browser content area. What do we currently do when those crash? Let's test.

OK, so every NPAPI plugin exposes a nsIObjectLoadingContent.

It's Gecko Media Plugins that I need to worry about / test.

Determine what to do about Gecko Media Plugins

On Gecko Media Plugins.

Meta bug.

Bug that added crash reporting capabilities for GMP's.

Found the test Gecko Media Plugin under obj/dist/bin/gmp-fake/1.0/

  
   13:19
  
   (jesup)
  
   mconley|livehacking:
  
   https://mozilla.github.io/webrtc-landing/pc_test.html
  
   and check "Require H.264"

  
   13:20
  
   (jesup)
  
  that will use GMP for the OpenH264 plugin

  
   13:20
  
   (felipe)
  
   mconley|livehacking: actually when I worked on that bug the machinery
 was not fully set up yet, so I don't know that myself either. We 
implemented the bug assuming the platform would send the expected 
notification on a crash

  
   13:20
  
   (mconley|livehacking)
  
  interesting

  
   13:20
  
   (jesup)
  
  there's also a test of the fake GMP plugin in dom/media/tests/mochitest

This works to reproduce the issue and to have a GMP open in a plugin-container, but note that this must be done outside of e10s !

I assume that GMP's never display inline like Flash or Java. Assuming this is true, then when a GMP plugin-container crashes, the main process will have the plugin-crashed observer notification called on it. If that observer notification can tell us that it was a GMP that crashed, we know that we only need to show a notification bar and do not need to notify content about the problem. So we can skip sending the message down with the run ID - the run ID is actually not useful in this case

Verify that GMP never displays inline

Yep, confirmed.

  
   :15
  
   (mconley|livehacking)
  
  jesup: hey - can I assume that GMP's will never display inline - like,
 no type of GMP is rendered in layout in an <object> like Flash? 
That they're all "behind the scenes" helping media elements like 
<video> and <audio>?

  
   14:16
  
   (jesup)
  
  yes.  Deep in the media pipelines

Drew explains everything here.

NPAPI plugin case

Have plugin-crashed notification pass run ID

Have browser-plugins' pluginCrashed hear the observer notification, and message all browsers of the window.

So, another wrinkle. The binding that gets applied to show the crashed plugin UI isn't something that PluginContent sets up, but is something that the nsObjectLoadingContent does by making a pseudoselector "moz-handler-crashed" match. So by the time the parent processes the minidump and sends down the message to the children, we cannot be certain that all plugins with a matching run ID are in the crashed state. At least, I don't think we can. :/

What about ehsan's idea? Let's think about that for a second.

nsIObjectLoadingContent.displayedType if nsIObjectLoadingContent.PLUGIN_CRASHED can tell us if the plugin is actually crashed or not.

So parent sounds down message… we go through all plugins that have the same run ID. If we find some that are not in the PLUGIN_CRASHED state, stash our plugin information in a Map somewhere along with a WeakMap of the plugins that haven't crashed. When we see the PluginCrashed events, remove plugin from WeakMap (if empty, destroy mapping).

On pagehide, also clear mapping.

Making progress!

I only seem to get the UI if I set a breakpoint in PluginContent.jsm in the content process though, which makes me think we've got something kinda race-y here.

It looks like there are times where the child receives the crash message from the parent, and the nsDOMWindowUtils::GetPlugins method returns nothing, because the plugin object has already unbound itself from the tree. I need to investigate that.

So wait, that happens after the PluginCrashed event? What's the order in this case?

Let's add some logging, since a debugger and a debug build seems to screw everything up.

Huh. Ok. So what's the plan there?

Well, what's the order of things?

Oh, interesting… in this last case, the PluginCrashed event fired first, and then we processed the messages from the parent.

But we processed TWO messages from the parent. What? Really? Oh… I guess it's once per tab, and I had two tabs open. Can I confirm that?

Yes, that is the case.

So this is kinda bad. The PluginCrashed event has happened, but by the time we process the message from the parent, nsIDOMWindowUtils::GetPlugins returns us nothing (because the plugins have been torn out of the tree, I guess?).

Woo! This seems to work! Great great great!

Make PluginContent.jsm special case the GMP case to work the old way.

Re-introduce mechanism that makes sure we don't show the plugin crashed notification bar if there's any one plugin instance that is large enough to show the inline UI.

Rename TabCrashReporter.jsm to… ContentCrashReporters.jsm? And then have that host TabCrashReporter and PluginCrashReporter

Have PluginCrashReporter (and I guess TabCrashReporter?) initialize in nsBrowserGlue.js?

Have PluginCrashReporter listen for the plugin-crash observer notification, and stash runId mapped to browser / plugin dump IDs

When submitReport comes up from any browser, pass it off to PluginCrashReporter. Have the PluginCrashReporter tell all navigator:browser windows message managers that a crash report has been submitted for such and such a runId.

Have PluginContent.jsm listen for that message, and update all plugins with matching runIds to the "crash report sent" state.

Deal with the invisible plugin crash case

How come the crashed plugins aren't being updated once I submit the report?

This was just some silliness on my part - I was referring to some variables with the wrong names.

Find out why I can't type anything in the crash reporter UI comment input. :/

This also appears to be the case for single process Firefox, which is a bit odd. Is this reproducible in release?

HOLY SMOKES. This appears to be reproducible up to beta, and is therefore maybe a big deal. I've filed this bug.

Looks like smichaud is on the case. Until he has it fixed, I guess I should move on.

Rename runId to runID - hopefully this doesn't bite me, and somebody asks for it to be put back.

Move optInCB stuff to browser-plugins.js

Do self review pass

Fix up review comments

Split out Run ID stuff to its own bug and get review now. Split this out to bug 1148012 .

Start testing various cases and get patches out of WIP

Test the full-page plugin case (like PDF viewer)

Is there any pocket of time that PluginCrashed will have fired for all plugins on documents / subdocuments, but that nsIDOMWindowUtils::GetPlugins can return an empty array? Example - during binding switch? bsmedberg says no.

   
    14:51
   
    (mconley)
   
   bsmedberg: hey - when plugins crash, is there any point in time where
 the <object> or <embed> is removed from the DOM before it's
 replaced with the crashed plugin sad-Lego block view? Am I correct in 
understanding that it remains in the DOM, but that we attach an XBL 
binding based on some pseudo-selector?

   
    14:56
   
    (bsmedberg)
   
    mconley: it is not removed from the DOM

Write a new test to test non-determinism.

Ugh, this was pretty hacky. It does the job but maaaaaan is it hacky.

Let me explain.

The problem here is that the code in PluginContent.jsm is dependant on how the <object> or <embed> element is expected to behave during a crash. Specifically, the <object> or <embed>, in the event of a crash, will match the embed:-moz-handler-crashed, applet:-moz-handler-crashed, or object:-moz-handler-crashed pseudoselector, which attaches the pluginProblem binding (the sad lego brick thing).

The thing is… it's really hard to mock things out so that the object matches the pseudoselector. And I can't seem to apply the binding manually. It seems I must forcibly crash the plugin in order to show the plugin problem binding.

There is a window of time I'm trying to capture here. The window of time is after the pseudoselector matches, but before the PluginCrashed event fires.

The problem is that I can't get the pseudoselector to match without firing the PluginCrashed event. So what I do is I cause the plugin to crash to have it match the pseudoselector, allow the PluginCrashed event to fire, and maybe the message to come down from the parent non-deterministically… and once that's all done, I rewind the tape a bit. I reset things so that the pluginProblem binding is in the state it was before it saw PluginCrashed or the message from the parent.

Then in one test, I fire the PluginCrashed event and then send the message from the parent, and do the reverse order in my other test.

Try push for test.

Ensure crash reports being submitted are useful

This took some doing, but on OS X, it looks like on a build, we get a test plugin created at objdir/dist/plugins/Test.plugin. Just copy that into ~/Library/Internet Plug-ins/ and then go to http://people.mozilla.org/~tmielczarek/testplugincrash.html to test. Apparently, kill -SIGABRT doesn't produce useful crash dumps.

~~Update plugin crash tests? Maybe I can make this limp along for the non-e10s case.~~

With jimm's patch from bug 1145208 , I can run the tests under browser/base/content/test/plugins.

In the non-e10s case, this is what fails:

browser/base/content/test/plugins/browser_pluginnotification.js

In the e10s case, this is what fails:

browser/base/content/test/plugins/browser_bug812562.js

browser/base/content/test/plugins/browser_pluginnotification.js

browser/base/content/test/plugins/browser_plugins_added_dynamically.js

browser/base/content/test/plugins/browser_CTP_crashreporting.js

browser/base/content/test/plugins/browser_globalplugin_crashinfobar.js passes, but is probably somehow busted.

browser/base/content/test/plugins/browser_pluginCrashCommentAndURL.js is busted.

I should update at least

browser/base/content/test/plugins/browser_CTP_crashreporting.js

browser/base/content/test/plugins/browser_globalplugin_crashinfobar.js

browser/base/content/test/plugins/browser_pluginCrashCommentAndURL.js

I think I can maybe spin this out to a separate bug.

Make sure test from this bug still works for non-e10s.

Hm. I appear to have broken this. Let's investigate.

Ah, I need to handle old-style crash reporting for non-e10s for the GMP plugin crash case. Done.

Ensure all NPAPI plugin crash tests work for non-e10s

My try push has some failures. Time to dive in and categorize!

TEST-UNEXPECTED-FAIL | browser/base/content/test/plugins/browser_CTP_crashreporting.js | submitStatus data should match - Got submitting, expected success

At Mossop's suggestion, I've got a patch in this bug to add some handy functions to the ContentTask scope for things like polling. That'll be useful for rewriting these tests.

Ok, done. I've created this ContentTestUtils.jsm, and a promiseCondition method. That will be useful.

That was pretty straightforward! Ok! Great!

TEST-UNEXPECTED-FAIL | browser/base/content/test/plugins/browser_globalplugin_crashinfobar.js | Waited too long for the plugin-crashed notification bar

TEST-UNEXPECTED-FAIL | browser/base/content/test/plugins/browser_globalplugin_crashinfobar.js | Successfully got the plugin-crashed notification bar

TEST-UNEXPECTED-FAIL | browser/base/content/test/plugins/browser_globalplugin_crashinfobar.js | Infobar was shown.

TEST-UNEXPECTED-FAIL | browser/base/content/test/plugins/browser_globalplugin_crashinfobar.js | Test timed out - expected PASS

Woo! That was also pretty straight-forward. ContentTask is awesome. So is BrowserTestUtils.

TEST-UNEXPECTED-FAIL | browser/base/content/test/plugins/browser_pluginnotification.js | Uncaught exception - TypeError: Argument 1 of Document.getAnonymousElementByAttribute is not an object

This looks like jimm just needs to finish refactoring this test - I can ignore this.

TEST-UNEXPECTED-FAIL | browser/base/content/test/plugins/browser_CTP_zoom.js | Uncaught exception - TypeError: Argument 1 of Document.getAnonymousElementByAttribute is not an object. (Debug builds and Win XP)

This looks like jimm just needs to finish refactoring this test - I can ignore this.

Timeout on Win XP...

PROCESS-CRASH | Main app process exited normally | application crashed [@ libnptest.so + 0x479d]

Figure out future plan for PluginCrashed event. Just leave it for other XUL apps? Is that necessary? I think I'm going to need this for the GMP case, since it's the only way for now that PluginContent.jsm can be informed of which window is using a GMP (since there's no way of iterating GMP's in a window+descendants like there is for NPAPI plugins).

Address Mossop's review comments in my non-determinism test patch

jimm's test updating patch is still WIP, and it's only a +, so I don't think I should block on it. I've rebased away from it.

Try push.

Huh. So I've got two PROCESS CRASH errors in the log. I suspect that, for some reason, the crashes caused by browser_pluginCrashReportNonDeterminism.js aren't "expected". Weird. Can I prove that?

Yes. That appears to be the case. Now I've added an observer to just delete the minidumps.

Address joshmoz's comments and jimm's comments

New try push

Hm… intermittent orange?

Plug intermittent orange.

Maybe fixed by checking statusDiv exists first. Try push.

Gosh darn it. Need to change promiseCondition to waitForCondition. *sigh*

Another try push.

Audit states to make sure we're using the right ones.

States:

noReport -> .msgNoCrashReport -> "No report available."

please -> .msgPleaseSubmit -> "Send crash report"

noSubmit -> .msgNotSubmitted -> "Crash reporting disabled."

submitting -> .msgSubmitting -> "Sending report…"

success -> .msgSubmitted -> "Crash report sent."

failed -> .msgSubmitFailed -> "Submission failed."

Do I have to support submittedCrashReport on the PluginCrashed event?

  
   16:56
  
   (mconley)
  
  bsmedberg: do you know if Firefox was ever configurable so that it'd 
send plugin crash reports automatically, and if we still support that?

  
   17:23
  
   (bsmedberg)
  
   mconley: I don't think it was ever configurable that way

  
   17:23
  
   (bsmedberg)
  
   mconley: unless perhaps for some special testing configuration

  
   17:23
  
   (mconley)
  
  bsmedberg: ok, good to know, thanks

Looks like that's a NO .

Try push

Another try push to see if those oranges go away (I don't think they're mine).

Awwwww yeah - no orange! Land ho!

SHIT. Bounced. WTF?

Looks like a Linux-only orange. I've retriggered the jobs on the original push here: https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=238352c73015 to see if they are intermittent.

I've also spun up a debug build on my local VM.

Address intermittent orange

It looks like the content process is crashing...

Here's a screenshot of the error occurring from the test machine:

http://i.imgur.com/Tf7xbfb.png

This seems to be the most interesting part of the log:

[Parent 22296] WARNING: '!aObserver', file /media/Projects/mozilla/mozilla-central/xpcom/ds/nsObserverService.cpp, line 287

[Child 22926] ###!!! ABORT: X_CreateGC: BadDrawable (invalid Pixmap or Window parameter); sync; id=0x4400008: file /media/Projects/mozilla/mozilla-central/toolkit/xre/nsX11ErrorHandler.cpp, line 157

#01: X11Error (/media/Projects/mozilla/mozilla-central/toolkit/xre/nsX11ErrorHandler.cpp:158)

[Child 22926] ###!!! ABORT: X_CreateGC: BadDrawable (invalid Pixmap or Window parameter); sync; id=0x4400008: file /media/Projects/mozilla/mozilla-central/toolkit/xre/nsX11ErrorHandler.cpp, line 157

Hit MOZ_CRASH() at /media/Projects/mozilla/mozilla-central/memory/mozalloc/mozalloc_abort.cpp:33

So we're aborting because of a bad drawable, it seems.

I'm going to disable this test for Linux for now, and file a follow-up to investigate and re-enable.

Re-land

Follow-ups

~~File a follow-up to consider merging PluginCrashReporter and TabCrashReporter?~~

Nah, on second thought, they do two completely different things.

~~File follow-up to rename PluginContent.jsm's pluginData and pluginCrashData to NPAPIPluginData and NPAPIPluginCrashData~~ I think this is only important once we start to work on GMP crash reporting, and assumes that we don't use the same run ID root for GMP's. I added this comment to remind me to think about this.

File follow-up to remove submittedCrashReport from PluginCrashed event - filed bug 1153265

File follow-up to retry submitting a crash report some number of times - filed bug 1153272

File a bug to deal with the case where a visible plugin crashes at the same time as an invisible plugin in a different process. It's edge-casey, but that first one might very well hide the notificationbar for the second. Filed bug 1153276

File a bug to re-enable browser_pluginCrashReportNonDeterminism.js on Linux and investigate X_CreateGC: BadDrawable bug. Bug 1152811

GMP

So, below, I'm worrying about GMP for WebRTC and EME. Is it possible to fix the crash reporting UI for e10s for NPAPI plugins while not worrying about the GMP case until later?

How would that work?

Both of the GMP cases rely on the gmp-plugin-crashed observer notification firing in the same process that content is running in, which is not going to work in e10s land, but I don't need to worry about that. It should still work properly for the non-e10s case, assuming the PluginCrashed event handler doesn't change behaviour too much.

Ok, I think I can do that. I should file a bug for the GMP crashing case and just move on.

Everything below is for the GMP case, which I've determined can be solved separately in bug 1146955 . I just have to make sure my solution in this bug doesn't break the non-e10s GMP crash report UI.

GMP + WebRTC case

Here's the bug for GMP e10s support.

It's pretty far along. I think I might need the "runId" thing for GMP's as well, but this might be far simpler because PGMPService.ipdl's LoadGMP will return a processID - I can probably pass a run ID back too.

Node ID isn't what we want. According to this , node ID distinguishes as follows:

"... site A.com running EME in an <iframe> from CDN.com will be considered a different node id than B.com running EME in an <iframe> from CDN.com."

GMPProcessParent might need to be able to get at our static run ID counter. Is there some common place where it makes sense for GMP and PluginModuleChromeParent to get the static counter? Or should they not use the same counter?

Since this is blocked on e10s support for GMP, should I spin this out to a separate bug?

Hm. Maybe I should just base my stuff on top of what's going on in bug 1057908 , since that's likely to land soon.

So, facts:

GMP already uses the PluginCrashed event, but sets the gmpPlugin property to true on it to differentiate and are dispatched on the document itself.
There doesn't seem to be a way of iterating GMP's for some document like there is for NPAPI plugins. I guess that makes sense - GMP's are deep in the plumbing and aren't actually represented in the DOM anywhere

Suppose we have each GMP process have a gmpRunId. On crash, parent gets observer notification, right?

If so, send message down to the child saying, "Hey yo, we heard a crash with such and such a GMP run ID".

Child hears it, and checks to see if it's ever seen a PluginCrashed event for a gmpPlugin with that run ID. If so, sends a message up saying so. Otherwise, adds the run ID to a set of crashed GMP run IDs until pagehide.

If a GMP PluginCrashed event is seen, check to see if there is a run ID being stored that matches, and send up a message immediately saying, "Hey, I was using that GMP when it crashed". Otherwise, toss the run ID into the same set so that we can send a message back up immediately when we get the message from the parent.

Whoa whoa whoa whoa whoa. It looks like the PluginCrashed event is fired from dom/media/PeerConnection.js. I need to understand this thing more.

Confirm that parent process gets informed of GMP crashes via observer notification. This does not appear to be true yet, but might be true once bug 1057908 is landed. I should test this. I've applied the patches - let's see what happens. Yes - I will get the observer notification in the parent process once bug 1057908 lands.

So I suspect I should be basing my work off of bug 1057908.

Hm… this is going to break PeerConnections.js crash handling model. Like, it's not going to get, as far as I can tell, the observer notification in the content process that it currently uses to bubble up the PluginCrashed events.

Ok, so suppose the parent process hears this observer notification (which it will), and then what it does is (like the NPAPI case), messages all browsers that the GMP with run ID X crashed.

Then somehow I need to be able to identify which browsers are making use of a GMP to know whether or not to display the notification bar. Maybe have PeerConnection.js add a message listener to the child process message manager.

Hears the message, then for each PeerConnection implementation that it's tracking, it calls pluginCrash, passing down the GMP run ID. Wait… how is it supposed to match run IDs to running PeerConnection instances? Fuck.

Maybe expose the run ID on PeerConnectionImpl.webidl? Then when we hear the message, PeerConnection.js iterates the list of all active peer connections, queries the PeerConnectionImpl's for matching run IDs, and if they match, calls PluginCrash on it. That causes the PluginCrash event to be bubbled up on the active document in the window that owns the PeerConnection.

That's when PluginContent.jsm picks it up. Notices that it's a gmpPlugin and shunts it accordingly - will sent a message directly to the parent to show the notification bar.

GMP + EME case

This bug adds support for the crash report UI for GMP + EME.