Bug 1069075 - Intermittent browser_CTP_data_urls.js | Test 3a, Waited too long for plugin to activate | Test 3c, Plugin should be activated Ugh. Ok, what's going wrong here?

Ok, I can reproduce this locally using a debug Ubuntu build and running with --run-until-failure. I love that feature.

The failure always seems to be at the same place: "Test 3a" says that it's waiting too long for a plugin to activate, and "Test 3c" checks for an activated plugin.

I suspect that 3a is really the root cause here, and 3c is noise as a result of 3a's failure. Let's see if we can figure out why 3a is going wrong.

What is "Test 3a" doing? It seems to be part of a multi-part test - Test 3a, 3b, and 3c.

What's important is knowing what this test does, and what occurs leading up to this test. The key to the failure will be in there somewhere.

Actually, if we can narrow that down immediately, that'd be good. Let's see if I can eliminate Test 1 and Test 2 and see if Test 3 still fails.

Ok, yes, I am able to run just Test 3a, 3b and 3c. I was able to get the test harness to target 3a first, and I got 3c to shortcut to finishing. So it's just these three.

Now, if I run with --run-until-failure, can I get it to fail? If so, then the orange is likely caused by something within Test 3a. If not, then something in the steps preceeding Test 3a is likely at fault, or at least contributing to the problem. Both pieces of information are useful.

AH HAH. I am not able to reproduce the problem with just Test 3a, 3b and 3c. So something in either Test 1 or Test 2 is contributing. Let's see if we can narrow that down. I will re-enable the Test 2 bits.

Huh… no failures with both Test 2 and 3 running. I'll admit I'm surprised - I expected it to be 2, since I would have expected any flim-flammery from 1 to have affected 2 and not 3. Interesting.

So my current hypothesis is that Test 1, in conjunction with or in spite of Test 2, is causing Test 3 to fail intermittently.

So I guess I'd better understand what the hell Test 1 and Test 2 are doing before I continue. *sigh*.

Test 1

plugin_data_url.html is loaded. This document has two links, which lead to data URIs that show plugins (the first link shows just the test plugin, the second link shows both the test plugin and the second test plugin). The document also has a plugin (which I'll call X) displaying on the page with the links.

We first make sure there's a click-to-play notification (the doorhanger) being shown. We also test that plugin X is not activated.

We then click on data-link-1, which is the data URI showing just the single plug-in, which I'll call Y.

That's the end of Test 1a.

Test 1b starts when the data URI finishes loading and the plugin binding has been attached. We make sure there's a click-to-play doorhanger. We make sure that Y is not activated. Then we  open the click-to-play doorhanger, and click the primary button which "allows" the plugin to play. We then wait for the plugin to activate.

Once the plugin is activated, we move on to Test 1c.

Test 1c clears all plugin permissions we've set up, and then re-loads plugin_data_url.html.

And that's the end of Test 1.

Test 2

plugin_data_url.html is loaded. We make sure there is a click-to-play doorhanger being displayed.

We make sure that the plugin X is not yet activated. And then we click on data-link-2, which loads up the data URI with two plugins - which I'll call Y and Z.

That's the end of Test 2a.

Test 2b starts with that data URI being loaded. We make sure that the click-to-play doorhanger is still there, and we also open it up. We make sure that both plugins are being listed in the doorhanger.

Then, we inspect those listings of plugins, and find the controls for Y. We tell it to allow Y for now, and click on the primary button of the doorhanger to make it so. We then wait for Y to be activated.

That's the end of Test 2b.

Test 2c starts … by also making sure that the plugin Y is activated. Strange… and then it clears all plugin permissions, and reloads plugin_data_url.html.

And that's it for Test 2.

Test 3

plugin_data_url.html is loaded. We make sure there's a click-to-play notification being shown.

We also make sure plugin X is not activated.

We then reshow the doorhanger, and click on the primary button ("allow") to make X active.

We then wait for X to go active. And this is where the test fails - X never goes active.

Ok, now I understand what this test is doing. Now… now why is it failing?

Oh, interesting - now it looks like I can reproduce this with just Test 2 running before Test 3. I was doing a build in the background on my host machine, so I'm guessing a heavily loaded CPU helped reveal that. That's good.

Let's add some logging to PluginContent.jsm to see if we're even calling playPlugin on the thing.

Ah hah - it looks like in Test 3a, the button we're clicking in the PopupNotification has label "OK", so we're sending the wrong message to the content process.

Y'know, I think I actually investigated this briefly as part of bug 899347, but I think I dropped it. Let me read my notes there and extract the important bits.

"Ah, interesting data from those logs. It looks like in the failing case, which is rare, the button that we're pressing in the popup is "Ok" and not "Allow now". Under what circumstances is the popup saying "OK"? Er… interesting. None of those buttons should be "OK"… and the request is sends down is "Block". Wtf?

Ah… that's button-accept from dialog.properties. "

So I suspect the panel is not fully initialized when we get it.

Nope, I was wrong - but I think I've narrowed it down.

The problem is that the framescript has a cache of all plugins its seen for a page, and it only flushes it on signals from the parent when notifications are removed. That's too late, because a notification might be removed long after the next page has loaded, and so we'd be using the old cache. That's why in some cases, we're sending up information on multiple plugins when the page we're on has only ever seen one.

I'm going to modify PluginContent to flush the cache on pageshow (pagehide is too late, since by that point, the page being loaded might have already sent plugin events).

This seems to fix it!

HOWEVER. This seems to break browser_CTP_notificationBar, and I think I've basically just regressed hidden plugin support, because the code attempts to get the information out of pluginData… and it's been cleared.

OK, I think I understand! Here's the story.

This all has to do with the lifetime of a cache inside a PluginContent instance. That cache is called pluginData, and we use this cache to remember what plugins are on a particular page, and map it to information we have about that plugin.

We want that cache to clear when we move from page to page, because one page might have completely different plugins from another.

We're currently clearing that cache when we see that the click-to-play PopupNotification is removed. We send a message asynchronously to the child, asking PluginContent to clear its cache.

But if we're fast enough, we have a pocket of time where we're viewing some page, and our pluginData is out of sync, because we haven't yet heard from the parent to clear our cache!

The natural inclination is to have the PluginContent object listen for pagehide or pageshow, and clear the cache at that point - but both of those points are too late. pageshow is fired _after_ the DOM is bound to the tree, and after plugins report that they instantiate, so we might send messages up to the parent about plugins with bad cache data just before we clear it out.

Same goes for pagehide - it fires too late. We might have browsed to a new page that shows some plugins, and sent messages to the parent with bad cache data before we get cleared out.

The solution, I think, is to listen for unload events for a page, and clear out the cache then. I also want to remove the stuff that clears the cache when the notification is removed - with this new code, that old stuff won't be necessary.

Ok, coming up with a patch.

Patch posted!

Try push: https://tbpl.mozilla.org/?tree=Try&rev=890374cea116

19 INFO TEST-PASS | chrome://mochitests/content/browser/browser/base/content/test/plugins/browser_CTP_data_urls.js | Test 2c, Plugin should be activated
Loading page plugin_data_url.html for next test3a!
For application/x-test found plugin libnptest.so
PluginContent saw event type: PluginInstantiated
_showClickToPlayNotification knows about 1 plugins.
Sending request to show CTP notification from child: 7 from stack: PluginContent.prototype._showClickToPlayNotification@resource:///modules/PluginContent.jsm:715:97
PluginContent.prototype.handleEvent@resource:///modules/PluginContent.jsm:424:7
Saw page hide!
Clearing plugin data cache! All is well!
++DOMWINDOW == 30 (0xa4d04da0) [pid = 30381] [serial = 30] [outer = 0xa4d02ac0]
PopupNotifications saw a location change
There are 1 notifications for the changing browser
After filtering, there are 0 notifications for the changing browser
Received CTP notification request 7 in parent for http://127.0.0.1:8888
Showing a ClickToPlay notification
Creating an empty map for pluginData
PluginData has 0 items, and we received information on 2 plugins.
PluginContent saw event type: PluginClickToPlay
_showClickToPlayNotification knows about 1 plugins.
Sending request to show CTP notification from child: 8 from stack: PluginContent.prototype._showClickToPlayNotification@resource:///modules/PluginContent.jsm:715:97
PluginContent.prototype.handleEvent@resource:///modules/PluginContent.jsm:424:7
pluginProblem_XBL_Constructor@chrome://mozapps/content/plugins/pluginProblem.xml:78:9
PluginContent.prototype.updateNotificationUI@resource:///modules/PluginContent.jsm:772:1
PluginContent.prototype.receiveMessage/<@resource:///modules/PluginContent.jsm:60:26
setTimeout_timer@resource://gre/modules/Timer.jsm:30:5
20 INFO Console message: [JavaScript Error: "The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document
contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol." {file: " http://127.0.0.1:8888/browser/browse
r/base/content/test/plugins/plugin_data_url.html" line: 0}]
Received CTP notification request 8 in parent for http://127.0.0.1:8888
Showing a ClickToPlay notification
Re-using pre-existing click-to-play notifications pluginData!!
PluginData has 2 items, and we received information on 1 plugins.
There should have been a single click-to-play notification above
Nabbing click-to-play notification
21 INFO TEST-PASS | chrome://mochitests/content/browser/browser/base/content/test/plugins/browser_CTP_data_urls.js | Test 3a, Should have a click-to-play noti


for (let plugin of cwu.plugins) {… is where we suddenly get to the pluginProblem.xml XBL constructor…

What happened was this:

Test 2c runs, or whatever, where we've got the two plugins on screen, and we click to activate the first one.

We activate it, and wait for the plugin to marked activated. Meanwhile, PluginContent.jsm sends a message back to browser-plugins.jsm updating it about the plugins on the screen, because instantiated plugins require that, I guess. But sometimes, we notice the activation, and move on straight away - we load the next page, and THEN we see the update about CTP and stuff and use stale plugin information.

So we wait for the CTP update after clicking on the first plugin, and all is well.

Q: Why does the principal check work?

A: Uh… weirdly, it seems as if the principal that gets sent up from that data_uri matches the page that it came from.

Example:

PluginContent is at data:text/html,<embed%20id="test1"%20style="width:%20200px;%20height:%20200px"%20type="application/x-test"/><embed%20id="test2"%20style="width:%20200px;%20height:%20200px"%20type="application/x-second-test"/> and our principal is: http://127.0.0.1:8888

I think that's OK - according to nsIPrincipal.idl : "... which allows the load of a data: URI (which inherits the principal of its loader)"

Ok, so this is working as suspected. Basically, the Principal is insufficient for things like blobs and data. What I need to do is send the actual location up along with the principal, and do a comparison against that as well. That should fix everything.

Try push with both patches: https://tbpl.mozilla.org/?tree=Try&rev=92c141444ab2

gfritzsche wants to know:

10:06 (gfritzsche) mconley: ok, now i remember - we cover up a race there for navigation between different urls - but what if i just reload the same page?
10:08 (mconley) gfritzsche: are you referring to the first patch or the second?
10:08 (gfritzsche) mconley: the one titled "Send location along with CTP messages."

So georg is worried about the case where we start a reload while a message is still in transit.

That's a good point. How do we deal with that?

Ok, this is a pre-existing thing that we're going to have to solve at some point, but for the sake of the treeherders, we're going to land these changes, and then I'll file a follow-up bug to try to find a solution for this entire class of problem.

Filed bug 1074983

And landed!