Bug 1102410 - AboutProtocolChild attempts to register the same factory twice for multiple nsIAboutModule's
Ok, starting to look into this now.
I was able to reproduce this on the weekend. Let me just confirm again.
16:53
(emtwo)
mconley:
https://github.com/mozilla/interest-dashboard/tree/5cf5a9ac817d486dfa8cfa5b9442dbaec14eac35
16:53
zz_miketaylr is now known as miketaylr
16:53
(emtwo)
the latest commit is "Added pga matcher" that was included on amo
Ok, that's useful. That revision that emtwo linked me to is the revision that's on AMO, which is breaking BugzillaJS.
I can reproduce this without difficulty using that particular revision, and the AMO version of BugzillaJS.
Both BugzillaJS and ID are restartless, and I have BugzillaJS installed
and working, and then install ID, BugzillaJS continues to work until
restart.
Which makes me think that somehow ID is interfering with BugzillaJS's start-up.
Also, I just want to see if this is reproducible using the tip of the
ID repository. If not, then I can bisect, which will really narrow
things down.
HA! LUCKY! Looks like BugzillaJS works with the tip of the ID repository. Bisecting time!
Huh… getting this from git bisect:
Some good revs are not ancestor of the bad rev.
git bisect cannot work properly in this case.
Maybe you mistake good and bad revs?
I guess bisect is not used to the idea that something _used_ to be
broken, and now something is fixed. Huh. So now I have to swap my notion
of good/bad works/broken to use git bisect. Lovely.
Ok, bisect done. Apparently
873566fcf6d5660eed7867a127497783ea4a9d2e
is what fixed things:
So one of these changes fixed things on tip. I'd like to figure out
which one. I'm going to revert them one by one until I go from
working to busted.
And then I'll know which file change resulted in Interest Dashboard
being fixed. And then I'll probably bisect the changes in the file
itself until I find the change that fixed the problem.
Why all of this effort? I want to make sure that whatever problem
Interest Dashboard used to have is not something that could easily
happen with other add-ons, and make sure there's nothing in there that
our shims could have fixed.
Hm, I was just going to do this in order, but I think it's far more
likely that something from lib/ fixed things. Let's go by most to least
likely.
Changes I tested:
deleted: lib/Headliner.js
I skipped ahead and didn't need to test the following:
modified: lib/Application.js
Ah… some change in Application.js broke things… let me try to confirm… what were these changes?
I guess I'll need to revert the changes one by one until I find the thing that caused the problem...
Oh, not revert now - since I've already reverted Application.js for this test, I'll
reapply
the changes, to see when I go from broken to fixed.
Ah, so apparently what fixed things was when in Application.js, we removed DevMenu, which is an about: module.
Hypothesis - our add-on shims break horribly if an SDK add-on (or any
add-on, really) attempts to register more than 1 about: module.
So what's a cheap way to test this?...
Probably the cheapest way is to create a test-case add-on that creates two nsIAboutModule components and registers them.
This is looking very plausible - I've isolated the changes down to the registration of the DevMenu nsIAboutModule factory.
Now looking increeeeedbly plausible. I deregistered the second
nsIAboutModule, and things worked again - so it has nothing specifically
to do with the DevMenu nsIAboutModule per se, but almost certainly
because the fact that there's two of them.
So I think this is definitely our shims bugging out. Let me write this in the bug.
I might be able to create a test case using the pre-existing shims tests.
(If my minimal testcase doesn't work - I should check to make sure that
the problem isn't in how the SDK registers factories… though that seems
pretty unlikely as the cause.)
NS_ERROR_FACTORY_EXISTS: Component returned failure code: 0xc1f30100
(NS_ERROR_FACTORY_EXISTS) [nsIComponentRegistrar.registerFactory]
This is likely the issue. Why would this break pagemod though?
Ohhh… because that operates during initialization...
Wait… is that the problem? Wouldn't this happen even without the ID add-on?
Hm, no it doesn't. Ok.
Time to write a test case. Let's register two about: pages and make sure they work.
My test case isn't doing the right thing - anyhow, I think I've solved
the mystery here - the AboutProtocolChild will attempt to register
itself twice when it tracks an nsIAboutModule getting registered in the
parent.
So here's what happens.
The first browser-child.js starts up… and it calls RemoteAddonsChild.init on itself.
RemoteAddonsChild, starts itself up the first time, and calls makeReady on itself.
NotificationTracker.init is called, which populates the
NotificationTracker with all of the information about shims it should
hook up.
ContentPolicyChild.init is called. That goes just fine.
AboutProtocolChild.init is called - here's where things go wrong - for
each tracking for the AboutProtocol, we attempt to register the same
AboutProtocolChild as the factory. Repeated calls result in an exception
being thrown.
So we throw an exception, and so makeReady fails out. That's why
subsequent creations of browser-child.js's result in makeReady getting
called again - because makeReady didn't succeed, RemoteAddonsChild just
assumed that it had never started up before, and calls makeReady again
for that new browser-child.js.
Let me write this up in the bug.
Changing bug summary from
Interest Dashboard breaks BugzillaJS
to
AboutProtocolChild attempts to register the same factory twice for multiple nsIAboutModule's
since it's more descriptive.
So my test does a good job of getting us into the error state - but
it's not so good at detecting when the error is occurring. This is
because attempts at loading my about: pages results in them being loaded
in the parent process, since they're blacklisted.
What's a good way of testing that this registration of the about: pages has gone all wrong?
Oh - I can take advantage that we prefix test some about: urls in
E10SUtils shouldBrowserBeRemote to determine if… well, a browser should
be remote. about:neterror is one of the things that passes the prefix
sniff test - so perhaps I should name my test pages about:neterror-test1
and about:neterror-test2. It's hacky, but it's also a test and the best
I can come up with, and I'd rather have a hacky test then no test at
all.
Let's give that a shot.
Yes, this works. Before I continue though, I had a new idea - perhaps
add-on code can XHR to those about pages. That'd be a little less hacky,
probably… and it'd bypass the blacklisting that E10SUtils uses. A
little more verbose though.
Let's try that.
Ooooh - yes, this will work nicely.
Ok, test written. Fix written. Cleaning up now, and attaching.
Posted.
The one thing I'm not sure about is this loadGroup change I had to
make. In AboutProtocolChannel in RemoteAddonsChild.jsm, I ran into a
problem where I'd get a "this.loadGroup is null" error when running the
test. From my understanding, loadGroup is something that starts out as
null, and that the caller needs to set (ugh)… so have I uncovered a
networking bug here? I've changed the code to handle the null case, but
I'm not sure if that's the right call. I've asked billm for more
information.
Hey, ok - billm likes the fix. I'm redirected the review request for
the test patch to ally, since she wants to get her hands more dirty,
review-wise.
For the test, ally likes it, mrbkap likes it, billm has the following suggestion:
15:59 billm: mconley: I think this code should just set loadGroup to null when the child had no load group:
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/addoncompat/RemoteAddonsParent.jsm#244
So I can do that, no problem, and then I'll land. Woo!
Hrm… actually, I'm not sure if billm meant I should replace what I'd
done with loadGroup with his suggestion. A flat out replacement results
in a throw, because I'm still attempting to access
this.loadGroup.notificationCallbacks in the RemoteAddonsChild.jsm when
this.loadGroup is null. Perhaps billm meant this as "in addition to"?
I've needinfo'd him.
Ah yep, he meant "as well as" - we're good to go to land the test.
Aaaaaaaand the trees are closed because AWS. Guess I'll just wait until they're re-opened.