Bug 1102410 - AboutProtocolChild attempts to register the same factory twice for multiple nsIAboutModule's
Ok, starting to look into this now.

I was able to reproduce this on the weekend. Let me just confirm again.

16:53 zz_miketaylr is now known as miketaylr
16:53 (emtwo) the latest commit is "Added pga matcher" that was included on amo

Ok, that's useful. That revision that emtwo linked me to is the revision that's on AMO, which is breaking BugzillaJS.

I can reproduce this without difficulty using that particular revision, and the AMO version of BugzillaJS.

Both BugzillaJS and ID are restartless, and I have BugzillaJS installed and working, and then install ID, BugzillaJS continues to work until restart.

Which makes me think that somehow ID is interfering with BugzillaJS's start-up.

Also, I just want to see if this is reproducible using the tip of the ID repository. If not, then I can bisect, which will really narrow things down.

HA! LUCKY! Looks like BugzillaJS works with the tip of the ID repository. Bisecting time!

Huh… getting this from git bisect:

Some good revs are not ancestor of the bad rev.
git bisect cannot work properly in this case.
Maybe you mistake good and bad revs?

I guess bisect is not used to the idea that something _used_ to be broken, and now something is fixed. Huh. So now I have to swap my notion of good/bad works/broken to use git bisect. Lovely.

Ok, bisect done. Apparently 873566fcf6d5660eed7867a127497783ea4a9d2e is what fixed things:

So one of these changes fixed things on tip. I'd like to figure out which one. I'm going to revert them one by one until I go from working to busted.

And then I'll know which file change resulted in Interest Dashboard being fixed. And then I'll probably bisect the changes in the file itself until I find the change that fixed the problem.

Why all of this effort? I want to make sure that whatever problem Interest Dashboard used to have is not something that could easily happen with other add-ons, and make sure there's nothing in there that our shims could have fixed.

Hm, I was just going to do this in order, but I think it's far more likely that something from lib/ fixed things. Let's go by most to least likely.

Changes I tested:

deleted:    lib/Headliner.js
deleted:    data/about-keywords.html

I skipped ahead and didn't need to test the following:

modified:   lib/Controller.js
deleted:    lib/Dispatcher.js
modified:   lib/HistoryReader.js
deleted:    lib/KeywordExtractor.js
deleted:    lib/NYTUtils.js
deleted:    lib/NYTimesHistoryVisitor.js
deleted:    lib/Surveyor.js
modified:   lib/Utils.js
modified:   lib/WorkerFactory.js
deleted:    lib/streams/areaDataProcessorBolt.js
deleted:    lib/streams/dailyKeywordsSpout.js
deleted:    lib/streams/intentInterestDataProcessorBolt.js
deleted:    lib/streams/spiderDataProcessorBolt.js
deleted:    lib/streams/timelineDataProcessorBolt.js
deleted:    lib/streams/totalKeywordCountBolt.js
deleted:    lib/streams/weightIntensityDataProcessorBolt.js
deleted:    data/about-keywords.js
deleted:    data/charts/AreaGraph.js
deleted:    data/charts/IntentInterestCharts.js
deleted:    data/charts/IntentInterestCharts.js
deleted:    data/charts/SpiderGraph.js
deleted:    data/charts/TimelineChart.js
deleted:    data/charts/WeightIntensityChart.js
deleted:    data/consent-translations.js
deleted:    data/consent.html
deleted:    data/consent.js
deleted:    data/css/nytimes/newstyles.css
deleted:    data/devmenu.html
deleted:    data/devmenu.js
deleted:    data/interests/keywordsWorker.js

modified:   lib/Application.js

Ah… some change in Application.js broke things… let me try to confirm… what were these changes?

I guess I'll need to revert the changes one by one until I find the thing that caused the problem...

Oh, not revert now - since I've already reverted Application.js for this test, I'll reapply the changes, to see when I go from broken to fixed.

Ah, so apparently what fixed things was when in Application.js, we removed DevMenu, which is an about: module.

Hypothesis - our add-on shims break horribly if an SDK add-on (or any add-on, really) attempts to register more than 1 about: module.

So what's a cheap way to test this?...

Probably the cheapest way is to create a test-case add-on that creates two nsIAboutModule components and registers them.

This is looking very plausible - I've isolated the changes down to the registration of the DevMenu nsIAboutModule factory.

Now looking increeeeedbly plausible. I deregistered the second nsIAboutModule, and things worked again - so it has nothing specifically to do with the DevMenu nsIAboutModule per se, but almost certainly because the fact that there's two of them.

So I think this is definitely our shims bugging out. Let me write this in the bug.

I might be able to create a test case using the pre-existing shims tests.

(If my minimal testcase doesn't work - I should check to make sure that the problem isn't in how the SDK registers factories… though that seems pretty unlikely as the cause.)

NS_ERROR_FACTORY_EXISTS: Component returned failure code: 0xc1f30100 (NS_ERROR_FACTORY_EXISTS) [nsIComponentRegistrar.registerFactory]

This is likely the issue. Why would this break pagemod though?

Ohhh… because that operates during initialization...

Wait… is that the problem? Wouldn't this happen even without the ID add-on?

Hm, no it doesn't. Ok.

Time to write a test case. Let's register two about: pages and make sure they work.

My test case isn't doing the right thing - anyhow, I think I've solved the mystery here - the AboutProtocolChild will attempt to register itself twice when it tracks an nsIAboutModule getting registered in the parent.

So here's what happens.

The first browser-child.js starts up… and it calls RemoteAddonsChild.init on itself.

RemoteAddonsChild, starts itself up the first time, and calls makeReady on itself.

NotificationTracker.init is called, which populates the NotificationTracker with all of the information about shims it should hook up.

ContentPolicyChild.init is called. That goes just fine.

AboutProtocolChild.init is called - here's where things go wrong - for each tracking for the AboutProtocol, we attempt to register the same AboutProtocolChild as the factory. Repeated calls result in an exception being thrown.

So we throw an exception, and so makeReady fails out. That's why subsequent creations of browser-child.js's result in makeReady getting called again - because makeReady didn't succeed, RemoteAddonsChild just assumed that it had never started up before, and calls makeReady again for that new browser-child.js.

Let me write this up in the bug.

Changing bug summary from

Interest Dashboard breaks BugzillaJS
to
AboutProtocolChild attempts to register the same factory twice for multiple nsIAboutModule's

since it's more descriptive.

So my test does a good job of getting us into the error state - but it's not so good at detecting when the error is occurring. This is because attempts at loading my about: pages results in them being loaded in the parent process, since they're blacklisted.

What's a good way of testing that this registration of the about: pages has gone all wrong?

Oh - I can take advantage that we prefix test some about: urls in E10SUtils shouldBrowserBeRemote to determine if… well, a browser should be remote. about:neterror is one of the things that passes the prefix sniff test - so perhaps I should name my test pages about:neterror-test1 and about:neterror-test2. It's hacky, but it's also a test and the best I can come up with, and I'd rather have a hacky test then no test at all.

Let's give that a shot.

Yes, this works. Before I continue though, I had a new idea - perhaps add-on code can XHR to those about pages. That'd be a little less hacky, probably… and it'd bypass the blacklisting that E10SUtils uses. A little more verbose though.

Let's try that.

Ooooh - yes, this will work nicely.

Ok, test written. Fix written. Cleaning up now, and attaching.

Posted.

The one thing I'm not sure about is this loadGroup change I had to make. In AboutProtocolChannel in RemoteAddonsChild.jsm, I ran into a problem where I'd get a "this.loadGroup is null" error when running the test. From my understanding, loadGroup is something that starts out as null, and that the caller needs to set (ugh)… so have I uncovered a networking bug here? I've changed the code to handle the null case, but I'm not sure if that's the right call. I've asked billm for more information.

Hey, ok - billm likes the fix. I'm redirected the review request for the test patch to ally, since she wants to get her hands more dirty, review-wise.

For the test, ally likes it, mrbkap likes it, billm has the following suggestion:

15:59 billm: mconley: I think this code should just set loadGroup to null when the child had no load group: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/addoncompat/RemoteAddonsParent.jsm#244

So I can do that, no problem, and then I'll land. Woo!

Hrm… actually, I'm not sure if billm meant I should replace what I'd done with loadGroup with his suggestion. A flat out replacement results in a throw, because I'm still attempting to access this.loadGroup.notificationCallbacks in the RemoteAddonsChild.jsm when this.loadGroup is null. Perhaps billm meant this as "in addition to"? I've needinfo'd him.

Ah yep, he meant "as well as" - we're good to go to land the test.

Aaaaaaaand the trees are closed because AWS. Guess I'll just wait until they're re-opened.