Bug 1114299 - [e10s] link with target="_blank" opens window with no tabs/toolbars/menus
Ooooh - jst filed this one. Time for mconley to prove his worth!

I can easily reproduce this. Time to get a debug build and figure out wtf is going on.

In the non-e10s case, the chrome flags are 4094.

We eventually reach:

rv = windowCreator2->CreateChromeWindow2(parentChrome, chromeFlags,
contextFlags, uriToLoad,
aOpeningTab, &cancel,
getter_AddRefs(newChrome));

And the flags are still 4094.

Ah, in e10s mode, the flags are 1049806.

Why?

Ah… in non-e10s mode, we match nsIWebBrowserChrome::CHROME_ALL, but not in non-e10s mode… where does that get set?

Ahhhhhhh GROSSSS

So here's what's up:

The problem here appears that nsWindowWatcher::CalculateChromeFlags is not prepared to handle the case where only ",remote" or ",non-remote" are passed as features.

Note that when ",private" is passed in by itself, we have code at the start of CalculateChromeFlags to detect this, and automatically set chromeFlags to nsIWebBrowserChrome::CHROME_ALL, which gives the window all of the default features.

I figure we want to do something similar when aFeatures equals any of:

"private"
"remote"
"non-remote"
"private,remote"
"remote,private"
"private,non-remote"
"non-remote,private"

So I think we need to put some logic somewhere in CalculateChromeFlags to detect these cases. Kinda gross, but there you have it.

It feels really hacky to just add string comparison checks for all of those… and if the number of flags that act like this grows, the conditionals will just go out of control.

Gijs brings up a good point though - why do we even need "remote" and "non-remote"? As far as I can tell, we only want "remote" for tests… only want "non-remote" for the "Open non-e10s window" option, which will eventually go away. Hrm.

Welp, let's suppose we write a new method that checks for these one-off cases. How would it work?

We'll call it, "ImpliesAllChrome". Returns true if aFeatures contains "remote", "non-remote" and/or "private", but nothing else.

  1. Have collection of one-off flags. That'd be "remote", "non-remote" and "private".
  2. Separate by commas
  3. For each feature, check against collection. If doesn't match, return false.
  4. Return true.

That's probably the simplest way to go here, at least to my Javascript-y mind. I have no idea how simple this sort of string thing will be in C++. Knowing Gecko strings, not very.

Ah, nsCTR::strtok is probably what we want.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2b3f9e650a80

Check for green try build ( https://treeherder.mozilla.org/#/jobs?repo=try&revision=2b3f9e650a80 )
Investigate and address test failure: browser/base/content/test/general/browser_bug462673.js

So wtf is going on here? It looks like the test is failing for e10s because we're suddenly opening a new window instead of a new tab. What the hell?

In the e10s case, the chromeFlags become 1052670.

Looks like something different is occurring in nsWindowWatcher::GetWindowOpenLocation...

So this appears to be a special case where the window is an e10s window, but the browser is non-remote, and attempting to open a window. In this case, we calculate that the window being opened has CHROME_ALL, but also CHROME_REMOTE_WINDOW.

Then, afterwards, in nsWindowWatcher::GetWindowOpenLocation, we do this check:

if (restrictionPref == 2 &&
// Only continue if there are no size/position features and no special
// chrome flags.
(aChromeFlags != nsIWebBrowserChrome :: CHROME_ALL ||
aPositionSpecified
|| aSizeSpecified)) {
return nsIBrowserDOMWindow :: OPEN_NEWWINDOW;
}

And we're supposed to not satisfy any of those conditions - AKA, aPositionSpecified should be false (which it is), aSizeSpecified should be false (which it is), but aChromeFlags should be nsIWebBrowserChrome::CHROME_ALL, which it is not! It's not because it has CHROME_REMOTE_WINDOW as well.

So we want to check to see if aChromeFlags is not equal to CHROME_ALL, or CHROME_ALL/PRIVATE, or CHROME_ALL/REMOTE, etc. bleh.
000000111111111110 <-- all
Current:      110000111111111110
100000000000000000000 <-- remote
10000000000000000 <-- private
100000000000000000 <-- non-private

So why didn't this break before my patch?

Because the features string was empty, CHROME_ALL was set, and

Hypothesis :

Calling window.open with the private feature from a non-remote browser in an e10s window will result in a non-e10s window.

Yes, this is true, but only from chrome callers.

Ok, so the problem is that sometimes Gecko decides to attach the "remote" feature for windows that are requested from chrome, or from remote tabs. It'll attach it based on whether or not the default is to make the tabs remote or not in a new window.

So my solution is to ignore the REMOTE bit on the chrome flags when determining which location to open a window to. :/

https://treeherder.mozilla.org/#/jobs?repo=try&revision=32fcfd715655

Answer this question:

How come this test was failing with my patch, but not failing before? In other words, why is the change to the WindowLocation code necessary?

Answer:

The test opens a window without features, and so we shortcut to just setting the chrome flags to CHROME_ALL (and maybe setting the private mode flag on it too - though the test doesn't do that).

What this means is that in the e10s case, we actually open a non-e10s window during the test case, which isn't great. We should always open e10s windows when e10s is the default, and choose during runtime if the URL is blacklisted or not.

With this new patch, we "do it up right", and mark the CHROME_ALL flag, but also the REMOTE flag.

This means that the GetWindowOpenLocation in the parent process goes wrong when we attempt to open a window from a non-remote browser in an e10s window - the window has the REMOTE flag on it!

So now we ignore that flag in the GetWindowOpenLocation calculation.

Ugh. That's actually kind of ugly.

Isn't there a better solution here?

Question: Why aren't we opening a non-e10s window without the patch? How does Gecko know to open a remote window?

It looks like in nsAppShellService::JustCreateTopWindow, after looking at the CHROME_REMOTE_WINDOW flag, we _also_ look at the load context of the parent window. Even if CHROME_REMOTE_WINDOW is not set, we'll still look at the load context of the parent window, and copy it.

What the fuck?

Question: Can e10s windows create non-e10s windows? How?

Yes, we can. Oh what the fuck - so we've got these two checks going on - one in nsAppShellService::JustCreateTopWindow, and one in nsWindowWatcher::OpenWindowInternal… and both do the job of calling SetPrivateBrowsing, SetRemoteTabs. But for the remote case, one defaults to inheriting from the parent (if a parent exists), whereas the other explicitly checks the chrome flags. And the second overrides the first.  why? WHY???

WHAT THE ACTUAL FUCK

Ok, let's calm down. Time to switch to decaf. One of the reasons for the one setting, and then the override is: sensible defaults.

But I do think there's something strange going on in here that we can be smarter about.

Wait a second.

The case that we're failing in nsWindowWatcher::GetWindowOpenLocation is only for when the new nsIDOMWindow is being opened from a JS caller… and when content opens the nsIDOMWindow, it shouldn't be able to specify remote/non-remote - we should default to the user preference.

So why are we still setting it?

Ah - because in the test, the page is a chrome:// URL, meaning it's a chrome caller! AND a caller from JS! GOTCHA.

So, to sum up the problem:

For chrome-privledged JS callers of nsWindowWatcher::OpenWindowInternal, use of private, non-private or remote will result in particular 1's being set on the higher bits of the chrome flags. Those higher 1's are not necessary in the GetWindowOpenLocation check, so we should truncate them.

Wait… wait wait wait.

In jst's test case, the one I'm _actually_ trying to fix, isChromeCaller is false… and so we're not actually getting the window chrome we want. How does this work in single-process?

Oh what the fuck. In the single-process case, clicking on jst's link results in isCallerChrome being true. WTAF???

Hypothesis:

When opening a window via window.open, isCallerChrome is false. When opening via a link click, isCallerChrome is true.

Yes, this appears to be true.

So what the actual fuck are we going to do here???

Thinking time...

There are several layers of problems here. First of all, CalculateChromeFlags is behaving differently between e10s and non-e10s windows. Specifically, in single process Firefox, CalculateChromeFlags makes distinctions between windows that are being opened via JS in content, and windows opened via something like a link click.

Multi-process is different - it only distinguishes between windows opened between content and chrome. It doesn't have the fine distinction between the types of ways windows can be opened from content (via js, or via link click).

Is that stopping me here? Or is that a separate problem?

GARHALKJHL:AEKJTL:KAEJLT:KJS:LK:LSDKFJ:LSKDJF :LSKERKJ ER:LCM SKLFJ

I hate this bug. Hate hate hate hate this bug.

Question: if the content process is opening a tab… can't we just let CalculateChromeFlags do its thing without needing to pass ,remote in the feature list?

Huh… I removed the ,remote suffix, and jst's testcase still failed. Why is that?

Ah, because in the child process, we compute that the window should be opened remotely anyways, so we modify the chrome flags coming up.

FOCUS. I need to figure out how to fix the root bug, and not get caught up in trying to make all of opening windows sane.

Two things:

  1. I need to make sure CHROME_ALL is applied even if the remote feature is supplied
  2. I need to make sure content can communicate a distinction between links opened from script, and links opened by clicking on links

Luckily, the only caller of CalculateChromeFlags (nsWindowWatcher::OpenWindowInternal) knows if the page was called from JS.

Remove useless ,remote thing that TabChild adds
Pass aCalledFromJS to CalculateChromeFlags
Make it clearer what we're using isCallerChrome, aCalledFromJS and aOpenedFromRemoteTab to compute

Huh. Now on startup, the main window has no toolbars. :(

The chrome flags it calculates are 2147483649… and without my patch, it calculates 2148536318

1000001100010010011011 <-- What my patch causes us to generate for the first window
1000001100100010111000 <-- What we were generating before my patch

So I need to know what bits 1, 2, 6, 11 and 12 are. Those don't match.

My patch has:

1, 2, 11 erroneously ON.
6, 12 erroneously OFF.

Translation from bit positions:

1: CHROME_DEFAULT
2: CHROME_WINDOW_BORDERS
6: CHROME_TOOLBAR
11: CHROME_TITLEBAR
12: CHROME_EXTRA

Final tally:

CHROME_DEFAULT, CHROME_WINDOW_BORDERS, CHROME_TITLEBAR are erroneously ON
CHROME_TOOLBAR, CHROME_EXTRA are erroneously OFF.

Ok, why?

Find out why presenceFlag is false with the patch

Because I goofed. By removing the FeaturesImplyAllChrome thing, I failed to put back the check for "all", to set the presenceFlag.

Oh god. This code.

Try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=11677acbc35b

Bah. Test failures. :(

docshell/test/navigation/test_bug278916.html

FUUUUUUUUUUUUUUUUUUUUUUUUUUUUUCK I HATE THIS BUG.

Today, I'mma smash it. TODAY IS THE DAY. WE SMASH. TODAY. SMASH BUG SMASH BUG.

So, as I wrote earlier, I wonder if part of my problem is that I'm getting distracted by how bad all of this shit is, that I'm not keeping my eye on the prize. EYE ON THE PRIZE.

The prize is fixing the bug! The distractions are trying to fix this whole damn thing. The temptation is to just burn it to the fucking ground.

Fix the bug fix the bug

LET US SYNTHESIZE THE PROBLEM. YET AGAIN.

The problem is that when content in content processes uses window.open with no arguments, and the parent decides that user prefs dictate that we should open a new window, the window is not given the CHROME_ALL chromeFlag.

Fixing that fixes jst's bug.

We also need to be careful - the test in browser_bug462673.js was failing for e10s because a content script was opening a window, and we computed the chrome flags to include the CHROME_REMOTE flag, and not just be CHROME_ALL, which causes us to open the window as a real window instead of a tab.

What to do, what to do, what to do?

Think think think


Ok, let's see if we can break this down into some invariants:

  1. Content can never dictate remote, private, non-remote, non-private. Anything opened by content will use the defaults that CalculateChromeFlags computes based on the user preferences.
  2. Assert that any chrome flags from the child do not include the remote, non-remote, private, non-private flags.
  3. If the parent is opening either a tab or a window, set the chrome flags accordingly, with the default remote chrome flag. Assert that remote begets remote.
  4. Profit???

Oh lord, here we go.

Add assertions that remote, private, non-remote and non-private flags never come from the content process.

So in nsWindowWatcher::GetWindowOpenLocation...

I want to be false if the bottom 12 bits do not match the bottom 12 bits of nsIWebBrowserChrome::CHROME_ALL.

0000 0000 0000 0000 0000 1111 1111 1110 <-- CHROME_ALL
0000 0000 0001 0000 0000 1111 1111 1110 <-- want to match
0000 0000 0000 0000 0000 1101 1111 1110 <-- no match
0000 0000 0000 0000 0001 1111 1111 1110 <-- want to match

WebBrowserChrome Features that content can choose:

menubar, toolbar, location, personalbar, status, dependent, dialog, minimizable, resizable, scrollbars

WebBrowserChrome features that content CANNOT choose:

chrome, modal, titlebar, alwaysRaised, alwaysLowered, z-lock, close,

FUCK. I have tests that fail the assertions I was passing in. :/

Maybe this is good. Maybe I've actually found bugs here. Let's examine.

docshell/test/navigation/test_bug278916.html

Ok, tons of assertion failures. Why?

Well, it looks like there are a number of places where content tries to open a window, and we calculate chrome flags...

AH FUCK, FAILING TESTS AGAIN.

A bunch are intermittents, but several are not:

docshell/test/test_bug637644.html

102 INFO TEST-UNEXPECTED-FAIL | docshell/test/test_bug637644.html | Snapshot canvases are not the same size - comparing them makes no sense - expected PASS
103 INFO TEST-UNEXPECTED-FAIL | docshell/test/test_bug637644.html | Popups should look identical - expected PASS
104 INFO TEST-UNEXPECTED-FAIL | docshell/test/test_bug637644.html | Snapshot canvases are not the same size - comparing them makes no sense - expected PASS
105 INFO TEST-UNEXPECTED-FAIL | docshell/test/test_bug637644.html | Popups should not look identical - expected PASS

Ah - interesting. The windows in e10s are not the right size. Wtf?

Mn-e10s...

"Connection to Marionette server is lost. Check gecko.log (desktop firefox) or logcat (b2g) for errors."

Gecko log: http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/fc02969961c1c8ffa6a779a11790288c2458f650f3b8b9bd760d86bef88ae145745f7b02ae5f3d2af6d662784c36efc35a411d06f67f34c31dde70d8effd5d89

Oh for fucks sake, I apparently have to ENABLE_MARIONETTE in order to run marionette tests. ???

Ok, fixed both of these - I think both were related. I was incorrectly passing the features from the child. The IPDL was using nsString when it should have used nsCString, and in all of the conversions, I think I was doing something dumb, where the features were always blank when they got to the parent.

So we're passing!

GAHHHH - AND NOW WE'RE STILL NOT OPENING THE WINDOW CORRECTLY FOR JST'S TEST CASE.

BAHHHHH

Ah - it's because of the features thing. I still need to make it nullptr if the length is 0 in TabParent.

Ok, and we're good again.

Neeeeed a regression test.

10000 0000 0110 1100 1110
0000 0000 1111 1111 1110 <-- CHROME_ALL

110 1100 1110

Write some regression tests

Try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4027fd3c62ee

HOLY SHIT, we've landed and merged!