Bug 1096093 - [e10s] Scrollbar missing when e10s enabled
Tracy was able to reproduce this after a Nightly update installed itself and he restarted.

His settings:
  1. External monitor (with Nightly on it)
  2. Mouse plugged in
  3. "System Preferences" > "General" "Show scroll bars" is set at "Show scroll bars Automatically"

I have STR!

I was finally able to reproduce this. My STR (likely not minimal, but it's a start):

0) Make sure Firefox is set up to restore your tabs / windows from last time when starting up
1) Open a single tab in Firefox that is long (requires a vertical scrollbar). The root document needs to be long - no CSS trickery (so a dxr page is no good. This Bugzilla bug is probably a good page to load, actually).
2) Make sure devtools.chrome.enabled and devtools.debugger.remote-enabled are set to true
3) Open up the Scratchpad from DevTools, and in the titlebar, set Environment to Browser
4) Paste the script I've attached to this bug ("restart.js") into the Scratchpad, and click "Run"

The browser should restart and load the test page. Every now and then, the vertical scrollbar is missing.

Should look at DrawScrollbar method in nsNativeThemeCocoa. In the _parent_ process, because the content process knows nothing about scrolling or anything.

Huh. We never seem to enter DrawScrollbar. What the hell?


So the left is the bug, the right is what we should be drawing. Notice that there's not even the "track" of the scrollbar back there. It's like Gecko is prepared to show the scrollbar, but just doesn't draw it. And this is the non-overlay scrollbar, obviously.

17:17 (spohl) mconley: pong
17:18 (mconley) spohl: hi! Investigating bug 1096093, which I've just managed to reproduce
17:18 (mconley) on the left is the bug
17:18 (mconley) the right is what's supposed to be rendered
17:18 (mconley) it looks like we're not drawing the non-overlay scrollbar at all in the buggy case. No track, no scrolly-nubbin, nothing
17:18 (mconley) spohl: got any suggestions about where I should to figure out why that might be happening?
17:21 (spohl) mconley: hmm, hard to say, but here is the bug (with patch) of the initial rollout of overlay scrollbars: https://bugzilla.mozilla.org/show_bug.cgi?id=636564
17:21 (spohl) mconley: I guess that could give you an idea of the code we touched
17:22 (spohl) mconley: is there something in particular that you do to reproduce the problem?
17:22 (mconley) spohl: yeah, it's kinda wacky. https://bugzilla.mozilla.org/show_bug.cgi?id=1096093#c23
17:23 (spohl) lol, you're kidding, right?
17:24 (mconley) spohl: I wish. :)
17:24 (spohl) mconley: and this is e10s only?
17:24 (mconley) spohl: yeah
17:24 (mconley) so, full of wtf
17:25 (spohl) mconley: I seem to recall bugmail about invalid drawables, which could obviously cause us to not draw the scrollbars
17:25 (spohl) mconley: but honestly, I know nothing about that…
17:25 (spohl) sorry… :-/
17:26 (mconley) spohl: no worries.
17:26 mconley applies latex gloves

Welp, shit. Now I'm learning about overlay scrollbars and how that worked.

Ohhhhh… ok, now I see why I might not have entered DrawScrollbar. DrawScrollbar looks really really old. I don't even know if it's used anymore.

I think I want the part of nsNativeThemeCocoa.mm that calls RenderWithCoreUILegacy? NS_THEME_SCROLLBAR_TRACK_VERTICAL maybe?

Hypothesis: I cannot replicate this behaviour if I have MOZ_DEBUG_CHILD_PROCESS=1.
FALSE. I was able to reproduce this. Even stranger, it looks like the scroll bar went away after being around for a while!

Bugzilla seems to be the place where the scrollbars really disappear… why is that?

Ooooh - what's really interesting is that in the buggy case, where I still have a scrollbar, and then I go back to Bugzilla, and the scrollbar is gone… the scrollbar seems to be in a weird overlay mode where it disappears unless the window is scrolling.

So overlay really seems to be involved here. Which means.. *sigh*, I guess I'd really better dig into that overlay scrollbar patch to see what it does.

Welp.

Here we go.

So the patch adds a -moz-overlay-scrollbars GkAtom...

There's this ScrollbarActivity thing that looks pretty important, and it implements nsIDOMEventListener. According to ScrollbarActivity.h:

/**
* ScrollbarActivity
*
* This class manages scrollbar behavior that imitates the native Mac OS X
* Lion overlay scrollbar behavior: Scrollbars are only shown while "scrollbar
* activity" occurs, and they're hidden with a fade animation after a short
* delay.
*
* Scrollbar activity has these states:
*  - inactive:
*      Scrollbars are hidden.
*  - ongoing activity:
*      Scrollbars are visible and being operated on in some way, for example
*      because they're hovered or pressed.
*  - active, but waiting for fade out
*      Scrollbars are still completely visible but are about to fade away.
*  - fading out
*      Scrollbars are subject to a fade-out animation.
*
* Initial scrollbar activity needs to be reported by the scrollbar holder that
* owns the ScrollbarActivity instance. This needs to happen via a call to
* ActivityOccurred(), for example when the current scroll position or the size
* of the scroll area changes.
*
* As soon as scrollbars are visible, the ScrollbarActivity class manages the
* rest of the activity behavior: It ensures that mouse motions inside the
* scroll area keep the scrollbars visible, and that scrollbars don't fade away
* while they're being hovered / dragged. It also sets a sticky hover attribute
* on the most recently hovered scrollbar.
*
* ScrollbarActivity falls into hibernation after the scrollbars have faded
* out. It only starts acting after the next call to ActivityOccurred() /
* ActivityStarted().
*/

Hm. So you'd think in the case where we have perma-scrollbars, that ScrollbarActivity wouldn't be necessary. Like, we just wouldn't need it. Right? So who is in charge of ScrollbarActivity?

Hypothesis: In the bad case, we're still using ScrollbarActivity when we're not supposed to.

That… doesn't seem to be the case, as far as I can tell. I don't see a construction of ScrollbarActivity. :/

I should put some logging in just to be sure.

It looks like both the child and the parent processes query nsLookAndFeel.mm to find out if we're using overlay scrollbars. And according to the child, the answer is always "true", and for the parent, the answer is "false".

What happens if I force the value to always be false?

So… forcing that value to be false seems to save the day here. The bug is gone.

I wonder why?

And what if we always return true?

Ok, still can't reproduce. And what if I feed one value to the content process, and another one to the parent process?

Another observation - I can easily get into the state where the initial browser has the overlay scrollbar when it shouldn't - subsequent tabs get no scrollbars!

I do think that relationship is important - the missing scrollbar manifests if multiple tabs are restored, and the initial tab gets the overlay scroll state.

ScrollbarActivity came out of bug 778810 , and was a B2G bug for making it possible to have overlay scrollbars in FxOS. This was before the Lion stuff came along - spohl's Lion scrollbar work came about and built on top of this.

Bug 778810 really just adds this ScrollbarActivity thing, and makes it manage a timer of 450 milliseconds that gets reset if there is any scrollbar activity. After the timeout is hit, the active attribute is removed from the vertical and horizontal scrollbars. As soon as there's more activity, it re-adds the active attribute. The active attribute is for visibility and effects.

So there are two things to consider - the styling of the scrollbars, and whether or not they can overlay content. I think it's the latter which is busted.

The scrollbar work that mstange (and spohl) worked on does fancy tracking of mouse events and scheduling of opacity, but that's about it.

Ah… interesting, so it looks like I do hit ScrollbarActivity::ActivityOccurred in the content process. So the content process is under the impression that we should be using overlay scrollbars. Huh.

What happens if the content process always thinks that we're _not_ using the overlay scrollbars?

That… seems to solve it? In a debug build anyway. Let's try on an opt build.

Yes, this also seems to solve it. Now the question is… is this the right solution?

For a bunch of time, eIntID_UseOverlayScrollbars is returning true, and then suddenly it returns false. That looks suspicious.

How does NSScroller work? Ah, that's something we bring in from Cocoa.h

bool nsLookAndFeel :: SystemWantsOverlayScrollbars()
{
return ([NSScroller respondsToSelector : @selector (preferredScrollerStyle)] &&
[NSScroller preferredScrollerStyle]
== mozNSScrollerStyleOverlay);
}

Hypothesis - the child process is starting up, and doesn't yet have up-to-date information about the scrollbar preference. It is not set up to listen for scrollbar preference changes. We do, however, notify the child when the parent process notices the preference changes. So the problem is occurring when the content process starts up after the parent process already knows the preference, but before the child process figures it out.

FALSE . Or at least partially false. ScrollbarActivity is constantly querying the OS for the LookAndFeel metric on overlay scrollbars, and if it doesn't match expectation, it's supposed to trigger a theme refresh on the PresContext.

WAT.

Ohhh, interesting…

void
ScrollFrameHelper
:: HandleScrollbarStyleSwitching()
{
// Check if we switched between scrollbar styles.
if (mScrollbarActivity &&
LookAndFeel
:: GetInt(LookAndFeel :: eIntID_UseOverlayScrollbars) == 0 ) {
mScrollbarActivity
-> Destroy();
mScrollbarActivity
= nullptr;
mOuter
-> PresContext() -> ThemeChanged();
}
else if ( ! mScrollbarActivity &&
LookAndFeel
:: GetInt(LookAndFeel :: eIntID_UseOverlayScrollbars) != 0 ) {
mScrollbarActivity
= new ScrollbarActivity(do_QueryFrame(mOuter));
mOuter
-> PresContext() -> ThemeChanged();
}
}

This thing is called everytime a scrollframe is reflowed… in the bad case, we satisfy neither of these conditions. mScrollbarActivity is nullptr, and LookAndFeel::GetInt(LookAndFeel::eIntID_UseOverlayScrollbars) is 0.

So what the hell is going on? What influences that white strip? Is it because we're waiting for a ThemeChanged that never comes?

I… I guess so.

So let's take a look at this again, in the single process case, and in the multi-process case. Let's set a breakpoint in nsPresContext::ThemeChanged in the single process case on start-up and see how we get there.

So, good to know - when a mouse is plugged in or unplugged, ChildView gets that information via scrollbarSystemMetricChanged.

It looks like even in the single-process case, the scrollbar LookAndFeel metric returns 1 for UseOverlayScrollbars for a while, and then false.

Hypothesis - I suspect that in the single-process case, we'll have scrollbarSystemMetricChanged on us...

TRUE .

So, conclusion: In single-process Firefox, we construct the DOM, and meanwhile, the operating system is asynchronously figuring out about the preferred scrollbar state. But we initially assume overlay scrollbars. As soon as the change occurs, nsChildView.mm hears about it in scrollbarSystemMetricChanged, and we do a NotifyThemeChanged to queue a ThemeChange flush. And meanwhile, the LookAndFeel for UseOverlayScrollbars has changed for everybody reading it.

In the multi-process case, the UseOverlayScrollbars metric changes, but the child never hears about the scrollbarSystemMetricChanged event. So it never does a ThemeChanged on the nsPresContext.

Solution brainstorming time.

  1. Have nsLookAndFeel for OS X in the content process get initial value not from NSScroller, but by messaging the parent, and caching the value. That way the child and parent are always in sync. When the parent hears that the theme has changed, send a message down to the child to flush. That would also potentially fix the highlight colours bug that mstange noticed.
  2. Have something in the content process also listen for the NSPreferredScrollerStyleDidChangeNotification notification and flush the theme accordingly. I can't think of a really great place to do that right now though. Also, seems break-out-of-sandbox-y?
  3. Variation on (1) - have the parent send down the initial overlay scrollbar value, which the content process will cache until it hears that it should flush from the parent.

If I do (1), maybe I can hitch the value onto the GetXPCOMProcessAttributes PContent message that ContentChild synchronously asks for when initting XPCOM.

Suppose I do that… I think I need a way for nsLookAndFeel.mm to know that we're feeding it a value that it should cache instead of reading the NSScroller information. Huh. Thinking…

Assumption: It's not necessary for nsLookAndFeel.mm to keep polling NSScroller for the preferred scrollbar style in the parent process, since we know we'll get a NSPreferredScrollerStyleDidChangeNotification when it changes. So read it once and cache it once.

So what we'll do is memoize it the first time in the parent process, and invalidate once we get NSPreferredScrollerStyleDidChangeNotification fired. Add methods to nsLookAndFeel.mm so that we can update it with the right value.

Then, in nsLookAndFeel.mm, when reading the overlay scrollbar int metric, instead of asking NSScroller, have the content process ask the ContentChild for the value. The ContentChild should have this information stashed in it from when it got initted.

When the parent sees NSPreferredScrollerStyleDidChangeNotification, LookAndFeel's will eventually hear about it. Maybe nsXPLookAndFeel.cpp can fire an Observer notification after LookAndFeel::Refresh that ContentParent's can listen for and message their children about.

ContentChild hears that LookAndFeel was updated.

Note that PuppetWidget doesn't have an mWidgetListener, so it…uh… we'd have to find some other way of updating the UI...

Q: Suppose I have 3 non-remote tabs - simple, with a single vertical scrollbar. When I change scrollbar stuff, how many times is nsPresContext::ThemeChanged called?

A: 4 times. One might be for the main window, and the other three for the content windows.

So I suspect every tab needs to know.

So have the ContentParent send a message down the ContentChild. The ContentChild manages TabChild's, so use:

PManager* PFoo::Manager()
const InfallibleTArray<PManagee*> PFoo::ManagedPManagee();
void PFoo::ManagedPManagee(InfallibleTArray<PManagee*>&);

And access each TabChild, telling each to call ThemeChanged on their PresContext's. Oh, maybe instead of using that snippet above, I use nsContentUtils::CallOnAllRemoteChildren...

Let's try to boil that down.

ContentChild gets initial overlay scrollbar metric on init, and tells nsLookAndFeel about it. We need to do that in order for the processes to stay in sync.

I need a way to poke that value into nsLookAndFeel. Or have nsLookAndFeel query ContentChild about it.

3:10 PM < mconley > jimm: hey, so I've got a plan for bug 1096093, and I want to run it by a widget peer
3:10 PM < firebot > https://bugzil.la/1096093 — ASSIGNED, mconley@mozilla.com — [e10s] Scrollbar missing when e10s enabled
3:11 PM jimm reads the bug
3:12 PM < mconley > jimm: what I propose is that nsLookAndFeel.mm should start caching the overlay scrollbar metric, and that ContentParent should send down an initial seed value of that metric to ContentChild. ContentChild sets that seed value in nsLookAndFeel.
3:12 PM < mconley > that way they're in sync with one another
3:12 PM chmanchester → chmanchester|afk
3:13 PM < mconley > then, when LookAndFeel::Refresh is called in the parent if the overlay scrollbar value is changed, we send a message down to all remote frames saying so
3:14 PM < mconley > wait
3:14 PM < mconley > I've already noticed a problem with this plan.
3:14 PM < mconley > jimm: lemme refine this - I'll get back to you
3:14 PM < jimm > this sounds like a general problem in that nsLookAndFeel in the child process isn't getting notifiedabout theme changed events, so its cache never updates?
3:15 PM < mconley > jimm: correct - but the added bonus is that for Cocoa, the overlay scrollbar metric that we query the OS for is async
3:15 PM < jimm > nice
3:15 PM < mconley > and that two separate processes might get different values for that metric
3:15 PM < mconley > and it's up to those processes to register notifications to hear about when the metric is updated
3:15 PM myrdd quit ( quassel@moz-3qofuv.access.ecotel.net ) Ping timeout: 121 seconds
3:16 PM < mconley > but we have no nice place to register for those sorts of notifications in the content process (I'm not even sure we'd want to - that might violate sandbox rules?)
3:16 PM Gijs → Gijs_away
3:16 PM < Mossop > Seems like the child process should never be asking the OS and just always getting it from the parent process
3:17 PM jchen → jchen|away
3:17 PM < jimm > I really hate the fact that we have two modules for everything
3:17 PM < jimm > one in each process
3:17 PM mconley nods
3:17 PM < jimm > we should have just one, and the other be a proxy
3:18 PM < mconley > Mossop: yes - but various parts of graphics / layout asks for that metric a lot, so we'd need to cache it and then eventually invalidate it when the value is updated
3:18 PM < jimm > mconley: sounds like certain values in nsLookAndFeel need to get forward
3:18 PM < jimm > from parent to child
3:18 PM < Mossop > mconley: Sure
3:20 PM < mconley > jimm: do you have any suggestions about how I should modify nsLookAndFeel to ask the parent (and cache) when running in the content process? Like, do we really want to send an individual message up to the parent for the first time each metric is queried for?
3:20 PM < mconley > Or do we want to send all of the current metrics down to the child when we initialize the ContentChild?
3:21 PM < jimm > I would build something into nsLookAndFeel that handles syncing certain values between the two.. and I would make that ipc interface async
3:21 PM < jimm > and flush when the parent picks up a change
3:21 PM < jimm > mconsley: would that work?
3:22 PM < jimm > mconley
3:22 PM < mconley > jimm: how do we make the interface async? nsLookAndFeel querying is very much synchronous...
3:23 PM < jimm > you mentioned that content polls repeatedly for this overlay setting..
3:23 PM < mconley > yes, content is constantly asking for immediate results for the overlay setting.
3:24 PM < jimm > which means updating it in the content process doesn't have to happen right when the content process asks for the value the first time
3:24 PM < Mossop > You said that querying the OS for the metric is async? So instead of querying the OS query the parent the first time
3:24 PM < jimm > I guess I'm saying that the paretn laf module listens for changes..
3:24 PM mconley thinks
3:24 PM < mconley > I guess I'm trying to make this solution too generic
3:24 PM < mconley > but really, I think the overlay scrollbar thing is really special
3:25 PM < mconley > because I can't find many other examples of async OS look and feel metrics
3:25 PM < mconley > jimm: would you agree with that?
3:25 PM < mconley > am I correct in saying that, for the most part, LookAndFeel metrics synchronously gather information from the underlying OS? And this scrollbar one is an outlier?
3:26 PM < jimm > are there other values in nsLookAndFeel we need to sync or is this the only value that isn't correct in the content process?
3:26 PM < jimm > yes I beleive so
3:26 PM jimm looks at the header
3:26 PM < mconley > jimm: we definitely need to tell the content process when to flush its look and feel cache
3:26 PM < mconley > for things like colours

Let nsLookAndFeel.mm cache the eIntID_UseOverlayScrollbars and eIntID_AllowOverlayScrollbarsOverlap values.
Create LookAndFeelIntCache struct, mapping IDs to values.
Add nsLookAndFeel::SeedIntCache(LookAndFeelIntCache); that will populate the cache that I created two steps above.
Add nsLookAndFeel::GetIntCache(), so that nsLookAndFeel is responsible for serializing the stuff to send down
Make sure GetIntCache signature makes sense for getting an InfallibleTArray.
Rename SeedIntCache to SetIntCache
Have ContentChild query the parent for the LookAndFeelIntCache when initting XPCOM stuff.
Have ContentParent call nsLookAndFeel::GetIntCache, and then send the resulting cache down to the child.
Have the ContentChild call nsLookAndFeel::SetIntCache
Have PresContext tell every TabParent when the ThemeChanges.
Have TabChild call ThemeChanged on each PresShell.

Figure out why the overlay scrollbar is scaled like crazy...


It's scaled completely wrong… why?

What controls that?

Looking at layout/xul/nsScrollbarFrame.cpp right now...

Wait… a big clue - with e10s disabled, all overlay scrollbars look like that. What happens if I remove my caching stuff?

wtf - that doesn't fix it… something weird is going on… I backed out most of the stuff under cocoa, and still no fix.

Hypothesis - I didn't introduce this problem. It's a problem from inbound.

TRUE. None of my patches introduced this problem. \o/ I just had an unlucky pull.

What about botond's patch? Nope, not botond's patch. Wtf.

Well, let's just ignore it for now. If it ends up merging to central, well, I guess I can bisect there. But for now, I need botond's patch.

Send over the cache (again) when the theme is refreshed. In the parent, clear the cache.
Find the changeset that introduced the massive overlay scrollbar problem - somebody already filed this bug. Bug 1156918.

Cleanup:
Move LookAndFeelInt IPDL struct into widget somehow?

Patch posted!

Got review! Try push for good measure: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f5d612c989c5

Bah - orange.

5:45 PM < mrbkap > mconley: So, I bet your patch is causing us to try to get the gtk theme or something.
5:45 PM < mrbkap > mconley: but in xpcshell we probably don't initialize it (or can't since there's no display)
5:46 PM mconley thinks
5:46 PM < mrbkap > mconley: so that'd cause GetXPCOMProcessAttributes to fail and the child process wouldn't start up properly.
5:46 PM < mconley > that sounds very plausible
5:46 PM < mconley > hrm
5:47 PM < mconley > mrbkap: any suggestions on ways around that?
5:47 PM < mconley > assuming it is the case?
5:47 PM < mrbkap > mconley: Let me run with the patch to at least get a stack to where it's happening.
5:47 PM < mrbkap > mconley: This should be easy to work around.
6:56 PM < mrbkap > mconley: ok
6:56 PM < mrbkap > mconley: I have your patch in a debugger.
6:56 PM < mconley > excellent
6:57 PM < mrbkap > mconley: things fall apart here: https://pastebin.mozilla.org/8831102
6:57 PM < mrbkap > mconley: You might have to talk to karlt to figure out how to fix this.
6:58 PM < mconley > mrbkap: ok, sounds good. Thanks so much!
6:58 PM < mrbkap > mconley: We didn't use to call into the look and feel code at all in xpcshell (where there's no display, etc.) but now we do.
6:58 PM < mrbkap > mconley: and the gtk code doesn't deal well with it.
6:58 PM < mconley > understood - yeah, I can see how this could break things.
7:00 PM < mrbkap > mconley: would you mind sneaking https://pastebin.mozilla.org/8831104 into the patch to fix your orange?
7:00 PM < mrbkap > mconley: I had to write it in order to get to the new error.


Here's the stack:

at /build/buildd/gtk+2.0-2.24.23/gtk/gtkinvisible.c:319
#10 0x00007fffeb4379de in ?? () from /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
#11 0x00007fffeb43910d in g_object_newv () from /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
#12 0x00007fffeb4398bc in g_object_new () from /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0
#13 0x00007ffff246d48a in nsLookAndFeel::Init (this=0x7fffe164eff0)
at /home/mrbkap/work/main/mozilla/widget/gtk/nsLookAndFeel.cpp:906
#14 0x00007ffff246b76c in nsLookAndFeel::nsLookAndFeel (this=0x7fffe164eff0)
at /home/mrbkap/work/main/mozilla/widget/gtk/nsLookAndFeel.cpp:49
#15 0x00007ffff2432274 in nsXPLookAndFeel::GetInstance ()
at /home/mrbkap/work/main/mozilla/widget/nsXPLookAndFeel.cpp:256
#16 0x00007ffff2433417 in mozilla::LookAndFeel::GetIntCache ()
at /home/mrbkap/work/main/mozilla/widget/nsXPLookAndFeel.cpp:777
#17 0x00007ffff2113c9f in mozilla::dom::ContentParent::RecvGetXPCOMProcessAttributes (this=
0x7fffe1472800, aIsOffline=0x7fffffff96ac, aIsLangRTL=0x7fffffff9750,
dictionaries=0x7fffffff98d0, clipboardCaps=0x7fffffff9770, domainPolicy=0x7fffffff9900,
lookAndFeelIntCache=0x7fffffff98e0)

Simplest approach might be to send a sync message up to the parent when the LookAndFeel is first Init'd.

Send sync message up to parent to get LookAndFeel int cache when initted.

Ok, let's TRY again! https://treeherder.mozilla.org/#/jobs?repo=try&revision=4e057bc88610

Fuuuuuuuuck - burned the tree for B2G and emulator stuff. :( . I think I'm missing a header import of nsTArray.h.

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f87eaa8bf913

File a bug to flush the LookAndFeel cache in the content process when it is flushed in the parent (otherwise, things like highlight colours become invalidated) This is actually getting fixed with my patch.
File a bug for mrbkap's patch with nsBidiKeyboard.cpp. Bug 1158791
File a follow-up to evaluate jimm's suggestion for moving the look and feel caching into nsXPLookAndFeel. Bug 1158793
File a follow-up where we don't get the inline scrollbars when starting up if mouse plugged in. Bug 1158798