Bug 1193838 - Make it possible to gather profiles from processes that have exited
It looks like the tps profiles aren't being gathered because the content process exits before we send a message down to gather the profiles.

General idea:

  • Tie lifetime of ProfileGatherer in the parent to the window of time in which we're profiling.
  • If a ContentParent starts up while we're already profiling, grab a reference to the ProfileGatherer
  • If a ContentParent notices that we've stopped profiling, have it drop the reference to ProfileGatherer.
  • If a ContentParent notices that we've started profiling, have it grab a reference to the ProfileGatherer.
  • If a ContentChild is shutting down and we're profiling, blast up a profile to the parent, and have ContentParent notify the ProfileGatherer.
  • Extend ProfileGatherer so that it can accept an array of ended-process profiles. Have a 5 profile cap on this.
  • When the actual gathering starts, have ProfileGatherer dump out the ended-process profiles as well as the ones for already-existing profiles, and then drop the ended-process profiles.
    • Alternatively, smartly drop any profiles that are outside of our sampling window.


Make ProfileGatherer be alive during the profiling window

Currently, ProfileGatherer is held onto by GeckoSampler. There's no direct access to GeckoSampler.

Have GeckoSampler instantiate and hold a ref-counted reference to a ProfileGatherer on construction.

Expose ProfileGatherer as an nsISupports on nsIProfiler
Make new ContentParents and PluginModuleParents get a reference to the ProfileGatherer on instantiation.

GAHHHH - I feel like I've got drywall spackle all over the place. This feels sloppy once placed on top of the "start profiling if the parent is profiling" stuff, because I've got all of these gatherers getting assigned in a shit tonne of places.

Okay, focus FOCUS - how should this work?

The profiler is either off, or on. There is no third state. A process comes in and out of existence. The first can occur at any time. I need to make sure that a ContentParent / ContentChild pair are ready and able to profile (and the parent ready to gather) any time the parent is told to profile, and a content process is around (or comes around).

So how should it work?

How many states are there? I think there are 16:

Profiler is off, content process does not exist

No problem, run like normal.

Profiler is turned on, content process does not exist

Profiler goes into active state. That is all.

Profiler is on, content process does not exist

Profiler is in active state. That is all.

Profiler is turned off, content process does not exist

Profiler is put into inactive state, that is all.

Profiler is off, content process starting up

Content process starts up normally.

Profiler is turned on, content process starting up

I think what should happen is that the ContentParent should init, realize we're profiling, and inform the child to start profiling with the right settings. This is not what we're currently doing in my patch set, so I've got some re-work there.

At this point, the ContentParent should also get and hold a reference to a ProfileGatherer.

Profiler is turned off, content process starting up

If we ever had a reference to the ProfileGatherer, we should drop it. We tell the content process to stop profiling, regardless of whether or not we'd originally told it to profile.

Profiler is off, content process starting up

Content process starts normally.

Profiler is off, content process is running

Content process running normally.

Profiler is turned on, content process is running

Content process is informed to start profiling.

Profiler is on, content process is running

Content process is running, should be profiling.

Profiler is turned off, content process is running

Content process should be profiling. Parent tells child to stop. Gatherer is dropped in ContentParent.

Profiler is off, content process is shutting down

Content process should shut down normally.

Profiler is turned on, content process is shutting down

ContentParent might get a reference to ProfileGatherer, but once the ContentParent actor goes away, we'll, we're done here. It's possible that the ContentChild will send up a profile before the parent completely disappears.

Profiler is on, content process is shutting down

ContentChild sends up a profile to the parent which gets put onto the ProfileGatherer.

Profiler is turned off, content process is shutting down

ProfileGatherer is dropped by ContentParent. Any profiles received from the child are ignored.

Ok, so course correction:

Modify my fix for 1103094 so that instead of sending things with the XPCOM Attributes message, we have the ContentParent notice during Init that we're profiling, and send an async message down.
Rebase this patch on top of fix for 1103094, and set a member variable to point to the ProfileGatherer. Same goes for PluginModuleParent.
Abstract the method for sending the async message and setting the member for ProfileGatherer. Hit it not only when profiler-started is seen, but also on ContentParent::Init if we notice we're profiling.
Make the above work for PluginModuleParent as well
On profile-stopped have ContentParent and PluginModuleParent drop the reference to ProfileGatherer
Modify ProfileGatherer to hold onto 5 profiles from dying subprocesses
Modify ProfileGatherer to accept profiles from dying subprocesses
Modify ProfileGatherer so that when gathering finishes, we can then sew together all of the dead-process profiles in as well as the other subprocess profiles. (This might involve registering the gatherer as an nsIObserver for the profile-subprocess observer notification)

Current problem: ProfileGatherer::Observe is not being fired. Why not?

Hypothesis 1: The topic is not being fired.

Test : Find another profiler-subprocess topic observer, and see if it gets handled.

Result : It is - this hypothesis is FALSE .

Hypothesis 2 : The topic observer is not being registered properly.

Test : Set a breakpoint where we set the observer, make sure the return code is NS_OK.

I suspect this is actually the case - I wasn't able to inspect the return value, but I'm pretty sure I'm passing the wrong third argument to AddObserver because ProfileGatherer does not support being held with a weak reference.

Let's set that third argument to false then, and see if that helps.

YES IT DOES HELP. Things appear to be working!

GAHHHHHHHH. Why aren't we getting our content process profiles when running with talos? What the dealio?

So the problem with gathering talos profiles from tps is that the tps test causes finishTest to run too early. This means that the profile that we dump out is for loading the test.html document, and nothing else! That's no good.

How does TART do this? TART uses the same pageloader mechanism to get started up, but I guess it doesn't cause finishTest to be called until its ready… how is that so?

It looks like we're profiling at the wrong times and then attempting to gather the profile at the wrong times. And that sucks.

This needs more investigation.

3:06 PM < mconley > avih: so TART is one of those tests that pageloader loads up, but doesn't actually measure or care about the page it loads. Like, it's loading that tart.html page, but it's not considering that "done" and finishing the test when it loads
3:07 PM < mconley > because then TART does its thing
3:07 PM < mconley > and TART tells pageloader when it's done, right?
3:07 PM < avih > yes
3:07 PM < mconley > avih: how do you tell pageloader not to consider the test done when the initial page loads?
3:07 PM < mconley > avih: and how do you tell pageloader that you're done?
3:07 PM < avih > in page loader terminology, it's called "the page does its own measurement"
3:08 PM < avih > mconley: 1. at the manifest you prefix the pages with '%'
3:09 PM < avih > 2. you call a function (tpRecordTime) with the results, where the pageloader addon injects this function into the content such that the content can call it without any special privileges (even if TART specifically does have privs since it's an addon, but the same system also happens with tscrollx and tsvgx which are non privileged pages which still use this system)
3:10 PM < avih > mconley: adding '%' at the page URL at the manifest tells the pageloader addon "don't measure load time, instead wait for tpRecordTime call with the results"
3:10 PM < mconley > avih: ah, I see
3:10 PM < mconley > avih: thanks!
3:10 PM < avih > sure :)

Consider increasing ContentChild shutdown timeout if parent process notices we're profiling
Consider not holding a reference to ProfileGatherer in PluginModuleChromeParent and ContentParent, and just query nsIProfile for the ProfileGatherer every time.

From BenWa:

" https://reviewboard.mozilla.org/r/16277/#review23801

::: tools/profiler/core/GeckoSampler.cpp:251
(Diff revision 7)
> +  mGatherer = new mozilla::ProfileGatherer(this);

This grabs a pointer to this non-reference counted and the ProfileGatherer can live longer than the GeckoSampler.

We've talked about a few solutions to this in person. We should fix this before landing this patch."

SHIT. He’s right. What should the lifetime of a GeckoSampler be in relationship to a ProfileGatherer, and what occurs during the following scenario:

  1. User starts profiling
  2. User gathers async profile
  3. While profile is still profiling, user stops profiling
  4. User starts profiling again and gathers a new profile

During the above scenario, what happens to the gatherer and the sampler? I already know that the GeckoSampler gets de’ref’d on profiler shutdown… or something like that. Hum.

A process that is still in the midst of sending up a profile will not have heard the stop request before it has finished gathering the profile (since it does that synchronously within its own process).

How about some invariants:

  1. The Profiler can be started and stopped at any time - even during an async profile gathering
  2. A subprocess that is in the process of gathering a profile (whether its started to do so or not), cannot be told about the starting or stopping of the Profiler
  3. A parent actor for a subprocess in the above state should not handle starting or stopping of a Profiler
  4. The ProfileGatherer and GeckoSampler should keep each other alive until all subprocesses have had their profiles gathered.
  5. After a subprocess has finished gathering an async profile, it should check to see if we’re still profiling and act accordingly. (Make sure to see if there’s a new ProfilerGatherer to use next time.)

That sounds like good follow-up fodder. For now, let’s just reject the Promise and destroy the ProfileGatherer if we stop sampling while gathering profiles.

r+!

Try build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=27df9e5d68d2

And another: https://treeherder.mozilla.org/#/jobs?repo=try&revision=83d7ade77a01

Aaaaaand another: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0396face8d98

Follow-ups:
https://bugzilla.mozilla.org/show_bug.cgi?id=1193838#c35 - filed bug 1229441