Bug 1193838 - Make it possible to gather profiles from processes that have exited
It looks like the tps profiles aren't being gathered because the
content process exits before we send a message down to gather the
profiles.
General idea:
- Tie lifetime of ProfileGatherer in the parent to the window of time in which we're profiling.
- If a ContentParent starts up while we're already profiling, grab a reference to the ProfileGatherer
- If a ContentParent notices that we've stopped profiling, have it drop the reference to ProfileGatherer.
- If a ContentParent notices that we've started profiling, have it grab a reference to the ProfileGatherer.
- If a ContentChild is shutting down and we're profiling, blast up a profile to the parent, and have ContentParent notify the ProfileGatherer.
- Extend ProfileGatherer so that it can accept an array of ended-process profiles. Have a 5 profile cap on this.
-
When the actual gathering starts, have ProfileGatherer dump out the
ended-process profiles as well as the ones for already-existing
profiles, and then drop the ended-process profiles.
- Alternatively, smartly drop any profiles that are outside of our sampling window.
-
Make ProfileGatherer be alive during the profiling window
Currently, ProfileGatherer is held onto by GeckoSampler. There's no direct access to GeckoSampler.
Have GeckoSampler instantiate and hold a ref-counted reference to a ProfileGatherer on construction.
Expose ProfileGatherer as an nsISupports on nsIProfiler
Make new ContentParents and PluginModuleParents get a reference to the ProfileGatherer on instantiation.
GAHHHH - I feel like I've got drywall spackle all over the place. This
feels sloppy once placed on top of the "start profiling if the parent is
profiling" stuff, because I've got all of these gatherers getting
assigned in a shit tonne of places.
Okay, focus FOCUS - how should this work?
The profiler is either off, or on. There is no third state. A process
comes in and out of existence. The first can occur at any time. I need
to make sure that a ContentParent / ContentChild pair are ready and able
to profile (and the parent ready to gather) any time the parent is told
to profile, and a content process is around (or comes around).
So how should it work?
How many states are there? I think there are 16:
Profiler is off, content process does not exist
No problem, run like normal.
Profiler is turned on, content process does not exist
Profiler goes into active state. That is all.
Profiler is on, content process does not exist
Profiler is in active state. That is all.
Profiler is turned off, content process does not exist
Profiler is put into inactive state, that is all.
Profiler is off, content process starting up
Content process starts up normally.
Profiler is turned on, content process starting up
I think what should happen is that the ContentParent should init,
realize we're profiling, and inform the child to start profiling with
the right settings. This is not what we're currently doing in my patch
set, so I've got some re-work there.
At this point, the ContentParent should also get and hold a reference to a ProfileGatherer.
Profiler is turned off, content process starting up
If we ever had a reference to the ProfileGatherer, we should drop it.
We tell the content process to stop profiling, regardless of whether or
not we'd originally told it to profile.
Profiler is off, content process starting up
Content process starts normally.
Profiler is off, content process is running
Content process running normally.
Profiler is turned on, content process is running
Content process is informed to start profiling.
Profiler is on, content process is running
Content process is running, should be profiling.
Profiler is turned off, content process is running
Content process should be profiling. Parent tells child to stop. Gatherer is dropped in ContentParent.
Profiler is off, content process is shutting down
Content process should shut down normally.
Profiler is turned on, content process is shutting down
ContentParent might get a reference to ProfileGatherer, but once the
ContentParent actor goes away, we'll, we're done here. It's possible
that the ContentChild will send up a profile before the parent
completely disappears.
Profiler is on, content process is shutting down
ContentChild sends up a profile to the parent which gets put onto the ProfileGatherer.
Profiler is turned off, content process is shutting down
ProfileGatherer is dropped by ContentParent. Any profiles received from the child are ignored.
Ok, so course correction:
Modify my fix for 1103094 so that instead of sending things with the
XPCOM Attributes message, we have the ContentParent notice during Init
that we're profiling, and send an async message down.
Rebase this patch on top of fix for 1103094, and set a member variable
to point to the ProfileGatherer. Same goes for PluginModuleParent.
Abstract the method for sending the async message and setting the
member for ProfileGatherer. Hit it not only when profiler-started is
seen, but also on ContentParent::Init if we notice we're profiling.
Make the above work for PluginModuleParent as well
On profile-stopped have ContentParent and PluginModuleParent drop the reference to ProfileGatherer
Modify ProfileGatherer to hold onto 5 profiles from dying subprocesses
Modify ProfileGatherer to accept profiles from dying subprocesses
Modify ProfileGatherer so that when gathering finishes, we can then sew
together all of the dead-process profiles in as well as the other
subprocess profiles. (This might involve registering the gatherer as an
nsIObserver for the profile-subprocess observer notification)
Current problem: ProfileGatherer::Observe is not being fired. Why not?
Hypothesis 1:
The topic is not being fired.
Test
: Find another profiler-subprocess topic observer, and see if it gets handled.
Result
: It is - this hypothesis is
FALSE
.
Hypothesis 2
: The topic observer is not being registered properly.
Test
: Set a breakpoint where we set the observer, make sure the return code is NS_OK.
I suspect this is actually the case - I wasn't able to inspect the
return value, but I'm pretty sure I'm passing the wrong third argument
to AddObserver because ProfileGatherer does not support being held with a
weak reference.
Let's set that third argument to false then, and see if that helps.
YES IT DOES HELP. Things appear to be working!
GAHHHHHHHH. Why aren't we getting our content process profiles when running with talos? What the dealio?
So the problem with gathering talos profiles from tps is that the tps
test causes finishTest to run too early. This means that the profile
that we dump out is for loading the test.html document, and nothing
else! That's no good.
How does TART do this? TART uses the same pageloader mechanism to get
started up, but I guess it doesn't cause finishTest to be called until
its ready… how is that so?
It looks like we're profiling at the wrong times and then attempting to gather the profile at the wrong times. And that sucks.
This needs more investigation.
3:06 PM
<
mconley
> avih: so TART is one of those tests that pageloader loads
up, but doesn't actually measure or care about the page it loads. Like,
it's loading that tart.html page, but it's not considering that "done"
and finishing the test when it loads
3:07 PM
<
mconley
> because then TART does its thing
3:07 PM
<
mconley
> and TART tells pageloader when it's done, right?
3:07 PM
<
avih
> yes
3:07 PM
<
mconley
> avih: how do you tell pageloader not to consider the test done when the initial page loads?
3:07 PM
<
mconley
> avih: and how do you tell pageloader that you're done?
3:07 PM
<
avih
> in page loader terminology, it's called "the page does its own measurement"
3:08 PM
<
avih
> mconley: 1. at the manifest you prefix the pages with '%'
3:09 PM
<
avih
> 2. you call a function (tpRecordTime) with the results,
where the pageloader addon injects this function into the content such
that the content can call it without any special privileges (even if
TART specifically does have privs since it's an addon, but the same
system also happens with tscrollx and tsvgx which are non privileged
pages which still use this system)
3:10 PM
<
avih
> mconley: adding '%' at the page URL at the manifest tells
the pageloader addon "don't measure load time, instead wait for
tpRecordTime call with the results"
3:10 PM
<
mconley
> avih: ah, I see
3:10 PM
<
mconley
> avih: thanks!
3:10 PM
<
avih
> sure :)
Consider increasing ContentChild shutdown timeout if parent process notices we're profiling
Consider not holding a reference to ProfileGatherer in
PluginModuleChromeParent and ContentParent, and just query nsIProfile
for the ProfileGatherer every time.
From BenWa:
::: tools/profiler/core/GeckoSampler.cpp:251
(Diff revision 7)
> + mGatherer = new mozilla::ProfileGatherer(this);
This grabs a pointer to this non-reference counted and the ProfileGatherer can live longer than the GeckoSampler.
We've talked about a few solutions to this in person. We should fix this before landing this patch."
SHIT. He’s right. What should the lifetime of a GeckoSampler be in
relationship to a ProfileGatherer, and what occurs during the following
scenario:
- User starts profiling
- User gathers async profile
- While profile is still profiling, user stops profiling
- User starts profiling again and gathers a new profile
During the above scenario, what happens to the gatherer and the
sampler? I already know that the GeckoSampler gets de’ref’d on profiler
shutdown… or something like that. Hum.
A process that is still in the midst of sending up a profile will not
have heard the stop request before it has finished gathering the profile
(since it does that synchronously within its own process).
How about some invariants:
- The Profiler can be started and stopped at any time - even during an async profile gathering
- A subprocess that is in the process of gathering a profile (whether its started to do so or not), cannot be told about the starting or stopping of the Profiler
- A parent actor for a subprocess in the above state should not handle starting or stopping of a Profiler
- The ProfileGatherer and GeckoSampler should keep each other alive until all subprocesses have had their profiles gathered.
- After a subprocess has finished gathering an async profile, it should check to see if we’re still profiling and act accordingly. (Make sure to see if there’s a new ProfilerGatherer to use next time.)
That sounds like good follow-up fodder. For now, let’s just reject the
Promise and destroy the ProfileGatherer if we stop sampling while
gathering profiles.
r+!
Aaaaaand another:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0396face8d98
Follow-ups: