Bug 1226564 - [e10s] CACHE V2 telemetry probe regressions
4:23 PM < chutten > mconley: billm: are you taking my analysis in vain? They haven't yet been reviewed...
4:23 PM < mconley > chutten: heya
4:24 PM < mconley > chutten: I’ve just been tasked with driving that bug forward
4:24 PM < mconley > chutten: to make sure we understand what’s going on here
4:24 PM < mconley > chutten: because a regression in network requests is no bueno
4:25 PM < mconley > chutten: the subtests you looked at look positive though - like, assuming “lower is better”, e10s appears to perform better there
4:25 PM < mconley > chutten: am I interpreting that correctly?
4:25 PM < chutten > mconley: Yes, but don't treat them as official. Still need an r+, and for that I need a cluster that can access parquet
4:25 PM < mconley > gotcha
4:26 PM LA-MJ quit ( kvirc@moz-2s60tn.static.zebra.lt ) Ping timeout: 121 seconds
4:26 PM < mconley > chutten: let’s assume that you confirm your analysis and get r+. What does that mean for the bug?
4:26 PM < mconley > does this mean e10s does not regress the CACHE V2 probes? Or only regress some of them?
4:26 PM < mconley > Or something else entirely?
4:27 PM < chutten > It regresses the startup cache v2 probes
4:27 PM < chutten > This might be due to different meanings of what "startup" means between e10s and non e10s, and it may also be that the probes are just bunk...
4:27 PM < chutten > ...or any of the other reasons outlined in the bug.
4:28 PM < chutten > But, yes. If necko says that the only ones to watch are the ones in that review, then e10s is looking Vehry Naice.


Here are the list of probes with their descriptions:


STARTUP_HTTP_CACHE_DISPOSITION_2_V2

From mayhemer in Comment 16 :
"My theory is that the breakpoint we stop measuring STARTUP_ and do "normal" capturing is simply in different time on e10s vs non-e10s.  It's happening sooner with e10s or less number of requests is happening before that point in e10s.  At the start we do OCSP requests that are simply not cached.  If there is more rate of OCSPs against normal requests it will show up as a regression.

I would ignore the regression here."



STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS
STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS

From mayhemer in Comment 16 :
"These are more a mystery for me.  Captured only on the parent process, no IPC overhead can cause this.

It might be that we do more I/O with e10s on startup.  Also can be the same cause as for STARTUP_HTTP_CACHE_DISPOSITION_2_V2.  Simply the time span used for "STARTUP_" capture is different (longer in non-e10s or we do less HTTP requests in e10s).  For longer intervals OS preloading may intervene or theoretically entries may be loaded/purged/reused during a longer startup span and that would greatly improve these probes thanks system caching.


Do we have probes for how many measurements, or HTTP requests directly, we do during the "STARTUP" phase?"




HTTP_PAGE_OPEN_TO_FIRST_FROM_CACHE_V2
HTTP_PAGE_COMPLETE_LOAD_CACHED_V2
HTTP_SUB_COMPLETE_LOAD_CACHED_V2
HTTP_SUB_OPEN_TO_FIRST_FROM_CACHE_V2

From mayhemer in Comment 16 :
"I'm afraid we don't correctly carry the "cached" timing values to the child.  What we are getting captured here are just few requests made on the parent like https://self-repair.mozilla.org/en-US/repair/ and similar.

I think this bug makes the most important probes useless.  Filed bug 1253379 on it."




HTTP_PAGE_COMPLETE_LOAD_NET_V2
HTTP_SUB_COMPLETE_LOAD_NET_V2

From mayhemer in Comment 16 :
"Can we check also HTTP_*_COMPLETE_LOAD_NET_V2 probe for regressions?"

https://github.com/chutten/e10s_analyses/blob/0d7a789a673f9c825389263764175d502191d9f8/beta45-withoutaddons/e10s_experiment.ipynb

e10s looks better here, assuming lower is better.


So now I’m looking at:
STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS
STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS

From mayhemer in Comment 16 :
"These are more a mystery for me.  Captured only on the parent process, no IPC overhead can cause this.

It might be that we do more I/O with e10s on startup.  Also can be the same cause as for STARTUP_HTTP_CACHE_DISPOSITION_2_V2.  Simply the time span used for "STARTUP_" capture is different (longer in non-e10s or we do less HTTP requests in e10s).  For longer intervals OS preloading may intervene or theoretically entries may be loaded/purged/reused during a longer startup span and that would greatly improve these probes thanks system caching.

That capture occurs here: http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySession.jsm#1846

Note the discrepancy in the number of samples for e10s in both cases.  There are like… 1/4 of the samples for e10s than with non-e10s. Why?

What kind of network requests can go out before the sessionstore-windows-restored notification is fired?

So mayhemer's theory here is that STARTUP_NETWORK_CACHE_V2_HIT_TIME_MS and STARTUP_NETWORK_CACHE_METADATA_FIRST_READ_TIME_MS seem worse due to the fact that the parent is doing fewer network requests up until the sessionstore-windows-restored observer notification is fired (this is what is used to snapshot the histograms and prefix with STARTUP_ at http://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySession.jsm#1846 ).

The idea here is that the reads off of the disk for the cache become faster as more lookups occur because the OS is optimizing for the disk accesses. This means that the more requests there are, the more accesses will report as cheap.

The idea that there are fewer requests is bolstered by the fact that there are fewer samples for these two probes in the e10s case. It's roughly 25% of the samples gathered for the e10s case.

It seems to be more rare for there to be any histograms captured for either of these probes with e10s enabled than with non-e10s. In fact, in the e10s case on my Windows machine, I can only get the probes to appear if the OS has just started up (so the file access optimizations from the OS haven't been invoked yet).

So that's my explanation for the shape of the curve here for those two probes. I'm not sure there's much else to do here. I'm closing this one out.