************************************************** Comment from Guy Almes >Stacey Bruno (2004-04-01 13:46:39) > > > > >By: Richard Hugues-Jones - richardh_j > Comments on NM-WG-Hierarchy from Guy T Almes >2004-03-26 10:32 > > I very much appreciate the work and insights in this document. > I do, though, have several comments. These all concern delay. >-- Guy A general response by the authors is that this document is not intended to supersede or suggest changes to the IPPM metrics document. Rather, this document addresses issues encompassing portability of measurements, even when those measurements are not taken according to official RFC specifications. > >[Section 8: Delay Characteristics] >Near the bottom of page 20, the text notes that "The above > raises several issues including:" and lists three points. >The nature of the three 'issues' seems neither clear nor consistent. > The first seems to be an observation that it's hard to measure > one-way delay without synchronizing time. Maybe this is really > just an observation. If so, it's exactly right: doing a good > job of measuring one-way delay usually requires lots of work > to succeed at achieving good synchronization. Is the 'issue' > more than that observation? The second seems to perhaps be seeking > clarification on how one-way delay >is defined when the packet is fragmented. If this is a claim > that RFC 2679 is incomplete on this point, then that would be > good to state. Otherwise, it's not clear what the 'issue' is. >The third seems similar to the first -- achieving accuracy in > measurements is hard work, and characterizing the accuracy of > a measurement is subtle. >Is that what is being said? >In short, this portion of the document is vague and may or may > not be asking questions about the accuracy or completeness of > RFC 2679. Both the intent and the substance of the section should > be (dropped or) clarified. As Guy suspected, the intention of these three points was to make observations about the difficulties. The phrase "The above raises several issues including:" has been changed to "Issues complicating the precise measurement of delay include:" > >[Section 8.1: One-way Delay] >Is this paragraph taken to be the definition of one-way delay > in the context of the hierarchy? >In conversation with one of the authors, I get the impression > that there is no intent to depart in substance from the metric > defined in RFC 2679. If so, it would be useful to add a simple > statement saying so. Lots of work went into 2679, and it would > be useful to build on that work. Perhaps a more carefully crafted > statement would note that the Hierarchy Characteristic is the > same as that described in 2679, that good Hierarchy Methodologies > might result from following points made in 2679, and that an > Hierarchy Observation of one instance of one-way delay is consistent > with an instance of a 2679 Singleton. It might further be noted > that the >current Hierarchy document does not attempt to address issues > analogous to the Sample and Statistics parts of 2679. >If this is done, then those who have worked with 2679 will not > be left with the (incorrect?) impression that the Hierarchy > intends to depart from 2679 on any point of substance (though > the Hierarchy document leaves some points >open that 2679 does not leave open). The hierarchy description of one-way delay cannot be the same as that of 2679 because the goal of the hierarchy is to describe a characteristic, while the goal of 2679 is to define a measurement methodology. The text has been updated to reflect this fact (and not to suggest that there is something peculiar about recording lost packets as having infinite delay). > >[Section 8.1.1: Jitter] >You should, of course, avoid citing Internet Drafts in GGF Proposed > Recommendations (since the I-Ds are ephemeral and the PRs are > much less so). reference updated to final rfc 3393 > >[Section 8.2: Roundtrip Delay] >The relation of Hierarchy Roundtrip Delay to RFC 2681, parallel > to that sketched above for One-way Delay, would be useful. comment analogous to that for rfc 2679 added. > >[Section 8.3: Issues in Measuring Delay] >The next-to-last paragraph goes off topic and discusses an interesting > issue of reporting and of statistics of sets of delay measurements, > specifically in the context of how to deal with delay Observations > that cannot be completed since the launched packet never arrives. > My first point is that this is off-topic (since it does not > relate to "issues in measuring") and should perhaps > be placed under a separate section called >something like "interpretation of sets of > Observations". >The cited RFCs take what might be characterized as a cautious > -- perhaps over-cautious -- view and should perhaps be critiqued > from time to time. The NM-WG should consider that these cautious > positions were not lightly taken, of course. >Specifically, as regards isolated instances of One-way Delay > Observations (what 2679 calls Singletons), it would seem to > be useful, and certainly not harmful, to retain the idea that > attempts to launch a one-way delay Observation that do not complete > due to packet loss can be characterized as >having Infinite delay. How that infinite delay is interpreted > is a >separate matter. In some cases, it might make sense to interpret > Infinite as 'inconclusive' (in all good humor, I append a lame > joke peculiar to my native land in which a similar interpretation > is made). We indeed have much to learn about about the phenomena > of delay and its impacts on >applications. Three points summarize part of what the cautious > view is taken in 2679: ><> with respect to the causes of delay and/or loss, it > is noted that, in many cases, packet loss is caused by extreme > instances of queueing delay, as when a packet makes it to a > congested router and is then dropped due to tail-drop or RED. > And these are the losses of most interest to TCP dynamics. But, > of course, packets are also lost due to bit error rates or >routing instabilities. ><> with respect to impact on applications, packet loss's > impact is indistinguishable from that of a very large loss. > Thus, for example, many UDP streaming media applications posit > some threshold of delay and, once this delay is encountered, > threat the delayed packet as lost even if it >arrives an augenblick later. ><> it is important to avoid saying that a network that > delivers a packet with large delay performed better than one > that lost that same packet. Otherwise, strange confusions can > result. >The real problems emerge when one tries to measure a bunch of > instances of one-way delay and summarize network performance > with a statistic (what 2679 treats in its Statistics of Samples > of one-way delay Singletons). If one launches 100 one-way delay > tests and accurately measures one-way delay for 95 of them and > the other 5 are lost, one must be cautious about how one treats > those five lost packets. 2679 shows how to use percentile-based > statistics to define, for instance, the minimum, the median, > and other percentiles of delay. A fine point is that the 97th > percentile of delay in the example given when there is 5% packet > loss is itself regarded as Infinite. Those interested are encouraged > to read the RFC. I'll make >three points here: ><> following 2679 in its treatment of Statistics is clear, > if a bit >cautious. ><> it might also be possible to do some extra work and > define percentiles of those packets with finite delay. For example, > if you convince yourself that loss is *not* due to > congestion/queueing, then this might make sense. >(For example, if you have determined that those five packets > were (almost) certainly caused by bit-error rates or by routing > instabilities.) In this case, one could imagine a carefully > crafted notion of cooked delay percentiles. If work is done > in this area, by all means take it (also) >back to the IETF IPPM WG. ><> but please do not talk about averages and standard deviations > and such, since this mathematical framework provides no help > in dealing with packet loss. Moreover, even apart from packet > loss, distributions of delay are often heavy-tailed and in these > cases (again, even apart from issues of >loss), means and standard deviations are fragile notions. > >I would add, by the way, that our community is likely to learn > a great deal about the phenomena of networks from accurate > measurement of one-way delay and from careful interpretation > of these phenomena. The discussion of representing "unusual" cases was moved to a new section 8.4 and an admonition that there are reasons to follow the IPPM RFCs was included. ************************************************** Comment from Eric Boyd > > > By: Eric Boyd - ericlboyd > Draft-GGF-NMWG-Hierarchy-02.pdf > 2004-03-25 12:05 > > Overall, I think the document is a valuable and useful contribution > to GGF. > > Specific Quibbles > ============= > 1) In Section 4.5, the last sentence says "We believe that > in general the guiding principal should be return as much of > the pertinent information as possible and let the querying > application make the decision [as to] what information > to use." > > I disagree with this statement and think it contradicts the NMWG > request requirements document > (https://forge.gridforum.org/docman2/ViewCategory.php?group_id=6 > 3&category_id=513) which says: > > "It should be possible to specify a query (albeit a simple > one) with a minimum amount of > information. This would be a test start and end point, a time, > and a test type (metric/measurement). > Note: this is a definition of the minimum set of information > required to request a test or historic > data, not the minimum set of values to be contained in a message > based on the schema. Default > values could be defined for non-specified information. > Although obviously not a direct requirement of a request schema, > it is nevertheless expected that a > minimally specified request should result in a correspondingly > minimal response, e.g. containing > time information and a measurement value only." > > At the very least, I think we should drop the sentence > altogether. > > At the very most, I'd like to replace it with the statement from > the request schema requirements document or something like it. > > My intuition, based on the reactions when this topic was discussed > in Berlin, is that there is not unamity among the group on this > point and that we should hash this point out in the very near > future (as it strongly impacts the request schema and the interaction > of the request schema and response schema). The sentence was deleted, as there is disagreement about the correct response to a vague query. > > 2) In section 5.1, 6.1, and 9.2, not to beat a dead horse from > Chicago, but I still think the "hoplist" is misleading > for a round-trip measurement associated with a path in the instance > where the path is assymetric. I think there should be a way > of indicating 2 hoplists, or, if only one is given, assuming > the path is symmetric. > 6.1 was modified to describe that the hoplist for a round-trip path can be either a roundtrip hoplist (there and back again) or may be one-way (in the case where the return isn't known, such as traceroute). This appears to be the cleanest approach to providing a real answer to this issue. > 3) In section 5.3, where would optical splitters along a lightpath > fall? They are essentially "observation points" but > not internal nodes. I would probably label it a virtual node, personally, but in general an optical splitter is a form of multicast, which we don't try to address. (which isn't to say we shouldn't, but while an observation tap is a simple concept to add, I could see the problem mushrooming quickly into the general multicast problem) > > 4) In section 7.1.1, in the first paragraph second sentence, > I would change it to "However, in general it is not [yet] > posssible [to] get access to this information for network switches > and routers in different administrative domains. For example, > an end-user is [as yet] unlikely to be able to access SNMP > information from a commercial ISP's network." > > In a sense, this is what many projects (e.g. piPEs) is trying > to enable in the academic world and thus eventually in the commercial > world. We shouldn't give up before we start. ************************************************** Comment from Stanislav Shalunov > > > > Comments from SourceForge: > > By: stanislav shalunov - shalunov > draft-ggf-nmwg-hierarchy-02.pdf > 2004-03-25 10:59 > > Summary: This is an important document that should go forward > withappropriate editing. My most pertinent comments are about > packet reordering. > > The following are more specific comments > for draft-ggf-nmwg-hierarchy-02.pdf. > > 3.2 Measurement Methodology > > This section mentions, by way of example, a methodology of measurement > of round-trip delay from separate one-way measurements. First, > I am not sure that the use of the word ``methodology'' here > and often subsequently is not somewhat misleading: adding one-way > delay for the two directions is a *method* of obtaining round-trip > delay; it's > difficult to see how it can be a methodology (``the science of > method or arrangement; a treatise on method,'' according to > the Webster dictionary). Second, while such a technique is a > valid way of projecting round-trip delay, it might be worthwhile > to mention that this projection could produce consistent bias > for some networks:consider, e.g., a networking connection that > consists of a wireless link from host A to base station B, followed > by a WAN link to host C; > further, assume that the WiFi link is operating in power-saving > mode, with A asking B for packets after 20ms of inactivity (assume > that propagation, processing, and serialization delays between > A and B are negligible); further, assume that the WAN link from > B to C is a simple deterministic link where no queuing ever > occurs and the (symmetric) delay is equal to 9ms; one-way delay > from A to C would then be 9ms while one-way delay from C to > A would then be (on average) 19ms (9ms of WAN delay plus a uniformly > distributed between 0ms and 20ms random delay from B to A); > the sum of these---the round-trip delay projected by the mentioned > technique---is 28ms; however, the actual round-trip delay measured, > for example, by sending a Poisson stream of once-per-second > on average probes from A to C would be 20ms. "Measurement methodology" seems to be an established enough term (google shows many hits in multiple disciplines with usage similar to ours). A comment was added that the different examples have associated issues. > > 4.5 Storage versus Retrieval > > In ``the guiding principle should be return as much [...]'', > is there perhaps a missing ``to'' before ``return''? > was deleted (see above) > 7. Bandwidth Characteristics > > While I suspect that the informal use of the word ``bandwidth'' > to describe things related to capacity and throughput will be > with us for a long time, I wonder if the authors could adopt > terminology that avoids the word entirely. We debated removing the term "bandwidth" from our vocabulary, but decided it will never happen... > > That said, I fail to understand the definitions: > > Available bandwidth is meaningless without averaging time > period. > After all, at any time the link is either (completely) used, > or idle. At one point something explictly discussing instantaneous bandwidth was in the document. However, we decided that specifying it as "data per unit time" captured the important essence---the instantaneous bandwidth concept seemed to confuse more people than it helped. Essentially "data per unit time" specifies an average, anyway. > > What you term achievable bandwidth is more clear, but I wonder > how one is going to reduce ``the protocol and operating system > used, and the end-host performance capability'' to a set of > enumerable or quantifiable parameters suitable for formal use. True, but people measure this all the time, so it needs to be in there even if it can't be formally specified. > > 7.1 Capacity > > While right now links with variable, adaptable, or flexible capacity > are rare in the high-performance networks (but quite common > elsewhere: all of dial-up modems, DSL modems, and WiFi use variable > capacity),the situation could change. It might be useful to > consider this (perhaps one could tie precise timestamps to capacity > numbers, or he could use ranges where capacity would be likely > to fall to describe the physical capability of the link without > a need to query low-level hardware---try that with a DSL > modem). I think associating attributes (including observation time) with observations should be sufficient to capture this. > > 7.2 Utilization > > It might be useful to stress how utilization number alone (``that > network uses 5% of its capacity right now'') is meaningless > without an averaging interval. For the same packet trace, the > variability of utilization as a function of time generally increases > as the averaging interval decreases. > added another sentence to emphasize burstiness > 8. Delay Characteristics > > I support the comments made earlier by Guy Almes. > > In addition, I wonder if it would be productive to trim the laundry > list of reasons delay is important to only those that are relevant > to Grids, or at least to rearrange the list of reasons so that > the relevant ones go first. Sure, (overall system) one-way delay > that exceeds 150ms makes it difficult to talk on the phone, > but is that one of the reasons you'd like to measure it? I don't want to get into a discussion of what is or isn't a grid (go ask the marketing people), but to me interactive real-time applications are an important class for grids. I read through the list and combined the two "minimum value" points. > > 11. Queuing Information > > This section opens with ``In wired networks, queue overflow in > routers is the predominant reason for packet loss.'' Leaving > aside scenarios where the NIC is the bottleneck (not so uncommon > for high-performance networks) and most losses have to occur > in hosts, this sentence still leaves an impression that these > losses are undesirable artifacts of network design. In fact, > they are the intended congestion signaling mechanism for loss-based > congestion control algorithms (such as that of (New)Reno TCP). agreed. tried to clarify that it is due to congestion > > 12. Packet Reordering > > This is the section that I find in need of significant > reworking. > > We have different network measurement metrics with definitions > that are obvious to different degrees. It is generally clear, > for example, what one-way delay or loss is---the only thing > that is necessary here is to agree on the details (e.g., first > bit to send to last bit to receive). With metrics related to > unused capacity it is less obvious what the natural definition > is, but the intent is usually clear. With reordering, on the > other hand, there is a genuine need to come up with > a definition of something that is not at all obvious. Multiple > equally intuitive definitions are possible. Multiple useful > definitions are possible (although they might be useful for different > purposes). > > What is the degree of reordering of the sample `2 1' (50% or > 100%)? > What about `2 1 4 3'? What about `4 3 2 1'? Multiple different > (and equally ``valid'') answers are possible. > > In the example that the authors mention---TCP---the effect of > reordering is important. However, ``early'' packets cause TCP > much less anguish than ``late'' packets (in the case of ``early'' > packets, TCP simply stops increasing the congestion window until > the sequence is ``straightened out'' while in the case of a > ``late'' packet, congestion window might be halved). There seems > to be no single obvious and straightforward way that would allow > to take this kind of > issues into account while still making it possible to compute > the metric on the fly with bounded memory use. > > My personal favorite is---not surprisingly---my own definition > [1] (it grew out of analysis of the aspects of reordering that > are important for TCP). The IETF IPPM WG is currently working > on a definition [2] that incorporates these ideas along with > more metrics. There is already more than a single definition > of reordering in current active use. Different network measurement > hardware vendors already use > different metrics, some more helpful than the others, none very > useful for predicting the effect of reordering on TCP. > > I earnestly plead the case of not defining Yet Another Reordering > Metric within GGF. The situation with reordering definition > is bad enough as it is. Especially unhelpful would be vague > ``percentage of reordering'' definition (most seem to agree > that reordering is poorly described by a single percentage number, > but even if it has to be described thusly, there is no single > obvious way of computing the number, so a very precise definition > would need to be used). > > It would seem to me that the least bad option in the case of > reordering would be for now to include a placeholder that would > say that the matter is on hold until the IETF IPPM WG makes > up its mind about reordering. Once that happens, use a single > definition from the IPPM document (and refer to it by section > number) that is most suitable for the purposes of GGF. > > [1] S. Shalunov, ``Definition of IP Packet Reordering Metric,'' > http://www.internet2.edu/~shalunov/ippm/draft-shalunov-reordering > -definition-02.txt > > [2] A. Morton, L. Ciavattone, G. Ramachandran, S. Shalunov, J. > Perser, ``Packet Reordering Metric for IPPM,'' > http://www.ietf.org/internet-drafts/draft-ietf-ippm-reordering-05 > .txt > > -- > Stanislav Shalunov http://www.internet2.edu/~shalunov/ > I think the placeholder will have to do, because I can't cite internet drafts in an R-P. I tried to add some more text clarifying why reordering is important and mentioning the shortcomings in simple percentage-based approaches. Note again that, while we don't want to define a new percentage-based methodology, the fact that they are currently prevalent means we need to include them. ************************************************** comment from Franco Travostino > > Thanks, Bruce. Hereafter my comments to nmwg-hierarchy-02: > > 1) This being a candidate Proposed Recommendation, I find that the > *standard* text (ie. the one resulting in rfc2119 requirement levels > for portability/interoperability) is diluted in a large amount of: > could, would, in general, asides, anecdotal text. I submit the idea > of splitting the draft in two drafts. The first draft would be a > lean recommendation document, roughly up to Figure 3 included, and > with a trailing terse section serving as glossary of terms. The > second draft would be an informational or BCP document, with all the > envisioned usages, rationales, TCP vs UDP vs ... wrinkles, examples, > etc., all without the strength of rfc2119 directives (or else, it > moves to the former document). I see the point you make, as the first half is tighter. However, without the more detailed discussions and definitions of what the terminology means, I'm not sure the first half would be as meaningful---i.e. if you have to read the informational part to understand the recommendation part, I'm not sure splitting them would be useful. > > 2) With regard to the standard text, I further observe that the > SHOULDs are roughly as many as the MUSTs. By definition, a SHOULD is a > gray area when it comes to interoperability. It should be used as > sparingly as possible, and it works best when the "valid reasons" to > take exception from a SHOULD are matter-of-fact ones. In my experience, protocol interoperability is a very different issue from measurement collection. We can't force people to only collect the correct N measurements with the approved tools. Therefore, we have to be more flexible in this context. > > 3) Precision and accuracy are two integral dimensions to a measured > value (e.g. time measurements). Where do you see these coming to bear > in the hierarchy? I don't. I see them as issues for a particular schema to capture accuracy, precision, variability, etc. i.e. they are issues that should be addressed in an implementation of the hierarchy > > 4) I fully support the utility and originality of this work, to the > extent that I will be asking one more question: Would this > recommendation (and the derivative ones) be optimally housed in the > IETF? I think of the Grid component being a worthy catalyst for this > work, and not its exclusive consumer. My guess is that this work would > get broader scrutiny and impact in the IETF. A few reasons: - this work fits in much better with the Grid need to build an interoperable framework around pre-existing (and co-existing) non-interoperable parts than it does in the IETF's tradition of defining its own standard that should supersede the other parts. For example, the discussion above about MUST vs SHOULD. This is just a fundamentally different type of document than the process the IETF is designed for. We need to enable portability, but we can't require compliance below the surface level. - because the other work on exchanging other information needed for grid environments---security, host load, reservations, performance, software, etc---is in GGF, I don't see a reason to identify networks as "special" and send them to a different organization. ************************************************** comment from Tiziana Ferrari > > > COMMENTS > > - Introduction: the last paragraph says "[..] from implementation based on > this hierarchy". > I may be wrong, but it looks like the hierarchy mentioned in the sentence > hasn't been described/introduced in the previous paragraphs of the > introduction. not specifically, but it's been referred to in general terms a couple times before in the intro and abstract. > - "Sample Grid use ..": "There is no advantage to splitting up the file copy > in this way" > Even if a bottleneck link is shared, the splitting may still be useful as it > contributes to distribute the traffic load among several source sites, and > consequently to alleviate bottlenecks around each individual site. good point > - "Characteristic": "property that is related to the performance and > reliability" > what does reliability mean in this context? isn't reliability a specific > "characteristic" that describes performance? the text seems to suggest that > reliability is something that complements it. Is reliability described by > the "availability" characteristic described later on in the text or is it > something else? This is probably true. I think that deleting "and reliability" would be fine. OTOH, I think performance is also only one example of a characteristic. I'm not sure we gain anything from deleting reliability either---really, I think it makes it clear that the concept is broader than just "raw bandwidth." > - I would suggest to clarify the relationship between the "Predictions" > section and the "Measurement methodology" section. For example, si the > prediction an example of derived methdology? As much as it pains me to do it, I don't think the "predictions" section has much relevance to this document anymore. I just deleted it, although I considered moving it to the "observations" subsection, where I think it would make more sense. > - "Observations": I would clarify that the sigletons forming a sample should > be singletons of the same characteristic referring to the same network > entity. done > - "Overview of measurements representation": "in fact some QoS policies nay > specify different routes [..]" not only, packets with different QoS > forwarding behavior may show different behaviors because of the different > traffic conditioning they are subject to, even if they follow the same > route. I think the handling was intended to be included in the "all aspects," and pointing out the route differences was just to make sure readers understand it really means everything. > Also, this section introduces the term "measurement system", it would be > good to have a definition of this. Alternatively the reference to a document > which defines it, could be added. I don't know how or where to define this, but I tried to add a clarifying comment > - Figure 2: what does "Proxy" mean? I added a clarification to section 5.3 that defines a proxy as "a device that implements NAT or SOCKS, or any other in-stream node that may modify the data." > - "Attributes and profiles": an examples of tuples and profiles would be > helpful to clarify this section I hesitate to do this, because everywhere else we give an example, someone says we're defining a standard with that example. Anyone want to write one? > - "Physical and functional topologies": could the authors clarify the last > sentence of the first para: "Physical topology can be determined for both > LANs and WANs". I'm not sure what you're looking for. There are four references in that sentence. > - an example which clarify the "virtual node" concept would help. Two examples are given in 5.3. I'm happy to provide more examples. What remains unclear about the concept? > - references: the authors may want to reorder them so that they are listed > according to the order they are quoted in the text. Ugh. Since the GGF, in their wisdom, settled on Word DOC for the standard format, I don't have any tool to help solve that problem. > > That's all. Congratulation to all the authors for the useful document > produced! > Tiziana