General comments: I don't quite understand the need for this mechanism -- why would one use these markings instead of transport-layer signals a la ECN? -- so I've constrained my comments to the mechanical details. My only high level comment pertains to the threat model and value of these metrics. In particular, it's not clear to me how an operator would distinguish between actual operational problems causing loss or delay from an attacker that's modifying marking flags to give the appearance of loss or delay. In untrusted domains, how are these markings expected to be used reliably? (I guess I just don't understand the threat model well enough, and I couldn't glean it from the security considerations.) Specific comments: Section 2. o In case of Hop-by-Hop Option Header carrying Alternate Marking bits, it is not inserted or deleted, but can be read by any node along the path. The intermediate nodes may be configured to support this Option or not and the measurement can be done only for the nodes configured to read the Option. Anyway this should not affect the traffic throughput on nodes that do not recognize the Option, as further discussed in Section 4. A couple questions come to mind when reading this. In no particular order: - What stops a hop along the path from inserting or deleting these markings? What is affected if that happens? - Does it affect throughput on nodes that _do_ recognize the option? While the threat model (monitoring within a controlled domain) seems to rule out these issues, the implications of alterations, even if accidental, seem worth elaborating upon. Flow Label and FlowMonID within the same packet have different scope, identify different flows, and are intended for different use cases. Is the set of packets defined by a FlowMonID a subset of those defined by a Flow Label, do they have some overlap, or are they completely disjoint? (Writing out the relationship in more detail might help clarify why a new label is indeed needed for non-experts.) It seems like a shame to redefine yet another flow field. As a nit, given the relation to and possible confusion with Flow Label, perhaps we could rename FlowMonID to something TraceID? So, for the purposes of this document, both IP addresses and Flow Label should not change in flight and, in some cases, they could be considered together with the FlowMonID for disambiguation. The restrictions of a controlled domain, wherein there is assumed to be no attacker that can modify these fields, is probably worth noting here. It's in Section 2.1 and the security considerations, in the "harm to measurements" section, but that is somewhat buried at this point in the document, though perhaps worth promoting to some point earlier in the document. Section 2.1. This should probably point to the security considerations for more information about controlled domains. Section 3.1. o Opt Data Len: The length of the Option Data Fields of this Option in bytes. Are there requirements for how long the reserved field in the option data is supposed to be? It seems that this field must consist of all zeroes, but that it can be up to 255 bytes long. Given that the data consists of a FlowMonID (20 bits) and two flags (2 bits), would it be useful to recommend (or require) a size for this? Section 5. It is important to highlight that the definition of the Hop-by-Hop Options in this document SHOULD NOT affect the throughput on nodes that do not recognize the Option. This is an interesting requirement. Surely a node that processes the option does more work before forwarding a packet, which seems like it would affect throughput, even if that impact is negligible. Perhaps "SHOULD NOT affect the throughput" could be rephrased as "is designed to minimize throughput impact on nodes that do not support the option"? Section 5.1. The measurement of the packet loss is really straightforward. The packets of the flow are grouped into batches, and all the packets within a batch are marked by setting the L bit (Loss flag) to a same value. Does this require nodes to batch packets in memory before forwarding? (As written, that seems to be the case, which seems odd.) The source node can switch the value of the L bit between 0 and 1 after a fixed number of packets or according to a fixed timer, and this depends on the implementation. Using a timer for this seems like a very error or noisy implementation approach. Beyond having tightly synchronized clocks, which is already a challenging requirement, is the idea that using a counter is somehow more complex than a timer? (If there's no benefit to using a timer, and it only introduces operational challenges, I'd recommend just removing the suggestion altogether, but I may be missing something.) In a few words this implies that the length of the batches MUST be chosen large enough so that the method is not affected by those factors. There does not seem to be enough guidance here to enforce this MUST, especially given the different factors that affect batch size. What happens if this MUST is violated? (Perhaps downgrading to a SHOULD would be better.) Section 5.2. How do nodes know if they should measure delay using the single- or double-marking methodology? Is that determines by some per-domain policy? The most efficient and robust mode is to select a single double-marked packet for each batch, in this way there is no time gap to consider between the double- marked packets to avoid their reorder. I'm having a hard time understanding this guidance. How exactly does one select a single packet? Is it done at random, or is there another way? (The figures seem to suggest that the packet is picked from the "middle" of a batch.) Section 5.3. The FlowMon identifier field is to uniquely identify a monitored flow within the measurement domain. The field is set at the source node. The FlowMonID can be uniformly assigned by the central controller or algorithmically generated by the source node. The latter approach cannot guarantee the uniqueness of FlowMonID but it may be preferred for local or private network, where the conflict probability is small due to the large FlowMonID space. What happens when all values in the FlowMonID space are consumed? Are old flows discarded or overwritten? I would imagine there's some way IDs are recycled given the finite 2^20 space, but that's not discussed. Section 5.3.1. This seems like text that should be moved to the security considerations. In doing so, it can also be trimmed. (I would claim that the 32-bit FlowMonID example is irrelevant given that these labels are 20 bits long, for example.) Section 6. Moreover, Alternate Marking should usually be applied in a controlled domain and this also helps to limit the problem. Does this mean to suggest that Alternate Marking can be used in networks where attackers exist? If so, comments above regarding the integrity of these fields should be addressed, I think. The privacy concerns of network measurement are limited because the method only relies on information contained in the Option Header without any release of user data. Although information in the Option Header is metadata that can be used to compromise the privacy of users, the limited marking technique seems unlikely to substantially increase the existing privacy risks from header or encapsulation metadata. The QUIC working group spent a _long_ time trying to understand the privacy implications of a single latency bit. I'd encourage the authors here to review the history of that discussion, and then revisit this paragraph. While privacy implications may not seem obvious, I think it's a mistake to say that it is unlikely to introduce any new sort of attack vector. The Alternate Marking application described in this document relies on an time synchronization protocol. Thus, by attacking the time protocol, an attacker can potentially compromise the integrity of the measurement. This seems somewhat buried, and probably worth promoting to the introduction. Editorial comments: - Some language is a bit informal, e.g., "Anyway, ...". I recommend removing such phrasings throughout. - "Alternate Marking" and "alternate marking" are inconsistently capitalized. Is that intentional? - OAM is undefined in Section 4 -- perhaps we can spell it out? (I assume it's Operations, Administration, and Maintenance.)