多跳 INT(带内遥测)¶
两台交换机串联,每台在每个转发包里插入自己的 14 字节 INT shim 头。 接收端解析整条逐跳栈,重建包穿过拓扑的轨迹。这是更贴近真实部署 形态的 INT 示例;想看更简单的单交换机入门,请参见 INT(带内遥测)。
这个示例展示了什么¶
- 逐跳元数据累积:路径上每台交换机都插入自己的 shim,出端 接收方按交换机数量看到对应块的元数据。
next_proto链式拼接:每个 shim 的next_proto字段指明顺序 上的下一个头。解析器走etherType → shim_1.next_proto → shim_2.next_proto → ipv4, 两跳情况下无须 P4 header stack。- 来自寄存器的逐交换机身份:同一份 P4 程序在两台交换机上运行;
每台启动时通过 v1.2 寄存器 API
write_register("MyIngress.switch_id_reg", index=0, value=N)写入自己的switch_id。
拓扑¶
examples/int_multi_hop/topology.py:
"""Linear 4-node topology demonstrating multi-hop INT.
h1 (10.0.0.1/24) --- s1 --------- s2 --- h2 (10.0.0.2/24)
port1 port2 port1 port2
Both switches run the same P4 program (``int_multi_hop.p4``). Each switch's
``switch_id`` register is written at start-up via the v1.2 register API:
s1 gets ``1``, s2 gets ``2``. L2 forwarding is exact-match on destination
MAC; static ARP is seeded between the hosts.
Run as root:
sudo p4net examples/int_multi_hop/topology.py
Then in another terminal:
sudo python3 examples/int_multi_hop/listener.py
And from the p4net shell (or a third terminal):
sudo ip netns exec h1 ping -c 3 -W 1 10.0.0.2
The listener prints one block per packet, with one line per traversed
switch (two lines per packet in this topology).
"""
from __future__ import annotations
import json
import os
from pathlib import Path
from p4net import Network
from p4net.topo import Topology
HERE = Path(__file__).resolve().parent
# Coordination file consumed by ``listener.py`` so the listener can align
# each switch's per-process BMv2 timestamp to wall-clock microseconds:
#
# wall_clock_us = switch.boot_timestamp_us + shim.ingress_timestamp_us
#
# Written at the end of ``setup(net)`` once both switches are running.
# The path is overridable via the ``P4NET_INT_BOOT_TIMES_PATH`` environment
# variable so multiple multi-hop INT topologies can coexist on one host.
# Pass it with ``sudo -E`` to preserve the variable across privilege
# escalation; both topology.py and listener.py read the same env var.
BOOT_TIMES_PATH = Path(
os.environ.get(
"P4NET_INT_BOOT_TIMES_PATH",
"/tmp/p4net-int-multi-hop-boot-times.json",
)
)
topology = Topology()
h1 = topology.add_host("h1", ip="10.0.0.1/24", mac="00:00:00:00:00:01")
h2 = topology.add_host("h2", ip="10.0.0.2/24", mac="00:00:00:00:00:02")
s1 = topology.add_switch("s1", p4_src=HERE / "int_multi_hop.p4")
s2 = topology.add_switch("s2", p4_src=HERE / "int_multi_hop.p4")
topology.add_link(h1, s1, port_b=1)
topology.add_link(s1, s2, port_a=2, port_b=1)
topology.add_link(s2, h2, port_a=2)
def setup(net: Network) -> None:
"""Static ARP, l2_forward tables, switch_id registers."""
h1_rt = net.host("h1")
h2_rt = net.host("h2")
s1_rt = net.switch("s1")
s2_rt = net.switch("s2")
h1_rt.exec(
[
"ip",
"neigh",
"replace",
"10.0.0.2",
"lladdr",
"00:00:00:00:00:02",
"dev",
"h1-eth0",
"nud",
"permanent",
]
)
h2_rt.exec(
[
"ip",
"neigh",
"replace",
"10.0.0.1",
"lladdr",
"00:00:00:00:00:01",
"dev",
"h2-eth0",
"nud",
"permanent",
]
)
# Per-switch INT identity.
s1_rt.client.write_register("MyIngress.switch_id_reg", index=0, value=1)
s2_rt.client.write_register("MyIngress.switch_id_reg", index=0, value=2)
# L2 forwarding: route by destination MAC out the link toward the host.
for sw_rt in (s1_rt, s2_rt):
sw_rt.client.insert_table_entry(
table="MyIngress.l2_forward",
match={"hdr.ethernet.dstAddr": "00:00:00:00:00:02"},
action="MyIngress.set_egress_port",
params={"port": 2},
)
sw_rt.client.insert_table_entry(
table="MyIngress.l2_forward",
match={"hdr.ethernet.dstAddr": "00:00:00:00:00:01"},
action="MyIngress.set_egress_port",
params={"port": 1},
)
# Publish each switch's BMv2 boot timestamp so the listener can align
# per-switch ``ingress_timestamp_us`` values to a common wall clock.
# ``Network.boot_timestamps`` (v1.5+) returns the same mapping as the
# previous manual ``{name: net.switch(name).boot_timestamp_us}`` form,
# and adapts automatically if more switches are added later.
boot_times = net.boot_timestamps
BOOT_TIMES_PATH.write_text(json.dumps(boot_times, indent=2))
print(f"boot timestamps written to {BOOT_TIMES_PATH}", flush=True)
if __name__ == "__main__":
from p4net.cli.main import main
raise SystemExit(main([__file__]))
四个节点、三条链路,线性路径:h1 — s1 — s2 — h2。
P4 程序¶
examples/int_multi_hop/int_multi_hop.p4:
/* Multi-hop in-band network telemetry — two-switch demo.
*
* Each switch on the path inserts its own 14-byte INT shim header between
* Ethernet and IPv4 on every forwarded packet. Shim chaining uses each
* shim's ``next_proto`` field rather than a P4 header stack:
*
* [ Ethernet (etherType = 0x88B6 if any shim is present) ]
* [ INT shim 1 (14 B; next_proto = 0x88B6 or 0x0800) ] <- inserted by hop 1
* [ INT shim 2 (14 B; next_proto = 0x0800) ] <- inserted by hop 2
* [ IPv4 + payload ]
*
* Shim format (identical to ``examples/int/int.p4`` in v1.1.0/v1.2.0):
* switch_id uint8
* ingress_timestamp_us uint48
* egress_port uint16
* queue_depth uint16
* next_proto uint16 (chains to next header in order)
* reserved uint8
*
* Wire-compatible with the single-switch INT listener: a v1.2.0 listener
* pointed at h2 will decode the first shim correctly and stop at the
* ``next_proto`` it doesn't recognize. The multi-hop listener
* (``listener.py``) walks the full chain.
*
* The same P4 program runs on both switches; each switch's identity comes
* from the ``switch_id_reg`` register, written at start-up via the v1.2
* register API.
*
* 2-hop maximum. Real production INT uses a P4 header stack of MAX_HOPS
* depth and ``push_front``; that's left as an extension exercise — see
* the README for the recipe.
*
* Pairs with ``examples/int_multi_hop/topology.py`` (4-node linear:
* h1 — s1 — s2 — h2).
*/
#include <core.p4>
#include <v1model.p4>
const bit<16> ETHERTYPE_IPV4 = 0x0800;
const bit<16> ETHERTYPE_INT = 0x88B6;
header ethernet_t {
bit<48> dstAddr;
bit<48> srcAddr;
bit<16> etherType;
}
header int_shim_t {
bit<8> switch_id;
bit<48> ingress_timestamp_us;
bit<16> egress_port;
bit<16> queue_depth;
bit<16> next_proto;
bit<8> reserved;
}
header ipv4_t {
bit<4> version;
bit<4> ihl;
bit<8> diffserv;
bit<16> totalLen;
bit<16> identification;
bit<3> flags;
bit<13> fragOffset;
bit<8> ttl;
bit<8> protocol;
bit<16> hdrChecksum;
bit<32> srcAddr;
bit<32> dstAddr;
}
struct headers {
ethernet_t ethernet;
int_shim_t int_shim_1;
int_shim_t int_shim_2;
ipv4_t ipv4;
}
struct metadata {}
parser MyParser(packet_in pkt, out headers hdr, inout metadata meta,
inout standard_metadata_t std) {
state start {
pkt.extract(hdr.ethernet);
transition select(hdr.ethernet.etherType) {
ETHERTYPE_IPV4: parse_ipv4;
ETHERTYPE_INT: parse_shim_1;
default: accept;
}
}
state parse_shim_1 {
pkt.extract(hdr.int_shim_1);
transition select(hdr.int_shim_1.next_proto) {
ETHERTYPE_IPV4: parse_ipv4;
ETHERTYPE_INT: parse_shim_2;
default: accept;
}
}
state parse_shim_2 {
pkt.extract(hdr.int_shim_2);
transition select(hdr.int_shim_2.next_proto) {
ETHERTYPE_IPV4: parse_ipv4;
default: accept;
}
}
state parse_ipv4 {
pkt.extract(hdr.ipv4);
transition accept;
}
}
control MyVerifyChecksum(inout headers hdr, inout metadata meta) { apply {} }
control MyIngress(inout headers hdr, inout metadata meta,
inout standard_metadata_t std) {
/* One-element register holding this switch's INT identifier.
* Written by the controller via P4RuntimeClient.write_register. */
register<bit<8>>(1) switch_id_reg;
action drop_packet() {
mark_to_drop(std);
}
action set_egress_port(bit<9> port) {
std.egress_spec = port;
}
table l2_forward {
key = {
hdr.ethernet.dstAddr: exact;
}
actions = {
drop_packet;
set_egress_port;
NoAction;
}
default_action = NoAction();
size = 1024;
}
apply {
if (hdr.ipv4.isValid()) {
l2_forward.apply();
if (std.egress_spec != 0) {
bit<8> sid;
switch_id_reg.read(sid, 0);
if (!hdr.int_shim_1.isValid()) {
/* First hop on path. */
hdr.int_shim_1.setValid();
hdr.int_shim_1.switch_id = sid;
hdr.int_shim_1.ingress_timestamp_us = (bit<48>) std.ingress_global_timestamp;
hdr.int_shim_1.egress_port = (bit<16>) std.egress_spec;
hdr.int_shim_1.queue_depth = (bit<16>) std.deq_qdepth;
hdr.int_shim_1.next_proto = hdr.ethernet.etherType;
hdr.int_shim_1.reserved = 0;
hdr.ethernet.etherType = ETHERTYPE_INT;
} else if (!hdr.int_shim_2.isValid()) {
/* Second hop. Chain shim_1.next_proto -> 0x88B6 so the
* receiver sees shim_1 -> shim_2 -> IPv4. */
hdr.int_shim_2.setValid();
hdr.int_shim_2.switch_id = sid;
hdr.int_shim_2.ingress_timestamp_us = (bit<48>) std.ingress_global_timestamp;
hdr.int_shim_2.egress_port = (bit<16>) std.egress_spec;
hdr.int_shim_2.queue_depth = (bit<16>) std.deq_qdepth;
hdr.int_shim_2.next_proto = hdr.int_shim_1.next_proto;
hdr.int_shim_2.reserved = 0;
hdr.int_shim_1.next_proto = ETHERTYPE_INT;
}
/* Both shim slots full = 3+ hop topology; this example does
* not support that. Real deployments use a header stack of
* MAX_HOPS depth and push_front(1). The packet still
* forwards correctly through this switch; the receiver
* just won't see the third hop's metadata. */
}
}
}
}
control MyEgress(inout headers hdr, inout metadata meta,
inout standard_metadata_t std) { apply {} }
control MyComputeChecksum(inout headers hdr, inout metadata meta) { apply {} }
control MyDeparser(packet_out pkt, in headers hdr) {
apply {
pkt.emit(hdr.ethernet);
pkt.emit(hdr.int_shim_1);
pkt.emit(hdr.int_shim_2);
pkt.emit(hdr.ipv4);
}
}
V1Switch(MyParser(), MyVerifyChecksum(), MyIngress(), MyEgress(),
MyComputeChecksum(), MyDeparser()) main;
要点:
- 用两个命名 header 实例
int_shim_1、int_shim_2,不用 P4 header stack。两跳情况下更易读;要做 N 跳,请参考示例 README 的 「扩展到 N 跳」一节。 - ingress 选择第一个未填的 shim slot,从
standard_metadata与配置 的switch_id写入。next_proto链路被重新拼接,使接收方看到eth → shim_1 → shim_2 → ipv4的顺序。 - deparser 按声明顺序 emit 所有 valid header。
listener¶
examples/int_multi_hop/listener.py:
"""Multi-hop INT listener — decodes a chain of stacked INT shim headers.
Walks the receiving frame's protocol chain starting from the outer
EtherType, parsing one 14-byte shim per hop until ``next_proto`` points
back into a non-INT protocol (typically IPv4, ``0x0800``).
If a coordination file is present at
``/tmp/p4net-int-multi-hop-boot-times.json`` (written by
``topology.py``'s ``setup(net)``), each switch's BMv2 boot timestamp is
loaded and combined with the per-hop ``ingress_timestamp_us`` to print
wall-clock arrival times and a per-hop forwarding-latency line.
Usage (must be run as root for AF_PACKET access):
sudo ip netns exec h2 python3 listener.py --iface h2-eth0
Or from the p4net interactive shell:
h2 xterm
# in the spawned xterm:
sudo python3 examples/int_multi_hop/listener.py --iface h2-eth0
"""
from __future__ import annotations
import argparse
import json
import os
import socket
import struct
import sys
from pathlib import Path
ETH_P_ALL = 0x0003
ETHERTYPE_INT = 0x88B6
ETHERTYPE_IPV4 = 0x0800
SHIM_LEN = 14
# ``P4NET_INT_BOOT_TIMES_PATH`` overrides the coordination file path; pass
# it with ``sudo -E`` to preserve the variable across privilege escalation.
# Both this listener and ``topology.py`` read the same env var.
DEFAULT_BOOT_TIMES_PATH = Path(
os.environ.get(
"P4NET_INT_BOOT_TIMES_PATH",
"/tmp/p4net-int-multi-hop-boot-times.json",
)
)
# Map a 1-based hop index in the captured frame to the switch name in the
# coordination file. The 2-switch example always sees s1 first, then s2.
HOP_INDEX_TO_SWITCH = {1: "s1", 2: "s2"}
def _decode_shim(buf: bytes) -> dict[str, int]:
"""Decode one 14-byte INT shim."""
if len(buf) < SHIM_LEN:
raise ValueError(f"INT shim truncated: got {len(buf)} bytes, need {SHIM_LEN}")
return {
"switch_id": buf[0],
"ingress_timestamp_us": int.from_bytes(buf[1:7], "big"),
"egress_port": struct.unpack("!H", buf[7:9])[0],
"queue_depth": struct.unpack("!H", buf[9:11])[0],
"next_proto": struct.unpack("!H", buf[11:13])[0],
"reserved": buf[13],
}
def _decode_ipv4_addrs(buf: bytes) -> tuple[str, str] | None:
if len(buf) < 20:
return None
src = socket.inet_ntoa(buf[12:16])
dst = socket.inet_ntoa(buf[16:20])
return src, dst
def _load_boot_times(path: Path) -> dict[str, int] | None:
"""Return ``{switch_name: boot_timestamp_us}`` or ``None`` if not present."""
if not path.is_file():
return None
try:
raw = json.loads(path.read_text())
except (OSError, json.JSONDecodeError):
return None
if not isinstance(raw, dict):
return None
return {str(k): int(v) for k, v in raw.items()}
def _render_packet(
hops: list[dict[str, int]],
next_proto: int,
flow: str,
boot_times: dict[str, int] | None,
) -> str:
"""Format one packet's hops for stdout. Used by both modes."""
lines: list[str] = [f"packet ({len(hops)} hop(s), final proto 0x{next_proto:04x}):{flow}"]
aligned_per_hop: list[int | None] = []
for i, hop in enumerate(hops, 1):
boot_us = None
if boot_times is not None:
sw_name = HOP_INDEX_TO_SWITCH.get(i)
if sw_name is not None:
boot_us = boot_times.get(sw_name)
if boot_us is not None:
aligned_us = boot_us + hop["ingress_timestamp_us"]
aligned_per_hop.append(aligned_us)
lines.append(
f" hop {i}: switch_id={hop['switch_id']} "
f"ts={hop['ingress_timestamp_us']}us "
f"aligned={aligned_us}us "
f"egress_port={hop['egress_port']} "
f"queue_depth={hop['queue_depth']}"
)
else:
aligned_per_hop.append(None)
lines.append(
f" hop {i}: switch_id={hop['switch_id']} "
f"ts={hop['ingress_timestamp_us']}us "
f"[unaligned] "
f"egress_port={hop['egress_port']} "
f"queue_depth={hop['queue_depth']}"
)
if boot_times is None:
lines.append(
" (run via `sudo p4net examples/int_multi_hop/topology.py` to get aligned timestamps)"
)
elif len(aligned_per_hop) == 2 and all(a is not None for a in aligned_per_hop):
delta = aligned_per_hop[1] - aligned_per_hop[0] # type: ignore[operator]
lines.append(f" latency_s1_to_s2 = {delta}us")
return "\n".join(lines) + "\n"
def main() -> int:
parser = argparse.ArgumentParser(
description="Decode stacked INT shim headers from a raw AF_PACKET socket."
)
parser.add_argument(
"--iface",
required=True,
help="Interface name to bind to (e.g. h2-eth0).",
)
parser.add_argument(
"--count",
type=int,
default=0,
help="Exit after printing this many INT frames (0 = forever).",
)
parser.add_argument(
"--boot-times",
type=Path,
default=DEFAULT_BOOT_TIMES_PATH,
help=(
"Path to the coordination JSON written by topology.py "
"(default: %(default)s). If missing, timestamps are shown unaligned."
),
)
args = parser.parse_args()
boot_times = _load_boot_times(args.boot_times)
sock = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, socket.htons(ETH_P_ALL))
sock.bind((args.iface, 0))
if boot_times is not None:
sys.stdout.write(
f"[listener] bound on {args.iface}; "
f"boot times loaded from {args.boot_times}: {boot_times}\n"
)
else:
sys.stdout.write(
f"[listener] bound on {args.iface}; "
f"no boot-times file at {args.boot_times} — running unaligned\n"
)
sys.stdout.flush()
seen = 0
while True:
frame, _addr = sock.recvfrom(65535)
if len(frame) < 14 + SHIM_LEN:
continue
etype = int.from_bytes(frame[12:14], "big")
if etype != ETHERTYPE_INT:
continue
# Walk the shim chain. ``next_proto`` on each shim points to the
# next header in order; we stop when it leaves the INT space.
hops: list[dict[str, int]] = []
offset = 14
next_proto = etype
while next_proto == ETHERTYPE_INT and offset + SHIM_LEN <= len(frame):
shim = _decode_shim(frame[offset : offset + SHIM_LEN])
hops.append(shim)
offset += SHIM_LEN
next_proto = shim["next_proto"]
addrs = _decode_ipv4_addrs(frame[offset:]) if next_proto == ETHERTYPE_IPV4 else None
flow = f" {addrs[0]} -> {addrs[1]}" if addrs else ""
sys.stdout.write(_render_packet(hops, next_proto, flow, boot_times))
sys.stdout.flush()
seen += 1
if args.count and seen >= args.count:
return 0
if __name__ == "__main__":
raise SystemExit(main())
listener 从外层 EtherType 开始遍历 shim 链,每跳解出一个 14 字节
shim,直到 next_proto 离开 INT 范围为止。
跑起来¶
一个终端:
setup(net) 给两台交换机装 L2 转发表、给两台主机预置静态 ARP,并
写入各自的身份寄存器。落到 p4net> shell。
另一个终端(或从 shell 起 h2 xterm):
再开一个终端:
listener 每过一个包打一个块;本拓扑中每个块两行(两个交换机)。
示例输出¶
v1.4 多跳集成测试的对齐模式实测:
packet (2 hop(s), final proto 0x0800): 10.0.0.1 -> 10.0.0.2
hop 1: switch_id=1 ts=800454us aligned=1778513670403185us egress_port=2 queue_depth=0
hop 2: switch_id=2 ts=699418us aligned=1778513670403875us egress_port=2 queue_depth=0
latency_s1_to_s2 = 690us
hop 1 是 s1,hop 2 是 s2。ts 是 BMv2 每进程本地时间戳;
aligned 是对齐后的 Unix 微秒挂钟值;latency_s1_to_s2 是两条
对齐时间戳之差——经过 BMv2 用户态流水线加 veth 对的真实逐跳转发
延迟。
直接跑 listener 而不经 setup(net)(没有协调文件)时退回 v1.3 的
未对齐显示:只有 ts,没有 aligned= 行,没有 latency 行。
跨交换机时间戳对齐如何工作¶
BMv2 的 standard_metadata.ingress_global_timestamp 是每进程的:
每个 simple_switch_grpc 实例的时钟在自身启动时从零开始,因此
raw shim_1.ts 与 shim_2.ts 无法直接跨交换机比较。自 v1.4 起,
每个 RunningSwitch 暴露 boot_timestamp_us 属性(进程启动时的
Unix 微秒挂钟值,在 subprocess.Popen 之前一刻捕获)。对齐公式:
setup(net) 把两台交换机的启动时间戳写到协调文件
/tmp/p4net-int-multi-hop-boot-times.json;listener 启动时读取,
在每条 raw ts 旁边打印 aligned=...us,两个对齐值之差即
latency_s1_to_s2。
漂移由 Popen + 进程初始化开销决定——典型情况下亚毫秒级,有时受 负载影响达到几毫秒。够用来判断「跨跳延迟是 μs 量级还是 ms 量级」, 真要做纳秒级延迟研究还得用 PTP 这类共享时间源。
topology.py 用 Network.boot_timestamps(v1.5+)来构建协调字典
——这是一个按交换机名字索引的只读字典。之前手写的
{name: net.switch(name).boot_timestamp_us} 仍可用,但加交换机
时不会自动适配。
同主机并行运行多份拓扑¶
协调文件默认路径是 /tmp/p4net-int-multi-hop-boot-times.json。
要在同一主机上并行跑多份多跳 INT 拓扑而不互相覆盖,设置环境变量
P4NET_INT_BOOT_TIMES_PATH 指向各自独立的路径:
P4NET_INT_BOOT_TIMES_PATH=/tmp/topo-a.json \
sudo -E p4net examples/int_multi_hop/topology.py
P4NET_INT_BOOT_TIMES_PATH=/tmp/topo-a.json \
sudo -E ip netns exec h2 python3 \
examples/int_multi_hop/listener.py --iface h2-eth0
必须用 sudo -E——否则 sudo 会清掉大多数环境变量,两个进程
都会悄悄退回默认路径并互相打架。topology 与 listener 必须读到同一
路径。
值得注意的点¶
- 每跳转发延迟现在可观测。
latency_s1_to_s2这一行在本机 从几百微秒到几毫秒不等。真实 ASIC 交换机要快 10–100 倍; BMv2 用户态解释器才是瓶颈。 - 出端口对应路径方向。s1 从 port 2 朝 s2 转发;s2 从 port 2 朝 h2 转发。不同拓扑会得到不同的端口号。
queue_depth在本负载下稳定为 0——BMv2 的默认队列设置下, 没有显式队列配置和饱和负载是看不到非零值的。
注意事项¶
- 当前流水线只支持两跳。第三台交换机会发现两个 shim slot 都已 valid,直接转发不再追加。真实部署用 MAX_HOPS 深度的 P4 header stack——示例 README 里有改写步骤。
- 对齐有亚毫秒漂移。
boot_timestamp_us是在Popen之前一刻 抓的,BMv2 实际内部时钟零点要更晚一点。够用来粗看,不够用来做 纳秒级精度研究。 - listener 依赖
/tmp/协调文件。默认路径/tmp/p4net-int-multi-hop-boot-times.json;要并行跑多份拓扑,请 通过P4NET_INT_BOOT_TIMES_PATH(配合sudo -E)指向独立 路径。 queue_depth几乎总是 0,与单交换机示例一致。- 不重算插入 shim 的校验和。IPv4 校验和只覆盖 IPv4 头本身; 位于以太网与 IPv4 之间的 shim 层是无保护的,这与 INT 规范的 假设(链路层完整性)一致。