Multi-hop INT (in-band telemetry)¶
Two switches in series, each inserting its own 14-byte INT shim header into every forwarded packet. The receiver decodes the full hop-by-hop stack to reconstruct the packet's journey across the topology. This is the production-style INT example; for the simpler single-switch introduction see INT (in-band telemetry).
What this demonstrates¶
- Hop-by-hop metadata accumulation: every switch on the path inserts its own shim, so the egress receiver sees one block of metadata per traversed switch.
- Shim chaining via
next_proto: each shim'snext_protofield names the next header in order. The parser walksetherType → shim_1.next_proto → shim_2.next_proto → ipv4. No P4 header stack required for the two-hop case. - Per-switch identity from a register: the same P4 program runs on
both switches; per-switch
switch_idcomes from the v1.2 register API viawrite_register("MyIngress.switch_id_reg", index=0, value=N)at start-up.
Topology¶
examples/int_multi_hop/topology.py:
"""Linear 4-node topology demonstrating multi-hop INT.
h1 (10.0.0.1/24) --- s1 --------- s2 --- h2 (10.0.0.2/24)
port1 port2 port1 port2
Both switches run the same P4 program (``int_multi_hop.p4``). Each switch's
``switch_id`` register is written at start-up via the v1.2 register API:
s1 gets ``1``, s2 gets ``2``. L2 forwarding is exact-match on destination
MAC; static ARP is seeded between the hosts.
Run as root:
sudo p4net examples/int_multi_hop/topology.py
Then in another terminal:
sudo python3 examples/int_multi_hop/listener.py
And from the p4net shell (or a third terminal):
sudo ip netns exec h1 ping -c 3 -W 1 10.0.0.2
The listener prints one block per packet, with one line per traversed
switch (two lines per packet in this topology).
"""
from __future__ import annotations
import json
import os
from pathlib import Path
from p4net import Network
from p4net.topo import Topology
HERE = Path(__file__).resolve().parent
# Coordination file consumed by ``listener.py`` so the listener can align
# each switch's per-process BMv2 timestamp to wall-clock microseconds:
#
# wall_clock_us = switch.boot_timestamp_us + shim.ingress_timestamp_us
#
# Written at the end of ``setup(net)`` once both switches are running.
# The path is overridable via the ``P4NET_INT_BOOT_TIMES_PATH`` environment
# variable so multiple multi-hop INT topologies can coexist on one host.
# Pass it with ``sudo -E`` to preserve the variable across privilege
# escalation; both topology.py and listener.py read the same env var.
BOOT_TIMES_PATH = Path(
os.environ.get(
"P4NET_INT_BOOT_TIMES_PATH",
"/tmp/p4net-int-multi-hop-boot-times.json",
)
)
topology = Topology()
h1 = topology.add_host("h1", ip="10.0.0.1/24", mac="00:00:00:00:00:01")
h2 = topology.add_host("h2", ip="10.0.0.2/24", mac="00:00:00:00:00:02")
s1 = topology.add_switch("s1", p4_src=HERE / "int_multi_hop.p4")
s2 = topology.add_switch("s2", p4_src=HERE / "int_multi_hop.p4")
topology.add_link(h1, s1, port_b=1)
topology.add_link(s1, s2, port_a=2, port_b=1)
topology.add_link(s2, h2, port_a=2)
def setup(net: Network) -> None:
"""Static ARP, l2_forward tables, switch_id registers."""
h1_rt = net.host("h1")
h2_rt = net.host("h2")
s1_rt = net.switch("s1")
s2_rt = net.switch("s2")
h1_rt.exec(
[
"ip",
"neigh",
"replace",
"10.0.0.2",
"lladdr",
"00:00:00:00:00:02",
"dev",
"h1-eth0",
"nud",
"permanent",
]
)
h2_rt.exec(
[
"ip",
"neigh",
"replace",
"10.0.0.1",
"lladdr",
"00:00:00:00:00:01",
"dev",
"h2-eth0",
"nud",
"permanent",
]
)
# Per-switch INT identity.
s1_rt.client.write_register("MyIngress.switch_id_reg", index=0, value=1)
s2_rt.client.write_register("MyIngress.switch_id_reg", index=0, value=2)
# L2 forwarding: route by destination MAC out the link toward the host.
for sw_rt in (s1_rt, s2_rt):
sw_rt.client.insert_table_entry(
table="MyIngress.l2_forward",
match={"hdr.ethernet.dstAddr": "00:00:00:00:00:02"},
action="MyIngress.set_egress_port",
params={"port": 2},
)
sw_rt.client.insert_table_entry(
table="MyIngress.l2_forward",
match={"hdr.ethernet.dstAddr": "00:00:00:00:00:01"},
action="MyIngress.set_egress_port",
params={"port": 1},
)
# Publish each switch's BMv2 boot timestamp so the listener can align
# per-switch ``ingress_timestamp_us`` values to a common wall clock.
# ``Network.boot_timestamps`` (v1.5+) returns the same mapping as the
# previous manual ``{name: net.switch(name).boot_timestamp_us}`` form,
# and adapts automatically if more switches are added later.
boot_times = net.boot_timestamps
BOOT_TIMES_PATH.write_text(json.dumps(boot_times, indent=2))
print(f"boot timestamps written to {BOOT_TIMES_PATH}", flush=True)
if __name__ == "__main__":
from p4net.cli.main import main
raise SystemExit(main([__file__]))
Four nodes, three links, linear path: h1 — s1 — s2 — h2.
P4 program¶
examples/int_multi_hop/int_multi_hop.p4:
/* Multi-hop in-band network telemetry — two-switch demo.
*
* Each switch on the path inserts its own 14-byte INT shim header between
* Ethernet and IPv4 on every forwarded packet. Shim chaining uses each
* shim's ``next_proto`` field rather than a P4 header stack:
*
* [ Ethernet (etherType = 0x88B6 if any shim is present) ]
* [ INT shim 1 (14 B; next_proto = 0x88B6 or 0x0800) ] <- inserted by hop 1
* [ INT shim 2 (14 B; next_proto = 0x0800) ] <- inserted by hop 2
* [ IPv4 + payload ]
*
* Shim format (identical to ``examples/int/int.p4`` in v1.1.0/v1.2.0):
* switch_id uint8
* ingress_timestamp_us uint48
* egress_port uint16
* queue_depth uint16
* next_proto uint16 (chains to next header in order)
* reserved uint8
*
* Wire-compatible with the single-switch INT listener: a v1.2.0 listener
* pointed at h2 will decode the first shim correctly and stop at the
* ``next_proto`` it doesn't recognize. The multi-hop listener
* (``listener.py``) walks the full chain.
*
* The same P4 program runs on both switches; each switch's identity comes
* from the ``switch_id_reg`` register, written at start-up via the v1.2
* register API.
*
* 2-hop maximum. Real production INT uses a P4 header stack of MAX_HOPS
* depth and ``push_front``; that's left as an extension exercise — see
* the README for the recipe.
*
* Pairs with ``examples/int_multi_hop/topology.py`` (4-node linear:
* h1 — s1 — s2 — h2).
*/
#include <core.p4>
#include <v1model.p4>
const bit<16> ETHERTYPE_IPV4 = 0x0800;
const bit<16> ETHERTYPE_INT = 0x88B6;
header ethernet_t {
bit<48> dstAddr;
bit<48> srcAddr;
bit<16> etherType;
}
header int_shim_t {
bit<8> switch_id;
bit<48> ingress_timestamp_us;
bit<16> egress_port;
bit<16> queue_depth;
bit<16> next_proto;
bit<8> reserved;
}
header ipv4_t {
bit<4> version;
bit<4> ihl;
bit<8> diffserv;
bit<16> totalLen;
bit<16> identification;
bit<3> flags;
bit<13> fragOffset;
bit<8> ttl;
bit<8> protocol;
bit<16> hdrChecksum;
bit<32> srcAddr;
bit<32> dstAddr;
}
struct headers {
ethernet_t ethernet;
int_shim_t int_shim_1;
int_shim_t int_shim_2;
ipv4_t ipv4;
}
struct metadata {}
parser MyParser(packet_in pkt, out headers hdr, inout metadata meta,
inout standard_metadata_t std) {
state start {
pkt.extract(hdr.ethernet);
transition select(hdr.ethernet.etherType) {
ETHERTYPE_IPV4: parse_ipv4;
ETHERTYPE_INT: parse_shim_1;
default: accept;
}
}
state parse_shim_1 {
pkt.extract(hdr.int_shim_1);
transition select(hdr.int_shim_1.next_proto) {
ETHERTYPE_IPV4: parse_ipv4;
ETHERTYPE_INT: parse_shim_2;
default: accept;
}
}
state parse_shim_2 {
pkt.extract(hdr.int_shim_2);
transition select(hdr.int_shim_2.next_proto) {
ETHERTYPE_IPV4: parse_ipv4;
default: accept;
}
}
state parse_ipv4 {
pkt.extract(hdr.ipv4);
transition accept;
}
}
control MyVerifyChecksum(inout headers hdr, inout metadata meta) { apply {} }
control MyIngress(inout headers hdr, inout metadata meta,
inout standard_metadata_t std) {
/* One-element register holding this switch's INT identifier.
* Written by the controller via P4RuntimeClient.write_register. */
register<bit<8>>(1) switch_id_reg;
action drop_packet() {
mark_to_drop(std);
}
action set_egress_port(bit<9> port) {
std.egress_spec = port;
}
table l2_forward {
key = {
hdr.ethernet.dstAddr: exact;
}
actions = {
drop_packet;
set_egress_port;
NoAction;
}
default_action = NoAction();
size = 1024;
}
apply {
if (hdr.ipv4.isValid()) {
l2_forward.apply();
if (std.egress_spec != 0) {
bit<8> sid;
switch_id_reg.read(sid, 0);
if (!hdr.int_shim_1.isValid()) {
/* First hop on path. */
hdr.int_shim_1.setValid();
hdr.int_shim_1.switch_id = sid;
hdr.int_shim_1.ingress_timestamp_us = (bit<48>) std.ingress_global_timestamp;
hdr.int_shim_1.egress_port = (bit<16>) std.egress_spec;
hdr.int_shim_1.queue_depth = (bit<16>) std.deq_qdepth;
hdr.int_shim_1.next_proto = hdr.ethernet.etherType;
hdr.int_shim_1.reserved = 0;
hdr.ethernet.etherType = ETHERTYPE_INT;
} else if (!hdr.int_shim_2.isValid()) {
/* Second hop. Chain shim_1.next_proto -> 0x88B6 so the
* receiver sees shim_1 -> shim_2 -> IPv4. */
hdr.int_shim_2.setValid();
hdr.int_shim_2.switch_id = sid;
hdr.int_shim_2.ingress_timestamp_us = (bit<48>) std.ingress_global_timestamp;
hdr.int_shim_2.egress_port = (bit<16>) std.egress_spec;
hdr.int_shim_2.queue_depth = (bit<16>) std.deq_qdepth;
hdr.int_shim_2.next_proto = hdr.int_shim_1.next_proto;
hdr.int_shim_2.reserved = 0;
hdr.int_shim_1.next_proto = ETHERTYPE_INT;
}
/* Both shim slots full = 3+ hop topology; this example does
* not support that. Real deployments use a header stack of
* MAX_HOPS depth and push_front(1). The packet still
* forwards correctly through this switch; the receiver
* just won't see the third hop's metadata. */
}
}
}
}
control MyEgress(inout headers hdr, inout metadata meta,
inout standard_metadata_t std) { apply {} }
control MyComputeChecksum(inout headers hdr, inout metadata meta) { apply {} }
control MyDeparser(packet_out pkt, in headers hdr) {
apply {
pkt.emit(hdr.ethernet);
pkt.emit(hdr.int_shim_1);
pkt.emit(hdr.int_shim_2);
pkt.emit(hdr.ipv4);
}
}
V1Switch(MyParser(), MyVerifyChecksum(), MyIngress(), MyEgress(),
MyComputeChecksum(), MyDeparser()) main;
Key points:
- Two named header instances
int_shim_1andint_shim_2instead of a P4 header stack. Easier to read at two hops; for N hops, see the "Extending" section in the example README. - Ingress picks the first unfilled shim slot and writes it from
standard_metadataplus the configuredswitch_id. Thenext_protochain is re-stitched so the receiver seeseth → shim_1 → shim_2 → ipv4. - The deparser emits every valid header in declaration order.
The listener¶
examples/int_multi_hop/listener.py:
"""Multi-hop INT listener — decodes a chain of stacked INT shim headers.
Walks the receiving frame's protocol chain starting from the outer
EtherType, parsing one 14-byte shim per hop until ``next_proto`` points
back into a non-INT protocol (typically IPv4, ``0x0800``).
If a coordination file is present at
``/tmp/p4net-int-multi-hop-boot-times.json`` (written by
``topology.py``'s ``setup(net)``), each switch's BMv2 boot timestamp is
loaded and combined with the per-hop ``ingress_timestamp_us`` to print
wall-clock arrival times and a per-hop forwarding-latency line.
Usage (must be run as root for AF_PACKET access):
sudo ip netns exec h2 python3 listener.py --iface h2-eth0
Or from the p4net interactive shell:
h2 xterm
# in the spawned xterm:
sudo python3 examples/int_multi_hop/listener.py --iface h2-eth0
"""
from __future__ import annotations
import argparse
import json
import os
import socket
import struct
import sys
from pathlib import Path
ETH_P_ALL = 0x0003
ETHERTYPE_INT = 0x88B6
ETHERTYPE_IPV4 = 0x0800
SHIM_LEN = 14
# ``P4NET_INT_BOOT_TIMES_PATH`` overrides the coordination file path; pass
# it with ``sudo -E`` to preserve the variable across privilege escalation.
# Both this listener and ``topology.py`` read the same env var.
DEFAULT_BOOT_TIMES_PATH = Path(
os.environ.get(
"P4NET_INT_BOOT_TIMES_PATH",
"/tmp/p4net-int-multi-hop-boot-times.json",
)
)
# Map a 1-based hop index in the captured frame to the switch name in the
# coordination file. The 2-switch example always sees s1 first, then s2.
HOP_INDEX_TO_SWITCH = {1: "s1", 2: "s2"}
def _decode_shim(buf: bytes) -> dict[str, int]:
"""Decode one 14-byte INT shim."""
if len(buf) < SHIM_LEN:
raise ValueError(f"INT shim truncated: got {len(buf)} bytes, need {SHIM_LEN}")
return {
"switch_id": buf[0],
"ingress_timestamp_us": int.from_bytes(buf[1:7], "big"),
"egress_port": struct.unpack("!H", buf[7:9])[0],
"queue_depth": struct.unpack("!H", buf[9:11])[0],
"next_proto": struct.unpack("!H", buf[11:13])[0],
"reserved": buf[13],
}
def _decode_ipv4_addrs(buf: bytes) -> tuple[str, str] | None:
if len(buf) < 20:
return None
src = socket.inet_ntoa(buf[12:16])
dst = socket.inet_ntoa(buf[16:20])
return src, dst
def _load_boot_times(path: Path) -> dict[str, int] | None:
"""Return ``{switch_name: boot_timestamp_us}`` or ``None`` if not present."""
if not path.is_file():
return None
try:
raw = json.loads(path.read_text())
except (OSError, json.JSONDecodeError):
return None
if not isinstance(raw, dict):
return None
return {str(k): int(v) for k, v in raw.items()}
def _render_packet(
hops: list[dict[str, int]],
next_proto: int,
flow: str,
boot_times: dict[str, int] | None,
) -> str:
"""Format one packet's hops for stdout. Used by both modes."""
lines: list[str] = [f"packet ({len(hops)} hop(s), final proto 0x{next_proto:04x}):{flow}"]
aligned_per_hop: list[int | None] = []
for i, hop in enumerate(hops, 1):
boot_us = None
if boot_times is not None:
sw_name = HOP_INDEX_TO_SWITCH.get(i)
if sw_name is not None:
boot_us = boot_times.get(sw_name)
if boot_us is not None:
aligned_us = boot_us + hop["ingress_timestamp_us"]
aligned_per_hop.append(aligned_us)
lines.append(
f" hop {i}: switch_id={hop['switch_id']} "
f"ts={hop['ingress_timestamp_us']}us "
f"aligned={aligned_us}us "
f"egress_port={hop['egress_port']} "
f"queue_depth={hop['queue_depth']}"
)
else:
aligned_per_hop.append(None)
lines.append(
f" hop {i}: switch_id={hop['switch_id']} "
f"ts={hop['ingress_timestamp_us']}us "
f"[unaligned] "
f"egress_port={hop['egress_port']} "
f"queue_depth={hop['queue_depth']}"
)
if boot_times is None:
lines.append(
" (run via `sudo p4net examples/int_multi_hop/topology.py` to get aligned timestamps)"
)
elif len(aligned_per_hop) == 2 and all(a is not None for a in aligned_per_hop):
delta = aligned_per_hop[1] - aligned_per_hop[0] # type: ignore[operator]
lines.append(f" latency_s1_to_s2 = {delta}us")
return "\n".join(lines) + "\n"
def main() -> int:
parser = argparse.ArgumentParser(
description="Decode stacked INT shim headers from a raw AF_PACKET socket."
)
parser.add_argument(
"--iface",
required=True,
help="Interface name to bind to (e.g. h2-eth0).",
)
parser.add_argument(
"--count",
type=int,
default=0,
help="Exit after printing this many INT frames (0 = forever).",
)
parser.add_argument(
"--boot-times",
type=Path,
default=DEFAULT_BOOT_TIMES_PATH,
help=(
"Path to the coordination JSON written by topology.py "
"(default: %(default)s). If missing, timestamps are shown unaligned."
),
)
args = parser.parse_args()
boot_times = _load_boot_times(args.boot_times)
sock = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, socket.htons(ETH_P_ALL))
sock.bind((args.iface, 0))
if boot_times is not None:
sys.stdout.write(
f"[listener] bound on {args.iface}; "
f"boot times loaded from {args.boot_times}: {boot_times}\n"
)
else:
sys.stdout.write(
f"[listener] bound on {args.iface}; "
f"no boot-times file at {args.boot_times} — running unaligned\n"
)
sys.stdout.flush()
seen = 0
while True:
frame, _addr = sock.recvfrom(65535)
if len(frame) < 14 + SHIM_LEN:
continue
etype = int.from_bytes(frame[12:14], "big")
if etype != ETHERTYPE_INT:
continue
# Walk the shim chain. ``next_proto`` on each shim points to the
# next header in order; we stop when it leaves the INT space.
hops: list[dict[str, int]] = []
offset = 14
next_proto = etype
while next_proto == ETHERTYPE_INT and offset + SHIM_LEN <= len(frame):
shim = _decode_shim(frame[offset : offset + SHIM_LEN])
hops.append(shim)
offset += SHIM_LEN
next_proto = shim["next_proto"]
addrs = _decode_ipv4_addrs(frame[offset:]) if next_proto == ETHERTYPE_IPV4 else None
flow = f" {addrs[0]} -> {addrs[1]}" if addrs else ""
sys.stdout.write(_render_packet(hops, next_proto, flow, boot_times))
sys.stdout.flush()
seen += 1
if args.count and seen >= args.count:
return 0
if __name__ == "__main__":
raise SystemExit(main())
The listener walks the shim chain starting from the outer EtherType,
parsing one 14-byte shim per hop until next_proto leaves the INT
space.
Run it¶
In one terminal:
setup(net) installs the L2 forwarding tables on both switches,
pre-seeds static ARP on both hosts, and writes each switch's identity
register. Drop into the p4net> shell.
In a second terminal (or h2 xterm from the shell):
From a third terminal:
The listener prints one block per packet, with one hop line per
traversed switch (two lines per packet in this topology).
Sample output¶
Captured by the v1.4 multi-hop integration test (aligned mode):
packet (2 hop(s), final proto 0x0800): 10.0.0.1 -> 10.0.0.2
hop 1: switch_id=1 ts=800454us aligned=1778513670403185us egress_port=2 queue_depth=0
hop 2: switch_id=2 ts=699418us aligned=1778513670403875us egress_port=2 queue_depth=0
latency_s1_to_s2 = 690us
hop 1 is s1; hop 2 is s2. Each ts is BMv2's per-process
ingress_global_timestamp; aligned is wall-clock μs since Unix
epoch; latency_s1_to_s2 is the wall-clock delta between aligned
arrival times — real per-hop forwarding latency through BMv2's
userspace pipeline plus the veth pair.
Running the listener directly without setup(net) (so no coordination
file is present) falls back to the v1.3 unaligned display: raw ts,
no aligned= line, no latency.
How cross-switch timestamp alignment works¶
BMv2's standard_metadata.ingress_global_timestamp is per-process:
each simple_switch_grpc instance's clock starts at zero on boot, so
raw shim_1.ts and shim_2.ts aren't directly comparable across
hops. Since v1.4, every RunningSwitch exposes a boot_timestamp_us
property (wall-clock μs since Unix epoch at process start, captured
immediately before subprocess.Popen). The alignment formula:
setup(net) writes both switches' boot timestamps to a JSON
coordination file at /tmp/p4net-int-multi-hop-boot-times.json; the
listener reads it at startup and prints aligned=...us next to each
raw ts. Subtraction across hops gives the latency_s1_to_s2 line.
Drift is bounded by Popen + early-init overhead — sub-millisecond typically, occasionally a couple of milliseconds under load. Good enough for μs-vs-ms regime decisions; for serious latency research use a real shared time source (PTP).
topology.py builds the coordination dict via
Network.boot_timestamps (v1.5+) — a read-only dict keyed by
switch name. The previous hand-written
{name: net.switch(name).boot_timestamp_us} comprehension still
works but doesn't adapt as switches are added.
Running concurrent topologies¶
The coordination file path defaults to
/tmp/p4net-int-multi-hop-boot-times.json. To run multiple multi-hop
INT topologies on one host without trampling each other's
coordination state, set P4NET_INT_BOOT_TIMES_PATH to a unique
path before starting each:
P4NET_INT_BOOT_TIMES_PATH=/tmp/topo-a.json \
sudo -E p4net examples/int_multi_hop/topology.py
P4NET_INT_BOOT_TIMES_PATH=/tmp/topo-a.json \
sudo -E ip netns exec h2 python3 \
examples/int_multi_hop/listener.py --iface h2-eth0
sudo -E is required — without it sudo strips most env vars
and both processes silently fall back to the default path, causing
collisions. Topology and listener must agree on the path.
What's interesting¶
- Per-hop forwarding latency is now observable. The
latency_s1_to_s2line ranges from a few hundred microseconds to a few milliseconds on this rig. Real ASIC switches are 10–100× faster; BMv2's userspace interpreter is the bottleneck. - Egress ports correspond to the path direction. s1 forwards out port 2 toward s2; s2 forwards out port 2 toward h2. Different topologies produce different numbers.
queue_depthis reliably 0 at this offered load — BMv2's default queueing doesn't surface non-zero values without explicit configuration and saturation.
Caveats¶
- Two hops only with the current pipeline. A third switch on the path would find both shim slots full and forward without further annotation. Real deployments use a P4 header stack of MAX_HOPS depth — see the example README for the rewrite recipe.
- Alignment drift is sub-millisecond.
boot_timestamp_usis captured immediately beforePopen, but BMv2's actual internal clock zero is slightly later. Good enough for μs/ms regime checks, not good enough for nanosecond-scale latency research; use PTP for that. - Listener relies on a
/tmp/coordination file. Path defaults to/tmp/p4net-int-multi-hop-boot-times.json; setP4NET_INT_BOOT_TIMES_PATH(withsudo -E) to run concurrent multi-hop INT topologies on the same host. queue_depthis almost always 0. Same as the single-switch example.- No checksum recomputation for the inserted shims. The IPv4 checksum covers only the IPv4 header; the shim layer between Ethernet and IPv4 is unprotected, matching how production INT works (the INT spec assumes link-layer integrity).