Discussion:
400g Tomahawk-3 chip
(too old to reply)
Dave Taht
2018-01-27 03:31:18 UTC
Permalink
If someone could translate the "smart buffering", "flow aware" and
"mice v elephants" comments into how it actually works in the
tomahawk-3 chip... I'd love it.

from: https://www.nextplatform.com/2018/01/20/flattening-networks-budgets-400g-ethernet/

"The packet processing pipeline sits behind the ports, implementing
ingress and egress packet processing, and because the buffer is
shared, it can hold a lot of data if a port that is the destination
becomes clogged and wait out the traffic jam without dropping packets
and forcing a resend. The buffer architecture has support for RoCEv2
remote direct memory access and congestion control, and also has what
Broadcom calls flow-aware traffic scheduling, which is a key for
hyperscalers and cloud builders and more than a few HPC and AI shops,
too.

“Companies are not running monolithic workloads,” explains Sankar.
“They run many different applications, running in parallel, as they go
through the switches. In that case, you have elephants and mice, the
former of which causes queueing delays that affect different flows in
different ways. So in real time, we can see those bandwidth-consuming
elephants and reprioritize them versus the mice to ensure that both
classes of traffic are experiencing the least amount of queueing
delay. This new architecture for buffering ultimately eliminates drops
and has 3X to 5X greater incast absorption and lossless capacity
versus alternatives, and in the process it drives down tail latency.”
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Jonathan Morton
2018-01-27 06:35:32 UTC
Permalink
Post by Dave Taht
If someone could translate the "smart buffering", "flow aware" and
"mice v elephants" comments into how it actually works in the
tomahawk-3 chip... I'd love it.
Just reading what you quoted, it looks like they have a big buffer which can absorb incast bursts effectively (that is, without incurring burst loss), but they also have flow isolation to minimise the inter-flow induced latency that a large buffer would normally imply. That's definitely a step forward if it's actually in the hardware at those sorts of speeds.

Precisely what counts as "a flow" is not quite clear. It could be just IP address pairs, rather than 5-tuples.

There doesn't seem to be any explicit mention of AQM here, but that reference to RoCEv2 "congestion control" might imply something in that category - or it could just be a back-pressure mechanism to hint local applications.

- Jonathan Morton

Loading...