Discussion:
[Bloat] quick review and rant of "Identifying and Handling Non Queue Building Flows in a Bottleneck Link"
Dave Taht
2018-10-29 04:02:45 UTC
Permalink
Dear Greg:

I don't feel like commenting much on ietf matters these days
but, jeeze, I'm really tired of rehashed arguments based on bad papers
that aren't even cited in the document, that aren't talking about the
real problem.

I get that some magical solution to a non-problem is wanted here:

https://tools.ietf.org/id/draft-white-tsvwg-nqb-00.txt

but then I just read this part... and had to cringe and respond.

Flow queueing approaches (such as fq_codel RFC 8290 [RFC8290]), on
the other hand, achieve latency improvements by associating packets
into "flow" queues and then prioritizing "sparse flows", i.e. packets
that arrive to an empty flow queue. Flow queueing does not attempt
to differentiate between flows on the basis of value (importance or
latency-sensitivity), it simply gives preference to sparse flows, and
tries to guarantee that the non-sparse flows all get an equal share

Nope, that's not what it does. please strike the sentence "it simply
gives preference... " and replace with something like "all flows
get as fully mixed as possible."

It's just slightly better min/max fair than most other fair queuing
mechanisms.

Better:

Flow queueing approaches (such as fq_codel RFC 8290 [RFC8290]), on
the other hand, achieve latency improvements by breaking flows back
into individual packets, and then prioritizing "sparse flows", i.e. packets
that arrive to an empty flow queue that empties every round.

Flow queueing does not attempt to differentiate between flows on the
basis of value (importance or latency-sensitivity), it simply
interleaves flows as best as possible, giving a slight preference to
sparse flows.

As a result, fq mechanisms are appropriate for unmanaged environments
and general internet traffic.

Editorial:

Are. Have. A full deployment is basically done now. fq_codel is
universally "on" on most linuxes, with sch_fq and pacing filling in the
rest. a FQ technique is a near universal default in the cloud
now. deployed in the 10s of millions on commercial linux APs. fq_codel
is the default in OSX wifi. It's now in freebsd, and there was a great
thread recently on how it all works in pfsense here:
https://forum.netgate.com/topic/112527/playing-with-fq_codel-in-2-4/709

Got any users for this other stuff yet?

Downsides to this approach can include loss of low latency performance
due to hash collisions (where a sparse flow shares a queue with a
bulk data flow)

sch_fq has *no* buckets. it's *perfectly* fair to millions of tcp flows.

fq_codel hash collision probability is sqrt(buckets), 1024 has thus far
been pretty good (32). When you combine that with the probability of
there being a fat flow to collide with (call it 3%? of all flows), the
real world impact of these collisions is nearly immeasurable against
normal traffic.

vs a single queue!!! which imposes the queue-length from one fat flow on
*all* flows. The probability of that happening is 100%.

Jeeze. Describing a "loss of low latency performance" this way is a 99%
lie vs what you are comparing against, and it's because I'm ranting
today since I've encountered this lie a dozen times in a dozen documents
as if repetition made it true!

NOW...

I NOTE HOWEVER: We happen to agree that that one collision in a few
hundred in the case of something you really really care about is too
much. So we added not only 8 way set associative hashing to sch_cake
(collision probability of roughly zero), but full support for diffserv
classification, and it's shipping in linux 4.19. (been available for two
years in openwrt, ubnt, evenroute and elsewhere). It's got great support
for docsis framing also.

I sure hope more people outside the ietf try it. Oh yea, we also
defaulted sch_cake to per host FQ (running simultaneously with the per
flow FQ), which works, even through nat, and that does a great job of
giving low latency to all those IOT devices you might have... keeps
torrents even more under control...

https://kernelnewbies.org/Linux_4.19#Better_networking_experience_with_the_CAKE_queue_management_algorithm

Anyway, observed use of diffserv markings is at nearly nil, and comcast
rewrites all traffic it doesn't recognise as CS1, so we wash that clean
on inbound.

And most of the time, at real bandwidths, fq_codel more than sufficies,
except when you need to toss off a one-liner for a shaper, like

tc qdisc add dev eth0 root cake docsis bandwidth 20mbit ack-filter

"complexity in managing a large number of queues"

fq_codel, um, a couple hundred lines of code. Trivial, compared to just
about anything real. tcp is a few thousand, for example. It's smaller
than most device drivers. The core of the algorithm is 20 lines.

(cake is more complicated)

Please stop calling it complicated. The codebase for nearly any queuing
system is going to be within 10-20% of each other...

THE REAL ELEPHANT IN THE ROOM is SHAPING.

... the 680ms of totally gratiutious buffering comcast CMTSes have at
100mbit, and 280 at 10mbit up. Having to outbound shape is not a problem
with any hardware we have deployed, even with cpus designed in the late
80s. Inbound shaping of the "Badwidth" being provided by ISPs today is
CPU intensive, easily 95% of the overall cost and complexity of doing
queuing "right". Can we have a discussion on fixing SHAPING somewhere
in the ietf? Can we get ISPs to at the very least, buffer no more than a
85ms of data?

"and
the scheduling (typically DRR) that enforces that each non-sparse
flow gets an equal fraction of link bandwidth causes problems with
VPNs and other tunnels"

strike "non-sparse" and say "saturating". The problem is MOST VPN
traffic is not saturating either, and when it is, it's usually managed
by a lousy FIFO queue in the first place.

So fq_codel gained the ability to also hash stuff in a terminating
tunnel with really nice results for voip in particular a few years
back. I can't find the test right now, but we observed peak latency and
jitter in the 100ms range before doing that, and 2ms after.

Sure, in contrast to other saturating flows outside the tunnel, you
might get less bandwidth, but all the vpn bandwidth you do get, is
goodwidth. For all traffic.

And it's per vpn tunnel. These days I see one vpn per user...

Did I rant already that the vast majority of flows are non-saturating?

I really wish, just once, one day that the l4s effort would actually try
real traffic in a real scenario, with web, voip, dns, gaming. I've
reached the point where I just start calling crazy conclusions based on
crazy assumptions "big buck bunny scenarios".

start with an hour's packet capture of your home gateways normal
traffic. count the number of distinct flows.... or your office gateways'...

, exhibits poor behavior with less-aggressive
CA algos, e.g. LEDBAT, and exhibits poor behavior with RMCAT CA
algos.

I would not call it poor. I would call it undesirable based on the
intent of the algorithm's designers. HOWEVER:

LEDBAT advocates think that imposing 100ms of queuing delay on all flows
is *ok*. To me, that's *poor*. In fact - *nuts*.

https://perso.telecom-paristech.fr/drossi/paper/rossi14comnet-b.pdf

I really wish I wasn't outnumbered on writing that paper, and could
release a sequel to this based on real world results...

'cause fq_codel users run bittorrent all day long and don't notice its
there. Every freaking day. I'm running 10 torrents right now. Never
notice. I knew when we wrote that we'd obsolete LPCC's brain damaged
100ms induced latency concept as a "good" method... *entirely*, and we
did, and *WE DON'T CARE*. You shouldn't either.

Try running torrent all day on cable without an inbound fq_codel
shaper. Just try it. Chose.

* "Exhibits poor behavior with rmcat CA algos."

The relevant paper for this was terrible, and based on an artificial benchmark not modeling real traffic, and was at 2Mbit to boot, by a very biased observer.

Videoconferencing traffic works GREAT with fq_codel at higher bandwidths
against all forms of real traffic. 20mbit down/4mbit up and above.

2Mbit is just going to always suck for videoconferencing, above that...

To quote VJ : "low rate videoconferencing never gets dropped"

Which is really the measured case, in real traffic, on real networks,
doing videoconferencing. Try it.

" In effect the network element is making a decision as to what
constitutes a flow, and then forcing all such flows to take equal
bandwidth at every instant."

Still wouldn't say it that way. It's maximizing entropy that matters.
Microbursts get ripped out. Jitter vanishes. Only fat flows see "equal
bandwidth".

My biggest bugaboo with the l4s thing is that somehow y'all think that
traffic will pre-arrive, automagically dispersed in statistically random
ways, and that does not happen. Ever. Traffic is naturally bursty and
needs to be broken up.

Anyway this rant is somewhat mis-directed. I do hope some magical way of
identifying sparser flows appears, other than FQ already does so
beautifully, but until then I really wish more people would repeat less
bullshit and go actually use the stuff they are dissing and understand
it more deeply.

It certainly is possible to do better entropy e2e than we do.

There has certainly been wonderful things with fq + pacing of late, and
I look forward to things like the new etx scheduler and intel hardware
support for timerwheels to be widely deployed.
Toke Høiland-Jørgensen
2018-10-29 14:15:13 UTC
Permalink
Hi Greg

Since Dave's email prompted me to read your draft, I thought I'd chime
3.2. Queuing behavior analysis
Similar to the queue protection function outlined in the previous
section, it may be feasible to devise a real time flow analyzer for a
node that would identify flows that are causing queue build up, and
redirect those flows to the QB queue, leaving the remaining flows in
the NQB queue.
To me, this sounds exactly like a description of what the sparse flow
optimisation of FQ-CoDel, so I'm puzzled why you don't believe this to
be the case? Sure, FQ-CoDel uses its per-flow queueing system as a
mechanism to achieve this, but if you really can't do per-flow queueing,
it is conceivable that the mechanism could be implemented without it.
The high-level logic is basically (in your terms) "on enqueue, does this
flow already have a packet queued? if yes, it's a QB flow, if no, it's
an NQB flow". It is simple, but quite effective.

I've recently analysed the behaviours that fall out of this simple
mechanism in some detail, which may be relevant to your efforts. See
this publication in IEEE Communication Letters:
https://doi.org/10.1109/LCOMM.2018.2871457

-Toke
Greg White
2018-10-31 00:12:39 UTC
Permalink
Hi Toke, thanks for the pointer to your paper, I had not seen it before.

I agree that could be a candidate algorithm. It is certainly simple. It may not be the only (or perhaps even the best) solution for the dual-queue case though. I'm thinking in the context of an L4S TCP flow, which can respond quickly to "new" ECN markings and achieve link saturation with ultra low (but not zero) queuing delay. A good property for a queue protection algorithm would be that these L4S flows could be consistently placed in the NQB queue. I think that the simple approach you mentioned would result in many L4S flows being deemed Queue Building.

Additionally, we've observed applications that send variable sized "messages" at a fixed rate (e.g. 20 messages/second) where the message sometimes exceeds the MTU and results in two closely spaced (possibly back-to-back) packets. This is a flow that I think should be considered to be NQB, but would get flagged as QB by the simple approach. You described this case in your paper, where you state that the first Q bytes of each burst will be treated as NQB (the first packet in the case we're talking about here), but the rest will be treated as QB. Assuming that message latency is important for these sorts of applications, this is equivalent to saying that the entire burst is considered as QB. In the fq_codel case, the message latency would be something like Q(n-1)(N+1)/R (assuming no other sparse flow arrivals), something like 1.3ms using the example values in your paper (plus n=2, N=10) which may be ok. In the dual-queue case it is a bigger deal, because the remaining packets would be put at the end of the QB queue, which could have a latency of 10 or 20 ms.

So, a queue protection function that provides a bit more (but still limited) allowance for a flow to have packets in queue would likely work better in the dual-queue case.

-Greg



On 10/29/18, 8:15 AM, "Toke Høiland-Jørgensen" <***@toke.dk> wrote:

Hi Greg

Since Dave's email prompted me to read your draft, I thought I'd chime
3.2. Queuing behavior analysis
Similar to the queue protection function outlined in the previous
section, it may be feasible to devise a real time flow analyzer for a
node that would identify flows that are causing queue build up, and
redirect those flows to the QB queue, leaving the remaining flows in
the NQB queue.
To me, this sounds exactly like a description of what the sparse flow
optimisation of FQ-CoDel, so I'm puzzled why you don't believe this to
be the case? Sure, FQ-CoDel uses its per-flow queueing system as a
mechanism to achieve this, but if you really can't do per-flow queueing,
it is conceivable that the mechanism could be implemented without it.
The high-level logic is basically (in your terms) "on enqueue, does this
flow already have a packet queued? if yes, it's a QB flow, if no, it's
an NQB flow". It is simple, but quite effective.

I've recently analysed the behaviours that fall out of this simple
mechanism in some detail, which may be relevant to your efforts. See
this publication in IEEE Communication Letters:
https://doi.org/10.1109/LCOMM.2018.2871457

-Toke
Toke Høiland-Jørgensen
2018-11-01 13:25:01 UTC
Permalink
Post by Greg White
Hi Toke, thanks for the pointer to your paper, I had not seen it before.
You're welcome :)
Post by Greg White
I agree that could be a candidate algorithm. It is certainly simple.
It may not be the only (or perhaps even the best) solution for the
dual-queue case though. I'm thinking in the context of an L4S TCP
flow, which can respond quickly to "new" ECN markings and achieve link
saturation with ultra low (but not zero) queuing delay. A good
property for a queue protection algorithm would be that these L4S
flows could be consistently placed in the NQB queue. I think that the
simple approach you mentioned would result in many L4S flows being
deemed Queue Building.
Yes, I think you are right (depending on traffic mix, of course). It
might be possible to tweak it to work better, though. E.g., by changing
the threshold (moving flows to QB if they end up with more than X ms of
queue). This would only work if you start out all flows at NQB, with the
associated aggressive marking behaviour; so I'm not sure if a normal TCP
flow would ever manage to get to QB state before getting clobbered by
the NQB markings...
Post by Greg White
Additionally, we've observed applications that send variable sized
"messages" at a fixed rate (e.g. 20 messages/second) where the message
sometimes exceeds the MTU and results in two closely spaced (possibly
back-to-back) packets. This is a flow that I think should be
considered to be NQB, but would get flagged as QB by the simple
approach. You described this case in your paper, where you state that
the first Q bytes of each burst will be treated as NQB (the first
packet in the case we're talking about here), but the rest will be
treated as QB. Assuming that message latency is important for these
sorts of applications, this is equivalent to saying that the entire
burst is considered as QB. In the fq_codel case, the message latency
would be something like Q(n-1)(N+1)/R (assuming no other sparse flow
arrivals), something like 1.3ms using the example values in your paper
(plus n=2, N=10) which may be ok. In the dual-queue case it is a
bigger deal, because the remaining packets would be put at the end of
the QB queue, which could have a latency of 10 or 20 ms.
Sure, it's by no means a perfect mechanism. But it makes up for that by
it's simplicity, IMO. And it does work really well for *a lot* of
today's latency-sensitive traffic.

(In your case of two-MTU messages, you could tune the quantum to allow
those; but of course you can construct examples that won't work).
Post by Greg White
So, a queue protection function that provides a bit more (but still
limited) allowance for a flow to have packets in queue would likely
work better in the dual-queue case.
Yeah, that's tricky, especially if you want it to be very accurate in
its distinction; which I sort of gather that you do, right?

-Toke
Dave Taht
2018-11-01 19:13:19 UTC
Permalink
Post by Toke Høiland-Jørgensen
Post by Greg White
Hi Toke, thanks for the pointer to your paper, I had not seen it before.
You're welcome :)
Post by Greg White
I agree that could be a candidate algorithm. It is certainly simple.
It may not be the only (or perhaps even the best) solution for the
dual-queue case though. I'm thinking in the context of an L4S TCP
flow, which can respond quickly to "new" ECN markings and achieve link
saturation with ultra low (but not zero) queuing delay. A good
property for a queue protection algorithm would be that these L4S
flows could be consistently placed in the NQB queue. I think that the
simple approach you mentioned would result in many L4S flows being
deemed Queue Building.
Yes, I think you are right (depending on traffic mix, of course). It
might be possible to tweak it to work better, though. E.g., by changing
the threshold (moving flows to QB if they end up with more than X ms of
queue). This would only work if you start out all flows at NQB, with the
associated aggressive marking behaviour; so I'm not sure if a normal TCP
flow would ever manage to get to QB state before getting clobbered by
the NQB markings...
Post by Greg White
Additionally, we've observed applications that send variable sized
"messages" at a fixed rate (e.g. 20 messages/second) where the message
sometimes exceeds the MTU and results in two closely spaced (possibly
back-to-back) packets. This is a flow that I think should be
considered to be NQB, but would get flagged as QB by the simple
approach. You described this case in your paper, where you state that
the first Q bytes of each burst will be treated as NQB (the first
packet in the case we're talking about here), but the rest will be
treated as QB. Assuming that message latency is important for these
sorts of applications, this is equivalent to saying that the entire
burst is considered as QB. In the fq_codel case, the message latency
would be something like Q(n-1)(N+1)/R (assuming no other sparse flow
arrivals), something like 1.3ms using the example values in your paper
(plus n=2, N=10) which may be ok. In the dual-queue case it is a
bigger deal, because the remaining packets would be put at the end of
the QB queue, which could have a latency of 10 or 20 ms.
Sure, it's by no means a perfect mechanism. But it makes up for that by
it's simplicity, IMO. And it does work really well for *a lot* of
today's latency-sensitive traffic.
(In your case of two-MTU messages, you could tune the quantum to allow
those; but of course you can construct examples that won't work).
sch_fq has a default quantum of 2 mtu, with a initial burst of 10.
There's all sorts of interesting work inside that to "right-size" the
ongoing gso offloads and a major new advance over there on calculating
rtts properly is described here:

https://lwn.net/Articles/766564/

...

there was at one point, a fq_pie implementation that used the rbtree
in sch_fq to
achieve perfect fairness.

we often tune fq_codel's quantum as low as 300, at low rates.
Post by Toke Høiland-Jørgensen
Post by Greg White
So, a queue protection function that provides a bit more (but still
limited) allowance for a flow to have packets in queue would likely
work better in the dual-queue case.
For inbound shaping sch_cake defaults to 2mtu at higher rates.

This kind of opens a question in that, what is a typical target
bandwidth for l4s applications?

the videoconferencing paper I dissed in my earlier rant focused only
at 2mbits (and drew the conclusion that sfq was the best option for
the cc algo's 300ms desired range)

I have generally been focused on badwidths in the 4mbit-40gbit range.
Post by Toke Høiland-Jørgensen
Yeah, that's tricky, especially if you want it to be very accurate in
its distinction; which I sort of gather that you do, right?
In part, that is why I would like the language in the problem
statement clarified. To give a concrete counterexample,
think upon the ultimate choice of a crc32 algorithm and the parameters
that drove that choice.

If an accuracy of identifying "sparse flows" or NQB flows can be
specified, then all sorts of algorithms can be thrown into the fray.
AI. Bloom filters. rbtrees. regexes. cookoo++ hashes... and boring
old fq tech. :)
Post by Toke Høiland-Jørgensen
-Toke
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Greg White
2018-11-06 04:17:38 UTC
Permalink
Hi Dave,

Thanks for the constructive review comments on the text of the draft, including pointers to references that could be included. I'll take those comments on in the next revision.

As I know you are aware, the majority of bottleneck links are not linux machines (running tc and feeding packets to a continuous dedicated link), so comparing lines of kernel code as a metric for complexity isn't the most useful. In the bottleneck links that I am concerned about (and other similar links), it is advantageous to tightly couple the queuing and scheduling mechanisms with the hardware-optimized discontinuous MAC layer, and also take into account engineering tradeoffs that are different than those that apply in a linux machine.

The point about bittorrent and RMCAT wasn't about latency, it was about throughput. I have come to share the view that others have expressed that it is a valuable property of the internet that congestion control algorithm developers are free to innovate. FQ in the bottleneck link diminishes this ability in some important ways. Some applications (cloud backup, mass uploads like the icloud photo library updates, bittorrent, etc) might not be interested in per-flow fairness with TCP, but would prefer to be less aggressive and thus take a smaller fraction of the link when there are competing TCP flows (note, I am not arguing that the *current* ledbat protocol is the right one). Other applications (RMCAT) might in the future aim to be per-flow fair, but need slightly longer to adapt than FQ allows.

Some folks are just uncomfortable with the departure from the "end-to-end" principal with fq (i.e. a middlebox gets to determine what constitutes a "flow", then enforces that all fat "flows" get equal bandwidth at every instant). Fq_codel is a pragmatic solution for some systems, and it gives a real undeniable benefit in common cases with current applications/mixes, but it isn't perfect. Perhaps sch_cake is better in some cases, but it makes a lot more decisions on behalf of the user that might be ok now, but may not be in the future. In light of all of the above, I'm open to considering other alternatives. So, I don't see this as a non-problem, and I don't see the potential solutions as being in any way magical.

-Greg
Post by Toke Høiland-Jørgensen
Post by Greg White
Hi Toke, thanks for the pointer to your paper, I had not seen it before.
You're welcome :)
Post by Greg White
I agree that could be a candidate algorithm. It is certainly simple.
It may not be the only (or perhaps even the best) solution for the
dual-queue case though. I'm thinking in the context of an L4S TCP
flow, which can respond quickly to "new" ECN markings and achieve link
saturation with ultra low (but not zero) queuing delay. A good
property for a queue protection algorithm would be that these L4S
flows could be consistently placed in the NQB queue. I think that the
simple approach you mentioned would result in many L4S flows being
deemed Queue Building.
Yes, I think you are right (depending on traffic mix, of course). It
might be possible to tweak it to work better, though. E.g., by changing
the threshold (moving flows to QB if they end up with more than X ms of
queue). This would only work if you start out all flows at NQB, with the
associated aggressive marking behaviour; so I'm not sure if a normal TCP
flow would ever manage to get to QB state before getting clobbered by
the NQB markings...
Post by Greg White
Additionally, we've observed applications that send variable sized
"messages" at a fixed rate (e.g. 20 messages/second) where the message
sometimes exceeds the MTU and results in two closely spaced (possibly
back-to-back) packets. This is a flow that I think should be
considered to be NQB, but would get flagged as QB by the simple
approach. You described this case in your paper, where you state that
the first Q bytes of each burst will be treated as NQB (the first
packet in the case we're talking about here), but the rest will be
treated as QB. Assuming that message latency is important for these
sorts of applications, this is equivalent to saying that the entire
burst is considered as QB. In the fq_codel case, the message latency
would be something like Q(n-1)(N+1)/R (assuming no other sparse flow
arrivals), something like 1.3ms using the example values in your paper
(plus n=2, N=10) which may be ok. In the dual-queue case it is a
bigger deal, because the remaining packets would be put at the end of
the QB queue, which could have a latency of 10 or 20 ms.
Sure, it's by no means a perfect mechanism. But it makes up for that by
it's simplicity, IMO. And it does work really well for *a lot* of
today's latency-sensitive traffic.
(In your case of two-MTU messages, you could tune the quantum to allow
those; but of course you can construct examples that won't work).
sch_fq has a default quantum of 2 mtu, with a initial burst of 10.
There's all sorts of interesting work inside that to "right-size" the
ongoing gso offloads and a major new advance over there on calculating
rtts properly is described here:

https://lwn.net/Articles/766564/

...

there was at one point, a fq_pie implementation that used the rbtree
in sch_fq to
achieve perfect fairness.

we often tune fq_codel's quantum as low as 300, at low rates.
Post by Toke Høiland-Jørgensen
Post by Greg White
So, a queue protection function that provides a bit more (but still
limited) allowance for a flow to have packets in queue would likely
work better in the dual-queue case.
For inbound shaping sch_cake defaults to 2mtu at higher rates.

This kind of opens a question in that, what is a typical target
bandwidth for l4s applications?

the videoconferencing paper I dissed in my earlier rant focused only
at 2mbits (and drew the conclusion that sfq was the best option for
the cc algo's 300ms desired range)

I have generally been focused on badwidths in the 4mbit-40gbit range.
Post by Toke Høiland-Jørgensen
Yeah, that's tricky, especially if you want it to be very accurate in
its distinction; which I sort of gather that you do, right?
In part, that is why I would like the language in the problem
statement clarified. To give a concrete counterexample,
think upon the ultimate choice of a crc32 algorithm and the parameters
that drove that choice.

If an accuracy of identifying "sparse flows" or NQB flows can be
specified, then all sorts of algorithms can be thrown into the fray.
AI. Bloom filters. rbtrees. regexes. cookoo++ hashes... and boring
old fq tech. :)
Post by Toke Høiland-Jørgensen
-Toke
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Dave Taht
2018-11-12 22:19:50 UTC
Permalink
Post by Greg White
Hi Dave,
Thanks for the constructive review comments on the text of the draft,
including pointers to references that could be included. I'll take
those comments on in the next revision.
Thank you for filtering out the late-night noisy part of my post.

I hit save on this one... please forgive me for making these points
strenuously, one last time, in this forum, I plan to just go back to
work, making running code run everywhere I can, after this.
Post by Greg White
As I know you are aware, the majority of bottleneck links are not
linux machines (running tc and feeding packets to a continuous
dedicated link)
I imagine there is ~2b home/business/edge routers in the world. I do not
know how many of those run a linux derived or bsd derived operating
system or hardware. A recent 5G test was also purely linux based.

The other bottlenecks (CMTS/DSLAM/ENODE-Bs) seem to be running something
else entirely.

2b ain't scratch.

There's still ddpk and other software solutions left to chase here.
Post by Greg White
, so comparing lines of kernel code as a metric for
complexity isn't the most useful.
What would you like as a metric for complexity?

In pure computer science we can build things up as an O(x) or O(nlogn)
or a variety of other bits of math. QFQ has a good analysis that way,
but that sort of analysis does tend to assume that stuff that's already
done in hw has a cost, when it doesn't.

This is kind of a serious question: in that you are seeking some method
to identify "sparse" packets cheaper than DRR or SFQ, without clearly
identifying what that is. For another complexity analysis, see the HHF
thing I'm linking to later here.

I've wracked my brain trying to come up with other things than hashes
and TCAMs - like bloom filters - and all of them have more complexity
and cost than hashes and FQ. In hardware. Or software.

People do regexps in hardware now. Those are *gnarly*.

It would be great to nail your complexity requirement to a
wall. Example:

"We want something other than hash + FQ that can run in a single clock
cycle on a Zilinx FPGA at 1Ghz to pull sparser flows out of the mix and
insert them ahead of other bulk flows."

That makes the research simpler, I think. That's a hardware spec.

hw bloom filters would be my best guess at something that might be able
to identify flows.

https://www.eecs.harvard.edu/~dbrooks/lyon_islped09.pdf

However, it easier to identify heavy hitters. The HHF algorithm for
example was a *good try*

https://lwn.net/Articles/577208/

but in terms of complexity and cost, well, I've not got around to
benching that one or looking into a hardware implementation, in my
world, lacking TCAMS - route table lookups, firewalls, natting and
shaping, far outweigh the cost of the base fq_codel algo.

shaping is *easy* in hw.
Post by Greg White
In the bottleneck links that I am
concerned about (and other similar links), it is advantageous to
tightly couple the queuing and scheduling mechanisms with the
hardware-optimized discontinuous MAC layer
It would be good if you clarified your targets in the draft.

Secondly, we did "tightly couple queuing and scheduling mechanisms with
the hardware-optimized discontinuous MAC layer" in our wifi work.

https://www.usenix.org/system/files/conference/atc17/atc17-hoiland-jorgensen.pdf

I'd hoped that that would beat a path forward for 5G, cable and GPON
networks designs.
Post by Greg White
, and also take into account
engineering tradeoffs that are different than those that apply in a
linux machine.
I am painfully aware of these, however, my points about engineering
tradeoffs do seem slide off of some people.

On the high end, on gear I'm familiar with today:

* we don't know how to get past 100Gbit without multi-queueing
* most (all) of the 10GB+ hardware I'm aware of has at least 64 HW
* queues
* hashing is done in hw

after that...

* 1024 DRR or SFQ require a 10 bit AND gate wired to the output of the
hasher that looks up where to stick the packet.
* timestamps are "free"

The DRR stuff has existed in netfpga for years.
Post by Greg White
The point about bittorrent and RMCAT wasn't about latency, it was
about throughput. I have come to share the view that others have
expressed that it is a valuable property of the internet that
congestion control algorithm developers are free to innovate.
I certainly think cc devs are free to innovate.

Things is 99.99999% of the world doesn't care about CC. Take
freeswitch's voip stuff - they don't do any CC, they just blast packets.
Plenty of applications just blast packets. The TCP-friendly world exists
only in the IETF. Outside, it's gnarly.

Of the remainder of the population, most cc devs want theirs to be the
one true algorithm, and that's the source of endless contention - like
watching BBR duke it out against cubic which horrible to behold.

Lowering the allowable buffering bar to something reasonable and then
innovating with an FQ'd environment is perfectly doable.

using FQ actually *opens up the possibilities* to do whatever CC you
like, because whatever you do does not impact anyone else (much) and
neither are you impacted by them.

Want to grab more bandwidth? Go ahead, try, until you get delay. Too
much delay and you drop packets. Don't want to drop packets, use
ecn... eat as much delay as you want.

Want to use less bandwidth and be a burden? That's totally doable
also (in my case I just reverted reno to IW2 and cut the growth rate to 40%)

Diffserv - were it only transported e2e - is how to better express
your intent.

.. as it is cake's diffserv works pretty well on egress from your network.
Post by Greg White
FQ in
the bottleneck link diminishes this ability in some important ways.
Some applications (cloud backup, mass uploads like the icloud photo
library updates, bittorrent, etc) might not be interested in per-flow
fairness with TCP, but would prefer to be less aggressive and thus
take a smaller fraction of the link when there are competing TCP flows
and that happens, automatically. less greedy flows get a smaller
fraction of the link.
Post by Greg White
(note, I am not arguing that the *current* ledbat protocol is the
right one). Other applications (RMCAT) might in the future aim to be
per-flow fair, but need slightly longer to adapt than FQ allows.
"but need slightly longer to adapt than FQ allows"

I don't see your point here. Videoconferencing slightly below the fair
share never gets a drop or sees delay. Higher rates, see a bit of
delay.

Videoconferencing is very good at dealing with drops and jitter
far far worse than what FQ or FQ_codel can inflict.

In terms of never dropping... I do hope to fiddle with ECN in NADA or
google's CC someday. Still, for videoconferencing, the minimum reaction
time is about a frame, 16ms -

... not hundreds. Certainly encoders today have difficulty shifting
rates faster than every few 100ms, but keeping videoconferencing delays
below 150ms total, e2e, is required for a good experience - and I'd
really love to see 16ms on that side, and for that matter, a return to
scan-lines to get rid of that 16ms we currently lose per frame.

The LOLA project had it right.
Post by Greg White
Some folks are just uncomfortable with the departure from the
"end-to-end" principal with fq
The authors of that paper, notably Dave Reed, have embraced this
departure from the e2e principle. The author of RED never shared it.
Post by Greg White
(i.e. a middlebox gets to determine
what constitutes a "flow", then enforces that all fat "flows" get
equal bandwidth at every instant).
but that's in general, not my concern. Microbursts, clumping at
bottlenecks, DOS attacks, applications that care no one whit about CC in
the first place is the largest part of the problem.

I'm unfond of middleboxes myself however I do not think re-ordering
packets *fairly* is unfair to anything.
Post by Greg White
Fq_codel is a pragmatic solution
for some systems, and it gives a real undeniable benefit in common
cases with current applications/mixes, but it isn't perfect.
I wish, very much, that the AQM working group had merely come up with a
recomendation for "no more than a contintental BDP's worth of buffering"
at the head end. It's multiple ISPs and designs that still have > 200ms
of buffering built into them that makes me the most nuts.

Can anyone here make a spirited and robust defence of > 100ms of buffering
at the bottleneck links along the edge? Of 500ms? Of 2 seconds?

We really never not even get close to 100ms worth of buffering with
today's paced tcps. Videoconferencing, you need a frame. What else?

The perfect is the enemy of the good. I'm perfectly happy to have cut
the internet's observable buffering to nearly 0 for most flows, and to
far less than 100ms for all others. I'm happy to see innovations like
BBR, which + fq of any sort - works great.

There may well be a target even further below than what we currently
achieve that's achievable, but given a choice between allowing things
to persist...

well... I went off and fixed wifi. good wifi will keep isps relevant
relative to 5G for a few more years at least... and the uptake is
near total on the hardware and software I've been able to influence,
with universal acclaim from the users.
Post by Greg White
Perhaps
sch_cake is better in some cases, but it makes a lot more decisions on
behalf of the user that might be ok now, but may not be in the future.
Certainly we put the control back in the users hands here. I did rather
wish with cake that someone here would go and play with the diffserv (which
allows a flow to express, directly, what it wants) stuff, or the ECN
implentation, which allows a flow to either respond with more or less
latency as it chooses, without loss with their CC algo or application of choice.

... and get back to us.

I do also note a subtle point. In many parts of the DSL world, at least,
users get to chose how their ISP shapes their traffic. I wish we could
have devices and ISP links, universally, that handed the users back that
control, and took the ISP and CC devs, out of it.
Post by Greg White
In light of all of the above, I'm open to considering other
alternatives. So, I don't see this as a non-problem, and I don't see
the potential solutions as being in any way magical.
What solutions do you see? I'm stumped. Best idea I have is bloom
filters, in hw, with something that's far less effective for the fast
majority of real traffic and real applications that fq_codel has been.

Can you point at anything that might work? Especially anything that can
also handle all the bad non-cc'd behavior on the internet today?
Post by Greg White
-Greg
Post by Toke Høiland-Jørgensen
Post by Greg White
Hi Toke, thanks for the pointer to your paper, I had not seen it before.
You're welcome :)
Post by Greg White
I agree that could be a candidate algorithm. It is certainly simple.
It may not be the only (or perhaps even the best) solution for the
dual-queue case though. I'm thinking in the context of an L4S TCP
flow, which can respond quickly to "new" ECN markings and achieve link
saturation with ultra low (but not zero) queuing delay. A good
property for a queue protection algorithm would be that these L4S
flows could be consistently placed in the NQB queue. I think that the
simple approach you mentioned would result in many L4S flows being
deemed Queue Building.
Yes, I think you are right (depending on traffic mix, of course). It
might be possible to tweak it to work better, though. E.g., by changing
the threshold (moving flows to QB if they end up with more than X ms of
queue). This would only work if you start out all flows at NQB, with the
associated aggressive marking behaviour; so I'm not sure if a normal TCP
flow would ever manage to get to QB state before getting clobbered by
the NQB markings...
Post by Greg White
Additionally, we've observed applications that send variable sized
"messages" at a fixed rate (e.g. 20 messages/second) where the message
sometimes exceeds the MTU and results in two closely spaced (possibly
back-to-back) packets. This is a flow that I think should be
considered to be NQB, but would get flagged as QB by the simple
approach. You described this case in your paper, where you state that
the first Q bytes of each burst will be treated as NQB (the first
packet in the case we're talking about here), but the rest will be
treated as QB. Assuming that message latency is important for these
sorts of applications, this is equivalent to saying that the entire
burst is considered as QB. In the fq_codel case, the message latency
would be something like Q(n-1)(N+1)/R (assuming no other sparse flow
arrivals), something like 1.3ms using the example values in your paper
(plus n=2, N=10) which may be ok. In the dual-queue case it is a
bigger deal, because the remaining packets would be put at the end of
the QB queue, which could have a latency of 10 or 20 ms.
Sure, it's by no means a perfect mechanism. But it makes up for that by
it's simplicity, IMO. And it does work really well for *a lot* of
today's latency-sensitive traffic.
(In your case of two-MTU messages, you could tune the quantum to allow
those; but of course you can construct examples that won't work).
sch_fq has a default quantum of 2 mtu, with a initial burst of 10.
There's all sorts of interesting work inside that to "right-size" the
ongoing gso offloads and a major new advance over there on calculating
https://lwn.net/Articles/766564/
...
there was at one point, a fq_pie implementation that used the rbtree
in sch_fq to
achieve perfect fairness.
we often tune fq_codel's quantum as low as 300, at low rates.
Post by Toke Høiland-Jørgensen
Post by Greg White
So, a queue protection function that provides a bit more (but still
limited) allowance for a flow to have packets in queue would likely
work better in the dual-queue case.
For inbound shaping sch_cake defaults to 2mtu at higher rates.
This kind of opens a question in that, what is a typical target
bandwidth for l4s applications?
the videoconferencing paper I dissed in my earlier rant focused only
at 2mbits (and drew the conclusion that sfq was the best option for
the cc algo's 300ms desired range)
I have generally been focused on badwidths in the 4mbit-40gbit range.
Post by Toke Høiland-Jørgensen
Yeah, that's tricky, especially if you want it to be very accurate in
its distinction; which I sort of gather that you do, right?
In part, that is why I would like the language in the problem
statement clarified. To give a concrete counterexample,
think upon the ultimate choice of a crc32 algorithm and the parameters
that drove that choice.
If an accuracy of identifying "sparse flows" or NQB flows can be
specified, then all sorts of algorithms can be thrown into the fray.
AI. Bloom filters. rbtrees. regexes. cookoo++ hashes... and boring
old fq tech. :)
Post by Toke Høiland-Jørgensen
-Toke
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Michael Welzl
2018-11-01 10:39:18 UTC
Permalink
Hi,
Post by Dave Taht
I don't feel like commenting much on ietf matters these days
but, jeeze,
(snip)
Post by Dave Taht
Did I rant already that the vast majority of flows are non-saturating?
That's a bug, not a feature - and you seem to treat it as an unchangeable fact.
Despite the undebatable importance of bufferbloat and its impact on e2e packet latency, this is only one of the factors playing into the "latency" that I perceive when I click on the link as I surf the Internet.
Flow completion time has to do with saturation as well.

Cheers,
Michael
Dave Taht
2018-11-01 14:20:34 UTC
Permalink
Post by Michael Welzl
Hi,
Post by Dave Taht
I don't feel like commenting much on ietf matters these days
but, jeeze,
(snip)
Post by Dave Taht
Did I rant already that the vast majority of flows are non-saturating?
That's a bug, not a feature - and you seem to treat it as an unchangeable fact.
It's been the case for 50 years. Until we have one sole source of data
(which given industry consolidation is seemingly increasingly likely),
coming from one big machine (unlikely). The original cablelabs (2012?)
study of this stuff assumed that growth of a web page would be past
6MB by now, it instead is under 3.
Post by Michael Welzl
Despite the undebatable importance of bufferbloat and its impact on e2e packet latency, this is only one of the factors playing into the "latency" that I perceive when I click on the link as I surf the Internet.
Doing a breakdown of that latency - most of that seems solved...

Background prefetch
DNS lookup
SSL connection negotiation
The actual transfer.
Screen draw....

I'm still missing your point. Is looking for "sparseness" part of a
CCN-like effort?

In my mind QUIC, when basically nailed up, with its improvements to
tcp (0-rtt crypto, pacing, bbr) is in the end game here, and "if only"
fq existed on the cmts's as I'd had so much hope for after the arris
study

bufferbloat would be basically be over and dealt with.

And most flows would still be short.
Post by Michael Welzl
Flow completion time has to do with saturation as well.
FCT was not a subject of that draft.

My (admittedly ranty) points were:

* Mis-representation of how fq_codel actually worked for political purposes,
in particular mischaracterizing the behavior for videoconferencing
over more typical links...

* a complaint that despite all the work on bufferbloat, measured CMTS
buffering was still in the 680ms range at 100mbit. If anyone had
bothered to click on the pfsense discussion, I "subtly" pointed at a
FIOS measurement of 120ms...

Still waiting for that first step from the US cable industry. Things
are getting MUCH better in the past 2 years (see
http://www.dslreports.com/speedtest/results/bufferbloat ) just not (if
you scroll down) in the US.

I don't, incidentally, know why they are getting better... certainly I
have some good numbers on the size of the wifi deployment...

* Pointing out that sch_cake did diffserv, per host and per flow fq,
and docsis framing. Some of which had applicability to the draft.

We have not the slightest intent to bring that to the ietf, but it
does tackle and solve what little there was of a "collision" problem,
thoroughly... the diffserv stuff might help (I routinely give dns a
boost)... feel free to go reflash a router and see if it helps in your
scenarios.

Inbound shaping is cpu intensive, but it works pretty well.
Post by Michael Welzl
Cheers,
Michael
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
Michael Welzl
2018-11-04 18:16:51 UTC
Permalink
Hi,
Post by Dave Taht
Post by Michael Welzl
Despite the undebatable importance of bufferbloat and its impact on e2e packet latency, this is only one of the factors playing into the "latency" that I perceive when I click on the link as I surf the Internet.
Doing a breakdown of that latency - most of that seems solved...
Background prefetch
DNS lookup
SSL connection negotiation
The actual transfer.
Screen draw....
I'm still missing your point. Is looking for "sparseness" part of a
CCN-like effort?
No, it’s just about flow completion time (“The actual transfer”) above being a function not only of the queue length, but also of the capacity the flow gets to use. Hence the push for a larger initial window.
Post by Dave Taht
Post by Michael Welzl
Flow completion time has to do with saturation as well.
FCT was not a subject of that draft.
Right - sorry for side-tracking.
I read them - I didn’t want to get into this debate, it was only a side comment about not all flows being limited, and there being some value in better capacity usage too.

Cheers,
Michael
Michael Welzl
2018-11-01 19:12:11 UTC
Permalink
I was thinking about the web. You’re right about all the rest.

Cheers
Michael


Sent from my iPhone
Post by Michael Welzl
Post by Dave Taht
I don't feel like commenting much on ietf matters these days
but, jeeze,
(snip)
Post by Dave Taht
Did I rant already that the vast majority of flows are non-saturating?
That's a bug, not a feature - and you seem to treat it as an unchangeable fact.
Why would you think that saturating flows should be common? A very large percentage of Internet traffic is streaming audio/video and that should never saturate a link, it should be pacing the data to the rate of the content.
DNS queries are not going to be saturating.
queries to check cache validity are not going to be saturating.
microservices calls (including most IoT data) and their replies are not going to be saturating, in part because they don't have much to say, and in part because even if they do have more to say, they aren't going to ramp up to high packet rates before they run out of data to send.
It's only bulk transfers of data that are possibly going to be saturating, and they are only going to saturate their allowed share of the slowest link in the path. On all other links they are going to be non-saturating.
As links get faster, things that would have been saturating years ago fail to saturate the new, faster links.
So what would the Internet look like if it didn't have the vast majority of flows being non-saturating?
David Lang
David Lang
2018-11-01 18:15:25 UTC
Permalink
re-sending since it bounced the first time
Post by Michael Welzl
Post by Dave Taht
I don't feel like commenting much on ietf matters these days
but, jeeze,
(snip)
Post by Dave Taht
Did I rant already that the vast majority of flows are non-saturating?
That's a bug, not a feature - and you seem to treat it as an unchangeable fact.
Why would you think that saturating flows should be common? A very large
percentage of Internet traffic is streaming audio/video and that should never
saturate a link, it should be pacing the data to the rate of the content.
DNS queries are not going to be saturating.
queries to check cache validity are not going to be saturating.
microservices calls (including most IoT data) and their replies are not going
to be saturating, in part because they don't have much to say, and in part
because even if they do have more to say, they aren't going to ramp up to
high packet rates before they run out of data to send.
It's only bulk transfers of data that are possibly going to be saturating,
and they are only going to saturate their allowed share of the slowest link
in the path. On all other links they are going to be non-saturating.
As links get faster, things that would have been saturating years ago fail to
saturate the new, faster links.
So what would the Internet look like if it didn't have the vast majority of
flows being non-saturating?
David Lang
Loading...