[Bloat] lwn.net's tcp small queues vs wifi aggregation solved

Post by Dave Taht
Nice war story. I'm glad this last problem with the fq_codel wifi code
is solved

This wasn't specific to the fq_codel wifi code, but hit all WiFi devices
that were running TCP on the local stack. Which would be mostly laptops,
I guess...

-Toke

Eric Dumazet

2018-06-21 12:55:45 UTC

Post by Dave Taht
Nice war story. I'm glad this last problem with the fq_codel wifi code
is solved

This wasn't specific to the fq_codel wifi code, but hit all WiFi devices
that were running TCP on the local stack. Which would be mostly laptops,
I guess...

Yes.

Also switching TCP stack to always GSO has been a major gain for wifi in my tests.

(TSQ budget is based on sk_wmem_alloc, tracking truesize of skbs, and not having
GSO is considerably inflating the truesize/payload ratio)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a6b2a1dc2a2105f178255fe495eb914b09cb37a
tcp: switch to GSO being always on

I expect SACK compression to also give a nice boost to wifi.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e
tcp: add SACK compression

Lastly I am working on adding ACK compression in TCP stack itself.

Dave Taht

2018-06-21 15:18:07 UTC

Post by Eric Dumazet

Post by Dave Taht
Nice war story. I'm glad this last problem with the fq_codel wifi code
is solved

This wasn't specific to the fq_codel wifi code, but hit all WiFi devices
that were running TCP on the local stack. Which would be mostly laptops,
I guess...

Yes.
Also switching TCP stack to always GSO has been a major gain for wifi in my tests.
(TSQ budget is based on sk_wmem_alloc, tracking truesize of skbs, and not having
GSO is considerably inflating the truesize/payload ratio)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a6b2a1dc2a2105f178255fe495eb914b09cb37a
tcp: switch to GSO being always on
I expect SACK compression to also give a nice boost to wifi.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e
tcp: add SACK compression
Lastly I am working on adding ACK compression in TCP stack itself.

One thing I've seen repeatedly on mac80211 aircaps is a tendency for
clients to use up two TXOPs rather than one.

scenario:

1) A tcp burst arrives at the client
2) A single ack migrates down the client stack into the driver, into
the device, which then attempts to compete for airtime on that TXOP
for that single ack, sometimes waiting 10s of msec to get that op
3) a bunch more acks arrive "slightly late"[1], and then get queued
for the next TXOP, waiting, again sometimes 10s of msec

(similar scenario in a client making a quick string of web related requests)

This is a case where inserting a teeny bit more latency to fill up the
queue (ugh!), or a driver having some way to ask the probability of
seeing more data in the
next 10us, or... something like that, could help.

...

[1] if you need coffee through your nose this morning, regarding usage
of the phrase "slightly late", read
http://www.rawbw.com/~svw/superman.html

--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Caleb Cushing

2018-06-21 15:31:18 UTC

actually... all of my devices, including my desktop connect through wifi
these days... and only one of them isn't running some variant of linux.

Post by Eric Dumazet

Post by Dave Taht
Nice war story. I'm glad this last problem with the fq_codel wifi code
is solved

This wasn't specific to the fq_codel wifi code, but hit all WiFi devices
that were running TCP on the local stack. Which would be mostly laptops,
I guess...

Yes.
Also switching TCP stack to always GSO has been a major gain for wifi in

my tests.

Post by Eric Dumazet
(TSQ budget is based on sk_wmem_alloc, tracking truesize of skbs, and

not having

Post by Eric Dumazet
GSO is considerably inflating the truesize/payload ratio)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a6b2a1dc2a2105f178255fe495eb914b09cb37a

Post by Eric Dumazet
tcp: switch to GSO being always on
I expect SACK compression to also give a nice boost to wifi.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e

Post by Eric Dumazet
tcp: add SACK compression
Lastly I am working on adding ACK compression in TCP stack itself.

One thing I've seen repeatedly on mac80211 aircaps is a tendency for
clients to use up two TXOPs rather than one.
1) A tcp burst arrives at the client
2) A single ack migrates down the client stack into the driver, into
the device, which then attempts to compete for airtime on that TXOP
for that single ack, sometimes waiting 10s of msec to get that op
3) a bunch more acks arrive "slightly late"[1], and then get queued
for the next TXOP, waiting, again sometimes 10s of msec
(similar scenario in a client making a quick string of web related requests)
This is a case where inserting a teeny bit more latency to fill up the
queue (ugh!), or a driver having some way to ask the probability of
seeing more data in the
next 10us, or... something like that, could help.
...
[1] if you need coffee through your nose this morning, regarding usage
of the phrase "slightly late", read
http://www.rawbw.com/~svw/superman.html
--
Dave TÃ€ht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619 <(669)%20226-2619>
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Caleb Cushing

http://xenoterracide.com

Stephen Hemminger

2018-06-21 15:46:11 UTC

On Thu, 21 Jun 2018 10:31:18 -0500

Post by Caleb Cushing
actually... all of my devices, including my desktop connect through wifi
these days... and only one of them isn't running some variant of linux.

Sigh. My experience with wifi is that it is not stable enough for that.
Both AP's I have tried Linksys ACM3200 or Netgear WNDR3800 I still see random drop outs.
Not sure if these are device resets (ie workarounds) or other issues.

These happen independent of firmware (vendor, OpenWRT, or LEDE).
So my suspicion is the that Wifi hardware is shite and that firmware is trying
to workaround and mask the problem.

Caleb Cushing

2018-06-21 17:41:10 UTC

I'm not disagreeing, just saying that wifi is much more prevalent now than
just laptops... literally I only have a cord for emergency use

On Thu, Jun 21, 2018 at 10:46 AM Stephen Hemminger <

Post by Stephen Hemminger
On Thu, 21 Jun 2018 10:31:18 -0500

Post by Caleb Cushing
actually... all of my devices, including my desktop connect through wifi
these days... and only one of them isn't running some variant of linux.

Sigh. My experience with wifi is that it is not stable enough for that.
Both AP's I have tried Linksys ACM3200 or Netgear WNDR3800 I still see random drop outs.
Not sure if these are device resets (ie workarounds) or other issues.
These happen independent of firmware (vendor, OpenWRT, or LEDE).
So my suspicion is the that Wifi hardware is shite and that firmware is trying
to workaround and mask the problem.

--
Caleb Cushing

http://xenoterracide.com

Dave Taht

2018-06-21 15:50:30 UTC

Post by Caleb Cushing
actually... all of my devices, including my desktop connect through wifi
these days... and only one of them isn't running some variant of linux.

Yes the tendency of manufacturers to hook things up to the more
convenient, but overbuffered and less opaque USB bus has become an
increasingly large problem
(canonical example - raspberry pi). In the case of LTE, especially,
everything is a USB dongle, and the CDC_ETH driver and device spec
actually mandates at least 32k of
on-chip buffering on the other side of the bus.

We had tried at one point (5 years ago) to find ways to apply
something BQL-like to this but failed.

I am currently getting miserable performance out of the one LTE dongle
I have (16K/sec up) but have not gone and fiddled with it with more
modern kernels. I ended up
just tethering via an android phone, which cracks 1mbit up.

The quality of the wifi drivers for USB is almost uniformly miserable,
and out of tree.

Post by Caleb Cushing

Post by Eric Dumazet

Post by Dave Taht
Nice war story. I'm glad this last problem with the fq_codel wifi code
is solved

This wasn't specific to the fq_codel wifi code, but hit all WiFi devices
that were running TCP on the local stack. Which would be mostly laptops,
I guess...

Yes.
Also switching TCP stack to always GSO has been a major gain for wifi in my tests.
(TSQ budget is based on sk_wmem_alloc, tracking truesize of skbs, and not having
GSO is considerably inflating the truesize/payload ratio)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a6b2a1dc2a2105f178255fe495eb914b09cb37a
tcp: switch to GSO being always on
I expect SACK compression to also give a nice boost to wifi.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5d9f4262b7ea41ca9981cc790e37cca6e37c789e
tcp: add SACK compression
Lastly I am working on adding ACK compression in TCP stack itself.

One thing I've seen repeatedly on mac80211 aircaps is a tendency for
clients to use up two TXOPs rather than one.
1) A tcp burst arrives at the client
2) A single ack migrates down the client stack into the driver, into
the device, which then attempts to compete for airtime on that TXOP
for that single ack, sometimes waiting 10s of msec to get that op
3) a bunch more acks arrive "slightly late"[1], and then get queued
for the next TXOP, waiting, again sometimes 10s of msec
(similar scenario in a client making a quick string of web related requests)
This is a case where inserting a teeny bit more latency to fill up the
queue (ugh!), or a driver having some way to ask the probability of
seeing more data in the
next 10us, or... something like that, could help.
...
[1] if you need coffee through your nose this morning, regarding usage
of the phrase "slightly late", read
http://www.rawbw.com/~svw/superman.html
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Caleb Cushing
http://xenoterracide.com

--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

David Collier-Brown

2018-06-21 16:29:18 UTC

Post by Dave Taht
This is a case where inserting a teeny bit more latency to fill up the
queue (ugh!), or a driver having some way to ask the probability of
seeing more data in the
next 10us, or... something like that, could help.

Hmmn, that sounds like a pattern seen in physical switching systems:
someone with knowledge that another car is coming (especially if it's
unexpected) waves a flag at the dispatcher to warn them to leave space
and avoid a nasty ka-thump and the extra strain on the couplers (;-))

--dave

--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
***@spamcop.net | -- Mark Twain

Jonathan Morton

2018-06-21 16:54:08 UTC

Hmmn, that sounds like a pattern seen in physical switching systems: someone with knowledge that another car is coming (especially if it's unexpected) waves a flag at the dispatcher to warn them to leave space and avoid a nasty ka-thump and the extra strain on the couplers (;-))

A more relevant railway analogy would be that a passenger train keeps its doors open while waiting for the departure signal to clear, permitting more passengers to board. At large stations the crew will press a TRTS (Train Ready To Start) button on the platform about half a minute before departure time, to prompt setting of the departure route in time, but a conflicting movement may delay the signal actually clearing.

- Jonathan Morton

Kathleen Nichols

2018-06-21 16:43:28 UTC

Well, if the driver sees the arriving packets, it could infer that an
ack will be produced shortly and will need a sending opportunity.

Kathie

(we tried this mechanism out for cable data head ends at Com21 and it
went into a patent that probably belongs to Arris now. But that was for
cable. It is a fact universally acknowledged that a packet of data must
be in want of an acknowledgement.)

Dave Taht

2018-06-21 19:17:21 UTC

Well, if the driver sees the arriving packets, it could infer that an
ack will be produced shortly and will need a sending opportunity.

Certainly in the case of wifi and lte and other simplex technologies
this seems feasible...

'cept that we're all busy finding ways to do ack compression this
month and thus the
two big tcp packets = 1 ack rule is going away. Still, an estimate,
with a short timeout
might help.

Another thing I've longed for (sometimes) is whether or not an
application like a web
browser signalling the OS that it has a batch of network packets
coming would help...

web browser:
setsockopt(batch_everything)
parse the web page, generate all your dns, tcp requests, etc, etc
setsockopt(release_batch)

Post by Kathleen Nichols
Kathie
(we tried this mechanism out for cable data head ends at Com21 and it
went into a patent that probably belongs to Arris now. But that was for
cable. It is a fact universally acknowledged that a packet of data must
be in want of an acknowledgement.)

voip doesn't behave this way, but for recognisable protocols like tcp
and perhaps quic...

Post by Kathleen Nichols
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Sebastian Moeller

2018-06-21 19:41:26 UTC

Hi All,

Well, if the driver sees the arriving packets, it could infer that an
ack will be produced shortly and will need a sending opportunity.

That short timeout seems essential, just because a link is wireless, does not mean the ACKs for passing TCP packets will appear shortly, who knows what routing happens after the wireless link (think city-wide mesh network). In a way such a solution should first figure out whether waiting has any chance of being useful, by looking at te typical delay between Data packets and the matching ACKs.

Post by Dave Taht
Another thing I've longed for (sometimes) is whether or not an
application like a web
browser signalling the OS that it has a batch of network packets
coming would help...

To make up for the fact that wireless uses unfortunately uses a very high per packet overhead it just tries to "hide" by amortizing it over more than one data packet. How about trying to find a better, less wasteful MAC instead ;) (and now we have two problems...) Now really from a latency perspective it clearly is better to ovoid overhead instead of use "batching" to better amortize it since batching increases latency (I stipulate that there are condition in which clever batching will not increase the noticeable latency if it can hide inside another latency increasing process).

Post by Dave Taht
setsockopt(batch_everything)
parse the web page, generate all your dns, tcp requests, etc, etc
setsockopt(release_batch)

voip doesn't behave this way, but for recognisable protocols like tcp
and perhaps quic...

I note that for voip, waiting does not make sense as all packets carry information and keeping jitter low will noticeably increase a calls perceived quality (if just by allowing the application yo use a small de-jitter buffer and hence less latency). There is a reason why wifi's voice access class, oith has the highest probability to get the next tx-slot and also is not allowed to send aggregates (whether that is fully sane is another question, answering which I do not feel competent).
I also think that on a docsis system it is probably a decent heuristic to assume that the endpoints will be a few milliseconds away at most (and only due to the coarse docsis grant-request clock).

Best Regards
Sebastian

Post by Kathleen Nichols
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

Toke Høiland-Jørgensen

2018-06-21 19:51:55 UTC

Post by Sebastian Moeller
To make up for the fact that wireless uses unfortunately uses a
very high per packet overhead it just tries to "hide" by
amortizing it over more than one data packet. How about trying
to find a better, less wasteful MAC instead ;) (and now we have
two problems...) Now really from a latency perspective it
clearly is better to ovoid overhead instead of use "batching" to
better amortize it since batching increases latency (I stipulate
that there are condition in which clever batching will not
increase the noticeable latency if it can hide inside another
latency increasing process).

Seems that 802.11ax will have some interesting features to this end.
Specifically, the spectrum can be split, allowing smaller chunks of it
to be used for reverse path transmissions (full-duplex at last?).

https://en.wikipedia.org/wiki/802.11ax#Technical_improvements

Also, 1024-QAM on 160Mhz channels; omg...

-Toke

Dave Taht

2018-06-21 19:54:21 UTC

Post by Sebastian Moeller
Hi All,

Well, if the driver sees the arriving packets, it could infer that an
ack will be produced shortly and will need a sending opportunity.

We are in this discussion, having a few issues with multiple contexts.
Mine (and eric's) is in improving wifi clients (laptops, handhelds)
behavior, where the tcp stack is local.

packet pairing estimates on routers... well, if you get an aggregate
"in", you should be able to get an aggregate "out" when it traverses
the same driver. routerwise, ack compression "done right" will help a
bit... it's the "done right" part that's the sticking point.

Post by Sebastian Moeller

Post by Dave Taht
Another thing I've longed for (sometimes) is whether or not an
application like a web
browser signalling the OS that it has a batch of network packets
coming would help...

On my bad days I'd really like to have a do-over on wifi. The only
hope I've had has been for LiFi or a ressurection of

I haven't poked into what's going on in 5G lately (the mac is
"better", but towers being distant does not help), nor have I been
tracking 802.11ax for a few years. Lower latency was all over the
802.11ax standard when I last paid attention.

Has 802.11ad gone anywhere?

Post by Sebastian Moeller
Now really from a latency perspective it clearly is better to ovoid overhead instead of use "batching" to better amortize it since batching increases latency (I stipulate that there are condition in which clever batching will not increase the noticeable latency if it can hide inside another latency increasing process).

Post by Dave Taht
setsockopt(batch_everything)
parse the web page, generate all your dns, tcp requests, etc, etc
setsockopt(release_batch)

voip doesn't behave this way, but for recognisable protocols like tcp
and perhaps quic...

Post by Kathleen Nichols
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Sebastian Moeller

2018-06-21 20:11:43 UTC

Hi Dave,

Post by Sebastian Moeller
Hi All,

Well, if the driver sees the arriving packets, it could infer that an
ack will be produced shortly and will need a sending opportunity.

We are in this discussion, having a few issues with multiple contexts.
Mine (and eric's) is in improving wifi clients (laptops, handhelds)
behavior, where the tcp stack is local.

Ah, sorry, I got this wrong and was looking at this from the APs perspective; sorry for the noise... and thanks for the patience

Post by Dave Taht
packet pairing estimates on routers... well, if you get an aggregate
"in", you should be able to get an aggregate "out" when it traverses
the same driver. routerwise, ack compression "done right" will help a
bit... it's the "done right" part that's the sticking point.

How will ACK compression help? If done aggressively it will sparse out the ACK stream potentially making aggregating ACK infeasible, no? On the other hand if sparse enough maybe not aggregating is not too painful? I guess I am just slow today...

Best Regards
Sebastian

Post by Sebastian Moeller

Post by Dave Taht
Another thing I've longed for (sometimes) is whether or not an
application like a web
browser signalling the OS that it has a batch of network packets
coming would help...

On my bad days I'd really like to have a do-over on wifi. The only
hope I've had has been for LiFi or a ressurection of
I haven't poked into what's going on in 5G lately (the mac is
"better", but towers being distant does not help), nor have I been
tracking 802.11ax for a few years. Lower latency was all over the
802.11ax standard when I last paid attention.
Has 802.11ad gone anywhere?

Post by Dave Taht
setsockopt(batch_everything)
parse the web page, generate all your dns, tcp requests, etc, etc
setsockopt(release_batch)

voip doesn't behave this way, but for recognisable protocols like tcp
and perhaps quic...

Post by Kathleen Nichols
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Kathleen Nichols

2018-06-22 14:01:37 UTC

Well, if the driver sees the arriving packets, it could infer that an
ack will be produced shortly and will need a sending opportunity.

It would be a poor algorithm that assumed the answer was "1" or "2" or
"42". It would be necessary to analyze data to see if something adaptive
is possible and it may not be. Your original note was looking for a way
for finding out if the probability of seeing more data in the next 10us
was sufficiently large to delay "a teeny bit" so that would be the
problem statement.

Post by Dave Taht
Another thing I've longed for (sometimes) is whether or not an
application like a web
browser signalling the OS that it has a batch of network packets
coming would help...
setsockopt(batch_everything)
parse the web page, generate all your dns, tcp requests, etc, etc
setsockopt(release_batch)

voip doesn't behave this way, but for recognisable protocols like tcp
and perhaps quic...

Post by Kathleen Nichols
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

Jonathan Morton

2018-06-22 14:12:30 UTC

Post by Kathleen Nichols
Your original note was looking for a way
for finding out if the probability of seeing more data in the next 10us
was sufficiently large to delay "a teeny bit" so that would be the
problem statement.

I would instead frame the problem as "how can we get hardware to incorporate extra packets, which arrive between the request and grant phases of the MAC, into the same TXOP?" Then we no longer need to think probabilistically, or induce unnecessary delay in the case that no further packets arrive.

- Jonathan Morton

Michael Richardson

2018-06-22 14:49:46 UTC

I would instead frame the problem as "how can we get hardware to
incorporate extra packets, which arrive between the request and grant
phases of the MAC, into the same TXOP?" Then we no longer need to
think probabilistically, or induce unnecessary delay in the case that
no further packets arrive.

I've never looked at the ring/buffer/descriptor structure of the ath9k, but
with most ethernet devices, they would just continue reading descriptors
until it was empty. Is there some reason that something similar can not
occur?

Or is the problem at a higher level?
Or is that we don't want to enqueue packets so early, because it's a source
of bloat?

--
] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] ***@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

Jonathan Morton

2018-06-22 15:02:55 UTC

Post by Jonathan Morton
I would instead frame the problem as "how can we get hardware to
incorporate extra packets, which arrive between the request and grant
phases of the MAC, into the same TXOP?" Then we no longer need to
think probabilistically, or induce unnecessary delay in the case that
no further packets arrive.

The question is of when the aggregate frame is constructed and "frozen", using only the packets in the queue at that instant. When the MAC grant occurs, transmission must begin immediately, so most hardware prepares the frame in advance of that moment - but how far in advance?

Behaviour suggests that it can be as soon as the MAC request is issued, in response to the *first* packet arriving in the queue - so a second TXOP is required for the *subsequent* packets arriving a microsecond later, even though there's technically still plenty of time to reform the aggregate then.

In principle it should be possible to delay frame construction until the moment the radio is switched on; there is a short period consumed by a data-indepedent preamble sequence. In the old days, HW designers would have bent over backwards to make that happen.

- Jonathan Morton

Michael Richardson

2018-06-22 21:55:05 UTC

Oh, I understand now. The aggregate frame has to be constructed, and it's
this frame that is actually in the xmit queue. I'm guessing that it's in the
hardware, because if it was in the driver, then we could perhaps do something?

Post by Jonathan Morton
In principle it should be possible to delay frame construction until
the moment the radio is switched on; there is a short period consumed
by a data-indepedent preamble sequence. In the old days, HW designers
would have bent over backwards to make that happen.

--
] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] ***@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

Toke Høiland-Jørgensen

2018-06-25 10:38:24 UTC

No, it's in the driver for ath9k. So it would be possible to delay it
slightly to try to build a larger one. The timing constraints are too
tight to do it reactively when the request is granted, though; so
delaying would result in idleness if there are no other flows to queue
before then...

Even for devices that build aggregates in firmware or hardware (as all
AC chipsets do), it might be possible to throttle the queues at higher
levels to try to get better batching. It's just not obvious that there's
an algorithm that can do this in a way that will "do no harm" for other
types of traffic, for instance...

-Toke

Jim Gettys

2018-06-25 23:54:18 UTC

Post by Jonathan Morton
I would instead frame the problem as "how can we get hardware to
incorporate extra packets, which arrive between the request and

grant

Post by Jonathan Morton
phases of the MAC, into the same TXOP?" Then we no longer need

Post by Jonathan Morton
think probabilistically, or induce unnecessary delay in the case

that

Post by Jonathan Morton
no further packets arrive.

I've never looked at the ring/buffer/descriptor structure of the

ath9k, but

Post by Michael Richardson
with most ethernet devices, they would just continue reading

descriptors

Post by Michael Richardson
until it was empty. Is there some reason that something similar

can not

Post by Michael Richardson
occur?
Or is the problem at a higher level?
Or is that we don't want to enqueue packets so early, because

it's a source

Post by Michael Richardson
of bloat?

The question is of when the aggregate frame is constructed and
"frozen", using only the packets in the queue at that instant.

When

Post by Jonathan Morton
the MAC grant occurs, transmission must begin immediately, so most
hardware prepares the frame in advance of that moment - but how

far in

Post by Jonathan Morton
advance?

Oh, I understand now. The aggregate frame has to be constructed, and

it's

Post by Michael Richardson
this frame that is actually in the xmit queue. I'm guessing that it's

in the

Post by Michael Richardson
hardware, because if it was in the driver, then we could perhaps do something?

No, it's in the driver for ath9k. So it would be possible to delay it
slightly to try to build a larger one. The timing constraints are too
tight to do it reactively when the request is granted, though; so
delaying would result in idleness if there are no other flows to queue
before then...
Even for devices that build aggregates in firmware or hardware (as all
AC chipsets do), it might be possible to throttle the queues at higher
levels to try to get better batching. It's just not obvious that there's
an algorithm that can do this in a way that will "do no harm" for other
types of traffic, for instance...
â

â
âIsn't this sort of delay a natural consequence of a busy channel?

What matters is not conserving txops *all the time*, but only when the
channel is busy and there aren't more txops available....

So when you are trying to transmit on a busy channel, that contention time
will naturally increase, since you won't
be able to get a transmit opportunity immediately. So you should queue up
more packets into an aggregate in that case.

We only care about conserving txops when they are scarce, not when they are
abundant.

This principle is why a window system as crazy as X11 is competitive: it
naturally becomes more efficient in the
face of load (more and more requests batch up and are handled at maximum
efficiency, so the system is at maximum
efficiency at full load.

Or am I missing something here?

Jim

Jonathan Morton

2018-06-26 00:07:30 UTC

Post by Toke HÃ¸iland-JÃ¸rgensen
No, it's in the driver for ath9k. So it would be possible to delay it
slightly to try to build a larger one. The timing constraints are too
tight to do it reactively when the request is granted, though; so
delaying would result in idleness if there are no other flows to queue
before then...

There has to be some sort of viable compromise here. How about initiating the request immediately, then building the aggregate when the request completes transmission? That should give at least the few microseconds required for the immediately following acks to reach the queue, and be included in the same aggregate.

Isn't this sort of delay a natural consequence of a busy channel?
What matters is not conserving txops *all the time*, but only when the channel is busy and there aren't more txops available....
So when you are trying to transmit on a busy channel, that contention time will naturally increase, since you won't be able to get a transmit opportunity immediately. So you should queue up more packets into an aggregate in that case.
We only care about conserving txops when they are scarce, not when they are abundant.
This principle is why a window system as crazy as X11 is competitive: it naturally becomes more efficient in the face of load (more and more requests batch up and are handled at maximum efficiency, so the system is at maximum efficiency at full load.
Or am I missing something here?

The problem is that currently every data aggregate received (one TXOP each from the AP) results in two TXOPs just to acknowledge them, the first one containing only a single ack. This is clearly wasteful, given the airtime overhead per TXOP relative to the raw data rate of modern wifi. Relying solely on backpressure would require that the channel was sufficiently busy to prevent the second TXOP from occurring until the following data aggregate is received, and that just seems too delicate to me.

- Jonathan Morton

David Lang

2018-06-26 00:21:34 UTC

We only care about conserving txops when they are scarce, not when they are abundant.
This principle is why a window system as crazy as X11 is competitive: it naturally becomes more efficient in the face of load (more and more requests batch up and are handled at maximum efficiency, so the system is at maximum efficiency at full load.
Or am I missing something here?

The problem is that currently every data aggregate received (one TXOP each
from the AP) results in two TXOPs just to acknowledge them, the first one
containing only a single ack. This is clearly wasteful, given the airtime
overhead per TXOP relative to the raw data rate of modern wifi. Relying
solely on backpressure would require that the channel was sufficiently busy to
prevent the second TXOP from occurring until the following data aggregate is
received, and that just seems too delicate to me.

If there are no other stations competing for airtime, why does it matter that we
use two txops? [1]

If there are no other stations that you are competing with for airtime, go ahead
and use it. If there are other stations that you are competing with for airtime,
you are unlikely to get the txop immediately, so as long as you can keep
updating the rf packet to send until the txop actially happens, the later data
will get folded in.

There will be a few times when you do get the txop immediately, and so you do
end up 'wasting' a txop, but the vast majority of the time you will be able to
combine the packets.

Now, the trick is figureing out how long we can wait to finalize the rf packet

David Lang

[1] ignoring the hidden transmitter problem for the moment)

Simon Barber

2018-06-26 00:36:11 UTC

Most hardware needs the packet finalized before it starts to contend for the medium (as far as I’m aware - let me know if you know differently). One issue is that if RTS/CTS is in use, then the packet duration needs to be known in advance (or at least mid point of the RTS transmission).

Simon

If there are no other stations competing for airtime, why does it matter that we use two txops? [1]
If there are no other stations that you are competing with for airtime, go ahead and use it. If there are other stations that you are competing with for airtime, you are unlikely to get the txop immediately, so as long as you can keep updating the rf packet to send until the txop actially happens, the later data will get folded in.
There will be a few times when you do get the txop immediately, and so you do end up 'wasting' a txop, but the vast majority of the time you will be able to combine the packets.
Now, the trick is figureing out how long we can wait to finalize the rf packet
David Lang
[1] ignoring the hidden transmitter problem for the moment)
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

Jonathan Morton

2018-06-26 00:44:00 UTC

Post by Simon Barber
Most hardware needs the packet finalized before it starts to contend for the medium (as far as I’m aware - let me know if you know differently). One issue is that if RTS/CTS is in use, then the packet duration needs to be known in advance (or at least mid point of the RTS transmission).

This is a valid argument. I think we could successfully argue for a delay of 1ms, if there isn't already enough data in the queue to fill an aggregate, after the oldest packet arrives until a request is issued.

Post by Simon Barber
If there are no other stations competing for airtime, why does it matter that we use two txops?

Jim Gettys

2018-06-26 00:52:08 UTC

Post by Simon Barber
Most hardware needs the packet finalized before it starts to contend for

the medium (as far as Iâm aware - let me know if you know differently). One
issue is that if RTS/CTS is in use, then the packet duration needs to be
known in advance (or at least mid point of the RTS transmission).
This is a valid argument. I think we could successfully argue for a delay
of 1ms, if there isn't already enough data in the queue to fill an
aggregate, after the oldest packet arrives until a request is issued.

Post by Simon Barber
If there are no other stations competing for airtime, why does it matter

that we use two txops?
One further argument would be power consumption. Radio transmitters eat
batteries for lunch; the only consistently worse offender I can think of is
a display backlight, assuming the software is efficient.
ââ

âNoât clear if this is true; we need current data.

In OLPC days, we measured the receive/transmit power consumption, and
transmit took essentially no more power than receive. The dominant power
consumption was due to signal processing the RF, not the transmitter. Just
listening sucked power....

Does someone understand what current 802.11 and actual chip sets consume
for power?

Jim

- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

David Lang

2018-06-26 00:56:05 UTC

Post by Simon Barber
Most hardware needs the packet finalized before it starts to contend for the
medium (as far as Iâm aware - let me know if you know differently). One issue
is that if RTS/CTS is in use, then the packet duration needs to be known in
advance (or at least mid point of the RTS transmission).

This is a valid argument. I think we could successfully argue for a delay of
1ms, if there isn't already enough data in the queue to fill an aggregate,
after the oldest packet arrives until a request is issued.

why does the length of the txop need to be known at the time that it's
requested?

I could see an argument that fairness algorithms need this info, but the per
txop overhead is _so_ much larger than the data transmission, that you would
have to add a huge amount of data to noticably affect the length of the
transmission.

remember, in wifi you don't ask a central point for permission to use X amount
of airtime, you wait for everyone to stop transmitting (and then a random time)
and then start sending. Nothing else in the area knows that you are going to
start transmitting, and it's only once they start decoding the start of the rf
packet you are sending that they can see how long it will be before you finish

Post by Simon Barber
If there are no other stations competing for airtime, why does it matter that we use two txops?

True, but this gets back to the question of how frequent this case is.

If you are in areas with congestion most of the time, so the common case is to
have to wait long enough for the data to be combined, then the difference in
power savings is going to be small.

'waiting just in case there is more to send' looks good on specific benchmarks,
but it adds latency all the time, even when it's not needed.

Now, using a travel analogy

I think how we operate today is as if we were a train at a station, when we
first are ready to move, the doors are closed and everyone sits inside waiting
for permission to move (think of how annoyed you have been sitting in a closed
aircraft at an airport waiting to move), and anyone outside has to wait for the
next train

But if instead we leave the doors open after we request permission, and only
close them when we know that we are going to be able to send very soon, late
arrivals can board.

Toke Høiland-Jørgensen

2018-06-26 11:16:54 UTC

Post by David Lang

Post by Simon Barber
Most hardware needs the packet finalized before it starts to contend for the
medium (as far as I’m aware - let me know if you know differently). One issue
is that if RTS/CTS is in use, then the packet duration needs to be known in
advance (or at least mid point of the RTS transmission).

This is a valid argument. I think we could successfully argue for a delay of
1ms, if there isn't already enough data in the queue to fill an aggregate,
after the oldest packet arrives until a request is issued.

why does the length of the txop need to be known at the time that it's
requested?

Because that's how the hardware is designed. There are really two
discussions here: (1) what could we do with a clean-slate(ish) design,
and (2) what can we retrofit into existing drivers such as the ath9k.

I think that the answer to (1) is probably 'quite a lot', but
unfortunately the answer to (2) is 'not that much'. We could probably do
a little bit better in ath9k, but for anything newer all bets are off,
as this functionality has moved into firmware.

Now, if there was a hardware vendor that was paying attention and could
do the right thing throughout the stack, that would be awesome of
course. But for Linux at least, sadly it seems that most hardware
vendors can barely figure out how to get *any* driver upstream... :/

Also, from a conceptual point of view, I really think ACK timing issues
are best solved at the TCP stack level. Which Eric is already working on
(SACK compression is already in 4.18, with normal ACK compression to
follow).

-Toke

Dave Taht

2018-06-26 01:27:55 UTC

Whoa, nelly! In the context of the local tcp stack over wifi, I was
making an observation that I "frequently" saw a pattern of a single
ack txop followed by a bunch in a separate txop. and I suggested a
very short (10us) timeout before committing to the hw - not 1ms.

Aside from this anecdote we have not got real data or statistics. The
closest thing I have to a tool that can take apart wireless aircaps is
here: https://github.com/dtaht/airtime-pie-chart which can be hacked
to take more things apart than it currently does. Looking for this
pattern in more traffic would be revealing in multiple ways. Looking
for more patterns in bigger wifi networks would be good also.

I like erics suggestion of doing more ack compression higher up in the
tcp stack.

There are two other things I've suggested in the past we look at. 1)
The current fq_codel_for_wifi code has a philosophy of "one aggregate
in the hardware, one ready to go". A simpler modification to fit more
in would be to (wait the best case estimate for delivering the one in
the hardware - a bit), then form the one ready-to-go.

2) rate limiting mcast and smoothing mcast bursts over time, allowing
more unicast through. presently the mcast queue is infinite and very
bursty. 802.11 std actually suggests mcast be rate limited by htb,
where I'd be htb + fq + merging dup packets. I was routinely able to
blow up the c.h.i.p's wifi and the babel protocol by flooding it with
mcast, as the local mcast queue could easily grow 16+ seconds long.

um, I'm giving a preso tomorrow and will run behind this thread. It's
nice to see the renewed enthusiasm here, keep it up.

Post by Simon Barber
If there are no other stations competing for airtime, why does it matter that we use two txops?

One further argument would be power consumption. Radio transmitters eat batteries for lunch; the only consistently worse offender I can think of is a display backlight, assuming the software is efficient.
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Simon Barber

2018-06-26 03:30:55 UTC

Current versions of Wireshark have an experimental feature I added to
expose airtime usage per packet and show 802.11 pcaps on a timeline.

Enable it under Preferences->Protocol->802.11 Radio

Simon

Post by Simon Barber
Most hardware needs the packet finalized before it starts to contend for
the medium (as far as I’m aware - let me know if you know differently). One
issue is that if RTS/CTS is in use, then the packet duration needs to be
known in advance (or at least mid point of the RTS transmission).

This is a valid argument. I think we could successfully argue for a delay
of 1ms, if there isn't already enough data in the queue to fill an
aggregate, after the oldest packet arrives until a request is issued.

Whoa, nelly! In the context of the local tcp stack over wifi, I was
making an observation that I "frequently" saw a pattern of a single
ack txop followed by a bunch in a separate txop. and I suggested a
very short (10us) timeout before committing to the hw - not 1ms.
Aside from this anecdote we have not got real data or statistics. The
closest thing I have to a tool that can take apart wireless aircaps is
here: https://github.com/dtaht/airtime-pie-chart which can be hacked
to take more things apart than it currently does. Looking for this
pattern in more traffic would be revealing in multiple ways. Looking
for more patterns in bigger wifi networks would be good also.
I like erics suggestion of doing more ack compression higher up in the
tcp stack.
There are two other things I've suggested in the past we look at. 1)
The current fq_codel_for_wifi code has a philosophy of "one aggregate
in the hardware, one ready to go". A simpler modification to fit more
in would be to (wait the best case estimate for delivering the one in
the hardware - a bit), then form the one ready-to-go.
2) rate limiting mcast and smoothing mcast bursts over time, allowing
more unicast through. presently the mcast queue is infinite and very
bursty. 802.11 std actually suggests mcast be rate limited by htb,
where I'd be htb + fq + merging dup packets. I was routinely able to
blow up the c.h.i.p's wifi and the babel protocol by flooding it with
mcast, as the local mcast queue could easily grow 16+ seconds long.
um, I'm giving a preso tomorrow and will run behind this thread. It's
nice to see the renewed enthusiasm here, keep it up.