Discussion:
[Bloat] beating the drum for BQL
Dave Taht
2018-08-23 00:49:08 UTC
Permalink
I had a chance to give a talk at broadcom recently, slides here:

http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf

(there's a fun slide on "carmageddon", and I am finding the
"tcp_square_wave" test *works* on EE types)

I was very happy to see BQL support in all of broadcom's *ethernet*
drivers, but along the way I noticed how many other drivers still
lacked it in the current kernel . Notably I figured a few dsl devices
should have it by now, and no doubt a few other device types.

I/we really should have beat the bql drum harder over the last 6
years. It's the basic start to all the debloating.
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Mikael Abrahamsson
2018-08-23 05:47:40 UTC
Permalink
I/we really should have beat the bql drum harder over the last 6 years.
It's the basic start to all the debloating.
It only helps with kernel based forwarding. A lot of devices don't even
use this, especially as speeds go up. They use packet accelerators so the
kernel never sees the packets after initial flow setup.

So you need to get the people developing that silicon to get with the
program.
--
Mikael Abrahamsson email: ***@swm.pp.se
Sebastian Moeller
2018-08-23 13:06:46 UTC
Permalink
Hi Mikael,
I/we really should have beat the bql drum harder over the last 6 years. It's the basic start to all the debloating.
It only helps with kernel based forwarding. A lot of devices don't even use this, especially as speeds go up. They use packet accelerators so the kernel never sees the packets after initial flow setup.
So you need to get the people developing that silicon to get with the program.
Or we could convince customers to stop buying toy router's that only work under a severely limited set of circumstances and opt for devices that pack enough CPU-punch to be actually adequate for modern internet speed tiers, no? Now if the packet accelerator is jus there to help save energy for typical cases but the CPU is powerful enough to handle a modern line with all the bells and whistles customers except, then I am all for it, but if the packet accelerator is just there to paper over an anemic CPU...

Best Regards
Sebastian

P.S.: I ignore issues of cost for ISP supplied CPE devices here since a) most ISPs I know actually charge rent for these devices now b) should have enough volume to drive down prices even for devices that are sufficiently fast for at least the 300 - 500 Mbps class of combined up- and download....
--
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Jan Ceuleers
2018-08-23 13:57:52 UTC
Permalink
Post by Sebastian Moeller
Or we could convince customers to stop buying toy router's that only work under a severely limited set of circumstances and opt for devices that pack enough CPU-punch to be actually adequate for modern internet speed tiers, no? Now if the packet accelerator is jus there to help save energy for typical cases but the CPU is powerful enough to handle a modern line with all the bells and whistles customers except, then I am all for it, but if the packet accelerator is just there to paper over an anemic CPU...
Not realistic, and also not environmentally friendly.

Doing packet forwarding in hardware is much more energy efficient, for
example. So not only is the hardware cheaper, it doesn't generate as
much heat, is cheaper to run, etc.

Jan
Sebastian Moeller
2018-08-23 14:22:07 UTC
Permalink
Hi Jan,
Post by Jan Ceuleers
Post by Sebastian Moeller
Or we could convince customers to stop buying toy router's that only work under a severely limited set of circumstances and opt for devices that pack enough CPU-punch to be actually adequate for modern internet speed tiers, no? Now if the packet accelerator is jus there to help save energy for typical cases but the CPU is powerful enough to handle a modern line with all the bells and whistles customers except, then I am all for it, but if the packet accelerator is just there to paper over an anemic CPU...
Not realistic, and also not environmentally friendly.
I unhappily agree with the first (market forces being what they are); I claim the second is untrue, since I stipulated packet accelerators as a way save energy and alllowing a competent main CPU to idle in low power mode, but pack the punch if needed. That sounds more environmentally sane than the current model in which you have to replace the incompetent router by a more competent one later... (and my issue is not that a router needs to last forever, but the e.g. an ISP supplied router should be able to handle at least the sold plan's bandwidth with its main CPU...)
Post by Jan Ceuleers
Doing packet forwarding in hardware is much more energy efficient, for
example. So not only is the hardware cheaper, it doesn't generate as
much heat, is cheaper to run, etc.
Sure doing less/ a half asses job is less costly than doing it right, but in the extreme not doing the job at all saves even more energy ;). And I am not sure we are barking up the right tree here, it is not that all home CPE are rigorously optimized for low power and energy saving... my gut feeling is that the only optimizing principle is cost for the manufacturer/OEM and that causes underpowered CPU that are packet-accerlerated"-doped to appear to be able to do their job. I might be wrong though, as I have ISP internal numbers on this issue.

Best Regards
Sebastian
Post by Jan Ceuleers
Jan
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Mikael Abrahamsson
2018-08-23 15:32:26 UTC
Permalink
Post by Sebastian Moeller
router should be able to handle at least the sold plan's bandwidth with
its main CPU...)
There is exactly one SoC on the market that does this, and that's Marvell
Armada 385, and it hasn't been very successful when it comes to ending up
in these kinds of devices. It's mostly ended up in NASes and devices such
as WRT1200AC, WRT1900ACS, WRT3200AC.
Post by Sebastian Moeller
Sure doing less/ a half asses job is less costly than doing it
right, but in the extreme not doing the job at all saves even more
energy ;). And I am not sure we are barking up the right tree here, it
is not that all home CPE are rigorously optimized for low power and
energy saving... my gut feeling is that the only optimizing principle is
cost for the manufacturer/OEM and that causes underpowered CPU that are
packet-accerlerated"-doped to appear to be able to do their job. I might
be wrong though, as I have ISP internal numbers on this issue.
The CPU power and RAM/flash has crept up a lot in the past 5 years because
other requirements in having the HGW support other applications than just
being a very simple NAT44+wifi router.

Cost is definitely an optimization, and when you're expected to have a
price-to-customer including software in the 20-40 EUR/device range, then
the SoC can't cost much. There has also been a lot of vendor lock-in.

But now speeds are creeping up even more, we're now seeing 2.5GE and 10GE
platforms, which require substantial CPU power to do forwarding. The Linux
kernel is now becoming the bottleneck in the forwarding, not even on a
3GHz Intel CPU is it possible to forward even 10GE using the normal Linux
kernel path (my guess right now is that this is due to context switching
etc, not really CPU performance).

Marvell has been the only one to really aim for lots of CPU performance in
their SoC, there might be others now going the same path but it's also a
downside if the CPU becomes bogged down with packet forwarding when it's
also expected to perform other tasks on behalf of the user (and ISP).
--
Mikael Abrahamsson email: ***@swm.pp.se
Rosen Penev
2018-08-23 17:14:57 UTC
Permalink
Post by Mikael Abrahamsson
Post by Sebastian Moeller
router should be able to handle at least the sold plan's bandwidth with
its main CPU...)
There is exactly one SoC on the market that does this, and that's Marvell
Armada 385, and it hasn't been very successful when it comes to ending up
in these kinds of devices. It's mostly ended up in NASes and devices such
as WRT1200AC, WRT1900ACS, WRT3200AC.
I completely agree with this as my Turris Omnia has solid Ethernet
performance. Low latency as well.

Well, from a driver point of view. Qualcomm and Mediatek also make
competitive hardware but the driver situation is terrible such that
other developers have to do the work. Marvell employees do most of the
work on mvneta.
Post by Mikael Abrahamsson
Post by Sebastian Moeller
Sure doing less/ a half asses job is less costly than doing it
right, but in the extreme not doing the job at all saves even more
energy ;). And I am not sure we are barking up the right tree here, it
is not that all home CPE are rigorously optimized for low power and
energy saving... my gut feeling is that the only optimizing principle is
cost for the manufacturer/OEM and that causes underpowered CPU that are
packet-accerlerated"-doped to appear to be able to do their job. I might
be wrong though, as I have ISP internal numbers on this issue.
The CPU power and RAM/flash has crept up a lot in the past 5 years because
other requirements in having the HGW support other applications than just
being a very simple NAT44+wifi router.
Cost is definitely an optimization, and when you're expected to have a
price-to-customer including software in the 20-40 EUR/device range, then
the SoC can't cost much. There has also been a lot of vendor lock-in.
But now speeds are creeping up even more, we're now seeing 2.5GE and 10GE
platforms, which require substantial CPU power to do forwarding. The Linux
kernel is now becoming the bottleneck in the forwarding, not even on a
3GHz Intel CPU is it possible to forward even 10GE using the normal Linux
kernel path (my guess right now is that this is due to context switching
etc, not really CPU performance).
Flow offloading can save quite a bit of CPU, even when done in
software. It also helps that the kernel network stack is getting
better.
Post by Mikael Abrahamsson
Marvell has been the only one to really aim for lots of CPU performance in
their SoC, there might be others now going the same path but it's also a
downside if the CPU becomes bogged down with packet forwarding when it's
also expected to perform other tasks on behalf of the user (and ISP).
If only there were more devices...
Post by Mikael Abrahamsson
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Mikael Abrahamsson
2018-08-23 17:50:21 UTC
Permalink
Flow offloading can save quite a bit of CPU, even when done in software.
It also helps that the kernel network stack is getting better.
I tried this on my 10GE x86-64 test bed. It didn't help. It's %sirq
limited it seems. flowoffload changed nothing. It helps on lower end CPU
platforms (I've tried it there too), but not for the 10GE forwarding case.
--
Mikael Abrahamsson email: ***@swm.pp.se
Dave Taht
2018-08-23 18:21:33 UTC
Permalink
One of the things not readily evident in trying to scale up, is the
cost of even the most basic routing table lookup. A lot of good work
in this area landed in linux 4.1 and 4.2 (see a couple posts here:
https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
)

Lookup time for even the smallest number of routes is absolutely
miserable for IPv6 -
https://vincent.bernat.im/en/blog/2017-ipv6-route-lookup-linux

I think one of the biggest driving factors of the whole TSO/GRO thing
is due to trying to get smaller packets through this phase of the
kernel, and not that they are so much more efficient at the card
itself. Given the kerfuffle over here (
https://github.com/systemd/systemd/issues/9725 ) I'd actually like to
come up with a way to move the linux application socket buffers to the
post-lookup side of the routing table. We spend a lot of extra time
bloating up superpackets just so they are cheaper to route.

TCAMs are expensive as hell, but the addition of even a small one,
readily accessible to userspace or from the kernel, might help in the
general case. I've actually oft wished to be able to offload these
sort of lookups into higher level algorithms and languages like
python, as a general purpose facility. Hey, if we can have giant GPUs,
why can't our cpus have tcams?

programmable TCAM support got enabled in a recent (mellonox?) product.
Can't find the link at the moment TCAMs of course, is where big fat
dedicated routers and switches shine, over linux - and even arp table
lookups are expensive in linux, though I'm not sure if anyone has
looked lately.
Dave Taht
2018-08-23 20:15:29 UTC
Permalink
I should also point out that the kinds of routing latency numbers in
those blog entries was on very high end intel hardware. It would be
good to re-run those sort of tests on the armada and others for
1,10,100, 1000 routes. Clever complicated algorithms have a tendency
to bloat icache and cost more than they are worth, fairly often, on
hardware that typically has 32k i/d caches, and a small L2.

BQL's XMIT_MORE is one example - while on the surface it looked like a
win, it cost too much on the ar71xx to use. Similarly I worry about
the new rx batching code (
https://lwn.net/SubscriberLink/763056/f9a20ec24b8d29dd/ ) which looks
*GREAT* - on *intel* - although I *think* it will be a win everywhere
this time. I tend to think a smaller napi value would help, and
sometimes I think about revisiting napi itself.

(and I'm perfectly willing to wait til openwrt does the rest of the
port for mips to 4.19 before fiddling with it... or longer. I could
use a dayjob)

Still, it's been the rx side of linux that has been increasingly
worrisome of late, and anything that can be done there for any chip
seems like a goodness.

on the mvneta front... I've worked on that driver... oh... if I could
get a shot at ripping out all the bloat in it and see what happened...

On the marvell front... yes, they tend to produce hardware that runs
too hot. I too rather like the chipset, and it's become my default hw
for most things in the midrange.

Lastly... there are still billions of slower ISP links left in the
world to fix, with hardware that now costs well under
40 bucks. The edgerouter X is 50 bucks (sans wifi) and good to
~180mbps for inbound shaping presently. Can we get those edge
connections fixed???
Post by Dave Taht
One of the things not readily evident in trying to scale up, is the
cost of even the most basic routing table lookup. A lot of good work
https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
)
Lookup time for even the smallest number of routes is absolutely
miserable for IPv6 -
https://vincent.bernat.im/en/blog/2017-ipv6-route-lookup-linux
I think one of the biggest driving factors of the whole TSO/GRO thing
is due to trying to get smaller packets through this phase of the
kernel, and not that they are so much more efficient at the card
itself. Given the kerfuffle over here (
https://github.com/systemd/systemd/issues/9725 ) I'd actually like to
come up with a way to move the linux application socket buffers to the
post-lookup side of the routing table. We spend a lot of extra time
bloating up superpackets just so they are cheaper to route.
TCAMs are expensive as hell, but the addition of even a small one,
readily accessible to userspace or from the kernel, might help in the
general case. I've actually oft wished to be able to offload these
sort of lookups into higher level algorithms and languages like
python, as a general purpose facility. Hey, if we can have giant GPUs,
why can't our cpus have tcams?
programmable TCAM support got enabled in a recent (mellonox?) product.
Can't find the link at the moment TCAMs of course, is where big fat
dedicated routers and switches shine, over linux - and even arp table
lookups are expensive in linux, though I'm not sure if anyone has
looked lately.
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Mikael Abrahamsson
2018-08-24 07:00:56 UTC
Permalink
On the marvell front... yes, they tend to produce hardware that runs too
hot. I too rather like the chipset, and it's become my default hw for
most things in the midrange.
I checked my WRT1200AC and it idles at 8W. My similar Broadcom box idles
at 10W, but that one has a lot more on the motherboard plus 4x4 wifi that
tends to run very hot. I intend to try them under load though and see how
much power usage changes.
Lastly... there are still billions of slower ISP links left in the
world to fix, with hardware that now costs well under
40 bucks. The edgerouter X is 50 bucks (sans wifi) and good to
~180mbps for inbound shaping presently. Can we get those edge
connections fixed???
There are indeed these kinds of slower devices, but it's also that they
tend to be the kind of device that last saw development a few years ago
and only reason it's still being new installed is because it's cheap.

In most of the world, customers do not rent the CPE so there is no cash
flow to the ISP to fix anything. So they tend to sit there until they
break.
--
Mikael Abrahamsson email: ***@swm.pp.se
Dave Taht
2018-08-24 08:06:48 UTC
Permalink
Post by Mikael Abrahamsson
On the marvell front... yes, they tend to produce hardware that runs too
hot. I too rather like the chipset, and it's become my default hw for
most things in the midrange.
I checked my WRT1200AC and it idles at 8W. My similar Broadcom box idles
at 10W, but that one has a lot more on the motherboard plus 4x4 wifi that
tends to run very hot. I intend to try them under load though and see how
much power usage changes.
My ar71xx/ath9 hw - like nanostations - was below 2w. wndr3800 don't
remember, I think the the ethernet switch added quite a bit. But 8Ws?
not even close to that. A modern LED lightbulb eats that and sheds
quite a lot of light.

Random curiosity: what do various SFP+ interfaces (notably gpon) eat?
has anyone got a gpon interface for the omnia yet? I *hate* the need
for ONTs.
Post by Mikael Abrahamsson
Lastly... there are still billions of slower ISP links left in the
world to fix, with hardware that now costs well under
40 bucks. The edgerouter X is 50 bucks (sans wifi) and good to
~180mbps for inbound shaping presently. Can we get those edge
connections fixed???
There are indeed these kinds of slower devices, but it's also that they
tend to be the kind of device that last saw development a few years ago
and only reason it's still being new installed is because it's cheap.
with better software on by default. I fully expect my wndr3800s to see
service for another 10 years. The failure rate is amazingly low.
Post by Mikael Abrahamsson
In most of the world, customers do not rent the CPE so there is no cash
flow to the ISP to fix anything. So they tend to sit there until they
break.
There are two different debloating tactics to take then... but it's
late I gotta sleep.
Post by Mikael Abrahamsson
--
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Toke Høiland-Jørgensen
2018-08-24 11:22:11 UTC
Permalink
Post by Dave Taht
Post by Mikael Abrahamsson
On the marvell front... yes, they tend to produce hardware that runs too
hot. I too rather like the chipset, and it's become my default hw for
most things in the midrange.
I checked my WRT1200AC and it idles at 8W. My similar Broadcom box idles
at 10W, but that one has a lot more on the motherboard plus 4x4 wifi that
tends to run very hot. I intend to try them under load though and see how
much power usage changes.
My ar71xx/ath9 hw - like nanostations - was below 2w. wndr3800 don't
remember, I think the the ethernet switch added quite a bit. But 8Ws?
not even close to that. A modern LED lightbulb eats that and sheds
quite a lot of light.
Random curiosity: what do various SFP+ interfaces (notably gpon) eat?
has anyone got a gpon interface for the omnia yet? I *hate* the need
for ONTs.
I have a regular Ethernet SFP module in one of my omnias. Can't very
well check its power usage ATM though, as I'm currently in a different
country from that box...

-Toke
Jan Ceuleers
2018-08-24 11:46:22 UTC
Permalink
Post by Dave Taht
Random curiosity: what do various SFP+ interfaces (notably gpon) eat?
I have taken a look at a couple. I see numbers in the range 1.7 - 2.2W
for GPON ONTs.
Jan Ceuleers
2018-08-24 13:43:42 UTC
Permalink
Post by Jan Ceuleers
Post by Dave Taht
Random curiosity: what do various SFP+ interfaces (notably gpon) eat?
I have taken a look at a couple. I see numbers in the range 1.7 - 2.2W
for GPON ONTs.
Just to be clear: that's for GPON SFP ONTs.
Mikael Abrahamsson
2018-08-24 13:44:42 UTC
Permalink
Post by Jan Ceuleers
Post by Jan Ceuleers
Post by Dave Taht
Random curiosity: what do various SFP+ interfaces (notably gpon) eat?
I have taken a look at a couple. I see numbers in the range 1.7 - 2.2W
for GPON ONTs.
Just to be clear: that's for GPON SFP ONTs.
Just the SFP, right?
--
Mikael Abrahamsson email: ***@swm.pp.se
Jan Ceuleers
2018-08-24 13:56:16 UTC
Permalink
Post by Mikael Abrahamsson
Post by Jan Ceuleers
Post by Jan Ceuleers
Post by Dave Taht
Random curiosity: what do various SFP+ interfaces (notably gpon) eat?
I have taken a look at a couple. I see numbers in the range 1.7 - 2.2W
for GPON ONTs.
Just to be clear: that's for GPON SFP ONTs.
Just the SFP, right?
Yes
Mikael Abrahamsson
2018-08-24 13:36:20 UTC
Permalink
Post by Dave Taht
My ar71xx/ath9 hw - like nanostations - was below 2w. wndr3800 don't
remember, I think the the ethernet switch added quite a bit. But 8Ws?
not even close to that. A modern LED lightbulb eats that and sheds quite
a lot of light.
My very simple and stupid 1GE SFP/ethernet fiber media converter, uses
4.3W when idling.
Post by Dave Taht
Random curiosity: what do various SFP+ interfaces (notably gpon) eat?
has anyone got a gpon interface for the omnia yet? I *hate* the need
for ONTs.
These can easily be 1-2 Watts. I put in a 1GE SFP into the before
mentioned Broadcom HGW and power usage went up from 9.4W to 10.2W. So if
it's a GPON or similar then I'd imagine it's substantially more
considering that it's quite a lot more things a GPON device needs to do.
--
Mikael Abrahamsson email: ***@swm.pp.se
Mikael Abrahamsson
2018-08-24 07:05:38 UTC
Permalink
Post by Dave Taht
I should also point out that the kinds of routing latency numbers in
those blog entries was on very high end intel hardware. It would be
good to re-run those sort of tests on the armada and others for
1,10,100, 1000 routes. Clever complicated algorithms have a tendency
to bloat icache and cost more than they are worth, fairly often, on
hardware that typically has 32k i/d caches, and a small L2.
My testing has been on OpenWrt with 4.14 on intel x86-64. Looking how the
box behaves, I'd say it's limited by context switching / interrupt load,
and not actually by CPU being busy doing "hard work".

All of the fast routing implementations (snabbswitch, FD.IO/VPP etc) they
take away CPU and devices from Linux, and runs busy-loop with polling a
lot of the time, an never context switching which means L1 cache is never
churned. This is how they become fast. I see potential to do "XDP
offload" of forwarding here, basically doing similar job to what a
hardware packet accelerator does. Then we can optimise forwarding by using
lessons learnt from the other projects potentially. Need to keep the
bufferbloat work in mind when doing this though, so we don't make that bad
again.
--
Mikael Abrahamsson email: ***@swm.pp.se
Toke Høiland-Jørgensen
2018-08-24 11:24:42 UTC
Permalink
Post by Mikael Abrahamsson
Post by Dave Taht
I should also point out that the kinds of routing latency numbers in
those blog entries was on very high end intel hardware. It would be
good to re-run those sort of tests on the armada and others for
1,10,100, 1000 routes. Clever complicated algorithms have a tendency
to bloat icache and cost more than they are worth, fairly often, on
hardware that typically has 32k i/d caches, and a small L2.
My testing has been on OpenWrt with 4.14 on intel x86-64. Looking how the
box behaves, I'd say it's limited by context switching / interrupt load,
and not actually by CPU being busy doing "hard work".
All of the fast routing implementations (snabbswitch, FD.IO/VPP etc)
they take away CPU and devices from Linux, and runs busy-loop with
polling a lot of the time, an never context switching which means L1
cache is never churned. This is how they become fast. I see potential
to do "XDP offload" of forwarding here, basically doing similar job to
what a hardware packet accelerator does.
Yup, that would help; we see basically 2-3x improvement in routing
performance with XDP over the regular stack. Don't think there's XDP
support in any of the low-end ethernet drivers yet, though...

-Toke
Toke Høiland-Jørgensen
2018-08-23 21:01:50 UTC
Permalink
Post by Dave Taht
One of the things not readily evident in trying to scale up, is the
cost of even the most basic routing table lookup. A lot of good work
https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
)
Lookup time for even the smallest number of routes is absolutely
miserable for IPv6 -
https://vincent.bernat.im/en/blog/2017-ipv6-route-lookup-linux
The IPv6 routing lookup is on par with v4 these days. We got 7.2M pkts/s
in our XDP tests on a single core (although admittedly a fairly high-end
Intel one). Which allows you to route 10Gbps of 64-byte packets on two
cores...

-Toke
Dave Taht
2018-08-24 14:44:07 UTC
Permalink
Post by Toke Høiland-Jørgensen
Post by Dave Taht
One of the things not readily evident in trying to scale up, is the
cost of even the most basic routing table lookup. A lot of good work
https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
)
Lookup time for even the smallest number of routes is absolutely
miserable for IPv6 -
https://vincent.bernat.im/en/blog/2017-ipv6-route-lookup-linux
The IPv6 routing lookup is on par with v4 these days. We got 7.2M pkts/s
in our XDP tests on a single core (although admittedly a fairly high-end
Intel one). Which allows you to route 10Gbps of 64-byte packets on two
cores...
Call me cynical, call me grumpy...

but did you get that result with testing 1,10,100,1000, 10,000, 100k,
1M routes? The best
case performance on that test looked like .150us, the worst case 1.75us
Post by Toke Høiland-Jørgensen
-Toke
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Toke Høiland-Jørgensen
2018-08-24 17:58:42 UTC
Permalink
Post by Dave Taht
Post by Toke Høiland-Jørgensen
Post by Dave Taht
One of the things not readily evident in trying to scale up, is the
cost of even the most basic routing table lookup. A lot of good work
https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
)
Lookup time for even the smallest number of routes is absolutely
miserable for IPv6 -
https://vincent.bernat.im/en/blog/2017-ipv6-route-lookup-linux
The IPv6 routing lookup is on par with v4 these days. We got 7.2M pkts/s
in our XDP tests on a single core (although admittedly a fairly high-end
Intel one). Which allows you to route 10Gbps of 64-byte packets on two
cores...
Call me cynical, call me grumpy...
but did you get that result with testing 1,10,100,1000, 10,000, 100k,
1M routes? The best
case performance on that test looked like .150us, the worst case 1.75us
Think those were single-route tests, or close to it. Don't have results
handy for large routing tables for IPv6, but IPv4 performance drops by
~33% when going from a single route to a full BGP table dump...

Can run some tests for v6 once by testbed is running again...

-Toke
Sebastian Moeller
2018-08-23 23:35:55 UTC
Permalink
Dear Mikael,
Post by Sebastian Moeller
router should be able to handle at least the sold plan's bandwidth with its main CPU...)
There is exactly one SoC on the market that does this, and that's Marvell Armada 385, and it hasn't been very successful when it comes to ending up in these kinds of devices. It's mostly ended up in NASes and devices such as WRT1200AC, WRT1900ACS, WRT3200AC.
Intersting question, how will the intel grx750 perform (https://www.intel.com/content/www/us/en/smart-home/connected-home/anywan-grx750-home-gateway-brief.html) after all that is a SoC with a dual core atom CPU with up to 2.5 GHz frequency?
Post by Sebastian Moeller
Sure doing less/ a half asses job is less costly than doing it right, but in the extreme not doing the job at all saves even more energy ;). And I am not sure we are barking up the right tree here, it is not that all home CPE are rigorously optimized for low power and energy saving... my gut feeling is that the only optimizing principle is cost for the manufacturer/OEM and that causes underpowered CPU that are packet-accerlerated"-doped to appear to be able to do their job. I might be wrong though, as I have ISP internal numbers on this issue.
The CPU power and RAM/flash has crept up a lot in the past 5 years because other requirements in having the HGW support other applications than just being a very simple NAT44+wifi router.
That matches my observation as well, people seem to want to concentrate more functionality at the one device that needs to run 24/7 instead of using multiple independent devices (and I do not want to blame them, even though I believe from a robustness perspective it would be better to not concentrate everything in the routing/firewall device).
Cost is definitely an optimization, and when you're expected to have a price-to-customer including software in the 20-40 EUR/device range, then the SoC can't cost much. There has also been a lot of vendor lock-in.
Sure, but my ISP charged 4 EUR per month for the DSL-router that adds up to 12*2*4 = 96 EUR over the 2 years contract duration and to 12*5*4 = 240 over my renting duration; assuming that my ISP does not need to make a profit on this device (after all I am renting this to be able to consume internet and telephone from them) that is considerably more that 20-40 EUR. This is especially farcical since until a few years ago the dsl-routers have been given for "free" and when they switched to mandatory renting the baseplan price was not reduced by the same amount. I guess what I want to convey is while cost is imprtant it is not a goo d excuse to distribute underpowered devices....
But now speeds are creeping up even more, we're now seeing 2.5GE and 10GE platforms, which require substantial CPU power to do forwarding.
Well, it is all swell if a router delivers 2.5/5/10 Gbps on the LAN side, but a) I know only few households that would profit from that and b) at that speeds short-comings of a router become even more obvious and c) bandplans to actually feed such a beast from the wan side seem expensive enough that the customer should also be able to pay for a competent router (one can get intel based multicore atom boards at the same price point as the high-end homerouters at ~250EUR).
The Linux kernel is now becoming the bottleneck in the forwarding, not even on a 3GHz Intel CPU is it possible to forward even 10GE using the normal Linux kernel path (my guess right now is that this is due to context switching etc, not really CPU performance).
That is a bridge to cross once we reach it, I doubt that we will realistically reach 10 Gbps home internet access for the masses soon.
Marvell has been the only one to really aim for lots of CPU performance in their SoC, there might be others now going the same path but it's also a downside if the CPU becomes bogged down with packet forwarding when it's also expected to perform other tasks on behalf of the user (and ISP).
As stated above there is an argument to concentrate non core router functionality to another device (like one of those NAS devices that can also share a printer)

All that said, I believe that your opinion is far closer to the real world and positions of the ISPs, so I expect things stay as they are, but I cab dream can't I ;) ...


Best Regards
Sebastian
--
David Collier-Brown
2018-08-24 01:08:02 UTC
Permalink
Post by Sebastian Moeller
router should be able to handle at least the sold plan's bandwidth with its main CPU...)
Looking at this as an economic decision, were I a modern ISP, I would want

* as few different devices as possible
* good margins on each

and I would not care about whether the smarts were in the CPU or a
controller, as my only interests would lie in solving an linear
programming problem for the least cost in base cost, inventory and
replacement costs.

A quick look at Tek Savvy (our best non-monopolist isp) shows that have
one modem they rent...

plus a large list of ones they support for a given speed range, from
which they chose their best priced one:

Manufacturer    Model    Hardware Version    Firmware Version

Cisco    DPC3848N    2.0    dpc3800-v303r2042161-160115a

Cisco    DPC3848V    1.0    dpc3800-v303r2042162-160115a

Technicolor    DPC3848V    1.0    dpc3800-v303r2042162-160620a*<- their favorite*

Technicolor    TC4350    3.0    50041.1.19.0

Hitron    CDA3    1A    4.5.0.14

Hitron    CDA3-35    1.A    6.1.2.26

TP-Link    TC7650    1.0    v1.0.3 Build 20161117 Rel358190

Hitron    CGNM-3550    1A    4.5.11.8-TPIA

Hitron    CGN3-RES    1A    4.2.4.11RES

SmartRG    SR808ac    1.0

--dave
--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
***@spamcop.net | -- Mark Twain
Jan Ceuleers
2018-08-24 06:16:18 UTC
Permalink
Post by Sebastian Moeller
Sure, but my ISP charged 4 EUR per month for the DSL-router that adds up to 12*2*4 = 96 EUR over the 2 years contract duration and to 12*5*4 = 240 over my renting duration; assuming that my ISP does not need to make a profit on this device (after all I am renting this to be able to consume internet and telephone from them) that is considerably more that 20-40 EUR. This is especially farcical since until a few years ago the dsl-routers have been given for "free" and when they switched to mandatory renting the baseplan price was not reduced by the same amount. I guess what I want to convey is while cost is imprtant it is not a goo d excuse to distribute underpowered devices....
A few points:

1. The CPE makers sell these devices to ISPs wholesale. The price point
they have to design to is determined by those ISPs' willingness to pay,
which is also influenced/co-determined by the market price. What ISPs
manage to rake in over the technical life span of those devices has
nothing to do with it - the CPE makers don't get a cut of the rental
charges. The economics explain the behaviour of both the ISP and the CPE
maker very well - nothing's going to change without some substantial
disruption.

2. Mandatory rental of CPE is a racket that's seen in various markets
(e.g. STBs in the cable market). It is more defendable in some
situations than in others. For example cable companies get away with it
because of a lack of sufficient standardisation (which is also not going
to be improved to the point where the STB market can be completely
opened because the standards are set by a cable industry body).

It's not so defendable in the DSL market anymore, although I can tell
you from experience that there's a lot of very poor DSL modems out there
which cause interop problems in the real world, including interference
with other people's DSL lines. The network operator having firm control
of which DSL modems are permitted onto their network helps performance
for all of their customers. But this can also be achieved by means other
than mandatory rental of CPE: there are countries where the regulator
oversees a DSL modem certification scheme, and end-users can then go out
and buy a certified device through their preferred retail outlet, at
very reasonable prices.

FWIW I bought three STBs for 50EUR from my cable company 5 years ago,
rather than renting them for 7EUR/month each.

3. ISPs and network operators will buy the cheapest CPE that meets their
needs. In respect of traffic throughput they will invariably use Ookla
to test. This is why I've been saying for quite some time that unless we
can convince Ookla to penalise bufferbloat we won't make a dent in how
CPE are designed in this respect.

Jan
Toke Høiland-Jørgensen
2018-08-24 11:27:13 UTC
Permalink
Post by Mikael Abrahamsson
But now speeds are creeping up even more, we're now seeing 2.5GE and 10GE
platforms,
Are there actually any 10GE embedded platforms one can buy? I've been
thinking about how to upgrade my home network without putting x86 boxes
everywhere...

-Toke
Pedro Tumusok
2018-08-24 12:46:34 UTC
Permalink
Post by Toke Høiland-Jørgensen
Post by Mikael Abrahamsson
But now speeds are creeping up even more, we're now seeing 2.5GE and
10GE
Post by Mikael Abrahamsson
platforms,
Are there actually any 10GE embedded platforms one can buy? I've been
thinking about how to upgrade my home network without putting x86 boxes
everywhere...
From the few vendor roadmap I have seen, there are a few Broadcom based
CPE's coming out this Q4 and Q1 next year, that supports up to 10GE
throughput, but again this is with offload/acceleration hw in there.
--
Best regards / Mvh
Jan Pedro Tumusok
Michael Richardson
2018-08-24 13:09:51 UTC
Permalink
Post by Toke Høiland-Jørgensen
Post by Mikael Abrahamsson
But now speeds are creeping up even more, we're now seeing 2.5GE and
10GE platforms,
Are there actually any 10GE embedded platforms one can buy? I've been
thinking about how to upgrade my home network without putting x86 boxes
everywhere...
http://wiki.macchiatobin.net/tiki-index.php have been recommended.
I tried to buy one last year, but they had production problems, and I
cancelled. They are apparently now for sale.

--
] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] ***@sandelman.ca http://www.sandelman.ca/ | ruby on rails [
Mikael Abrahamsson
2018-08-24 13:37:25 UTC
Permalink
Post by Toke Høiland-Jørgensen
Are there actually any 10GE embedded platforms one can buy? I've been
thinking about how to upgrade my home network without putting x86 boxes
everywhere...
https://www.solid-run.com/marvell-armada-family/macchiatobin/

I know people currently working on XDP-enabling the drivers for that
board (Marvell 8040).
--
Mikael Abrahamsson email: ***@swm.pp.se
Toke Høiland-Jørgensen
2018-08-24 13:44:41 UTC
Permalink
Post by Mikael Abrahamsson
Post by Toke Høiland-Jørgensen
Are there actually any 10GE embedded platforms one can buy? I've been
thinking about how to upgrade my home network without putting x86 boxes
everywhere...
https://www.solid-run.com/marvell-armada-family/macchiatobin/
I know people currently working on XDP-enabling the drivers for that
board (Marvell 8040).
Cool! Guess I should try to get my hands on one :)

-Toke
Dave Taht
2018-08-26 04:29:29 UTC
Permalink
Post by Mikael Abrahamsson
Post by Sebastian Moeller
router should be able to handle at least the sold plan's bandwidth with
its main CPU...)
There is exactly one SoC on the market that does this, and that's Marvell
Armada 385, and it hasn't been very successful when it comes to ending up
in these kinds of devices. It's mostly ended up in NASes and devices such
as WRT1200AC, WRT1900ACS, WRT3200AC.
I just pulled two of those out of my junk drawer. (bricked presently).
It looks like we
can't apply fq_codel for wifi to it (big binary blob), still.

The firmware interface code is pretty clean though.

https://github.com/kaloz/mwlwifi

I rather liked the 385 chip myself, but wifi... can't fix, going back
in junk drawer
unless someone wants one.

The expressobin is a Marvell Armada "3700LP (88F3720) dual core ARM
Cortex A53 processor up to 1.2GHz" - how does that compare? I have
plenty of ath10k and ath9k pcmcia cards....
Post by Mikael Abrahamsson
Post by Sebastian Moeller
Sure doing less/ a half asses job is less costly than doing it
right, but in the extreme not doing the job at all saves even more
energy ;). And I am not sure we are barking up the right tree here, it
is not that all home CPE are rigorously optimized for low power and
energy saving... my gut feeling is that the only optimizing principle is
cost for the manufacturer/OEM and that causes underpowered CPU that are
packet-accerlerated"-doped to appear to be able to do their job. I might
be wrong though, as I have ISP internal numbers on this issue.
The CPU power and RAM/flash has crept up a lot in the past 5 years because
other requirements in having the HGW support other applications than just
being a very simple NAT44+wifi router.
Cost is definitely an optimization, and when you're expected to have a
price-to-customer including software in the 20-40 EUR/device range, then
the SoC can't cost much. There has also been a lot of vendor lock-in.
But now speeds are creeping up even more, we're now seeing 2.5GE and 10GE
platforms, which require substantial CPU power to do forwarding. The Linux
kernel is now becoming the bottleneck in the forwarding, not even on a
3GHz Intel CPU is it possible to forward even 10GE using the normal Linux
kernel path (my guess right now is that this is due to context switching
etc, not really CPU performance).
Marvell has been the only one to really aim for lots of CPU performance in
their SoC, there might be others now going the same path but it's also a
downside if the CPU becomes bogged down with packet forwarding when it's
also expected to perform other tasks on behalf of the user (and ISP).
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Jonathan Morton
2018-08-26 04:32:22 UTC
Permalink
dual core ARM Cortex A53 processor up to 1.2GHz
...that basically makes it half of a Raspberry Pi 3, CPU-wise.

- Jonathan Morton
Rosen Penev
2018-08-26 16:28:30 UTC
Permalink
Post by Dave Taht
Post by Mikael Abrahamsson
Post by Sebastian Moeller
router should be able to handle at least the sold plan's bandwidth with
its main CPU...)
There is exactly one SoC on the market that does this, and that's Marvell
Armada 385, and it hasn't been very successful when it comes to ending up
in these kinds of devices. It's mostly ended up in NASes and devices such
as WRT1200AC, WRT1900ACS, WRT3200AC.
I just pulled two of those out of my junk drawer. (bricked presently).
It looks like we
can't apply fq_codel for wifi to it (big binary blob), still.
Yeah it's junk. While still developed, it has outstanding issues that
have not been fixed in a long time.

An example, monitor mode does not show data packets. Packet injection
still works though.
Post by Dave Taht
The firmware interface code is pretty clean though.
https://github.com/kaloz/mwlwifi
I rather liked the 385 chip myself, but wifi... can't fix, going back
in junk drawer
unless someone wants one.
The expressobin is a Marvell Armada "3700LP (88F3720) dual core ARM
Cortex A53 processor up to 1.2GHz" - how does that compare? I have
plenty of ath10k and ath9k pcmcia cards....
Post by Mikael Abrahamsson
Post by Sebastian Moeller
Sure doing less/ a half asses job is less costly than doing it
right, but in the extreme not doing the job at all saves even more
energy ;). And I am not sure we are barking up the right tree here, it
is not that all home CPE are rigorously optimized for low power and
energy saving... my gut feeling is that the only optimizing principle is
cost for the manufacturer/OEM and that causes underpowered CPU that are
packet-accerlerated"-doped to appear to be able to do their job. I might
be wrong though, as I have ISP internal numbers on this issue.
The CPU power and RAM/flash has crept up a lot in the past 5 years because
other requirements in having the HGW support other applications than just
being a very simple NAT44+wifi router.
Cost is definitely an optimization, and when you're expected to have a
price-to-customer including software in the 20-40 EUR/device range, then
the SoC can't cost much. There has also been a lot of vendor lock-in.
But now speeds are creeping up even more, we're now seeing 2.5GE and 10GE
platforms, which require substantial CPU power to do forwarding. The Linux
kernel is now becoming the bottleneck in the forwarding, not even on a
3GHz Intel CPU is it possible to forward even 10GE using the normal Linux
kernel path (my guess right now is that this is due to context switching
etc, not really CPU performance).
Marvell has been the only one to really aim for lots of CPU performance in
their SoC, there might be others now going the same path but it's also a
downside if the CPU becomes bogged down with packet forwarding when it's
also expected to perform other tasks on behalf of the user (and ISP).
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Mikael Abrahamsson
2018-08-26 18:44:34 UTC
Permalink
Post by Dave Taht
The expressobin is a Marvell Armada "3700LP (88F3720) dual core ARM
Cortex A53 processor up to 1.2GHz" - how does that compare? I have
plenty of ath10k and ath9k pcmcia cards....
I have one of these, incl wifi. Right now the drivers are not in great
shape, but they're being worked on. My espressobin has worse performance
than on its wired ports than my WRT1200AC (Armada 385).

I have talked to people who say the drivers are being worked on though...
If you have input, Kaloz is probably a great person to take that input. I
know other people working on Marvell drivers as well.
--
Mikael Abrahamsson email: ***@swm.pp.se
Dave Taht
2018-08-26 20:53:31 UTC
Permalink
This post might be inappropriate. Click to display it.
Rosen Penev
2018-08-26 20:58:22 UTC
Permalink
Post by Mikael Abrahamsson
Post by Dave Taht
The expressobin is a Marvell Armada "3700LP (88F3720) dual core ARM
Cortex A53 processor up to 1.2GHz" - how does that compare? I have
plenty of ath10k and ath9k pcmcia cards....
I have one of these, incl wifi. Right now the drivers are not in great
shape, but they're being worked on. My espressobin has worse performance
than on its wired ports than my WRT1200AC (Armada 385).
If as you mentioned earlier that ethernet performance is limited by
interrupts, then this commit is kind of depressing:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/net/ethernet/marvell/mvneta.c?h=next-20180824&id=0f5c6c30a0f8c629b92ecdaef61b315c43fde10a
Post by Mikael Abrahamsson
I have talked to people who say the drivers are being worked on though...
If you have input, Kaloz is probably a great person to take that input. I
know other people working on Marvell drivers as well.
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
Dave Taht
2018-08-26 21:08:55 UTC
Permalink
Post by Rosen Penev
Post by Mikael Abrahamsson
Post by Dave Taht
The expressobin is a Marvell Armada "3700LP (88F3720) dual core ARM
Cortex A53 processor up to 1.2GHz" - how does that compare? I have
plenty of ath10k and ath9k pcmcia cards....
I have one of these, incl wifi. Right now the drivers are not in great
shape, but they're being worked on. My espressobin has worse performance
than on its wired ports than my WRT1200AC (Armada 385).
If as you mentioned earlier that ethernet performance is limited by
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/net/ethernet/marvell/mvneta.c?h=next-20180824&id=0f5c6c30a0f8c629b92ecdaef61b315c43fde10a
I was on that thread. It was broken before entirely. As for the single
interrupt on this chip variant - believe it or not, I'm not huge on
lots of different interrupts for everything. I'd like it if we had
more than an rx or tx interrupt
in general, I'd love it if we had a programmable "tx is almost done"
interrupt that you could tune to the
interrupt latency... and it's complicated and costs wires to have lots
of different interrupt types... and (fantasizing again) I'd love it if
we had a scratchpad or dedicated memory to store interrupt handlers in
rather than relying on cache....

I'd looked deeply into improving this driver once upon a time, and
wanted to rip the software gro out of it,
in particular, and not defer things as much, trying things like NAPI
of 16 and measuring where time was spent. The copy to memory is
expensive, and then it defers further work.

Less code, particularly near interrupt time, is better than a lot.
adding XMIT_MORE to the ar71xx driver (which hurt it badly) is one
example.

Given something *cool* now landing ( skb list batching, see lwn ) it
would be worthwhile to revisit this. I don't care if I get more
interrupts/sec (particularly on a multicore) if we could drain the rx
ring over smaller intervals...

but that's me, I'm all about the latency. :) Nobody's willing to rip
the latency out of stuff, they'd rather add features. It's really hard
to correctly measure interrupt latency regardless.
Post by Rosen Penev
Post by Mikael Abrahamsson
I have talked to people who say the drivers are being worked on though...
If you have input, Kaloz is probably a great person to take that input. I
know other people working on Marvell drivers as well.
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Rosen Penev
2018-08-27 00:55:22 UTC
Permalink
Post by Dave Taht
Post by Rosen Penev
Post by Mikael Abrahamsson
Post by Dave Taht
The expressobin is a Marvell Armada "3700LP (88F3720) dual core ARM
Cortex A53 processor up to 1.2GHz" - how does that compare? I have
plenty of ath10k and ath9k pcmcia cards....
I have one of these, incl wifi. Right now the drivers are not in great
shape, but they're being worked on. My espressobin has worse performance
than on its wired ports than my WRT1200AC (Armada 385).
If as you mentioned earlier that ethernet performance is limited by
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/net/ethernet/marvell/mvneta.c?h=next-20180824&id=0f5c6c30a0f8c629b92ecdaef61b315c43fde10a
I was on that thread. It was broken before entirely. As for the single
interrupt on this chip variant - believe it or not, I'm not huge on
lots of different interrupts for everything. I'd like it if we had
more than an rx or tx interrupt
in general, I'd love it if we had a programmable "tx is almost done"
interrupt that you could tune to the
interrupt latency... and it's complicated and costs wires to have lots
of different interrupt types... and (fantasizing again) I'd love it if
we had a scratchpad or dedicated memory to store interrupt handlers in
rather than relying on cache....
I'd looked deeply into improving this driver once upon a time, and
wanted to rip the software gro out of it,
in particular, and not defer things as much, trying things like NAPI
of 16 and measuring where time was spent. The copy to memory is
expensive, and then it defers further work.
Less code, particularly near interrupt time, is better than a lot.
adding XMIT_MORE to the ar71xx driver (which hurt it badly) is one
example.
I've been looking a lot at ag71xx. The driver has so much low hanging
fruit. Unfortunately I lack the knowledge to fix most of it.

This patch for example gives me a ~20-30mbps improvement in iperf
tests: https://pastebin.com/ZExWjXQZ

It's kind of unfortunate given how much atheros hardware is out there.
Post by Dave Taht
Given something *cool* now landing ( skb list batching, see lwn ) it
would be worthwhile to revisit this. I don't care if I get more
interrupts/sec (particularly on a multicore) if we could drain the rx
ring over smaller intervals...
but that's me, I'm all about the latency. :) Nobody's willing to rip
the latency out of stuff, they'd rather add features. It's really hard
to correctly measure interrupt latency regardless.
Post by Rosen Penev
Post by Mikael Abrahamsson
I have talked to people who say the drivers are being worked on though...
If you have input, Kaloz is probably a great person to take that input. I
know other people working on Marvell drivers as well.
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Dave Taht
2018-08-27 01:21:46 UTC
Permalink
Post by Rosen Penev
Post by Dave Taht
Post by Rosen Penev
Post by Mikael Abrahamsson
Post by Dave Taht
The expressobin is a Marvell Armada "3700LP (88F3720) dual core ARM
Cortex A53 processor up to 1.2GHz" - how does that compare? I have
plenty of ath10k and ath9k pcmcia cards....
I have one of these, incl wifi. Right now the drivers are not in great
shape, but they're being worked on. My espressobin has worse performance
than on its wired ports than my WRT1200AC (Armada 385).
If as you mentioned earlier that ethernet performance is limited by
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/net/ethernet/marvell/mvneta.c?h=next-20180824&id=0f5c6c30a0f8c629b92ecdaef61b315c43fde10a
I was on that thread. It was broken before entirely. As for the single
interrupt on this chip variant - believe it or not, I'm not huge on
lots of different interrupts for everything. I'd like it if we had
more than an rx or tx interrupt
in general, I'd love it if we had a programmable "tx is almost done"
interrupt that you could tune to the
interrupt latency... and it's complicated and costs wires to have lots
of different interrupt types... and (fantasizing again) I'd love it if
we had a scratchpad or dedicated memory to store interrupt handlers in
rather than relying on cache....
I'd looked deeply into improving this driver once upon a time, and
wanted to rip the software gro out of it,
in particular, and not defer things as much, trying things like NAPI
of 16 and measuring where time was spent. The copy to memory is
expensive, and then it defers further work.
Less code, particularly near interrupt time, is better than a lot.
adding XMIT_MORE to the ar71xx driver (which hurt it badly) is one
example.
I've been looking a lot at ag71xx. The driver has so much low hanging
fruit. Unfortunately I lack the knowledge to fix most of it.
This patch for example gives me a ~20-30mbps improvement in iperf
tests: https://pastebin.com/ZExWjXQZ
cool! that driver as widely used as it is has never had enough eyeballs on it.

Have you put that patch in front of the openwrt folk? I note that that
driver didn't
used to live in that part of the tree - have they finally moved over?
Post by Rosen Penev
It's kind of unfortunate given how much atheros hardware is out there.
Post by Dave Taht
Given something *cool* now landing ( skb list batching, see lwn ) it
would be worthwhile to revisit this. I don't care if I get more
interrupts/sec (particularly on a multicore) if we could drain the rx
ring over smaller intervals...
but that's me, I'm all about the latency. :) Nobody's willing to rip
the latency out of stuff, they'd rather add features. It's really hard
to correctly measure interrupt latency regardless.
Post by Rosen Penev
Post by Mikael Abrahamsson
I have talked to people who say the drivers are being worked on though...
If you have input, Kaloz is probably a great person to take that input. I
know other people working on Marvell drivers as well.
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Rosen Penev
2018-08-27 01:33:45 UTC
Permalink
Post by Dave Taht
Post by Rosen Penev
Post by Dave Taht
Post by Rosen Penev
Post by Mikael Abrahamsson
Post by Dave Taht
The expressobin is a Marvell Armada "3700LP (88F3720) dual core ARM
Cortex A53 processor up to 1.2GHz" - how does that compare? I have
plenty of ath10k and ath9k pcmcia cards....
I have one of these, incl wifi. Right now the drivers are not in great
shape, but they're being worked on. My espressobin has worse performance
than on its wired ports than my WRT1200AC (Armada 385).
If as you mentioned earlier that ethernet performance is limited by
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/net/ethernet/marvell/mvneta.c?h=next-20180824&id=0f5c6c30a0f8c629b92ecdaef61b315c43fde10a
I was on that thread. It was broken before entirely. As for the single
interrupt on this chip variant - believe it or not, I'm not huge on
lots of different interrupts for everything. I'd like it if we had
more than an rx or tx interrupt
in general, I'd love it if we had a programmable "tx is almost done"
interrupt that you could tune to the
interrupt latency... and it's complicated and costs wires to have lots
of different interrupt types... and (fantasizing again) I'd love it if
we had a scratchpad or dedicated memory to store interrupt handlers in
rather than relying on cache....
I'd looked deeply into improving this driver once upon a time, and
wanted to rip the software gro out of it,
in particular, and not defer things as much, trying things like NAPI
of 16 and measuring where time was spent. The copy to memory is
expensive, and then it defers further work.
Less code, particularly near interrupt time, is better than a lot.
adding XMIT_MORE to the ar71xx driver (which hurt it badly) is one
example.
I've been looking a lot at ag71xx. The driver has so much low hanging
fruit. Unfortunately I lack the knowledge to fix most of it.
This patch for example gives me a ~20-30mbps improvement in iperf
tests: https://pastebin.com/ZExWjXQZ
cool! that driver as widely used as it is has never had enough eyeballs on it.
No. Definitely not enough.
Post by Dave Taht
Have you put that patch in front of the openwrt folk? I note that that
driver didn't
used to live in that part of the tree - have they finally moved over?
Yeah I did. Felix rejected it on the grounds that he didn't understand
it. So I've just been keeping it in my tree. The patch itself comes
from an old SDK by Qualcomm.
Post by Dave Taht
Post by Rosen Penev
It's kind of unfortunate given how much atheros hardware is out there.
Post by Dave Taht
Given something *cool* now landing ( skb list batching, see lwn ) it
would be worthwhile to revisit this. I don't care if I get more
interrupts/sec (particularly on a multicore) if we could drain the rx
ring over smaller intervals...
but that's me, I'm all about the latency. :) Nobody's willing to rip
the latency out of stuff, they'd rather add features. It's really hard
to correctly measure interrupt latency regardless.
Post by Rosen Penev
Post by Mikael Abrahamsson
I have talked to people who say the drivers are being worked on though...
If you have input, Kaloz is probably a great person to take that input. I
know other people working on Marvell drivers as well.
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Dave Taht
2018-08-27 01:45:11 UTC
Permalink
Post by Rosen Penev
Post by Dave Taht
Post by Rosen Penev
Post by Dave Taht
Post by Rosen Penev
Post by Mikael Abrahamsson
Post by Dave Taht
The expressobin is a Marvell Armada "3700LP (88F3720) dual core ARM
Cortex A53 processor up to 1.2GHz" - how does that compare? I have
plenty of ath10k and ath9k pcmcia cards....
I have one of these, incl wifi. Right now the drivers are not in great
shape, but they're being worked on. My espressobin has worse performance
than on its wired ports than my WRT1200AC (Armada 385).
If as you mentioned earlier that ethernet performance is limited by
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/net/ethernet/marvell/mvneta.c?h=next-20180824&id=0f5c6c30a0f8c629b92ecdaef61b315c43fde10a
I was on that thread. It was broken before entirely. As for the single
interrupt on this chip variant - believe it or not, I'm not huge on
lots of different interrupts for everything. I'd like it if we had
more than an rx or tx interrupt
in general, I'd love it if we had a programmable "tx is almost done"
interrupt that you could tune to the
interrupt latency... and it's complicated and costs wires to have lots
of different interrupt types... and (fantasizing again) I'd love it if
we had a scratchpad or dedicated memory to store interrupt handlers in
rather than relying on cache....
I'd looked deeply into improving this driver once upon a time, and
wanted to rip the software gro out of it,
in particular, and not defer things as much, trying things like NAPI
of 16 and measuring where time was spent. The copy to memory is
expensive, and then it defers further work.
Less code, particularly near interrupt time, is better than a lot.
adding XMIT_MORE to the ar71xx driver (which hurt it badly) is one
example.
I've been looking a lot at ag71xx. The driver has so much low hanging
fruit. Unfortunately I lack the knowledge to fix most of it.
This patch for example gives me a ~20-30mbps improvement in iperf
tests: https://pastebin.com/ZExWjXQZ
cool! that driver as widely used as it is has never had enough eyeballs on it.
No. Definitely not enough.
Post by Dave Taht
Have you put that patch in front of the openwrt folk? I note that that
driver didn't
used to live in that part of the tree - have they finally moved over?
Yeah I did. Felix rejected it on the grounds that he didn't understand
it. So I've just been keeping it in my tree. The patch itself comes
from an old SDK by Qualcomm.
Hmm. By eyeball I can't see how it could speed up things that much either. :)

But I used to be willig to bnchmark. not thi month though
Post by Rosen Penev
Post by Dave Taht
Post by Rosen Penev
It's kind of unfortunate given how much atheros hardware is out there.
Post by Dave Taht
Given something *cool* now landing ( skb list batching, see lwn ) it
would be worthwhile to revisit this. I don't care if I get more
interrupts/sec (particularly on a multicore) if we could drain the rx
ring over smaller intervals...
but that's me, I'm all about the latency. :) Nobody's willing to rip
the latency out of stuff, they'd rather add features. It's really hard
to correctly measure interrupt latency regardless.
Post by Rosen Penev
Post by Mikael Abrahamsson
I have talked to people who say the drivers are being worked on though...
If you have input, Kaloz is probably a great person to take that input. I
know other people working on Marvell drivers as well.
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Rosen Penev
2018-08-27 01:53:54 UTC
Permalink
Post by Dave Taht
Post by Rosen Penev
Post by Dave Taht
Post by Rosen Penev
Post by Dave Taht
Post by Rosen Penev
Post by Mikael Abrahamsson
Post by Dave Taht
The expressobin is a Marvell Armada "3700LP (88F3720) dual core ARM
Cortex A53 processor up to 1.2GHz" - how does that compare? I have
plenty of ath10k and ath9k pcmcia cards....
I have one of these, incl wifi. Right now the drivers are not in great
shape, but they're being worked on. My espressobin has worse performance
than on its wired ports than my WRT1200AC (Armada 385).
If as you mentioned earlier that ethernet performance is limited by
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/net/ethernet/marvell/mvneta.c?h=next-20180824&id=0f5c6c30a0f8c629b92ecdaef61b315c43fde10a
I was on that thread. It was broken before entirely. As for the single
interrupt on this chip variant - believe it or not, I'm not huge on
lots of different interrupts for everything. I'd like it if we had
more than an rx or tx interrupt
in general, I'd love it if we had a programmable "tx is almost done"
interrupt that you could tune to the
interrupt latency... and it's complicated and costs wires to have lots
of different interrupt types... and (fantasizing again) I'd love it if
we had a scratchpad or dedicated memory to store interrupt handlers in
rather than relying on cache....
I'd looked deeply into improving this driver once upon a time, and
wanted to rip the software gro out of it,
in particular, and not defer things as much, trying things like NAPI
of 16 and measuring where time was spent. The copy to memory is
expensive, and then it defers further work.
Less code, particularly near interrupt time, is better than a lot.
adding XMIT_MORE to the ar71xx driver (which hurt it badly) is one
example.
I've been looking a lot at ag71xx. The driver has so much low hanging
fruit. Unfortunately I lack the knowledge to fix most of it.
This patch for example gives me a ~20-30mbps improvement in iperf
tests: https://pastebin.com/ZExWjXQZ
cool! that driver as widely used as it is has never had enough eyeballs on it.
No. Definitely not enough.
Post by Dave Taht
Have you put that patch in front of the openwrt folk? I note that that
driver didn't
used to live in that part of the tree - have they finally moved over?
Yeah I did. Felix rejected it on the grounds that he didn't understand
it. So I've just been keeping it in my tree. The patch itself comes
from an old SDK by Qualcomm.
Hmm. By eyeball I can't see how it could speed up things that much either. :)
Me and another user's thinking is that it causes fewer cache misses.

I ported one of Felix' patches that he did for ramips where he got rid
of netdev_alloc_frag and was disappointed to find out that performance
dropped by 40mbps. That was fixed just by moving the new struct member
higher.

So yeah. Lots of low hanging fruit.
Post by Dave Taht
But I used to be willig to bnchmark. not thi month though
No worries.
Post by Dave Taht
Post by Rosen Penev
Post by Dave Taht
Post by Rosen Penev
It's kind of unfortunate given how much atheros hardware is out there.
Post by Dave Taht
Given something *cool* now landing ( skb list batching, see lwn ) it
would be worthwhile to revisit this. I don't care if I get more
interrupts/sec (particularly on a multicore) if we could drain the rx
ring over smaller intervals...
but that's me, I'm all about the latency. :) Nobody's willing to rip
the latency out of stuff, they'd rather add features. It's really hard
to correctly measure interrupt latency regardless.
Post by Rosen Penev
Post by Mikael Abrahamsson
I have talked to people who say the drivers are being worked on though...
If you have input, Kaloz is probably a great person to take that input. I
know other people working on Marvell drivers as well.
--
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Mikael Abrahamsson
2018-08-27 11:33:08 UTC
Permalink
Post by Dave Taht
I was on that thread. It was broken before entirely. As for the single
interrupt on this chip variant - believe it or not, I'm not huge on
When doing 10GE tests on x86-64 I received the highest performance when I
set interrupt affinity to single core per interface.
--
Mikael Abrahamsson email: ***@swm.pp.se
Pete Heist
2018-08-23 08:26:19 UTC
Permalink
http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf <http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf>
Thanks for sharing, this is really useful, raising awareness where it matters. Quite a bit of content... :)

Ubiquiti needs some work getting this into more of their products (EdgeMAX in particular). A good time to lobby for this might be, well a couple months ago, as they’re producing alpha builds for their upcoming 2.0 release with kernel 4.9 and new Cavium/Mediatek/Octeon SDKs. I just asked about the status in the EdgeRouter Beta forum, in case it finds the right eyes before the release:

https://community.ubnt.com/t5/EdgeRouter-Beta/BQL-support/m-p/2466657 <https://community.ubnt.com/t5/EdgeRouter-Beta/BQL-support/m-p/2466657>

https://community.ubnt.com/t5/EdgeMAX-Beta-Blog/New-EdgeRouter-firmware-2-0-0-alpha-2-has-been-released/ba-p/2414938 <https://community.ubnt.com/t5/EdgeMAX-Beta-Blog/New-EdgeRouter-firmware-2-0-0-alpha-2-has-been-released/ba-p/2414938>
Mikael Abrahamsson
2018-08-23 10:51:42 UTC
Permalink
Post by Pete Heist
http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf <http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf>
Thanks for sharing, this is really useful, raising awareness where it matters. Quite a bit of content... :)
https://community.ubnt.com/t5/EdgeRouter-Beta/BQL-support/m-p/2466657 <https://community.ubnt.com/t5/EdgeRouter-Beta/BQL-support/m-p/2466657>
https://community.ubnt.com/t5/EdgeMAX-Beta-Blog/New-EdgeRouter-firmware-2-0-0-alpha-2-has-been-released/ba-p/2414938 <https://community.ubnt.com/t5/EdgeMAX-Beta-Blog/New-EdgeRouter-firmware-2-0-0-alpha-2-has-been-released/ba-p/2414938>
My only experience with these devices is the Edgerouter 3/5/X, and they
have very low performance if you disable offloads (which you need to do to
enable AQM) and run everything in CPU, around 100 megabit/s of
uni-directional traffic.

Do they have other platforms where this would actually matter?
--
Mikael Abrahamsson email: ***@swm.pp.se
Pete Heist
2018-08-23 11:38:01 UTC
Permalink
Post by Pete Heist
Thanks for sharing, this is really useful, raising awareness where it matters. Quite a bit of content... :)
https://community.ubnt.com/t5/EdgeRouter-Beta/BQL-support/m-p/2466657 <https://community.ubnt.com/t5/EdgeRouter-Beta/BQL-support/m-p/2466657>
https://community.ubnt.com/t5/EdgeMAX-Beta-Blog/New-EdgeRouter-firmware-2-0-0-alpha-2-has-been-released/ba-p/2414938 <https://community.ubnt.com/t5/EdgeMAX-Beta-Blog/New-EdgeRouter-firmware-2-0-0-alpha-2-has-been-released/ba-p/2414938>
My only experience with these devices is the Edgerouter 3/5/X, and they have very low performance if you disable offloads (which you need to do to enable AQM) and run everything in CPU, around 100 megabit/s of uni-directional traffic.
I have a similar experience with my ER-X- with soft rate limiting (the only thing allowed in the UI), 120-140Mbit is the upper limit.
Do they have other platforms where this would actually matter?
One can add fq_codel or Cake from the command line (many are doing so), and some may be using 100Mbit Ethernet. In that case, BQL could be useful even on these devices when run at line rate, without soft rate limiting.

I’m also assuming that their higher-end products are capable of fq_codel at 1 Gbit, and that some may use those devices at line rate. Any time this is the case, even if only during bursts, BQL would be useful, I suppose...
Pete Heist
2018-08-24 17:13:54 UTC
Permalink
Post by Pete Heist
http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf <http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf>
Thanks for sharing, this is really useful, raising awareness where it matters. Quite a bit of content... :)
https://community.ubnt.com/t5/EdgeRouter-Beta/BQL-support/m-p/2466657 <https://community.ubnt.com/t5/EdgeRouter-Beta/BQL-support/m-p/2466657>
This started a discussion, and no, so far it looks like there’s no BQL support in the upcoming 2.0 release.

For my own benefit, re-reading the original patch series comment (https://lwn.net/Articles/469652/ <https://lwn.net/Articles/469652/>) makes it sound like BQL is useful even without AQM (original benchmarks were done with straight pfifo_fast). I didn’t realize this, actually. If anything incorrect about BQL was said in this discussion, correct us, please
 :)

Pete
Dave Taht
2018-08-24 17:30:22 UTC
Permalink
Post by Dave Taht
http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf
Thanks for sharing, this is really useful, raising awareness where it matters. Quite a bit of content... :)
https://community.ubnt.com/t5/EdgeRouter-Beta/BQL-support/m-p/2466657
This started a discussion, and no, so far it looks like there’s no BQL support in the upcoming 2.0 release.
For my own benefit, re-reading the original patch series comment (https://lwn.net/Articles/469652/) makes it sound like BQL is useful even without AQM (original benchmarks were done with straight pfifo_fast). I didn’t realize this, actually. If anything incorrect about BQL was said in this discussion, correct us, please… :)
yes, bql is very useful even with pfifo fast. without BQL I doubt the
internet would be scaling as it is today in the dc, or on the smaller
hosts and devices that support it. It's in the mvneta, it's in the
ar71xx, with documented results there that I could dig up. (tho:
things like tsq are helping and mask the problem on simple tests) The
experiment I documented on the slides that kicked off this thread and
the other experiment on the systemd bug, easily show the benefit on
hosts forwarding packets (be they from local applications, coming from
various sources like docker containers, etc), and anyone can show what
goes wrong if you disable BQL nowadays, basically restoring linux-3.3
behavior, with a very simple test:

For I in /sys/class/net/your_device/queues/tx*/byte_queue_limits/limit_min
do
echo 10000000 > $I
done

so long as you run enough kinds of flows that don't engage TSQ.

However, in the edgerouter w/offloads case all that part of the stack
has been short circuited into the offload engine. I don't know how
much buffering is in there on the new firmware, I'd done a few tests
on it in the old days, showing it to be around 10ms at gigE but even
that memory is kind of vague (the easy test here is slam two ports
into one), and for all I know the new firmware is worse, without going
back to track this new release. (I do have a few edgerouters but they
are all in production)

There was also a paper on BQL a few years back that I can dig up....
Post by Dave Taht
Pete
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Dave Taht
2018-08-24 18:43:01 UTC
Permalink
Post by Dave Taht
Post by Dave Taht
http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf
Thanks for sharing, this is really useful, raising awareness where it matters. Quite a bit of content... :)
https://community.ubnt.com/t5/EdgeRouter-Beta/BQL-support/m-p/2466657
This started a discussion, and no, so far it looks like there’s no BQL support in the upcoming 2.0 release.
For my own benefit, re-reading the original patch series comment (https://lwn.net/Articles/469652/) makes it sound like BQL is useful even without AQM (original benchmarks were done with straight pfifo_fast). I didn’t realize this, actually. If anything incorrect about BQL was said in this discussion, correct us, please… :)
yes, bql is very useful even with pfifo fast. without BQL I doubt the
internet would be scaling as it is today in the dc, or on the smaller
hosts and devices that support it. It's in the mvneta, it's in the
things like tsq are helping and mask the problem on simple tests) The
experiment I documented on the slides that kicked off this thread and
the other experiment on the systemd bug, easily show the benefit on
hosts forwarding packets (be they from local applications, coming from
various sources like docker containers, etc), and anyone can show what
goes wrong if you disable BQL nowadays, basically restoring linux-3.3
For I in /sys/class/net/your_device/queues/tx*/byte_queue_limits/limit_min
do
echo 10000000 > $I
done
so long as you run enough kinds of flows that don't engage TSQ.
However, in the edgerouter w/offloads case all that part of the stack
has been short circuited into the offload engine. I don't know how
much buffering is in there on the new firmware, I'd done a few tests
on it in the old days, showing it to be around 10ms at gigE but even
that memory is kind of vague (the easy test here is slam two ports
into one), and for all I know the new firmware is worse, without going
back to track this new release. (I do have a few edgerouters but they
are all in production)
There was also a paper on BQL a few years back that I can dig up....
The only academic analysis so of BQL i knew of was this: "bufferbloat
systemic analysis": http://200 dot 131 dot 219 dot
61/publications/2014/its2014_bb.pdf - note that bufferbloat.net's
filters don't let me post numeric urls and you can find the paywalled
versions by searching for that title on google scholar. Or on sci-hub.

I found that again by re-reading my preso to sigcomm 2014 "The value
of repeatable experiments and negative results" -
https://conferences.sigcomm.org/sigcomm/2014/doc/slides/137.pdf )
which - in addition to providing some value-able history and links to
the bufferbloat, fq, and aqm efforts, is really one of my best rants
*ever* aimed at the academic research and publication process.

I enjoyed writing that, and giving the preso *a lot*. For some reason
or another sigcomm has not invited me back. :)
Post by Dave Taht
Post by Dave Taht
Pete
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
Loading...