[Bloat] one benefit of turning off shaping + fq

Post by Dave Taht
It turns out we are contributing to global warming.
https://community.ubnt.com/t5/UniFi-Routing-Switching/USG-temperature/m-p/2547046/highlight/true#M115060

There is a reason vendors have packet accelerators. It's more efficient
compared to doing everything in CPU.

--
Mikael Abrahamsson email: ***@swm.pp.se

David Lang

2018-11-15 00:56:10 UTC

Post by Dave Taht
It turns out we are contributing to global warming.

https://community.ubnt.com/t5/UniFi-Routing-Switching/USG-temperature/m-p/2547046/highlight/true#M115060

so how much power is wasted in re-transmitting packets due to bloat?

Dave Taht

2018-11-15 03:44:38 UTC

Post by David Lang

Post by Dave Taht
It turns out we are contributing to global warming.

https://community.ubnt.com/t5/UniFi-Routing-Switching/USG-temperature/m-p/2547046/highlight/true#M115060

so how much power is wasted in re-transmitting packets due to bloat?

That might be a good way to look at it also. It seems possible
to make the calculation in time for midnight of march 31st.

Post by David Lang
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

Pete Heist

2018-11-23 11:47:57 UTC

This post might be inappropriate. Click to display it.

Dave Taht

2018-11-23 16:26:43 UTC

Post by Dave Taht
It turns out we are contributing to global warming.
https://community.ubnt.com/t5/UniFi-Routing-Switching/USG-temperature/m-p/2547046/highlight/true#M115060
Would it be right to say that the biggest opportunity for reducing
consumption is to avoid shaping, i.e. by adding BQL-like functionality
to all classes of device drivers

Shaping outbound with BQL's support for a dynamic interrupt would be
*free*. A few ethernet chips already have that. Basically you set a
register saying "you are really a 200Mbit interface, return a completion
interrupt after the equivalent of that amount of time has passed".

I can neither remember what chips can do this already, or the name of
the bql feature that does it, this morning.

But it's a register you twiddle and a simple divider circuit.

But outbound is not the problem for us from a heat generation standpoint...

Post by Dave Taht
and/or by deploying congestion control globally that avoids the need for it?

I think it would be interesting to compare energy per byte successfully
delivered across various technologies. Driving fiber lines is pretty
high energy, though, and I think (without a back of envelope handy),
that that would be far more expensive than shaping currently is.

still, adding 6 C to everybody's home router to shape inbound under
heavy load is pretty costly both in energy and reduced service life.

Post by Dave Taht
Other ideas: move queue management into hardware

I have increasingly high hopes for P4 and other forms of hardware to
finally do shaping and queue management right.

https://github.com/ralfkundel/p4-codel/issues/2

Back in the day, I was a huge fan of async logic, which I first
encountered via caltech's cpu and later the amulet.

https://en.wikipedia.org/wiki/Asynchronous_circuit#Asynchronous_CPU

This reduces power consumption enormously. The caltech logic design
system is now open source, and I'd looked it over a few years ago hoping
I could use it to ressurect my ancient skills in this department. I
can't find it this morning, either. there's coffee around here
somewhere... My *big* interest in this tech was because it essentially
eliminates clock noise and you can build a much more sensitive wireless
reciever with it. I got bit by DRAMs being "too loud" on several occasions.

Fulcrum (before they got bought by intel) used async logic in their switch chips.

I think (but am not sure) that the technique is undergoing a
renanassance in the AI chips. The big IBM chip uses it, and it just
totally makes sense if you have zillions of small cpus doing neaural
networks, to only power them up when needed. No crazy P1,P2,P3 etc clock
states are needed, the chip just speeds up or slows down as a function
of heat.

I've never really understood why it didn't take off, I think, in part,
it doesn't scale to wide busses well, and that centrally clocked designs
are how most engineers and fpgas and code got designed since. Anything
with delay built into it seems hard for EEs to grasp.... but I wish I
knew why, or had the time to go play with circuits again at a reasonable
scale.

Post by Dave Taht
power network
equipment with renewables, or just use the Internet less. :)

I am glad to see more of the former happening. A recent data center
design in singapore basically needed it's own nuclear power plant.

In my case I've always wanted the computing to take place under the
users fingers, I do not like the centralization trend we are in today at
all. I like that apple seems to be leading the way to be putting all
these cool new AI tools in your own hands.

As for the latter... I'm using browsers less now (emacs rocks), and
seem to be getting more done.

Post by Dave Taht
Pete
(I noticed an audience member brought this up in Toke’s thesis
defense)

I sadly slept through that. I hope it was recorded.

Jonathan Morton

2018-11-23 16:43:02 UTC

Post by Dave Taht
(I noticed an audience member brought this up in Toke’s thesis
defense)

I sadly slept through that. I hope it was recorded.

As it was streamed on YT, YT archived it.

- Jonathan Morton

Dave Taht

2018-11-23 16:48:43 UTC

Ahhh.... good. My morning has improved. I found the coffee, and I have
something way more interesting that CNN on.

Post by Dave Taht
(I noticed an audience member brought this up in Toke’s thesis
defense)

I sadly slept through that. I hope it was recorded.

As it was streamed on YT, YT archived it.
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Luca Muscariello

2018-11-23 17:16:54 UTC

Yes there was some discussion about that.
Moving things to hardware should fix that.

Evens traffic management in NPU based routers makes use of hardware based
polling for shaping. These are trade offs one has to face all the time.

There has been a discussion at the defense about hardware vs software,
hardware + software, when one, when the other.

BTW Toke is Doctor Toke now :-)

Post by Dave Taht
Ahhh.... good. My morning has improved. I found the coffee, and I have
something way more interesting that CNN on.
http://youtu.be/upvx6rpSLSw

Post by Pete Heist
(I noticed an audience member brought this up in Tokeâs thesis defense)

I sadly slept through that. I hope it was recorded.

As it was streamed on YT, YT archived it.
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave TÃ€ht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

Dave Taht

2018-11-23 17:27:42 UTC

On Fri, Nov 23, 2018 at 9:17 AM Luca Muscariello

Post by Luca Muscariello
Yes there was some discussion about that.
Moving things to hardware should fix that.
Evens traffic management in NPU based routers makes use of hardware based polling for shaping. These are trade offs one has to face all the time.
There has been a discussion at the defense about hardware vs software, hardware + software, when one, when the other.

I'm still listening/watching.

Post by Luca Muscariello
BTW Toke is Doctor Toke now :-)

I hope you made him sweat, at least a little. :)

Post by Luca Muscariello

Post by Dave Taht
Ahhh.... good. My morning has improved. I found the coffee, and I have
something way more interesting that CNN on.
http://youtu.be/upvx6rpSLSw

Post by Dave Taht
(I noticed an audience member brought this up in Toke’s thesis
defense)

I sadly slept through that. I hope it was recorded.

As it was streamed on YT, YT archived it.
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Luca Muscariello

2018-11-23 17:32:38 UTC

Post by Dave Taht
On Fri, Nov 23, 2018 at 9:17 AM Luca Muscariello

Post by Luca Muscariello
Yes there was some discussion about that.
Moving things to hardware should fix that.
Evens traffic management in NPU based routers makes use of hardware

based polling for shaping. These are trade offs one has to face all the
time.

Post by Luca Muscariello
There has been a discussion at the defense about hardware vs software,

hardware + software, when one, when the other.
I'm still listening/watching.

Post by Luca Muscariello
BTW Toke is Doctor Toke now :-)

I hope you made him sweat, at least a little. :)

Difficult to sweet here. Itâs freezing, but Iâm leaving already. Toke I
think has already started to eat and drink too much...

Post by Luca Muscariello

Post by Dave Taht
Ahhh.... good. My morning has improved. I found the coffee, and I have
something way more interesting that CNN on.
http://youtu.be/upvx6rpSLSw

Post by Pete Heist
(I noticed an audience member brought this up in Tokeâs thesis
defense)

I sadly slept through that. I hope it was recorded.

As it was streamed on YT, YT archived it.
- Jonathan Morton
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave TÃ€ht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave TÃ€ht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Toke Høiland-Jørgensen

2018-11-25 21:14:17 UTC

Post by Dave Taht
On Fri, Nov 23, 2018 at 9:17 AM Luca Muscariello

Post by Luca Muscariello
Yes there was some discussion about that.
Moving things to hardware should fix that.
Evens traffic management in NPU based routers makes use of hardware

based polling for shaping. These are trade offs one has to face all the
time.

Post by Luca Muscariello
There has been a discussion at the defense about hardware vs software,

hardware + software, when one, when the other.
I'm still listening/watching.

Post by Luca Muscariello
BTW Toke is Doctor Toke now :-)

I hope you made him sweat, at least a little. :)

Difficult to sweet here. It’s freezing, but I’m leaving already. Toke I
think has already started to eat and drink too much...

Haha, indeed. Only catching up to email now... :D

Thanks for a fun discussion, Luca, and to everyone who listened in. I
had a tremendously fun three hours, hope everyone else did as well! :)

-Toke

Pete Heist

2018-11-26 12:52:39 UTC

Post by Toke HÃ¸iland-JÃ¸rgensen

Post by Luca Muscariello

Post by Dave Taht
On Fri, Nov 23, 2018 at 9:17 AM Luca Muscariello

Post by Luca Muscariello
BTW Toke is Doctor Toke now :-)

I hope you made him sweat, at least a little. :)

Difficult to sweet here. Itâs freezing, but Iâm leaving already. Toke I
think has already started to eat and drink too much...

Haha, indeed. Only catching up to email now... :D
Thanks for a fun discussion, Luca, and to everyone who listened in. I
had a tremendously fun three hours, hope everyone else did as well! :)

I did- great work Toke and congratulations on the result! Wish there were more interesting discussions like that. :)

Pete

Dave Taht

2018-11-26 12:54:38 UTC

Post by Dave Taht
On Fri, Nov 23, 2018 at 9:17 AM Luca Muscariello
BTW Toke is Doctor Toke now :-)
I hope you made him sweat, at least a little. :)
Difficult to sweet here. It’s freezing, but I’m leaving already. Toke I
think has already started to eat and drink too much...
Haha, indeed. Only catching up to email now... :D
Thanks for a fun discussion, Luca, and to everyone who listened in. I
had a tremendously fun three hours, hope everyone else did as well! :)
I did- great work Toke and congratulations on the result! Wish there were more interesting discussions like that. :)

I had great difficulty making it out. Would it be possible to get a transcript?

Post by Dave Taht
Pete
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Toke Høiland-Jørgensen

2018-11-26 13:30:14 UTC

I had great difficulty making it out. Would it be possible to get a transcript?

Yeah, we only had the one mic, unfortunately. Youtube does an automatic
transcription that you can find if you press the three dots beneath the
video (or just turn on subtitles); but I don't think we have any
volunteers to do a manual one...

-Toke

Toke Høiland-Jørgensen

2018-11-26 13:26:57 UTC

Post by Toke HÃ¸iland-JÃ¸rgensen

Post by Pete Heist

Post by Dave Taht
On Fri, Nov 23, 2018 at 9:17 AM Luca Muscariello

Post by Luca Muscariello
BTW Toke is Doctor Toke now :-)

I hope you made him sweat, at least a little. :)

Difficult to sweet here. It’s freezing, but I’m leaving already. Toke I
think has already started to eat and drink too much...

Haha, indeed. Only catching up to email now... :D
Thanks for a fun discussion, Luca, and to everyone who listened in. I
had a tremendously fun three hours, hope everyone else did as well! :)

I did- great work Toke and congratulations on the result! Wish there
were more interesting discussions like that. :)

Thanks! :)

-Toke

Pete Heist

2018-11-24 11:49:49 UTC

This post might be inappropriate. Click to display it.

Holland, Jake

2018-11-27 18:14:01 UTC

On 2018-11-23, 08:33, "Dave Taht" <***@taht.net> wrote:
Back in the day, I was a huge fan of async logic, which I first
encountered via caltech's cpu and later the amulet.

https://en.wikipedia.org/wiki/Asynchronous_circuit#Asynchronous_CPU

...

I've never really understood why it didn't take off, I think, in part,
it doesn't scale to wide busses well, and that centrally clocked designs
are how most engineers and fpgas and code got designed since. Anything
with delay built into it seems hard for EEs to grasp.... but I wish I
knew why, or had the time to go play with circuits again at a reasonable
scale.

At the time, I was told the objections they got were that it uses about 2x the space for the same functionality, and space usage is approximately linear with the chip cost, and when under load you still need reasonable cooling, so it was only considered maybe worthwhile for some narrow use cases.

I don't really know enough to confirm or deny the claim, and the use cases may have gotten a lot closer to a good match by now, but this was the opinion of at least some of the people involved with the work, IIRC.

Stephen Hemminger

2018-11-27 18:31:14 UTC

On Tue, 27 Nov 2018 18:14:01 +0000

Post by Dave Taht
Back in the day, I was a huge fan of async logic, which I first
encountered via caltech's cpu and later the amulet.
https://en.wikipedia.org/wiki/Asynchronous_circuit#Asynchronous_CPU
...
I've never really understood why it didn't take off, I think, in part,
it doesn't scale to wide busses well, and that centrally clocked designs
are how most engineers and fpgas and code got designed since. Anything
with delay built into it seems hard for EEs to grasp.... but I wish I
knew why, or had the time to go play with circuits again at a reasonable
scale.
At the time, I was told the objections they got were that it uses about 2x the space for the same functionality, and space usage is approximately linear with the chip cost, and when under load you still need reasonable cooling, so it was only considered maybe worthwhile for some narrow use cases.
I don't really know enough to confirm or deny the claim, and the use cases may have gotten a lot closer to a good match by now, but this was the opinion of at least some of the people involved with the work, IIRC.
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

With asynchronous circuits there is too much unpredictablity and instability.
Seem to remember there are even cases where two inputs arrive at once and output is non-determistic.

Dave Taht

2018-11-27 19:09:44 UTC

On Tue, Nov 27, 2018 at 10:31 AM Stephen Hemminger

Post by Stephen Hemminger
On Tue, 27 Nov 2018 18:14:01 +0000

And the pentultimate cost here was unpredictable and many power
states, hyperthreading (which is looking to die post spectre), and
things like ddpk which spin processors madly to keep up. I always
liked things like

I wish I knew more about what fulcrum did in their switch designs...

everybody knows I'm a fan of the mill cpu which has lots of little
optimizations close to each functional unit (among many other things
using virtual memory internally for everything,
and separating out the PLB (protection level buffer) from the TLB). I
would really like to bring back an era where cpus could context or
security level switch in 5 clocks.

Someday something like that will be built. Til then, the closest chip
to something I'd like to be working on for networks is how the xmos is
designed: https://en.wikipedia.org/wiki/XMOS#xCORE_multicore_microcontrollers
- or https://www.xmos.com/developer/silicon/xcore200-ethernet which
has 1MByte of single-clock sram on it.

"The xCORE architecture delivers, in hardware, many of the elements
that are usually seen in a real-time operating system (RTOS). This
includes the task scheduler, timers, I/O operations, and channel
communication. By eliminating sources of timing uncertainty
(interrupts, caches, buses and other shared resources), xCORE can
provide deterministic and predictable performance for many
applications. A task can typically respond in nanoseconds to events
such as external I/O or timers. This makes it possible to program
xCORE devices to perform hard real-time tasks that would otherwise
require dedicated hardware."

Nobody else's ethernet controllers work this way.

Post by Stephen Hemminger

Post by Dave Taht
I don't really know enough to confirm or deny the claim, and the use cases may have gotten a lot closer to a good match by now, but this was the opinion of at least some of the people involved with the work, IIRC.
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

With asynchronous circuits there is too much unpredictablity and instability.
Seem to remember there are even cases where two inputs arrive at once and output is non-determistic.

Yes, that was a big problem... in the 90s... but cpus *were*
successfully designed that didn't do that.

I am the sort of character that is totally willing to toss out decades
of evolution in chip design in order to get better SNR for wireless.
:)

I wish I knew of a mailing list where I could get a definitive answer
on "modern problems with async circuits", or an update on the kind of
techniques the new AI chips were using to keep their power consumption
so low. I'll keep googling.

Post by Stephen Hemminger
_______________________________________________
Bloat mailing list
https://lists.bufferbloat.net/listinfo/bloat

--
Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Pete Heist

2018-11-27 22:07:12 UTC

Post by Dave Taht
I wish I knew of a mailing list where I could get a definitive answer
on "modern problems with async circuits", or an update on the kind of
techniques the new AI chips were using to keep their power consumption
so low. I'll keep googling.

Iâd be interested in knowing this as well. This gives some examples of async circuits: https://web.stanford.edu/class/archive/ee/ee371/ee371.1066/lectures/lect_12.pdf <https://web.stanford.edu/class/archive/ee/ee371/ee371.1066/lectures/lect_12.pdf>

Page 43, âBottom Lineâ mentions that asynchronous design has âsome delay matching / overhead issuesâ. Apparently delay matching means getting the signal outputs on two separate paths to arrive at the same time(?) Presumably overhead refers to the 2x space on the die previously mentioned, for completion detection. Pages 23-25 on âdata-bundling constraintsâ might also highlight some other challenges. Some more current material would be interesting though...

Jonathan Morton

2018-11-27 22:33:45 UTC

I’d be interested in knowing this as well. This gives some examples of async circuits: https://web.stanford.edu/class/archive/ee/ee371/ee371.1066/lectures/lect_12.pdf
Page 43, “Bottom Line” mentions that asynchronous design has “some delay matching / overhead issues”. Apparently delay matching means getting the signal outputs on two separate paths to arrive at the same time(?) Presumably overhead refers to the 2x space on the die previously mentioned, for completion detection. Pages 23-25 on “data-bundling constraints” might also highlight some other challenges. Some more current material would be interesting though...

The area overhead is at least partly mitigated by the major advantage of not having to distribute and gate a coherent clock signal across the entire chip. I half-remember seeing a quote that distributing the clock represents about 30% of the area and/or power consumption of a modern deep-sub-micron design. This is area and power that is not directly contributing to functionality.

Generally there are two major styles of asynchronous logic:

1: Standard combinatorial logic stages accompanied by self-timing circuits with a matched delay, generally known as "bundled data". This style has little overhead (probably less than the clock distribution it replaces) but requires local timing closure (the timing circuit must have strictly *more* delay than the logic it accompanies) to assure correct functionality. I suspect that achieving local timing closure is easier than the global timing closure required by conventional synchronous logic.

2: Dual-rail QDI logic, in which completion is explicitly signalled by the arrival of a result. This almost completely eliminates timing closure from the logic correctness equation, but the area overhead can be substantial. Achieving maximum performance in this style can also be challenging, but suitable approaches do exist, eg:

https://brej.org/papers/mapld.pdf

Both styles can inherently adapt timings to thermal and voltage conditions within a design range without much explicit provisioning, and typically have much cleaner power load and EMI characteristics than synchronous logic. But as you can see from the above, the downsides typically associated with async logic tend to apply to one or the other of the styles, not to both at once.

- Jonathan Morton

Jonathan Morton

2018-11-27 22:36:25 UTC

Just to add - I think the biggest impediment to experimentation in asynchronous logic is the complete absence of convenient Muller C-element gates in the 74-series logic family. If you want to build some, I recommend using NAND and OR gates as inputs to active-low SR flipflops.

- Jonathan Morton

Dave Taht

2018-11-28 07:23:55 UTC

Post by Jonathan Morton
Just to add - I think the biggest impediment to experimentation in
asynchronous logic is the complete absence of convenient Muller
C-element gates in the 74-series logic family. If you want to build
some, I recommend using NAND and OR gates as inputs to active-low SR
flipflops.

Need millions of transistors, not dozens. :)

To me the biggest barrier is in tools. I'm still looking for the caltech
tool and language which really helped in thinking in this way, and I did
find it on github once, and it still seemed developed....

And the field is not entirely dead, after all. I keep meaning to pick up
one of the new risc-v boards. Here's a async design of the risc-v... in
GO of all things. (I also really hate the universal adoption of java
amongst the circuit design folk... and I really loved the prospects of
chisel, except for the jvm dependency):

https://www.inf.pucrs.br/~calazans/publications/2017_MarcosSartori_EoTW.pdf

in the risc-v world, well, it's still trundling forward.

https://www.lowrisc.org/about/

This is pretty neat - standby is 2uA:

https://greenwaves-technologies.com/en/gap8-product/

And pulp is pretty neat.

https://pulp-platform.org//

Still, I liked xmos's stuff... rexcomputing hasn't surfaced in a while

In the last weird hardware embedded news of the day, you can get a old
intel compute stick for 34 dollars on ebay.

https://www.ebay.com/p/Intel-Compute-Stick-STCK1A8LFC-Intel-Atom-Z3735F-1-33GHz-8GB-PC-Stick-BOXSTCK1A8LFC/11020833331?iid=153273128090&chn=ps

they were painfully slow but fit on your keychain. The most modern
version of this design is
https://www.amazon.com/Intel-Compute-Computer-processor-BOXSTK2m3W64CC/dp/B01AZC4IKK/ref=sr_1_4?s=electronics&ie=UTF8&qid=1543389564&sr=1-4&keywords=intel+compute+stick

2 cores, 4MB of cache, 64GB of flash... on your keychain.

I rather miss vga in that it would be better to be able to screw these in...

Holland, Jake

2018-11-27 19:11:00 UTC