Discussion:
DETNET
(too old to reply)
Matthias Tafelmeier
2017-11-04 13:45:13 UTC
Permalink
Perceived that as shareworthy/entertaining ..

https://tools.ietf.org/html/draft-ietf-detnet-architecture-03#section-4.5

without wanting to belittle it.

--
Besten Gruß

Matthias Tafelmeier
Bob Briscoe
2017-11-12 19:58:26 UTC
Permalink
Matthias, Dave,

The sort of industrial control applications that detnet is targeting
require far lower queuing delay and jitter than fq_CoDel can give. They
have thrown around numbers like 250us jitter and 1E-9 to 1E-12 packet
loss probability.

However, like you, I just sigh when I see the behemoth detnet is building.

Nonetheless, it's important to have a debate about where to go to next.
Personally I don't think fq_CoDel alone has legs to get (that) much better.

I prefer the direction that Mohamad Alizadeh's HULL pointed in:
Less is More: Trading a little Bandwidth for Ultra-Low Latency in the
Data Center <https://people.csail.mit.edu/alizadeh/papers/hull-nsdi12.pdf>

In HULL you have i) a virtual queue that models what the queue would be
if the link were slightly slower, then marks with ECN based on that.
ii)  a much more well-behaved TCP (HULL uses DCTCP with hardware pacing
in the NICs).

I would love to be able to demonstrate that HULL can achieve the same
extremely low latency and loss targets as detnet, but with a fraction of
the complexity.

*Queuing latency?* This keeps the real FIFO queue in the low hundreds to
tens of microseconds.

*Loss prob?* Mohammad doesn't recall seeing a loss during the entire
period of the experiments, but he doubted their measurement
infrastructure was sufficiently accurate (or went on long enough) to be
sure they were able to detect one loss per 10^12 packets.

For their research prototype, HULL used a dongle they built, plugged
into each output port to constrict the link in order to shift the AQM
out of the box. However, Broadcom mid-range chipsets already contain
vertual queue hardware (courtesey of a project we did with them when I
was at BT:
How to Build a Virtual Queue from Two Leaky Buckets (and why one is not
enough) <http://bobbriscoe.net/pubs.html#vq2lb> ).

*For public Internet, not just for DCs?* You might have seen the work
we've done (L4S <https://riteproject.eu/dctth/>) to get queuing delay
over regular public Internet and broadband down to about mean 500us;
90%-ile 1ms, by making DCTCP deployable alongside existing Internet
traffic (unlike HULL, pacing at the source is in Linux, not hardware).
My personal roadmap for that is to introduce virtual queues at some
future stage, to get down to the sort of delays that detnet wants, but
over the public Internet with just FIFOs.

Radio links are harder, of course, but a lot of us are working on that too.



Bob

On 12/11/2017 22:58, Matthias Tafelmeier wrote:
> On 11/07/2017 01:36 AM, Dave Taht wrote:
>>> Perceived that as shareworthy/entertaining ..
>>>
>>> https://tools.ietf.org/html/draft-ietf-detnet-architecture-03#section-4.5
>>>
>>> without wanting to belittle it.
>> Hope springs eternal that they might want to look over the relevant
>> codel and fq_codel RFCS at some point or another.
>
> Not sure, appears like juxtaposing classical mechanics to nanoscale
> physics.
>
> --
> Besten Gruß
>
> Matthias Tafelmeier
>
>
>
> _______________________________________________
> Bloat mailing list
> ***@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
Matthias Tafelmeier
2017-11-13 17:56:09 UTC
Permalink
> However, like you, I just sigh when I see the behemoth detnet is building.
>
Does it? Well, so far the circumference seems justififiable for what
they want to achieve, at least according to what I can tell from these
rather still abstract concepts.


>> The sort of industrial control applications that detnet is targeting
>> require far lower queuing delay and jitter than fq_CoDel can give.
>> They have thrown around numbers like 250us jitter and 1E-9 to 1E-12
>> packet loss probability.
>>
> Nonetheless, it's important to have a debate about where to go to
> next. Personally I don't think fq_CoDel alone has legs to get (that)
> much better.
>
Certainly, all you said is valid - as I stated, I mostly wanted to share
the digest/the existance of the inititiative without
judging/reproaching/peaching ...

> I prefer the direction that Mohamad Alizadeh's HULL pointed in:
> Less is More: Trading a little Bandwidth for Ultra-Low Latency in the
> Data Center <https://people.csail.mit.edu/alizadeh/papers/hull-nsdi12.pdf>
>
> In HULL you have i) a virtual queue that models what the queue would
> be if the link were slightly slower, then marks with ECN based on
> that. ii) a much more well-behaved TCP (HULL uses DCTCP with hardware
> pacing in the NICs).
>
> I would love to be able to demonstrate that HULL can achieve the same
> extremely low latency and loss targets as detnet, but with a fraction
> of the complexity.
>
Well, if it's already for specific HW, then I'd prefer to see RDMA in
place right away with getting rid of IRQs and other TCP/IP specific rust
along the way, at least for DC realms :) Although, this HULL might has a
spin for it from economics perspective.

> *For public Internet, not just for DCs?* You might have seen the work
> we've done (L4S <https://riteproject.eu/dctth/>) to get queuing delay
> over regular public Internet and broadband down to about mean 500us;
> 90%-ile 1ms, by making DCTCP deployable alongside existing Internet
> traffic (unlike HULL, pacing at the source is in Linux, not hardware).
> My personal roadmap for that is to introduce virtual queues at some
> future stage, to get down to the sort of delays that detnet wants, but
> over the public Internet with just FIFOs.
>
>
Thanks for sharing, that sounds thrilling - especially the achieved
latencies and the non-spec. HW needs. All the best with it, again, maybe
more an economical quarrel to overcome then again.

--
Besten Gruß

Matthias Tafelmeier
Dave Taht
2017-11-15 19:31:53 UTC
Permalink
Matthias Tafelmeier <***@gmx.net> writes:

> However, like you, I just sigh when I see the behemoth detnet is building.
>
> Does it? Well, so far the circumference seems justififiable for what they want
> to achieve, at least according to what I can tell from these rather still
> abstract concepts.
>
> The sort of industrial control applications that detnet is targeting
> require far lower queuing delay and jitter than fq_CoDel can give. They
> have thrown around numbers like 250us jitter and 1E-9 to 1E-12 packet
> loss probability.
>
> Nonetheless, it's important to have a debate about where to go to next.
> Personally I don't think fq_CoDel alone has legs to get (that) much better.

The place where bob and I always disconnect is that I care about
interflow latencies generally more than queuing latencies and prefer to
have strong incentives for non-queue building flows in the first
place. This results in solid latencies of 1/flows at your bandwidth. At
100Mbit, a single 1500 byte packet takes 130us to deliver, gbit, 13us,
10Gbit, 1.3us.

So for values of flows of 2, 20, 200, at these bandwidths, we meet
this detnet requirement. As for queuing, if you constrain the network diameter,
and use ECN, fq_codel can scale down quite a lot, but I agree there is
other work in this area that is promising.

However, underneath this, unless a shaper like htb or cake is used, is
additional unavoidable buffering in the device driver, at 1Gbit and
higher, managed by BQL. We've successfully used sch_cake to hold things
down to a single packet, soft-shaped, at speeds of 15Gbit or so, on high
end hardware.

Now, I don't honestly know enough about detnet to say if any part of
this discussion actually applies to what they are trying to solve! and I
don't plan to look into until the next ietf meeting.

I've been measuring overlying latencies elsewhere in the Linux kernel at
2-6us with a long tail to about 2ms for years now. There is a lot of
passionate work trying to get latencies down for small packets above
10Gbits in the Linux world.. but there, it's locking, and routing - not
queueing - that is the dominating factor.

>
>
>
> Certainly, all you said is valid - as I stated, I mostly wanted to share the
> digest/the existance of the inititiative without judging/reproaching/peaching .
> ..
>
> I prefer the direction that Mohamad Alizadeh's HULL pointed in:
> Less is More: Trading a little Bandwidth for Ultra-Low Latency in the Data
> Center

I have adored all his work. DCTCP, HULL, one other paper... what's he
doing now?

> In HULL you have i) a virtual queue that models what the queue would be if
> the link were slightly slower, then marks with ECN based on that. ii) a much
> more well-behaved TCP (HULL uses DCTCP with hardware pacing in the NICs).

I do keep hoping that more folk will look at cake... it's a little
crufty right now, but we just added ack filtering to it and starting up
a major set of test runs in december/january.

> I would love to be able to demonstrate that HULL can achieve the same
> extremely low latency and loss targets as detnet, but with a fraction of the
> complexity.
>
> Well, if it's already for specific HW, then I'd prefer to see RDMA in place
> right away with getting rid of IRQs and other TCP/IP specific rust along the
> way, at least for DC realms :) Although, this HULL might has a spin for it from
> economics perspective.

It would be good for more to read the proceeds from the recent netdev
conference:

https://lwn.net/Articles/738912/

>
> For public Internet, not just for DCs? You might have seen the work we've
> done (L4S) to get queuing delay over regular public Internet and broadband
> down to about mean 500us; 90%-ile 1ms, by making DCTCP deployable alongside
> existing Internet traffic (unlike HULL, pacing at the source is in Linux,
> not hardware). My personal roadmap for that is to introduce virtual queues
> at some future stage, to get down to the sort of delays that detnet wants,
> but over the public Internet with just FIFOs.

My personal goal is to just apply what we got to incrementally reduce
all delays from seconds to milliseconds, across the internet and on
every device I can fix. Stuff derived from the sqm-scripts is
universally available in third party firmware now, and in many devices.

Also: I'm really really happy with what we've done for wifi so far, I
think we can cut peak latencies by another factor or 3, maybe even 5,
with what we got coming up next from the make-wifi-fast project.

And that's mostly *driver* work, not abstract queuing theory.
Matthias Tafelmeier
2017-11-15 20:09:32 UTC
Permalink
On 11/15/2017 08:45 PM, Ken Birman wrote:
> I'm missing context. Can someone tell me why I'm being cc'ed on these?
right, I cced you so I'm liable to clarify.

> Topic seems germane to me (Derecho uses RDMA on RoCE) but I'm unclear what this thread is "about"

I perceived you as a strong kind of advocate for RDMA, at least that's a
supposition I hold ever since having read your blogs/research around it.
It was mentioned here in context with other existing or already
pertaining technologies as to the purpose of improving network
performance in general - especially for certain domains. Therefore, I
thought blending/exchanging knowledge or inciting a discussion might be
profitable for both worlds.

Originally, this ML is for the Bufferbloat project, but, occassionally
it does get captivated ...

>> Feel free to tinker your bayes filter if all sounds too alien.<<

--
Besten Gruß

Matthias Tafelmeier
Dave Taht
2017-11-15 20:16:12 UTC
Permalink
Matthias Tafelmeier <***@gmx.net> writes:

> On 11/15/2017 08:45 PM, Ken Birman wrote:
>> I'm missing context. Can someone tell me why I'm being cc'ed on these?
> right, I cced you so I'm liable to clarify.
>
>> Topic seems germane to me (Derecho uses RDMA on RoCE) but I'm unclear what this thread is "about"
>
> I perceived you as a strong kind of advocate for RDMA, at least that's a
> supposition I hold ever since having read your blogs/research around it.
> It was mentioned here in context with other existing or already
> pertaining technologies as to the purpose of improving network
> performance in general - especially for certain domains. Therefore, I
> thought blending/exchanging knowledge or inciting a discussion might be
> profitable for both worlds.

I'm notorious for trying to engage other folk slightly outside our
circle, thx for carrying on the grand tradition. I do admit it can be
quite a lot to get blindsided by!

>
> Originally, this ML is for the Bufferbloat project, but, occassionally
> it does get captivated ...

Well, the overarching goal here is to reduce latencies on everything
(hardware, interconnects, busses, drivers, oses, stacks) to the bare
minimum, on every technology we can reach, and leverage other ideas in
other technologies to do so, whenever possible.

If we had a name for that, other than bufferbloat, I'd switch to it,
because we are way beyond just dealing with excessive buffering after 6+
years of operation.

I will got read up on ken's RDMA stuff.

>
>>> Feel free to tinker your bayes filter if all sounds too alien.<<
Matthias Tafelmeier
2017-11-18 15:56:07 UTC
Permalink
Hello
> We would love to have users who are intent on breaking everything... all open source, no weird IP or anything, no plan to commercialize it.
>
> Ken

I'm certain there is, a small impediment rather seems to be the hardware
prereq. One does need a RDMA cabable NIC at least at the edge nodes -
and only Mellanox does manufactor those atm., afaik. Hence, any plans to
form an open emulator or something along those lines?

--
Besten Gruß

Matthias Tafelmeier
Toke Høiland-Jørgensen
2017-12-11 20:32:23 UTC
Permalink
Ken Birman <***@cornell.edu> writes:

> If you love low latency and high speed, you will be blown away by
> http://www.cs.cornell.edu/derecho.pdf

That link is a 404 for me; do you have a working one? :)

-Toke
Ken Birman
2017-12-11 20:43:20 UTC
Permalink
Oh, sorry, my mistake! The URL is actually http://www.cs.cornell.edu/ken/derecho.pdf

The download site is http://GitHub.com/Derecho-Project

Ken

-----Original Message-----
From: Toke Høiland-Jørgensen [mailto:***@toke.dk]
Sent: Monday, December 11, 2017 3:32 PM
To: Ken Birman <***@cornell.edu>; 'Dave Taht' <***@taht.net>; Matthias Tafelmeier <***@gmx.net>
Cc: ***@lists.bufferbloat.net
Subject: Re: [Bloat] DETNET

Ken Birman <***@cornell.edu> writes:

> If you love low latency and high speed, you will be blown away by
> http://www.cs.cornell.edu/derecho.pdf

That link is a 404 for me; do you have a working one? :)

-Toke
Matthias Tafelmeier
2017-11-18 15:38:52 UTC
Permalink
On 11/15/2017 08:31 PM, Dave Taht wrote:
>> However, like you, I just sigh when I see the behemoth detnet is building.
>>
>> Does it? Well, so far the circumference seems justififiable for what they want
>> to achieve, at least according to what I can tell from these rather still
>> abstract concepts.
>>
>> The sort of industrial control applications that detnet is targeting
>> require far lower queuing delay and jitter than fq_CoDel can give. They
>> have thrown around numbers like 250us jitter and 1E-9 to 1E-12 packet
>> loss probability.
>>
>> Nonetheless, it's important to have a debate about where to go to next.
>> Personally I don't think fq_CoDel alone has legs to get (that) much better.
> The place where bob and I always disconnect is that I care about
> interflow latencies generally more than queuing latencies and prefer to
> have strong incentives for non-queue building flows in the first
> place. This results in solid latencies of 1/flows at your bandwidth. At
> 100Mbit, a single 1500 byte packet takes 130us to deliver, gbit, 13us,
> 10Gbit, 1.3us.

A not necessarily informed enough question to that: couldn't this
marking based virtual queueuing get extended to a per flow mechanism if
the marking loop was implemented in an efficient way?

--
Besten Gruß

Matthias Tafelmeier
Dave Taht
2017-11-19 18:33:02 UTC
Permalink
Matthias Tafelmeier <***@gmx.net> writes:

> On 11/15/2017 08:31 PM, Dave Taht wrote:
>
> Nonetheless, it's important to have a debate about where to go to next.
> Personally I don't think fq_CoDel alone has legs to get (that) much better.
>
> The place where bob and I always disconnect is that I care about
> interflow latencies generally more than queuing latencies and prefer to
> have strong incentives for non-queue building flows in the first
> place. This results in solid latencies of 1/flows at your bandwidth. At
> 100Mbit, a single 1500 byte packet takes 130us to deliver, gbit, 13us,
> 10Gbit, 1.3us.
>
> A not necessarily informed enough question to that: couldn't this marking based
> virtual queueuing get extended to a per flow mechanism if the marking loop was
> implemented in an efficient way?

Bob's mechanism splits into 2 queues based on the presence of ecn in the
header. It is certainly possible to do fq with more queues within their
concept. :)

I've tossed fq_codel onto their groups' demo. Which, if you toss any but
their kind of traffic through it, seemed to perform pretty good, and
even with... well, I'd like to see some long RTT tests of their stuff vs
fq_codel.


At the moment, I'm heads down on trying to get sch_cake (
https://www.bufferbloat.net/projects/codel/wiki/CakeTechnical/ ), maybe
even the "cobalt" branch which has a codel/blue hybrid in it,
upstream.

Mellonox is making available a toolkit for some of their boards, I've
been meaning to take a look at it. ENOTIME.

On the high end, ethernet hardware is already doing the 5 tuple hash
needed for good FQ for us (this is used for sch_fq, fq_codel, and even
more importantly RSS steering), but cake uses a a set associative hash
and some other tricks that would be best also to push into hardware. I'd
love (or rather, an ISP would love) to be able to run 10k+ good shapers
in hardware.

...

At some point next year I'll take a look at detnet. Maybe.
Matthias Tafelmeier
2017-11-20 17:56:21 UTC
Permalink
> If this thread is about a specific scenario, maybe someone could point me to the OP where the scenario was first described?

I've forwarded the root of the thread to you - no specific scenario,
only I was sharing/referring DETNET IETF papers.

--
Besten Gruß

Matthias Tafelmeier
Ken Birman
2017-11-20 19:04:38 UTC
Permalink
Well, the trick of running RDMA side by side with TCP inside a datacenter using 2 Diffsrv classes would fit the broad theme of deterministic traffic classes, provided that you use RDMA in just the right way. You get the highest level of determinism for the reliable one-sided write case, provided that your network only has switches and no routers in it (so in a COS network, rack-scale cases, or perhaps racks with one TOR switch, but not the leaf and spine routers). The reason for this is that with routers you can have resource delays (RDMA sends only with permission, in the form of credits). Switches always allow sending, and have full bisection bandwidth, and in this specific configuration of RDMA, the receiver grants permission at the time the one-sided receive buffer is registered, so after that setup, the delays will be a function of (1) traffic on the sender NIC, (2) traffic on the receiver NIC, (3) queue priorities, when there are multiple queues sharing one NIC.

Other sources of non-determinism for hardware RDMA would include limited resources within the NIC itself. An RDMA NIC has to cache the DMA mappings for pages you are using, as well as qpair information for the connected qpairs. The DMA mapping itself has a two-level structure. So there are three kinds of caches, and each of these can become overfull. If that happens, the needed mapping is fetched from host memory, but this evicts data, so you can see a form of cache-miss-thrashing occur in which performance will degrade sharply. Derecho avoids putting too much pressure on these NIC resources, but some systems accidentally overload one cache or another and then they see performance collapse as they scale.

But you can control for essentially all of these factors.

You would then only see non-determinism to the extent that your application triggers it, through paging, scheduling effects, poor memory allocation area affinities (e.g. core X allocates block B, but then core Y tries to read or write into it), locking, etc. Those effects can be quite large. Getting Derecho to run at the full 100Gbps network rates was really hard because of issues of these kinds -- and there are more and more papers reporting similar issues for Linux and Mesos as a whole. Copying will also kill performance: 100Gbps is faster than memcpy for a large, non-cached object. So even a single copy operation, or a single checksum computation, can actually turn out to be by far the bottleneck -- and can be a huge source of non-determinism if you trigger this but only now and then, as with a garbage collected language.

Priority inversions are another big issue, at the OS thread level or in threaded applications. What happens with this case is that you have a lock and accidentally end up sharing it between a high priority thread (like an RDMA NIC, which acts like a super-thread with the highest possible priority), and a lower priority thread (like any random application thread). If the application uses the thread priorities features of Linux/Mesos, this can exacerbate the chance of causing inversions.

So an inversion would arise if for high priority thread A to do something, like check a qpair for an enabled transfer, a lower priority thread B needs to run (like if B holds the lock but then got preempted). This is a rare race-condition sort of problem, but when it bites, A gets stuck until B runs. If C is high priority and doing something like busy-waiting for a doorbell from the RDMA NIC, or for a completion, C prevents B from running, and we get a form of deadlock that can persist until something manages to stall C. Then B finishes, A resumes, etc. Non-deterministic network delay ensues.

So those are the kinds of examples I spend a lot of my time thinking about. The puzzle for me isn't at the lower levels -- RDMA and the routers already have reasonable options. The puzzle is that the software can introduce tons of non-determinism even at the very lowest kernel or container layers, more or less in the NIC itself or in the driver, or perhaps in memory management and thread scheduling.

I could actually give more examples that relate to interactions between devices: networks plus DMA into a frame buffer for a video or GPU, for example (in this case the real issue is barriers: how do you know if the cache and other internal pipelines of that device flushed when the transfer into its memory occurred? Turns out that there is no hardware standard for this, and it might not always be a sure thing). If they use a sledgehammer solution, like a bus reset (available with RDMA), that's going to have a BIG impact on perceived network determinism... yet it actually is an end-host "issue", not a network issue.

Ken

-----Original Message-----
From: Matthias Tafelmeier [mailto:***@gmx.net]
Sent: Monday, November 20, 2017 12:56 PM
To: Ken Birman <***@cornell.edu>; 'Dave Taht' <***@taht.net>
Cc: Bob Briscoe <***@bobbriscoe.net>; ***@lists.bufferbloat.net
Subject: Re: *** GMX Spamverdacht *** RE: [Bloat] DETNET


> If this thread is about a specific scenario, maybe someone could point me to the OP where the scenario was first described?

I've forwarded the root of the thread to you - no specific scenario, only I was sharing/referring DETNET IETF papers.

--
Besten Gruß

Matthias Tafelmeier
Matthias Tafelmeier
2017-12-17 12:46:26 UTC
Permalink
> Well, the trick of running RDMA side by side with TCP inside a datacenter using 2 Diffsrv classes would fit the broad theme of deterministic traffic classes, provided that you use RDMA in just the right way. You get the highest level of determinism for the reliable one-sided write case, provided that your network only has switches and no routers in it (so in a COS network, rack-scale cases, or perhaps racks with one TOR switch, but not the leaf and spine routers). The reason for this is that with routers you can have resource delays (RDMA sends only with permission, in the form of credits). Switches always allow sending, and have full bisection bandwidth, and in this specific configuration of RDMA, the receiver grants permission at the time the one-sided receive buffer is registered, so after that setup, the delays will be a function of (1) traffic on the sender NIC, (2) traffic on the receiver NIC, (3) queue priorities, when there are multiple queues sharing one NIC.
>
> Other sources of non-determinism for hardware RDMA would include limited resources within the NIC itself. An RDMA NIC has to cache the DMA mappings for pages you are using, as well as qpair information for the connected qpairs. The DMA mapping itself has a two-level structure. So there are three kinds of caches, and each of these can become overfull. If that happens, the needed mapping is fetched from host memory, but this evicts data, so you can see a form of cache-miss-thrashing occur in which performance will degrade sharply. Derecho avoids putting too much pressure on these NIC resources, but some systems accidentally overload one cache or another and then they see performance collapse as they scale.
>
> But you can control for essentially all of these factors.
>
> You would then only see non-determinism to the extent that your application triggers it, through paging, scheduling effects, poor memory allocation area affinities (e.g. core X allocates block B, but then core Y tries to read or write into it), locking, etc. Those effects can be quite large. Getting Derecho to run at the full 100Gbps network rates was really hard because of issues of these kinds -- and there are more and more papers reporting similar issues for Linux and Mesos as a whole. Copying will also kill performance: 100Gbps is faster than memcpy for a large, non-cached object. So even a single copy operation, or a single checksum computation, can actually turn out to be by far the bottleneck -- and can be a huge source of non-determinism if you trigger this but only now and then, as with a garbage collected language.
>
> Priority inversions are another big issue, at the OS thread level or in threaded applications. What happens with this case is that you have a lock and accidentally end up sharing it between a high priority thread (like an RDMA NIC, which acts like a super-thread with the highest possible priority), and a lower priority thread (like any random application thread). If the application uses the thread priorities features of Linux/Mesos, this can exacerbate the chance of causing inversions.
>
> So an inversion would arise if for high priority thread A to do something, like check a qpair for an enabled transfer, a lower priority thread B needs to run (like if B holds the lock but then got preempted). This is a rare race-condition sort of problem, but when it bites, A gets stuck until B runs. If C is high priority and doing something like busy-waiting for a doorbell from the RDMA NIC, or for a completion, C prevents B from running, and we get a form of deadlock that can persist until something manages to stall C. Then B finishes, A resumes, etc. Non-deterministic network delay ensues.
> So those are the kinds of examples I spend a lot of my time thinking about. The puzzle for me isn't at the lower levels -- RDMA and the routers already have reasonable options. The puzzle is that the software can introduce tons of non-determinism even at the very lowest kernel or container layers, more or less in the NIC itself or in the driver, or perhaps in memory management and thread scheduling.
>
> I could actually give more examples that relate to interactions between devices: networks plus DMA into a frame buffer for a video or GPU, for example (in this case the real issue is barriers: how do you know if the cache and other internal pipelines of that device flushed when the transfer into its memory occurred? Turns out that there is no hardware standard for this, and it might not always be a sure thing). If they use a sledgehammer solution, like a bus reset (available with RDMA), that's going to have a BIG impact on perceived network determinism... yet it actually is an end-host "issue", not a network issue.
>
>
Nothing to challenge here. For the scheduler part I only want to add -
you certainly know that better than me -, that there are quite nifty
software techniques to literally erradicate at least the priority
inversion problem. Only speaking for LNX, know that there's quite some
movement in general for scheduler amendmends for network processing at
the moment. Not sure if vendors of embedded versions or the RT-patch of
it haven't made it extinct already. Though, not the point you're making.
Further, it would still leave the clock-source as non-determism introducer.

Quite a valuable research endeavor would be to quantify all of those
traits and compare or make it comparable to  the characteristics of
certain other approaches, e.g., the quite promising LNX kernel busy
polling[1] mechanisms. All are suffering from similar weaknesses,
remaining the question for the which. Saying that a little briskly w\o
clearly thinking through its feasibility for the time being.

[1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf

--
Besten Gruß

Matthias Tafelmeier
Ken Birman
2017-12-17 16:06:38 UTC
Permalink
I see this as a situation that really argues for systematic experiments, maybe a paper you could aim towards SIGCOMM IMC. Beyond a certain point, you simply need to pull out the stops and try to understand what the main obstacles to determinism turn out to be in practice, for realistic examples of systems that might need determinism (medical monitoring in a hospital, for example, or wide-area tracking of transients in the power grid, things of that sort).

In fact I can see how a case could be made for doing a series of such papers: one purely in network settings, one looking at datacenter environments (not virtualized, unless you want to do even one more paper), one looking at WAN distributed infrastructures. Maybe one for embedded systems like self-driving cars or planes.

I would find that sort of paper interesting if the work was done really well, and really broke down the causes for unpredictability, traced each one to a root case, maybe even showed how to fix the issues identified.

There are also several dimensions to consider (one paper could still tackle multiple aspects): latency, throughput. And then there are sometimes important tradeoffs for people trying to run at the highest practical data rates versus those satisfied with very low rates, so you might also want to look at how the presented load impacts the stability of the system.

But I think the level of interest in this topic would be very high. The key is to acknowledge that with so many layers of hardware and software playing roles, only an experimental study can really shed might light. Very likely, this is like anything else: 99% of the variability is coming from 1% of the end-to-end pathway... fix that 1% and you'll find that there is a similar issue but emerging from some other place. But fix enough of them, and you could have a significant impact -- and industry would adopt solutions that really work...

Ken

-----Original Message-----
From: Matthias Tafelmeier [mailto:***@gmx.net]
Sent: Sunday, December 17, 2017 7:46 AM
To: Ken Birman <***@cornell.edu>; 'Dave Taht' <***@taht.net>
Cc: Bob Briscoe <***@bobbriscoe.net>; ***@lists.bufferbloat.net
Subject: Re: *** GMX Spamverdacht *** RE: [Bloat] DETNET


> Well, the trick of running RDMA side by side with TCP inside a datacenter using 2 Diffsrv classes would fit the broad theme of deterministic traffic classes, provided that you use RDMA in just the right way. You get the highest level of determinism for the reliable one-sided write case, provided that your network only has switches and no routers in it (so in a COS network, rack-scale cases, or perhaps racks with one TOR switch, but not the leaf and spine routers). The reason for this is that with routers you can have resource delays (RDMA sends only with permission, in the form of credits). Switches always allow sending, and have full bisection bandwidth, and in this specific configuration of RDMA, the receiver grants permission at the time the one-sided receive buffer is registered, so after that setup, the delays will be a function of (1) traffic on the sender NIC, (2) traffic on the receiver NIC, (3) queue priorities, when there are multiple queues sharing one NIC.
>
> Other sources of non-determinism for hardware RDMA would include limited resources within the NIC itself. An RDMA NIC has to cache the DMA mappings for pages you are using, as well as qpair information for the connected qpairs. The DMA mapping itself has a two-level structure. So there are three kinds of caches, and each of these can become overfull. If that happens, the needed mapping is fetched from host memory, but this evicts data, so you can see a form of cache-miss-thrashing occur in which performance will degrade sharply. Derecho avoids putting too much pressure on these NIC resources, but some systems accidentally overload one cache or another and then they see performance collapse as they scale.
>
> But you can control for essentially all of these factors.
>
> You would then only see non-determinism to the extent that your application triggers it, through paging, scheduling effects, poor memory allocation area affinities (e.g. core X allocates block B, but then core Y tries to read or write into it), locking, etc. Those effects can be quite large. Getting Derecho to run at the full 100Gbps network rates was really hard because of issues of these kinds -- and there are more and more papers reporting similar issues for Linux and Mesos as a whole. Copying will also kill performance: 100Gbps is faster than memcpy for a large, non-cached object. So even a single copy operation, or a single checksum computation, can actually turn out to be by far the bottleneck -- and can be a huge source of non-determinism if you trigger this but only now and then, as with a garbage collected language.
>
> Priority inversions are another big issue, at the OS thread level or in threaded applications. What happens with this case is that you have a lock and accidentally end up sharing it between a high priority thread (like an RDMA NIC, which acts like a super-thread with the highest possible priority), and a lower priority thread (like any random application thread). If the application uses the thread priorities features of Linux/Mesos, this can exacerbate the chance of causing inversions.
>
> So an inversion would arise if for high priority thread A to do something, like check a qpair for an enabled transfer, a lower priority thread B needs to run (like if B holds the lock but then got preempted). This is a rare race-condition sort of problem, but when it bites, A gets stuck until B runs. If C is high priority and doing something like busy-waiting for a doorbell from the RDMA NIC, or for a completion, C prevents B from running, and we get a form of deadlock that can persist until something manages to stall C. Then B finishes, A resumes, etc. Non-deterministic network delay ensues.
> So those are the kinds of examples I spend a lot of my time thinking about. The puzzle for me isn't at the lower levels -- RDMA and the routers already have reasonable options. The puzzle is that the software can introduce tons of non-determinism even at the very lowest kernel or container layers, more or less in the NIC itself or in the driver, or perhaps in memory management and thread scheduling.
>
> I could actually give more examples that relate to interactions between devices: networks plus DMA into a frame buffer for a video or GPU, for example (in this case the real issue is barriers: how do you know if the cache and other internal pipelines of that device flushed when the transfer into its memory occurred? Turns out that there is no hardware standard for this, and it might not always be a sure thing). If they use a sledgehammer solution, like a bus reset (available with RDMA), that's going to have a BIG impact on perceived network determinism... yet it actually is an end-host "issue", not a network issue.
>
>
Nothing to challenge here. For the scheduler part I only want to add - you certainly know that better than me -, that there are quite nifty software techniques to literally erradicate at least the priority inversion problem. Only speaking for LNX, know that there's quite some movement in general for scheduler amendmends for network processing at the moment. Not sure if vendors of embedded versions or the RT-patch of it haven't made it extinct already. Though, not the point you're making.
Further, it would still leave the clock-source as non-determism introducer.

Quite a valuable research endeavor would be to quantify all of those traits and compare or make it comparable to  the characteristics of certain other approaches, e.g., the quite promising LNX kernel busy polling[1] mechanisms. All are suffering from similar weaknesses, remaining the question for the which. Saying that a little briskly w\o clearly thinking through its feasibility for the time being.

[1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf

--
Besten Gruß

Matthias Tafelmeier
Ken Birman
2017-11-18 15:45:43 UTC
Permalink
You can do that with packet tags and filtering

Sent from my iPhone

On Nov 18, 2017, at 10:39 AM, Matthias Tafelmeier <***@gmx.net<mailto:***@gmx.net>> wrote:

On 11/15/2017 08:31 PM, Dave Taht wrote:

However, like you, I just sigh when I see the behemoth detnet is building.

Does it? Well, so far the circumference seems justififiable for what they want
to achieve, at least according to what I can tell from these rather still
abstract concepts.

The sort of industrial control applications that detnet is targeting
require far lower queuing delay and jitter than fq_CoDel can give. They
have thrown around numbers like 250us jitter and 1E-9 to 1E-12 packet
loss probability.

Nonetheless, it's important to have a debate about where to go to next.
Personally I don't think fq_CoDel alone has legs to get (that) much better.


The place where bob and I always disconnect is that I care about
interflow latencies generally more than queuing latencies and prefer to
have strong incentives for non-queue building flows in the first
place. This results in solid latencies of 1/flows at your bandwidth. At
100Mbit, a single 1500 byte packet takes 130us to deliver, gbit, 13us,
10Gbit, 1.3us.

A not necessarily informed enough question to that: couldn't this marking based virtual queueuing get extended to a per flow mechanism if the marking loop was implemented in an efficient way?

--
Besten Gruß

Matthias Tafelmeier



<0x8ADF343B.asc>
Matthias Tafelmeier
2017-11-18 17:55:32 UTC
Permalink
On 11/15/2017 08:31 PM, Dave Taht wrote:
>> Well, if it's already for specific HW, then I'd prefer to see RDMA in place
>> right away with getting rid of IRQs and other TCP/IP specific rust along the
>> way, at least for DC realms :) Although, this HULL might has a spin for it from
>> economics perspective.
> It would be good for more to read the proceeds from the recent netdev
> conference:
>
> https://lwn.net/Articles/738912/
>
Especially ...


"TCP Issues (Eric Dumazet)

At 100gbit speeds with large round trip times we hit very serious
scalability issues in the TCP stack.

Particularly, retransmit queues of senders perform quite poorly. It has
always been a simple linked list."

goes in line with the aforementioned.


--
Besten Gruß

Matthias Tafelmeier
Matthias Tafelmeier
2017-11-20 18:32:12 UTC
Permalink
>
> - RDMA didn’t work well on Ethernet until recently, but this was fixed
> by a technique called DCQCN (Mellanox), or its cousin TIMELY (Google).
> Microsoft recently had a SIGCOMM paper on running RDMA+DCQCN side by
> side with TCP/IP support their Azure platform, using a single data
> center network, 100Gb. They found it very feasible, although
> configuration of the system requires some sophistication. Azure
> supports Linux, Mesos, Windows, you name it.
Yes, here's the DCQCN paper:
https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p523.pdf
Really besoothing read for any TCP cc interested person, actually.
TIMELY is quite impressive, either. Wasn't ware of that, thanks for
sharing. THought Google was refraining to use RDMA since of it
effectively not getting rid of the TAIL-LOSS scenarios, but obviously
only for the WAN use case.

> The one thing they didn’t try was heavy virtualization. [...]
>
That should have been covered by now, though not by Microsoft, I saw
recent work being done by a certain virtualization vendor exploiting
RDMA for it's 'all the rage' storage stack. Therefore, it shouldn't be a
real stumbling block.

>
> . It can also discover that you lack RDMA hardware and in that case,
> will automatically use TCP.
But hold on, that's still useless, isn't it, since it does get
capricious/shaky rather quickly? Does this still hold?


> We are doing some testing of pure LibFabrics performance now, both in
> data centers and in WAN networks (after all, you can get 100Gbps over
> substantial distances these days...

> Cornell has it from Ithaca to New York City where we have a hospital
> and our new Tech campus). We think this could let us run Derecho over
> a WAN with no hardware RDMA at all.
>
Interesting, are you planning to cast this WAN evaluations into a paper
or other pieces of intelligence? Never thought it'd make it out of the
data centre, actually. There is hardly any read in this direction.


--
Besten Gruß

Matthias Tafelmeier
Ken Birman
2017-11-18 19:46:59 UTC
Permalink
“clan” is an iPad typo. “vlan”

From: Ken Birman
Sent: Saturday, November 18, 2017 2:44 PM
To: Matthias Tafelmeier <***@gmx.net>
Cc: Dave Taht <***@taht.net>; Bob Briscoe <***@bobbriscoe.net>; ***@cs.cornell.edu; ***@lists.bufferbloat.net
Subject: Re: [Bloat] DETNET

Several remarks:
- if you have hardware RDMA, you can use it instead of TCP, but only within a data center, or at most, between two side by side data centers. In such situations the guarantees of RDMA are identical to TCP: lossless, uncorrupted, ordered data delivery. In fact there are versions of TCP that just map your requests to RDMA. But for peak speed, and lowest latency, you need the RDMA transfer to start in user space, and terminate in user space (end to end). Any kernel involvement will slow things down, even with DMA scatter gather (copying would kill performance, but as it turns out, scheduling delays between user and kernel, or interrupts, are almost as bad)

- RDMA didn’t work well on Ethernet until recently, but this was fixed by a technique called DCQCN (Mellanox), or its cousin TIMELY (Google). Microsoft recently had a SIGCOMM paper on running RDMA+DCQCN side by side with TCP/IP support their Azure platform, using a single data center network, 100Gb. They found it very feasible, although configuration of the system requires some sophistication. Azure supports Linux, Mesos, Windows, you name it. The one thing they didn’t try was heavy virtualization. In fact they disabled enterprise clan functionality in their routers. So if you need that, you might have issues.

- One-way latencies tend to be in the range reported earlier today, maybe 1-2us for medium sized transfers. A weakness of RDMA is that it has a smallest latency that might still be larger than you would wish, in its TCP-like mode. Latency is lower for unreliable RDMA, or for direct writes into shared remote memory regions. You can get down to perhaps 0.1us in those cases, for a small write like a single integer. In fact the wire transfer format always moves fairly large numbers of bytes, maybe 96? It varies by wire speed. But the effect is that writing one bit or writing 512 bytes can actually be pretty much identical in terms of latency.

- the HPC people figured out how to solve this issue of not having hardware RDMA on development machines. The main package they use is called MPI, and it has an internal split: the user-mode half talks to an adaptor library called LibFabrics, and then this maps to RDMA. It can also discover that you lack RDMA hardware and in that case, will automatically use TCP. We plan to port Derecho to run on this soon, almost certainly by early spring 2018. Perhaps sooner: the API mimics the RDMA one, so it wont be hard to do. I would recommend this for anyone doing new development. The only issue is that for now LibFabrics is a fancy C header file that uses C macro expansion, which means you can’t use it directly from C++...you need a stub library, which can add a tiny bit of delay. I’m told that the C++ library folks are going to create a variadic templates version, which would eliminate the limitation and offer the same inline code expansion as with the C header, but I don’t know when that will happen.

We are doing some testing of pure LibFabrics performance now, both in data centers and in WAN networks (after all, you can get 100Gbps over substantial distances these days... Cornell has it from Ithaca to New York City where we have a hospital and our new Tech campus). We think this could let us run Derecho over a WAN with no hardware RDMA at all.

- there is also a way to run Derecho on SoftRoCE, which is a Linux software emulation of RDMA. We tried this, and it is a solid 34-100x slower, so not interesting except for development. I would steer away from SoftRoCE as an option. It also pegs two cores at 100%... one in the kernel and one in user space. So maybe this is just in need of tuning, but it certainly seems like that code path is just not well optimized. A poor option for developing code that would be interoperable between software or hardware accelerated RDMA at this stage. LibFabrics probably isn’t a superstar either in terms of speed, but so far it appears to be faster, quite stable, easier to install, and much less of a background load...

Ken
Loading...