Re: pending changes for TOE support



My initial feeling is that, even if an interface supports TOE, we shouldn't
enable the capability in the enabled vector by default, as TOE bypasses
firewall behavior, etc, and would certainly be a surprise if an admin swapped
a chelsio card for a non-TOE supporting card. What's your feeling on this?


The current implementation bypasses the firewall. This and likely
other hardware has extensive filtering support so it isn't
neccessarily intrinsic.

The usage model at this moment is that the customer makes a conscious
decision to load the TOE driver and understands the implications. I
think this is quite adequate for 10GigE cards currently. However, this
will need to be revisited when these chips start showing up on
mainstream motherboards.


+ * The TOE API assumes that the tcp offload engine can offload the
+ * the entire connection from set up to teardown, with some provision
+ * being made to allowing the software stack to handle time wait. If
+ * the device does not meet these criteria, it is the driver's responsibility
+ * to overload the functions that it needs to in tcp_usrreqs and make
+ * its own calls to tcp_output if it needs to do so.

While I'm familiar with TCP, I'm less familiar with the scope of what cards
support for TOE. Do we know of any cards that are less capable than the
chelsio card in this respect, or are they all sort of on-par on that front?
I.e., do we think the above eventuality is likely?

I don't have any way of knowing. I think it is probably safe to say
that any vendors that don't meet that criteria now will in the future
as transistor density increases.


If we don't, then one of the things I'd like to see us do is fairly carefully
assert, at least for a few months, that TCP never "slips" into any
transmission-related paths that could lead to truly odd and hard-to-diagnose
behavior when runnning with TOE. I.e., tcp_output, etc.


I'm happy to do that. However, I see problems introduced by offloading
connections as being driver bugs much the same as problems caused by
the driver's TCP segmentation offload or checksum offload. The
problems will be isolated to connections using a specific interface.



If we do think it's likely, we don't need to address this immediately, but we
should make sure that before we ship TOE in a release, we've thought somewhat
more thoroughly about that case. As long as TOE remains un-MFC'd, we don't
find ourselves with an obligation to maintain guarantees about the interfaces,
and that includes dealing with incompatibility :-). Do we know if any of the
current 10gbps vendors other than chelsio are actively looking at TOE on
FreeBSD, and could be engaged in discussion?


As with most vendors and FreeBSD, I suspect that the two that I know
of will have zero interest until they have a prospective customer. At
which point they'll want it done yesterday.


I think it might be useful to add a couple of paragraphs here on three topics:

(1) Clarify the way in which windows are updated between the device driver and
the socket code, both for sending/receiving. You talk a bit about
"credit", but introducing it up-front would be useful.

I didn't realize a definition was necessary. To the best of my
knowledge this is the common term used when discussing flow control.
I've seen it used for Fibre Channel and IB. The one ambiguity that
arises is whether or not it refers to bytes or segments.



(2) One of the issues I've run into in the TCP and socket code generally is
that there was significant lack of clarity on the "life cycle" of the set
of related data structures. Could you write a bit of text about when
drivers will allocate state and when they will free it? I.e., tu_attach
allocates state, tu_{abort,detach} free it, and TCP promises not to call
anything before attach or anything after abort/detach.

(3) Could you talk at a high level about the ways in which TOE drivers will
interact with TCP? You do it a bit in each of the sections, but if
there's a principle, pulling it out would be useful. Also, you should
indicate whether the driver is allowed to drop the inpcb lock or not.

I've done my best to minimize changes to TCP. It is safe to assume
that the invariants are the same as those for tcp_output. I think we
should ask the author of tcp_output to document the interface,
expected state transitions, and its invariants (joke).


Doing this would address a few of the comments I have below also.

+ * + tu_send
+ * - tells the driver that new data may have been added to the
+ * socket's send buffer - the driver should not fail if the
+ * buffer is in fact unchanged

I'm a bit confused by the description of the error condition here. Could you
clarify when a driver should return an error, and what the impact of an error
returned will be on the connection state? In fact, it probably makes sense to
have an up-front comment on conventions for error-handling -- if TOE returns
an error will that generally lead to a TCP tear-down?

The offload routines are substituted for tcp_output and thus should
interact with the stack in the same way. By extension they should have
the same failure modes and invariants.


+ * - The driver expects the inpcb lock to be held and

This comment is truncated -- is there an and?

We should specify that drivers are not allowed to drop the inpcb lock if that
is the case, FYI.

+ * + tu_rcvd
+ * - returns credits to the driver and triggers window updates
+ * to the peer (a credit is a byte in the user's receive window)

Might begin with a sentence defining the notion of credit. Is it possible to
use tu_rcvd to reduce credit to the card if the socket buffer size is changed,
or just increase it?

+ * - the driver is expected to determine how many bytes have been
+ * consumed and credit that back to the card so that it can grow
+ * the window again

Could you provide an example of how it is to do that -- i.e., is it just going
to inspect so_rcv in the same way native TCP does?

Correct. It is up to the driver to maintain any ancillary state needed
to determine that.



+ * - this function needs to correctly handle being called any number of
+ * times without any bytes being consumed from the receive buffer.
+ * - the driver expects the inpcb lock to be held
+ *
+ * + tu_disconnect
+ * - tells the driver to send FIN to peer
+ * - driver is expected to send the remaining data and then do a clean half
close
+ * - disconnect implies at least half-close so only send, abort, and detach
+ * are legal

Could you clarify this a bit? Do you mean that TCP guarangees that only
tu_send, tu_abort, and tu_detach will be delivered to the driver in the
future?

Those are the only things that make sense, but the driver is not
expected to break if TCP does.



+ * - the driver is expected to handle transition through the shutdown
+ * state machine and allow the stack to support SO_LINGER.

Probably worth commenting that the device driver won't detach the toe state.


+ *
+ * + tu_abort
+ * - closes the connection and sends a RST to peer
+ * - driver is expectd to trigger an RST and detach the toepcb

In regular TCP, the pru_abort method is only called on pending connections
while still in the listen queues of a listen socket. Is this true of
tu_abort, or is tu_abort a more general method to be used to cancel
connections? If so, probably worth commenting on that.


tu_abort is called in place of tcp_output in pru_abort.


+ * - no further calls are legal after abort
+ * - the driver expects the inpcb lock to be held
+ *
+ * + tu_detach
+ * - tells driver that the socket is going away so disconnect
+ * the toepcb and free appropriate resources
+ * - allows the driver to cleanly handle the case of connection state
+ * outliving the socket
+ * - no further calls are legal after detach
+ * - the driver acquires the tcbinfo lock

For this call, you haven't specified whether the inpcb lock is held. If it
is, the driver acquiring the tcbinfo lock without first dropping the inpcb
lock would be a lock order reversal. Should the caller instead acquire/hold
it?

The inpcb lock no longer exists at this point.



For the above calls, what guarantees does the TCP stack make about the
presence of the socket, if any?

The assumptions are the same as those of tcp_output except for
syncache_event and detach, at which points the socket does not yet
exist or no longer exists.


These interfaces all pass the tcpcb, but in our regular TCP stack, the
invariant is the existence of the inpcb, not the tcpcb, which may be replaced
with a tcptw (or in one edge case, inp_ppcb may be NULL). If there will be
drivers in the future that implement timewait, perhaps we should be passing in
the inpcb?

The interface is intended to drop in the place of tcp_output.


+ * + tu_syncache_event
+ * - even if it is not actually needed, the driver is expected to
+ * call syncache_add for the initial SYN and then syncache_expand
+ * for the SYN,ACK
+ * - tells driver that a connection either has not been added or has
+ * been dropped from the syncache
+ * - the driver is expected to maintain state that lives outside the
+ * software stack so the syncache needs to be able to notify the
+ * toe driver that the software stack is not going to create a connection
+ * for a received SYN
+ * - the driver is responsible for any synchronization required

Presumably tu_syncache_event is called from the syncache and locks will be
held when that happens...?

The driver doesn't care what locks the syncache holds. The syncache
event handler is responsible for acquiring any locks necessary to
synchronize with the rest of the driver for the transition from
SYN_RCVD -> ESTABLISHED.


How will the race between the syncache deciding to drop a connection of its
own accord and the hardware/driver deciding to accept be addressed, generally
speaking?

That is a driver implementation issue. The one case to avoid is a
deadlock between the driver calling syncache_expand and the syncache
calling syncache_event.


+8
+extern struct toe_usrreqs tcp_offload_usrreqs;

What is the purpose of this global? Presumably we can have two drivers that
both implement offload at once?

I think you're follow on reading of tcp_offload.c answers that question.


-Kip
_______________________________________________
freebsd-arch@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "freebsd-arch-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: pending changes for TOE support
    ... decision to load the TOE driver and understands the implications. ... While I'm familiar with TCP, I'm less familiar with the scope of what cards ... indicate whether the driver is allowed to drop the inpcb lock or not. ...
    (freebsd-current)
  • Re: pending changes for TOE support
    ... decision to load the TOE driver and understands the implications. ... I think I would prefer that our policy switch be the capenable flag, so that compiling things in or out doesn't change functional behavior for existing interfaces. ... carefully assert, at least for a few months, that TCP never "slips" into any transmission-related paths that could lead to truly odd and hard-to-diagnose behavior when runnning with TOE. ... I see problems introduced by offloading connections as being driver bugs much the same as problems caused by the driver's TCP segmentation offload or checksum offload. ...
    (freebsd-arch)
  • Re: pending changes for TOE support
    ... decision to load the TOE driver and understands the implications. ... I think I would prefer that our policy switch be the capenable flag, so that compiling things in or out doesn't change functional behavior for existing interfaces. ... carefully assert, at least for a few months, that TCP never "slips" into any transmission-related paths that could lead to truly odd and hard-to-diagnose behavior when runnning with TOE. ... I see problems introduced by offloading connections as being driver bugs much the same as problems caused by the driver's TCP segmentation offload or checksum offload. ...
    (freebsd-current)
  • Re: pending changes for TOE support
    ... * A driver publishes that it provides offload services ... * connection would use does not support TCP offload. ... My initial feeling is that, even if an interface supports TOE, we shouldn't enable the capability in the enabled vector by default, as TOE bypasses firewall behavior, etc, and would certainly be a surprise if an admin swapped a chelsio card for a non-TOE supporting card. ... indicate whether the driver is allowed to drop the inpcb lock or not. ...
    (freebsd-arch)
  • Re: pending changes for TOE support
    ... * A driver publishes that it provides offload services ... * connection would use does not support TCP offload. ... My initial feeling is that, even if an interface supports TOE, we shouldn't enable the capability in the enabled vector by default, as TOE bypasses firewall behavior, etc, and would certainly be a surprise if an admin swapped a chelsio card for a non-TOE supporting card. ... indicate whether the driver is allowed to drop the inpcb lock or not. ...
    (freebsd-current)