Skip to content

provider/irdma: Implement CoCo buffer APIs#1766

Draft
10000TB wants to merge 33 commits into
linux-rdma:masterfrom
10000TB:irdma-CoCo-buffer
Draft

provider/irdma: Implement CoCo buffer APIs#1766
10000TB wants to merge 33 commits into
linux-rdma:masterfrom
10000TB:irdma-CoCo-buffer

Conversation

@10000TB

@10000TB 10000TB commented Jun 26, 2026

Copy link
Copy Markdown

[Draft]: Implement CoCo DMA Bounce buffer APIs defined in #1748

TODOs:

  • alloc parent domain
  • ibv_alloc_buf, ibv_free_buf
  • internal decrypted queue buffer allocation
  • irdma-abi.h update to be dmabuf aware
    • Simply extend existing legacy irdma ABI to be dmabuf aware.
    • Alternatively, a major re-write is needed to use IOCTL ABI, that is a larger undertake, however.
  • Migrations:
    • QP (push dbl / hw state included)
    • CQ
    • SRQ (shadow area appended at the end for doorbell)
  • Unit tests
  • irdma driver support.
  • Kernel Integrations.

cc: @jakemoroni @jpirko

yishaih and others added 28 commits June 25, 2026 15:33
ibv_cmd_reg_mr_ex() did not record the MR's access flags in vmr->access,
unlike ibv_cmd_reg_mr().  ibv_dereg_mr() decides whether to re-enable fork
tracking from verbs_get_mr(mr)->access:

    if (... && !(access & IBV_ACCESS_ON_DEMAND))
            ibv_dofork_range(addr, length);

With vmr->access left zero, an on-demand (ODP) MR registered through
ibv_reg_mr_ex() / ibv_cmd_reg_mr_ex() reads access == 0 at dereg, so
ibv_dofork_range() runs even though registration skipped the matching
ibv_dontfork_range() for ODP (need_fork = !(ON_DEMAND || FD)).  The
unbalanced dofork corrupts the fork-range accounting when ibv_fork_init()
is enabled.

Set vmr->access = mr_init_attr->access on the success path, mirroring
ibv_cmd_reg_mr(), so the dereg fork decision is correct.  (The write-ABI
fallback path already goes through ibv_cmd_reg_mr(), which sets it.)

Fixes: ca61708 ("verbs: Add ibv_cmd_reg_mr_ex() to be used by drivers")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
To commit: d7a40b519497 ("RDMA/uverbs: Expose CoCo DMA bounce
requirement to userspace").

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add struct ibv_buf with addr and size fields to provide a common
abstraction for buffer metadata that providers can embed in their
internal buffer structures. Introduce an init helper alongside.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add an int dmabuf_fd field to struct ibv_buf so that providers can
store the DMA-buf file descriptor directly in the common buffer
abstraction.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add alloc_buf/free_buf provider ops and corresponding helpers to
allow applications to allocate buffers using the provider's configured
allocation method.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
The default (dummy) reg_mr_ex op returns EOPNOTSUPP, so ibv_reg_mr_ex()
fails on any provider that does not implement its own reg_mr_ex,
even when the provider supports the equivalent legacy registration ops.

Translate the ibv_mr_init_attr request into the provider's existing
reg_mr() or reg_dmabuf_mr() op, selecting the dmabuf-based op when the
request carries an fd and the address-based op otherwise.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Extend struct ibv_mr_init_attr with a buf field and the
IBV_REG_MR_MASK_BUF comp_mask bit so a handle obtained from
ibv_alloc_buf() can be registered directly through ibv_reg_mr_ex().

When the bit is set, translage the ibv_buf fields into the concreate
init attr fields before registration, so the rest of the flow stays
unchanged: a dma-buf backed buffer is mapped to the fd-based path
(fd/fd_offset/iova) while a plain memory buffer is mapped to the
addr-based path. Consume the IBV_REG_MR_MASK_BUF bit during this
translation, leave only the masks the lower layers already understand.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add ibv_reg_buf_mr() to register memory returned by ibv_alloc_buf().

Implement it as a thin wrapper: set IBV_REG_MR_MASK_BUF and forward the
buffer handle to ibv_reg_mr_ex(), which resolves it to the appropriate
addr-based or DMA-buf-based registration and validates the range against
the stored buffer metadata. Keep it equivalent to calling
ibv_reg_mr_ex() with IBV_REG_MR_MASK_BUF.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
ibv_cmd_reg_mr_ex() issued the REG_MR ioctl unconditionally (execute_ioctl)
with no fallback to the legacy write-based registration, unlike
ibv_cmd_reg_mr() which is write-ABI based.  An address-based MR registered
through the extended path therefore fails on a kernel old enough to lack
the REG_MR ioctl method, even though it still supports the legacy write ABI
that ibv_cmd_reg_mr() uses.

Issue the command through the existing ioctl/write fallback machinery
instead: declare the command buffer with DECLARE_FBCMD_BUFFER and dispatch
via execute_ioctl_fallback().  On TRY_WRITE -- the kernel has no REG_MR
ioctl method -- fall back to the legacy write REG_MR command via
ibv_cmd_reg_mr() for the address-based case.  A dma-buf (fd) or DMA-handle
request has no write-ABI equivalent, and a kernel without the ioctl could
not service it anyway, so those return EOPNOTSUPP.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add CoCo unprotected alloc flag for applications to opt-in to
unprotected/shared memory allocation via parent domain creation.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add a flag the kernel sets when the device is in a CoCo guest
and requires DMA bounce buffering.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
The new umem UAPI gives every dmabuf-backed buffer (CQ ring, QP
main/RQ/SQ, mlx5 doorbell record) its own ioctl attribute of type
UVERBS_ATTR_UMEM whose payload is a single struct ib_uverbs_buffer_desc.

Add a per-attribute helper, fill_attr_in_buf_umem(), that providers
call once per buffer they want to register through the kernel.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add a driver command-buffer chain argument to the private QP create
helper ibv_cmd_create_qp_ex2() and build its command buffer with
DECLARE_CMD_BUFFER_LINK_COMPAT(), mirroring ibv_cmd_create_cq_ex2().
This lets a provider link a driver-namespace command buffer onto the
create-QP ioctl; later commits use it to attach per-buffer UMEM
attributes (UVERBS_ATTR_CREATE_QP_*_BUF_UMEM).

Update all existing callers (mlx4, mlx5, hns, mana, rxe, ionic) to
pass NULL as they have no driver chain to attach.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add dmabuf_heap.c/h including a set of internal helpers for providers
to allocate memory from Linux DMA-buf heaps (/dev/dma_heap/<name>).
Add a helper to initialize "system_cc_shared" heap allocator.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Replace the separate void *buf and size_t length fields in struct
mlx5_buf with an embedded struct ibv_buf, aligning the provider's
internal buffer representation with the common ibv_buf abstraction.

Link struct mlx5_buf to the ibv_buf API (pd, addr, size) instead of
maintaining duplicate fields. Add a pd argument to the internal
allocation helpers and initialize the embedded buffer through
ibv_buf_init(), so each buffer records its owning protection domain.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
When a parent domain is created with ALLOW_CC_UNPROTECTED_ALLOC and the
device reports CC_DMA_BOUNCE, open a dmabuf heap and use it for all
provider-internal buffer allocations (CQ, QP, SRQ, RWQ, doorbell
records) through the existing preferred allocation path.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Implement mlx5_alloc_buf_op()/mlx5_free_buf_op() using the existing
struct mlx5_buf based allocation infrastructure.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Wire CQ and QP create paths to the new per-attribute UMEM UAPI: emit
a struct ib_uverbs_buffer_desc for each dmabuf-backed buffer (CQ ring,
QP main/SQ, doorbell record) on the driver_attrs chain via
fill_attr_in_buf_umem().

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
In preparation for the follow-up patch, move ctx->buf allocation later
in pp_init_ctx().

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add -U/--allow-cc-unprotected option for running in CoCo guests if
device DMA requires unprotected/shared memory. When set, create a parent
domain with ALLOW_CC_UNPROTECTED_ALLOC and use
ibv_alloc_buf()/ibv_free_buf()/ibv_reg_buf_mr() for MR buffer allocation
and registration.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Expose IBV_DEVICE_CC_DMA_BOUNCE to pyverbs so the extended device
capability can be queried from Python.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Expose IBV_PARENT_DOMAIN_INIT_ATTR_ALLOW_CC_UNPROTECTED_ALLOC to pyverbs
so it can be passed in the parent domain init comp_mask.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add a comp_mask argument to ParentDomainInitAttr so callers can request
IBV_PARENT_DOMAIN_INIT_ATTR_ALLOW_CC_UNPROTECTED_ALLOC, the opt-in used
by DMA-bounce devices on Confidential Computing (CoCo) guests.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Wrap the provider-aware buffer API (ibv_alloc_buf, ibv_free_buf,
ibv_reg_buf_mr) with new Buf and BufMR classes. Buf owns a buffer
allocated through a PD and is tracked in a per-PD weakset, so it is
torn down before the PD it belongs to. BufMR registers an MR over a
(sub)range of a Buf and deregisters it on close without freeing the
underlying buffer.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
MREx.close() called the base close() and then reset its own cdef object
member via 'self.dmah = None'. That assignment is both redundant and
unsafe during deallocation.

Cython's generated subclass deallocator clears the subclass cdef object
members (dmah) before chaining to the base deallocator that runs the
inherited MR.__dealloc__ -> self.close(). So when close() runs at
deallocation time, dmah is already NULL and 'self.dmah = None' does an
unguarded DECREF on NULL, segfaulting. This only stayed hidden while
every MREx was closed explicitly first (which sets self.mr = NULL and
makes the deallocation-time close() skip its body); an MREx reclaimed by
the garbage collector crashes.

Releasing dmah here is unnecessary: tp_dealloc already drops the
reference, and MR.close() performs the ibv_dereg_mr(). Drop the override
and inherit MR.close(), matching the other MR subclasses (e.g. DMMR)
that do not reset their extra members in close().

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add test_buf.py exercising the ibv_buf API (ibv_alloc_buf,
ibv_reg_buf_mr, ibv_free_buf), and the ibv_reg_mr_ex()
IBV_REG_MR_MASK_BUF path, over both a plain PD and a parent domain
created with ALLOW_CC_UNPROTECTED_ALLOC, in API-only and RC/UD traffic
variants. Add an is_cq_ex option to the rdma_traffic and atomic_traffic
helpers so the tests can drive an extended CQ.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Allocation of system_cc_dmabuf is only carried out if CoCo
DMA bounce feature in device, and cc unprotected attribute are
both present. Proper cleanup is added to cleanup path.

Signed-off-by: David Hu <xuehaohu@google.com>
Signed-off-by: David Hu <xuehaohu@google.com>
@10000TB 10000TB marked this pull request as draft June 26, 2026 06:19
@10000TB 10000TB changed the title provider/Irdma: Implement CoCo buffer APIs provider/irdma: Implement CoCo buffer APIs Jun 26, 2026
10000TB added 2 commits June 27, 2026 05:12
Move common allocation logic into shared allocator function
__irdma_alloc_buf. As a result, uAPI irdma_ualloc_buf is
updated accordingly. Proper cleanup paths are also updated
accordingly.

Signed-off-by: David Hu <xuehaohu@google.com>
For simplicity, irdma legacy ABI is updated to support
passing dmabuf fd to irdma driver. Alternatively, a
major write is needed to migrate queues ops via IOCTL
ABI.

Signed-off-by: David Hu <xuehaohu@google.com>
@10000TB 10000TB force-pushed the irdma-CoCo-buffer branch from 08ba9c9 to 8e618f9 Compare June 27, 2026 06:53
10000TB added 2 commits June 27, 2026 06:54
Cleanup path is also properly updated to be dmabuf aware

Signed-off-by: David Hu <xuehaohu@google.com>
Cleanup path is also updated accordingly

Signed-off-by: David Hu <xuehaohu@google.com>
@10000TB 10000TB force-pushed the irdma-CoCo-buffer branch from 8e618f9 to fbd917c Compare June 27, 2026 07:06
Cleanup path is also updated accordingly

Signed-off-by: David Hu <xuehaohu@google.com>
@jakemoroni

jakemoroni commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Hi,

Thanks for getting a jump on this. We should probably work out some of the uAPI details since irdma is a bit unusual in that regard.

As you are aware, irdma handles user-allocated QP/CQ/SRQ ring buffers differently from other drivers (from what I can tell). For irdma, QP/CQ/SRQ creation is a two step process where the ring buffers are allocated and registered as special memory regions (indicated via flags in udata) using the normal reg_mr command prior to issuing the actual ibv_cmd_create_qp/cq/srq call. The driver maintains an internal registry to correlate these special MRs with their child ring object when the object is created.

So, I think we have a few options:

  1. Add a flag attribute to irdma's reg_dmabuf_mr command (UVERBS_ATTR_FLAGS_IN) to indicate whether the region is a normal memory region or a user ring region, similar to what is passed for reg_mr today. Then, the provider would follow mostly the same flow, but would just use reg_dmabuf_mr for dmabuf backed rings. This is likely the simplest change in terms of the irdma kernel driver as well since it already shares a lot of code between reg_dmabuf_mr and normal reg_mr.
  2. Change irdma to support providing umem during the actual create_qp/cq/srq call. This is similar to what you've done but would ideally use the new UVERBS_ATTR_UMEM attribute as opposed to adding new fields to the driver udata for each command. This would be a larger change and should probably be treated as a separate endeavor if we decide to go down this route, since it's kind of unrelated to dmabuf ring support (though we would get that for free).
  3. There were also talks of some way to allow for mmap-ed dmabuf VAs to be passed to normal ibv_reg_mr. I don't know if that's still PoR but if so, then we wouldn't need any irdma kernel driver changes at all.

Option 1 is the easiest and probably carries less risk, but I am not sure if the maintainers would prefer option 2 since it's a more standard way.

As a side note, we will probably need to fix a few irdma kernel driver issues before adding new features:

  1. Need to support robust udata. I just sent a patch series for this.
  2. There were some bugs related to this two step user ring process. Specifically, we need to prevent a bad userspace from deregistering these special MRs if there's still a QP/CQ/SRQ object in HW pointing to it. FWIW, this probably wouldn't exist with option 2, but we're still stuck supporting the two step process anyway.

@10000TB

10000TB commented Jun 27, 2026

Copy link
Copy Markdown
Author

SG to clarify and align on uAPIs. Good call.

The simpler the better - Therefore, I also prefer option 1.

Option 2 appear a reasonable step forward to converge with standard ways. Though, non-trivial work to support IOCTl ABIs appears needed. Simply modifying the legacy write syscall ABI structs appears a plausible middle ground. A recent patch for fixing a vulnerability on user triggered NULL deref appears specific to the two step process and a counter argument against the two step process (I think your side is also spot on). Though I don't think the two step process is bad. Converging with standard ways is probably the strongest argument, if the alignment is option 2.

For VA dmabuf, it sounds interesting. I'd be curious to learn more. My limited understanding is that kernel side needs a dmabuf fd to cast it to dmabuf structure. Maybe there is no need to cast it at all in kernel with the mmapped approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants