provider/irdma: Implement CoCo buffer APIs by 10000TB · Pull Request #1766 · linux-rdma/rdma-core

10000TB · 2026-06-26T06:18:21Z

[Draft]: Implement CoCo DMA Bounce buffer APIs defined in #1748

TODOs:

ibv_cmd_reg_mr_ex() did not record the MR's access flags in vmr->access, unlike ibv_cmd_reg_mr(). ibv_dereg_mr() decides whether to re-enable fork tracking from verbs_get_mr(mr)->access: if (... && !(access & IBV_ACCESS_ON_DEMAND)) ibv_dofork_range(addr, length); With vmr->access left zero, an on-demand (ODP) MR registered through ibv_reg_mr_ex() / ibv_cmd_reg_mr_ex() reads access == 0 at dereg, so ibv_dofork_range() runs even though registration skipped the matching ibv_dontfork_range() for ODP (need_fork = !(ON_DEMAND || FD)). The unbalanced dofork corrupts the fork-range accounting when ibv_fork_init() is enabled. Set vmr->access = mr_init_attr->access on the success path, mirroring ibv_cmd_reg_mr(), so the dereg fork decision is correct. (The write-ABI fallback path already goes through ibv_cmd_reg_mr(), which sets it.) Fixes: ca61708 ("verbs: Add ibv_cmd_reg_mr_ex() to be used by drivers") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Jiri Pirko <jiri@nvidia.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

To commit: d7a40b519497 ("RDMA/uverbs: Expose CoCo DMA bounce requirement to userspace"). Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Add struct ibv_buf with addr and size fields to provide a common abstraction for buffer metadata that providers can embed in their internal buffer structures. Introduce an init helper alongside. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Add an int dmabuf_fd field to struct ibv_buf so that providers can store the DMA-buf file descriptor directly in the common buffer abstraction. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Add alloc_buf/free_buf provider ops and corresponding helpers to allow applications to allocate buffers using the provider's configured allocation method. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

The default (dummy) reg_mr_ex op returns EOPNOTSUPP, so ibv_reg_mr_ex() fails on any provider that does not implement its own reg_mr_ex, even when the provider supports the equivalent legacy registration ops. Translate the ibv_mr_init_attr request into the provider's existing reg_mr() or reg_dmabuf_mr() op, selecting the dmabuf-based op when the request carries an fd and the address-based op otherwise. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Extend struct ibv_mr_init_attr with a buf field and the IBV_REG_MR_MASK_BUF comp_mask bit so a handle obtained from ibv_alloc_buf() can be registered directly through ibv_reg_mr_ex(). When the bit is set, translage the ibv_buf fields into the concreate init attr fields before registration, so the rest of the flow stays unchanged: a dma-buf backed buffer is mapped to the fd-based path (fd/fd_offset/iova) while a plain memory buffer is mapped to the addr-based path. Consume the IBV_REG_MR_MASK_BUF bit during this translation, leave only the masks the lower layers already understand. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Add ibv_reg_buf_mr() to register memory returned by ibv_alloc_buf(). Implement it as a thin wrapper: set IBV_REG_MR_MASK_BUF and forward the buffer handle to ibv_reg_mr_ex(), which resolves it to the appropriate addr-based or DMA-buf-based registration and validates the range against the stored buffer metadata. Keep it equivalent to calling ibv_reg_mr_ex() with IBV_REG_MR_MASK_BUF. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

ibv_cmd_reg_mr_ex() issued the REG_MR ioctl unconditionally (execute_ioctl) with no fallback to the legacy write-based registration, unlike ibv_cmd_reg_mr() which is write-ABI based. An address-based MR registered through the extended path therefore fails on a kernel old enough to lack the REG_MR ioctl method, even though it still supports the legacy write ABI that ibv_cmd_reg_mr() uses. Issue the command through the existing ioctl/write fallback machinery instead: declare the command buffer with DECLARE_FBCMD_BUFFER and dispatch via execute_ioctl_fallback(). On TRY_WRITE -- the kernel has no REG_MR ioctl method -- fall back to the legacy write REG_MR command via ibv_cmd_reg_mr() for the address-based case. A dma-buf (fd) or DMA-handle request has no write-ABI equivalent, and a kernel without the ioctl could not service it anyway, so those return EOPNOTSUPP. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Jiri Pirko <jiri@nvidia.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add CoCo unprotected alloc flag for applications to opt-in to unprotected/shared memory allocation via parent domain creation. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Add a flag the kernel sets when the device is in a CoCo guest and requires DMA bounce buffering. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

The new umem UAPI gives every dmabuf-backed buffer (CQ ring, QP main/RQ/SQ, mlx5 doorbell record) its own ioctl attribute of type UVERBS_ATTR_UMEM whose payload is a single struct ib_uverbs_buffer_desc. Add a per-attribute helper, fill_attr_in_buf_umem(), that providers call once per buffer they want to register through the kernel. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Add a driver command-buffer chain argument to the private QP create helper ibv_cmd_create_qp_ex2() and build its command buffer with DECLARE_CMD_BUFFER_LINK_COMPAT(), mirroring ibv_cmd_create_cq_ex2(). This lets a provider link a driver-namespace command buffer onto the create-QP ioctl; later commits use it to attach per-buffer UMEM attributes (UVERBS_ATTR_CREATE_QP_*_BUF_UMEM). Update all existing callers (mlx4, mlx5, hns, mana, rxe, ionic) to pass NULL as they have no driver chain to attach. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Add dmabuf_heap.c/h including a set of internal helpers for providers to allocate memory from Linux DMA-buf heaps (/dev/dma_heap/<name>). Add a helper to initialize "system_cc_shared" heap allocator. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Replace the separate void *buf and size_t length fields in struct mlx5_buf with an embedded struct ibv_buf, aligning the provider's internal buffer representation with the common ibv_buf abstraction. Link struct mlx5_buf to the ibv_buf API (pd, addr, size) instead of maintaining duplicate fields. Add a pd argument to the internal allocation helpers and initialize the embedded buffer through ibv_buf_init(), so each buffer records its owning protection domain. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

When a parent domain is created with ALLOW_CC_UNPROTECTED_ALLOC and the device reports CC_DMA_BOUNCE, open a dmabuf heap and use it for all provider-internal buffer allocations (CQ, QP, SRQ, RWQ, doorbell records) through the existing preferred allocation path. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Implement mlx5_alloc_buf_op()/mlx5_free_buf_op() using the existing struct mlx5_buf based allocation infrastructure. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Wire CQ and QP create paths to the new per-attribute UMEM UAPI: emit a struct ib_uverbs_buffer_desc for each dmabuf-backed buffer (CQ ring, QP main/SQ, doorbell record) on the driver_attrs chain via fill_attr_in_buf_umem(). Signed-off-by: Jiri Pirko <jiri@nvidia.com>

In preparation for the follow-up patch, move ctx->buf allocation later in pp_init_ctx(). Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Add -U/--allow-cc-unprotected option for running in CoCo guests if device DMA requires unprotected/shared memory. When set, create a parent domain with ALLOW_CC_UNPROTECTED_ALLOC and use ibv_alloc_buf()/ibv_free_buf()/ibv_reg_buf_mr() for MR buffer allocation and registration. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Expose IBV_DEVICE_CC_DMA_BOUNCE to pyverbs so the extended device capability can be queried from Python. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Expose IBV_PARENT_DOMAIN_INIT_ATTR_ALLOW_CC_UNPROTECTED_ALLOC to pyverbs so it can be passed in the parent domain init comp_mask. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Add a comp_mask argument to ParentDomainInitAttr so callers can request IBV_PARENT_DOMAIN_INIT_ATTR_ALLOW_CC_UNPROTECTED_ALLOC, the opt-in used by DMA-bounce devices on Confidential Computing (CoCo) guests. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Wrap the provider-aware buffer API (ibv_alloc_buf, ibv_free_buf, ibv_reg_buf_mr) with new Buf and BufMR classes. Buf owns a buffer allocated through a PD and is tracked in a per-PD weakset, so it is torn down before the PD it belongs to. BufMR registers an MR over a (sub)range of a Buf and deregisters it on close without freeing the underlying buffer. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

MREx.close() called the base close() and then reset its own cdef object member via 'self.dmah = None'. That assignment is both redundant and unsafe during deallocation. Cython's generated subclass deallocator clears the subclass cdef object members (dmah) before chaining to the base deallocator that runs the inherited MR.__dealloc__ -> self.close(). So when close() runs at deallocation time, dmah is already NULL and 'self.dmah = None' does an unguarded DECREF on NULL, segfaulting. This only stayed hidden while every MREx was closed explicitly first (which sets self.mr = NULL and makes the deallocation-time close() skip its body); an MREx reclaimed by the garbage collector crashes. Releasing dmah here is unnecessary: tp_dealloc already drops the reference, and MR.close() performs the ibv_dereg_mr(). Drop the override and inherit MR.close(), matching the other MR subclasses (e.g. DMMR) that do not reset their extra members in close(). Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Add test_buf.py exercising the ibv_buf API (ibv_alloc_buf, ibv_reg_buf_mr, ibv_free_buf), and the ibv_reg_mr_ex() IBV_REG_MR_MASK_BUF path, over both a plain PD and a parent domain created with ALLOW_CC_UNPROTECTED_ALLOC, in API-only and RC/UD traffic variants. Add an is_cq_ex option to the rdma_traffic and atomic_traffic helpers so the tests can drive an extended CQ. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

Allocation of system_cc_dmabuf is only carried out if CoCo DMA bounce feature in device, and cc unprotected attribute are both present. Proper cleanup is added to cleanup path. Signed-off-by: David Hu <xuehaohu@google.com>

Signed-off-by: David Hu <xuehaohu@google.com>

Move common allocation logic into shared allocator function __irdma_alloc_buf. As a result, uAPI irdma_ualloc_buf is updated accordingly. Proper cleanup paths are also updated accordingly. Signed-off-by: David Hu <xuehaohu@google.com>

For simplicity, irdma legacy ABI is updated to support passing dmabuf fd to irdma driver. Alternatively, a major write is needed to migrate queues ops via IOCTL ABI. Signed-off-by: David Hu <xuehaohu@google.com>

Cleanup path is also properly updated to be dmabuf aware Signed-off-by: David Hu <xuehaohu@google.com>

Cleanup path is also updated accordingly Signed-off-by: David Hu <xuehaohu@google.com>

jakemoroni · 2026-06-27T18:26:14Z

Hi,

Thanks for getting a jump on this. We should probably work out some of the uAPI details since irdma is a bit unusual in that regard.

As you are aware, irdma handles user-allocated QP/CQ/SRQ ring buffers differently from other drivers (from what I can tell). For irdma, QP/CQ/SRQ creation is a two step process where the ring buffers are allocated and registered as special memory regions (indicated via flags in udata) using the normal reg_mr command prior to issuing the actual ibv_cmd_create_qp/cq/srq call. The driver maintains an internal registry to correlate these special MRs with their child ring object when the object is created.

So, I think we have a few options:

Add a flag attribute to irdma's reg_dmabuf_mr command (UVERBS_ATTR_FLAGS_IN) to indicate whether the region is a normal memory region or a user ring region, similar to what is passed for reg_mr today. Then, the provider would follow mostly the same flow, but would just use reg_dmabuf_mr for dmabuf backed rings. This is likely the simplest change in terms of the irdma kernel driver as well since it already shares a lot of code between reg_dmabuf_mr and normal reg_mr.
Change irdma to support providing umem during the actual create_qp/cq/srq call. This is similar to what you've done but would ideally use the new UVERBS_ATTR_UMEM attribute as opposed to adding new fields to the driver udata for each command. This would be a larger change and should probably be treated as a separate endeavor if we decide to go down this route, since it's kind of unrelated to dmabuf ring support (though we would get that for free).
There were also talks of some way to allow for mmap-ed dmabuf VAs to be passed to normal ibv_reg_mr. I don't know if that's still PoR but if so, then we wouldn't need any irdma kernel driver changes at all.

Option 1 is the easiest and probably carries less risk, but I am not sure if the maintainers would prefer option 2 since it's a more standard way.

As a side note, we will probably need to fix a few irdma kernel driver issues before adding new features:

Need to support robust udata. I just sent a patch series for this.
There were some bugs related to this two step user ring process. Specifically, we need to prevent a bad userspace from deregistering these special MRs if there's still a QP/CQ/SRQ object in HW pointing to it. FWIW, this probably wouldn't exist with option 2, but we're still stuck supporting the two step process anyway.

10000TB · 2026-06-27T19:30:28Z

SG to clarify and align on uAPIs. Good call.

The simpler the better - Therefore, I also prefer option 1.

Option 2 appear a reasonable step forward to converge with standard ways. Though, non-trivial work to support IOCTl ABIs appears needed. Simply modifying the legacy write syscall ABI structs appears a plausible middle ground. A recent patch for fixing a vulnerability on user triggered NULL deref appears specific to the two step process and a counter argument against the two step process (I think your side is also spot on). Though I don't think the two step process is bad. Converging with standard ways is probably the strongest argument, if the alignment is option 2.

For VA dmabuf, it sounds interesting. I'd be curious to learn more. My limited understanding is that kernel side needs a dmabuf fd to cast it to dmabuf structure. Maybe there is no need to cast it at all in kernel with the mmapped approach.

yishaih and others added 28 commits June 25, 2026 15:33

Update kernel headers

752a9b2

To commit: d7a40b519497 ("RDMA/uverbs: Expose CoCo DMA bounce requirement to userspace"). Signed-off-by: Jiri Pirko <jiri@nvidia.com>

libibverbs: Add dmabuf fd field to struct ibv_buf

60a6681

Add an int dmabuf_fd field to struct ibv_buf so that providers can store the DMA-buf file descriptor directly in the common buffer abstraction. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

libibverbs: Add ibv_alloc_buf/ibv_free_buf verbs API

0dd1543

Add alloc_buf/free_buf provider ops and corresponding helpers to allow applications to allocate buffers using the provider's configured allocation method. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

libibverbs: Add parent domain ALLOW_CC_UNPROTECTED_ALLOC flag

7691b51

Add CoCo unprotected alloc flag for applications to opt-in to unprotected/shared memory allocation via parent domain creation. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

libibverbs: Add CoCo DMA bounce flag

47d9907

Add a flag the kernel sets when the device is in a CoCo guest and requires DMA bounce buffering. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

mlx5: Implement ibv_alloc_buf/ibv_free_buf ops

28142de

Implement mlx5_alloc_buf_op()/mlx5_free_buf_op() using the existing struct mlx5_buf based allocation infrastructure. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

rc_pingpong: Move buffer allocation after PD setup

de196b3

In preparation for the follow-up patch, move ctx->buf allocation later in pp_init_ctx(). Signed-off-by: Jiri Pirko <jiri@nvidia.com>

pyverbs: Add CoCo DMA bounce device cap flag

9fe1f21

Expose IBV_DEVICE_CC_DMA_BOUNCE to pyverbs so the extended device capability can be queried from Python. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

pyverbs: Add parent domain CC unprotected flag

7f3c8bd

Expose IBV_PARENT_DOMAIN_INIT_ATTR_ALLOW_CC_UNPROTECTED_ALLOC to pyverbs so it can be passed in the parent domain init comp_mask. Signed-off-by: Jiri Pirko <jiri@nvidia.com>

provider/irdma: Implement alloc_parent_domain for CoCo

c3cc7df

Allocation of system_cc_dmabuf is only carried out if CoCo DMA bounce feature in device, and cc unprotected attribute are both present. Proper cleanup is added to cleanup path. Signed-off-by: David Hu <xuehaohu@google.com>

provider/irdma: Implement ibv_alloc_buf and ibv_free_buf

0c1e6e3

Signed-off-by: David Hu <xuehaohu@google.com>

10000TB marked this pull request as draft June 26, 2026 06:19

10000TB changed the title ~~provider/Irdma: Implement CoCo buffer APIs~~ provider/irdma: Implement CoCo buffer APIs Jun 26, 2026

10000TB added 2 commits June 27, 2026 05:12

provider/irdma: Update irdma-abi.h to support dmabuf

7225039

For simplicity, irdma legacy ABI is updated to support passing dmabuf fd to irdma driver. Alternatively, a major write is needed to migrate queues ops via IOCTL ABI. Signed-off-by: David Hu <xuehaohu@google.com>

10000TB force-pushed the irdma-CoCo-buffer branch from 08ba9c9 to 8e618f9 Compare June 27, 2026 06:53

10000TB added 2 commits June 27, 2026 06:54

provider/irdma: Migrate CQ creation to be dmabuf aware

3b412cc

Cleanup path is also properly updated to be dmabuf aware Signed-off-by: David Hu <xuehaohu@google.com>

provider/irdma: Migrate QP creation to be dmabuf aware

fbd917c

Cleanup path is also updated accordingly Signed-off-by: David Hu <xuehaohu@google.com>

10000TB force-pushed the irdma-CoCo-buffer branch from 8e618f9 to fbd917c Compare June 27, 2026 07:06

provider/irdma: Migrate srq creation to be dmabuf aware

1dba9e5

Cleanup path is also updated accordingly Signed-off-by: David Hu <xuehaohu@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

provider/irdma: Implement CoCo buffer APIs#1766

provider/irdma: Implement CoCo buffer APIs#1766
10000TB wants to merge 33 commits into
linux-rdma:masterfrom
10000TB:irdma-CoCo-buffer

10000TB commented Jun 26, 2026 •

edited

Loading

Uh oh!

jakemoroni commented Jun 27, 2026 •

edited

Loading

Uh oh!

10000TB commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

10000TB commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakemoroni commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

10000TB commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

10000TB commented Jun 26, 2026 •

edited

Loading

jakemoroni commented Jun 27, 2026 •

edited

Loading

10000TB commented Jun 27, 2026 •

edited

Loading