provider/irdma: Implement CoCo buffer APIs#1766
Conversation
ibv_cmd_reg_mr_ex() did not record the MR's access flags in vmr->access,
unlike ibv_cmd_reg_mr(). ibv_dereg_mr() decides whether to re-enable fork
tracking from verbs_get_mr(mr)->access:
if (... && !(access & IBV_ACCESS_ON_DEMAND))
ibv_dofork_range(addr, length);
With vmr->access left zero, an on-demand (ODP) MR registered through
ibv_reg_mr_ex() / ibv_cmd_reg_mr_ex() reads access == 0 at dereg, so
ibv_dofork_range() runs even though registration skipped the matching
ibv_dontfork_range() for ODP (need_fork = !(ON_DEMAND || FD)). The
unbalanced dofork corrupts the fork-range accounting when ibv_fork_init()
is enabled.
Set vmr->access = mr_init_attr->access on the success path, mirroring
ibv_cmd_reg_mr(), so the dereg fork decision is correct. (The write-ABI
fallback path already goes through ibv_cmd_reg_mr(), which sets it.)
Fixes: ca61708 ("verbs: Add ibv_cmd_reg_mr_ex() to be used by drivers")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
To commit: d7a40b519497 ("RDMA/uverbs: Expose CoCo DMA bounce
requirement to userspace").
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add struct ibv_buf with addr and size fields to provide a common abstraction for buffer metadata that providers can embed in their internal buffer structures. Introduce an init helper alongside. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add an int dmabuf_fd field to struct ibv_buf so that providers can store the DMA-buf file descriptor directly in the common buffer abstraction. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add alloc_buf/free_buf provider ops and corresponding helpers to allow applications to allocate buffers using the provider's configured allocation method. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
The default (dummy) reg_mr_ex op returns EOPNOTSUPP, so ibv_reg_mr_ex() fails on any provider that does not implement its own reg_mr_ex, even when the provider supports the equivalent legacy registration ops. Translate the ibv_mr_init_attr request into the provider's existing reg_mr() or reg_dmabuf_mr() op, selecting the dmabuf-based op when the request carries an fd and the address-based op otherwise. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Extend struct ibv_mr_init_attr with a buf field and the IBV_REG_MR_MASK_BUF comp_mask bit so a handle obtained from ibv_alloc_buf() can be registered directly through ibv_reg_mr_ex(). When the bit is set, translage the ibv_buf fields into the concreate init attr fields before registration, so the rest of the flow stays unchanged: a dma-buf backed buffer is mapped to the fd-based path (fd/fd_offset/iova) while a plain memory buffer is mapped to the addr-based path. Consume the IBV_REG_MR_MASK_BUF bit during this translation, leave only the masks the lower layers already understand. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add ibv_reg_buf_mr() to register memory returned by ibv_alloc_buf(). Implement it as a thin wrapper: set IBV_REG_MR_MASK_BUF and forward the buffer handle to ibv_reg_mr_ex(), which resolves it to the appropriate addr-based or DMA-buf-based registration and validates the range against the stored buffer metadata. Keep it equivalent to calling ibv_reg_mr_ex() with IBV_REG_MR_MASK_BUF. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
ibv_cmd_reg_mr_ex() issued the REG_MR ioctl unconditionally (execute_ioctl) with no fallback to the legacy write-based registration, unlike ibv_cmd_reg_mr() which is write-ABI based. An address-based MR registered through the extended path therefore fails on a kernel old enough to lack the REG_MR ioctl method, even though it still supports the legacy write ABI that ibv_cmd_reg_mr() uses. Issue the command through the existing ioctl/write fallback machinery instead: declare the command buffer with DECLARE_FBCMD_BUFFER and dispatch via execute_ioctl_fallback(). On TRY_WRITE -- the kernel has no REG_MR ioctl method -- fall back to the legacy write REG_MR command via ibv_cmd_reg_mr() for the address-based case. A dma-buf (fd) or DMA-handle request has no write-ABI equivalent, and a kernel without the ioctl could not service it anyway, so those return EOPNOTSUPP. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Jiri Pirko <jiri@nvidia.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add CoCo unprotected alloc flag for applications to opt-in to unprotected/shared memory allocation via parent domain creation. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add a flag the kernel sets when the device is in a CoCo guest and requires DMA bounce buffering. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
The new umem UAPI gives every dmabuf-backed buffer (CQ ring, QP main/RQ/SQ, mlx5 doorbell record) its own ioctl attribute of type UVERBS_ATTR_UMEM whose payload is a single struct ib_uverbs_buffer_desc. Add a per-attribute helper, fill_attr_in_buf_umem(), that providers call once per buffer they want to register through the kernel. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add a driver command-buffer chain argument to the private QP create helper ibv_cmd_create_qp_ex2() and build its command buffer with DECLARE_CMD_BUFFER_LINK_COMPAT(), mirroring ibv_cmd_create_cq_ex2(). This lets a provider link a driver-namespace command buffer onto the create-QP ioctl; later commits use it to attach per-buffer UMEM attributes (UVERBS_ATTR_CREATE_QP_*_BUF_UMEM). Update all existing callers (mlx4, mlx5, hns, mana, rxe, ionic) to pass NULL as they have no driver chain to attach. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add dmabuf_heap.c/h including a set of internal helpers for providers to allocate memory from Linux DMA-buf heaps (/dev/dma_heap/<name>). Add a helper to initialize "system_cc_shared" heap allocator. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Replace the separate void *buf and size_t length fields in struct mlx5_buf with an embedded struct ibv_buf, aligning the provider's internal buffer representation with the common ibv_buf abstraction. Link struct mlx5_buf to the ibv_buf API (pd, addr, size) instead of maintaining duplicate fields. Add a pd argument to the internal allocation helpers and initialize the embedded buffer through ibv_buf_init(), so each buffer records its owning protection domain. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
When a parent domain is created with ALLOW_CC_UNPROTECTED_ALLOC and the device reports CC_DMA_BOUNCE, open a dmabuf heap and use it for all provider-internal buffer allocations (CQ, QP, SRQ, RWQ, doorbell records) through the existing preferred allocation path. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Implement mlx5_alloc_buf_op()/mlx5_free_buf_op() using the existing struct mlx5_buf based allocation infrastructure. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Wire CQ and QP create paths to the new per-attribute UMEM UAPI: emit a struct ib_uverbs_buffer_desc for each dmabuf-backed buffer (CQ ring, QP main/SQ, doorbell record) on the driver_attrs chain via fill_attr_in_buf_umem(). Signed-off-by: Jiri Pirko <jiri@nvidia.com>
In preparation for the follow-up patch, move ctx->buf allocation later in pp_init_ctx(). Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add -U/--allow-cc-unprotected option for running in CoCo guests if device DMA requires unprotected/shared memory. When set, create a parent domain with ALLOW_CC_UNPROTECTED_ALLOC and use ibv_alloc_buf()/ibv_free_buf()/ibv_reg_buf_mr() for MR buffer allocation and registration. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Expose IBV_DEVICE_CC_DMA_BOUNCE to pyverbs so the extended device capability can be queried from Python. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Expose IBV_PARENT_DOMAIN_INIT_ATTR_ALLOW_CC_UNPROTECTED_ALLOC to pyverbs so it can be passed in the parent domain init comp_mask. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add a comp_mask argument to ParentDomainInitAttr so callers can request IBV_PARENT_DOMAIN_INIT_ATTR_ALLOW_CC_UNPROTECTED_ALLOC, the opt-in used by DMA-bounce devices on Confidential Computing (CoCo) guests. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Wrap the provider-aware buffer API (ibv_alloc_buf, ibv_free_buf, ibv_reg_buf_mr) with new Buf and BufMR classes. Buf owns a buffer allocated through a PD and is tracked in a per-PD weakset, so it is torn down before the PD it belongs to. BufMR registers an MR over a (sub)range of a Buf and deregisters it on close without freeing the underlying buffer. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
MREx.close() called the base close() and then reset its own cdef object member via 'self.dmah = None'. That assignment is both redundant and unsafe during deallocation. Cython's generated subclass deallocator clears the subclass cdef object members (dmah) before chaining to the base deallocator that runs the inherited MR.__dealloc__ -> self.close(). So when close() runs at deallocation time, dmah is already NULL and 'self.dmah = None' does an unguarded DECREF on NULL, segfaulting. This only stayed hidden while every MREx was closed explicitly first (which sets self.mr = NULL and makes the deallocation-time close() skip its body); an MREx reclaimed by the garbage collector crashes. Releasing dmah here is unnecessary: tp_dealloc already drops the reference, and MR.close() performs the ibv_dereg_mr(). Drop the override and inherit MR.close(), matching the other MR subclasses (e.g. DMMR) that do not reset their extra members in close(). Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Add test_buf.py exercising the ibv_buf API (ibv_alloc_buf, ibv_reg_buf_mr, ibv_free_buf), and the ibv_reg_mr_ex() IBV_REG_MR_MASK_BUF path, over both a plain PD and a parent domain created with ALLOW_CC_UNPROTECTED_ALLOC, in API-only and RC/UD traffic variants. Add an is_cq_ex option to the rdma_traffic and atomic_traffic helpers so the tests can drive an extended CQ. Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Allocation of system_cc_dmabuf is only carried out if CoCo DMA bounce feature in device, and cc unprotected attribute are both present. Proper cleanup is added to cleanup path. Signed-off-by: David Hu <xuehaohu@google.com>
Signed-off-by: David Hu <xuehaohu@google.com>
Move common allocation logic into shared allocator function __irdma_alloc_buf. As a result, uAPI irdma_ualloc_buf is updated accordingly. Proper cleanup paths are also updated accordingly. Signed-off-by: David Hu <xuehaohu@google.com>
For simplicity, irdma legacy ABI is updated to support passing dmabuf fd to irdma driver. Alternatively, a major write is needed to migrate queues ops via IOCTL ABI. Signed-off-by: David Hu <xuehaohu@google.com>
08ba9c9 to
8e618f9
Compare
Cleanup path is also properly updated to be dmabuf aware Signed-off-by: David Hu <xuehaohu@google.com>
Cleanup path is also updated accordingly Signed-off-by: David Hu <xuehaohu@google.com>
8e618f9 to
fbd917c
Compare
Cleanup path is also updated accordingly Signed-off-by: David Hu <xuehaohu@google.com>
|
Hi, Thanks for getting a jump on this. We should probably work out some of the uAPI details since irdma is a bit unusual in that regard. As you are aware, irdma handles user-allocated QP/CQ/SRQ ring buffers differently from other drivers (from what I can tell). For irdma, QP/CQ/SRQ creation is a two step process where the ring buffers are allocated and registered as special memory regions (indicated via flags in udata) using the normal reg_mr command prior to issuing the actual ibv_cmd_create_qp/cq/srq call. The driver maintains an internal registry to correlate these special MRs with their child ring object when the object is created. So, I think we have a few options:
Option 1 is the easiest and probably carries less risk, but I am not sure if the maintainers would prefer option 2 since it's a more standard way. As a side note, we will probably need to fix a few irdma kernel driver issues before adding new features:
|
|
SG to clarify and align on uAPIs. Good call. The simpler the better - Therefore, I also prefer option 1. Option 2 appear a reasonable step forward to converge with standard ways. Though, non-trivial work to support IOCTl ABIs appears needed. Simply modifying the legacy write syscall ABI structs appears a plausible middle ground. A recent patch for fixing a vulnerability on user triggered NULL deref appears specific to the two step process and a counter argument against the two step process (I think your side is also spot on). Though I don't think the two step process is bad. Converging with standard ways is probably the strongest argument, if the alignment is option 2. For VA dmabuf, it sounds interesting. I'd be curious to learn more. My limited understanding is that kernel side needs a dmabuf fd to cast it to dmabuf structure. Maybe there is no need to cast it at all in kernel with the mmapped approach. |
[Draft]: Implement CoCo DMA Bounce buffer APIs defined in #1748
TODOs:
irdma-abi.hupdate to be dmabuf awareirdmadriver support.cc: @jakemoroni @jpirko