[model] support Unlimited_OCR#9645
Conversation
- Add MLLMModelType.unlimited_ocr and MLLMTemplateType.unlimited_ocr
- Add UnlimitedOCRLoader with multi-GPU device_map patch
- Fix torch.cat device mismatch for image_newline/view_seperator
- Fix masked_scatter_ device mismatch caused by hard-coded .cuda()
- Add UnlimitedOCR template inheriting from DeepseekOCR
- Override image_placeholder to remove trailing newline
- Add _fix_device() for parameter device alignment
- Register model: PaddlePaddle/Unlimited-OCR
Tested: LoRA fine-tuning on LaTeX_OCR dataset with 8x GPU,
inference verified with 4/5 exact match on validation set.
There was a problem hiding this comment.
Code Review
This pull request adds support for the unlimited-ocr model, including its model type, architecture registration, loader, and template. Key feedback points out a potential race condition when dynamically patching global PyTorch functions inside the forward pass, a redundancy in calling super() in UnlimitedOCRLoader.get_model which bypasses the parent class's patching logic, and an inconsistency in using ModelArch instead of MLLMModelArch during architecture registration.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def _apply_multi_gpu_patch(): | ||
| """ | ||
| Fixed two bugs affecting `UnlimitedOCRModel` in multi-GPU scenarios using `device_map='auto'`: | ||
|
|
||
| Bug 1 - Device mismatch in `torch.cat`: | ||
| `image_newline` and `view_seperator` are `nn.Parameter`s; | ||
| under `device_map='auto'`, their device placement might not align | ||
| with the image features. | ||
|
|
||
| Bug 2 - Device mismatch in `masked_scatter_`: | ||
| Hard-coded `.cuda()` usage caused a conflict where `images_in_this_batch` | ||
| resided on the projector's device (e.g., `cuda:7`), | ||
| while `inputs_embeds` resided on the device hosting `embed_tokens` (e.g., `cuda:0`). | ||
|
|
||
| Fix strategy: Temporarily replace `torch.cat` and `torch.Tensor.masked_scatter_` during the forward pass | ||
| to handle device placement automatically, then restore the original methods after execution. | ||
| """ | ||
| import sys | ||
| import torch | ||
|
|
||
| modeling_module = None | ||
| for mod_name, mod in sys.modules.items(): | ||
| if 'modeling_unlimitedocr' in mod_name: | ||
| modeling_module = mod | ||
| break | ||
|
|
||
| if modeling_module is None: | ||
| return False | ||
|
|
||
| UnlimitedOCRModel = getattr(modeling_module, 'UnlimitedOCRModel', None) | ||
| if UnlimitedOCRModel is None: | ||
| return False | ||
|
|
||
| # Avoid redundant patching | ||
| if getattr(UnlimitedOCRModel, '_swift_multi_gpu_patched', False): | ||
| return True | ||
|
|
||
| _original_forward = UnlimitedOCRModel.forward | ||
|
|
||
| def _patched_forward(self, *args, **kwargs): | ||
| _orig_cat = torch.cat | ||
| _orig_masked_scatter_ = torch.Tensor.masked_scatter_ | ||
|
|
||
| def _safe_cat(tensors, dim=0, **cat_kwargs): | ||
| # Using the device of the first tensor as the reference, the others are aligned to it. | ||
| ref_device = None | ||
| for t in tensors: | ||
| if isinstance(t, torch.Tensor): | ||
| ref_device = t.device | ||
| break | ||
| if ref_device is None: | ||
| return _orig_cat(tensors, dim, **cat_kwargs) | ||
| aligned = [ | ||
| t.to(ref_device) if isinstance(t, torch.Tensor) and t.device != ref_device else t for t in tensors | ||
| ] | ||
| return _orig_cat(aligned, dim, **cat_kwargs) | ||
|
|
||
| def _safe_masked_scatter_(tensor_self, mask, source): | ||
| # Use the device of tensor_self (inputs_embeds[idx]) as the reference. | ||
| dev = tensor_self.device | ||
| if mask.device != dev: | ||
| mask = mask.to(dev) | ||
| if source.device != dev: | ||
| source = source.to(dev) | ||
| return _orig_masked_scatter_(tensor_self, mask, source) | ||
|
|
||
| # Simultaneously replace the module namespace and the global scope (double insurance). | ||
| modeling_module.torch.cat = _safe_cat | ||
| torch.cat = _safe_cat | ||
| torch.Tensor.masked_scatter_ = _safe_masked_scatter_ | ||
| try: | ||
| return _original_forward(self, *args, **kwargs) | ||
| finally: | ||
| # Restore the state to avoid contaminating other modules. | ||
| modeling_module.torch.cat = _orig_cat | ||
| torch.cat = _orig_cat | ||
| torch.Tensor.masked_scatter_ = _orig_masked_scatter_ | ||
|
|
||
| UnlimitedOCRModel.forward = _patched_forward | ||
| UnlimitedOCRModel._swift_multi_gpu_patched = True | ||
| return True |
There was a problem hiding this comment.
Dynamically patching global functions like torch.cat and torch.Tensor.masked_scatter_ inside the forward pass and restoring them in a finally block is highly prone to race conditions in concurrent/multi-threaded environments (e.g., multi-threaded inference servers). If two threads execute the forward pass concurrently, they can overwrite each other's saved original functions, leading to permanent global pollution of torch.cat with the patched version.
Instead, apply the safe wrapper patches globally once during the initialization of the patch. This completely avoids any race conditions and simplifies the forward pass logic.
@staticmethod
def _apply_multi_gpu_patch():
"""
Fixed two bugs affecting `UnlimitedOCRModel` in multi-GPU scenarios using `device_map='auto'`:
Bug 1 - Device mismatch in `torch.cat`:
`image_newline` and `view_seperator` are `nn.Parameter`s;
under `device_map='auto'`, their device placement might not align
with the image features.
Bug 2 - Device mismatch in `masked_scatter_`:
Hard-coded `.cuda()` usage caused a conflict where `images_in_this_batch`
resided on the projector's device (e.g., `cuda:7`),
while `inputs_embeds` resided on the device hosting `embed_tokens` (e.g., `cuda:0`).
Fix strategy: Replace `torch.cat` and `torch.Tensor.masked_scatter_` globally once
with thread-safe wrappers that handle device alignment automatically.
"""
import sys
import torch
modeling_module = None
for mod_name, mod in sys.modules.items():
if 'modeling_unlimitedocr' in mod_name:
modeling_module = mod
break
if modeling_module is None:
return False
UnlimitedOCRModel = getattr(modeling_module, 'UnlimitedOCRModel', None)
if UnlimitedOCRModel is None:
return False
# Avoid redundant patching
if getattr(UnlimitedOCRModel, '_swift_multi_gpu_patched', False):
return True
_orig_cat = torch.cat
def _safe_cat(tensors, dim=0, **cat_kwargs):
ref_device = None
has_mismatch = False
for t in tensors:
if isinstance(t, torch.Tensor):
if ref_device is None:
ref_device = t.device
elif t.device != ref_device:
has_mismatch = True
if not has_mismatch or ref_device is None:
return _orig_cat(tensors, dim, **cat_kwargs)
aligned = [
t.to(ref_device) if isinstance(t, torch.Tensor) and t.device != ref_device else t for t in tensors
]
return _orig_cat(aligned, dim, **cat_kwargs)
_orig_masked_scatter_ = torch.Tensor.masked_scatter_
def _safe_masked_scatter_(tensor_self, mask, source):
dev = tensor_self.device
if mask.device != dev:
mask = mask.to(dev)
if source.device != dev:
source = source.to(dev)
return _orig_masked_scatter_(tensor_self, mask, source)
torch.cat = _safe_cat
if hasattr(modeling_module, 'torch'):
modeling_module.torch.cat = _safe_cat
torch.Tensor.masked_scatter_ = _safe_masked_scatter_
UnlimitedOCRModel._swift_multi_gpu_patched = True
return True|
|
||
| register_model_arch( | ||
| MultiModelKeys( | ||
| ModelArch.unlimited_ocr, |
| self.auto_model_cls = self.auto_model_cls or AutoModel | ||
| model = super(DeepseekOCRLoader, self).get_model(model_dir, *args, **kwargs) | ||
| patch_output_clone(model.model.embed_tokens) | ||
| patch_output_to_input_device(model.model.sam_model) | ||
| patch_output_to_input_device(getattr(model.model, self.visual_name)) | ||
| patch_output_to_input_device(model.model.projector) | ||
| patch_output_to_input_device(model.model) |
There was a problem hiding this comment.
Since UnlimitedOCRLoader inherits from DeepseekOCRLoader, calling super(DeepseekOCRLoader, self).get_model(...) bypasses DeepseekOCRLoader.get_model and duplicates all of its patching logic. This is highly redundant and hard to maintain.
Instead, use super().get_model(...) to let DeepseekOCRLoader apply its patches, and then simply apply the additional patch_output_to_input_device(model.model) patch.
| self.auto_model_cls = self.auto_model_cls or AutoModel | |
| model = super(DeepseekOCRLoader, self).get_model(model_dir, *args, **kwargs) | |
| patch_output_clone(model.model.embed_tokens) | |
| patch_output_to_input_device(model.model.sam_model) | |
| patch_output_to_input_device(getattr(model.model, self.visual_name)) | |
| patch_output_to_input_device(model.model.projector) | |
| patch_output_to_input_device(model.model) | |
| model = super().get_model(model_dir, *args, **kwargs) | |
| patch_output_to_input_device(model.model) |
Support for
PaddlePaddle/Unlimited-OCR.Changes:
swift/model/constant.py: AddMLLMModelType.unlimited_ocrswift/model/model_arch.py: AddMLLMModelArch.unlimited_ocrswift/model/models/deepseek.py: AddUnlimitedOCRLoaderwith multi-GPUdevice_mappatchswift/template/constant.py: AddMLLMTemplateType.unlimited_ocrswift/template/templates/deepseek.py: AddUnlimitedOCRtemplateUsage:
swift sft \ --model PaddlePaddle/Unlimited-OCR \ --model_type unlimited-ocr \ --template unlimited_ocr \ --dataset AI-ModelScope/LaTeX_OCR \ --lazy_tokenize true swift infer \ --adapters <checkpoint> \ --load_data_args true \ --stream true