Skip to content

[model] support Unlimited_OCR#9645

Merged
Jintao-Huang merged 6 commits into
modelscope:mainfrom
z0o0ey:support_unlimited_ocr
Jun 26, 2026
Merged

[model] support Unlimited_OCR#9645
Jintao-Huang merged 6 commits into
modelscope:mainfrom
z0o0ey:support_unlimited_ocr

Conversation

@z0o0ey

@z0o0ey z0o0ey commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Support for PaddlePaddle/Unlimited-OCR.

Changes:

  • swift/model/constant.py: Add MLLMModelType.unlimited_ocr
  • swift/model/model_arch.py: Add MLLMModelArch.unlimited_ocr
  • swift/model/models/deepseek.py: Add UnlimitedOCRLoader with multi-GPU device_map patch
  • swift/template/constant.py: Add MLLMTemplateType.unlimited_ocr
  • swift/template/templates/deepseek.py: Add UnlimitedOCR template

Usage:

swift sft \
    --model PaddlePaddle/Unlimited-OCR \
    --model_type unlimited-ocr \
    --template unlimited_ocr \
    --dataset AI-ModelScope/LaTeX_OCR \
    --lazy_tokenize true

swift infer \
    --adapters <checkpoint> \
    --load_data_args true \
    --stream true

- Add MLLMModelType.unlimited_ocr and MLLMTemplateType.unlimited_ocr
- Add UnlimitedOCRLoader with multi-GPU device_map patch
  - Fix torch.cat device mismatch for image_newline/view_seperator
  - Fix masked_scatter_ device mismatch caused by hard-coded .cuda()
- Add UnlimitedOCR template inheriting from DeepseekOCR
  - Override image_placeholder to remove trailing newline
  - Add _fix_device() for parameter device alignment
- Register model: PaddlePaddle/Unlimited-OCR

Tested: LoRA fine-tuning on LaTeX_OCR dataset with 8x GPU,
        inference verified with 4/5 exact match on validation set.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the unlimited-ocr model, including its model type, architecture registration, loader, and template. Key feedback points out a potential race condition when dynamically patching global PyTorch functions inside the forward pass, a redundancy in calling super() in UnlimitedOCRLoader.get_model which bypasses the parent class's patching logic, and an inconsistency in using ModelArch instead of MLLMModelArch during architecture registration.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +384 to +464
def _apply_multi_gpu_patch():
"""
Fixed two bugs affecting `UnlimitedOCRModel` in multi-GPU scenarios using `device_map='auto'`:

Bug 1 - Device mismatch in `torch.cat`:
`image_newline` and `view_seperator` are `nn.Parameter`s;
under `device_map='auto'`, their device placement might not align
with the image features.

Bug 2 - Device mismatch in `masked_scatter_`:
Hard-coded `.cuda()` usage caused a conflict where `images_in_this_batch`
resided on the projector's device (e.g., `cuda:7`),
while `inputs_embeds` resided on the device hosting `embed_tokens` (e.g., `cuda:0`).

Fix strategy: Temporarily replace `torch.cat` and `torch.Tensor.masked_scatter_` during the forward pass
to handle device placement automatically, then restore the original methods after execution.
"""
import sys
import torch

modeling_module = None
for mod_name, mod in sys.modules.items():
if 'modeling_unlimitedocr' in mod_name:
modeling_module = mod
break

if modeling_module is None:
return False

UnlimitedOCRModel = getattr(modeling_module, 'UnlimitedOCRModel', None)
if UnlimitedOCRModel is None:
return False

# Avoid redundant patching
if getattr(UnlimitedOCRModel, '_swift_multi_gpu_patched', False):
return True

_original_forward = UnlimitedOCRModel.forward

def _patched_forward(self, *args, **kwargs):
_orig_cat = torch.cat
_orig_masked_scatter_ = torch.Tensor.masked_scatter_

def _safe_cat(tensors, dim=0, **cat_kwargs):
# Using the device of the first tensor as the reference, the others are aligned to it.
ref_device = None
for t in tensors:
if isinstance(t, torch.Tensor):
ref_device = t.device
break
if ref_device is None:
return _orig_cat(tensors, dim, **cat_kwargs)
aligned = [
t.to(ref_device) if isinstance(t, torch.Tensor) and t.device != ref_device else t for t in tensors
]
return _orig_cat(aligned, dim, **cat_kwargs)

def _safe_masked_scatter_(tensor_self, mask, source):
# Use the device of tensor_self (inputs_embeds[idx]) as the reference.
dev = tensor_self.device
if mask.device != dev:
mask = mask.to(dev)
if source.device != dev:
source = source.to(dev)
return _orig_masked_scatter_(tensor_self, mask, source)

# Simultaneously replace the module namespace and the global scope (double insurance).
modeling_module.torch.cat = _safe_cat
torch.cat = _safe_cat
torch.Tensor.masked_scatter_ = _safe_masked_scatter_
try:
return _original_forward(self, *args, **kwargs)
finally:
# Restore the state to avoid contaminating other modules.
modeling_module.torch.cat = _orig_cat
torch.cat = _orig_cat
torch.Tensor.masked_scatter_ = _orig_masked_scatter_

UnlimitedOCRModel.forward = _patched_forward
UnlimitedOCRModel._swift_multi_gpu_patched = True
return True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Dynamically patching global functions like torch.cat and torch.Tensor.masked_scatter_ inside the forward pass and restoring them in a finally block is highly prone to race conditions in concurrent/multi-threaded environments (e.g., multi-threaded inference servers). If two threads execute the forward pass concurrently, they can overwrite each other's saved original functions, leading to permanent global pollution of torch.cat with the patched version.

Instead, apply the safe wrapper patches globally once during the initialization of the patch. This completely avoids any race conditions and simplifies the forward pass logic.

    @staticmethod
    def _apply_multi_gpu_patch():
        """
        Fixed two bugs affecting `UnlimitedOCRModel` in multi-GPU scenarios using `device_map='auto'`:

        Bug 1 - Device mismatch in `torch.cat`:
            `image_newline` and `view_seperator` are `nn.Parameter`s;
            under `device_map='auto'`, their device placement might not align
            with the image features.

        Bug 2 - Device mismatch in `masked_scatter_`:
            Hard-coded `.cuda()` usage caused a conflict where `images_in_this_batch`
            resided on the projector's device (e.g., `cuda:7`),
            while `inputs_embeds` resided on the device hosting `embed_tokens` (e.g., `cuda:0`).

        Fix strategy: Replace `torch.cat` and `torch.Tensor.masked_scatter_` globally once
        with thread-safe wrappers that handle device alignment automatically.
        """
        import sys
        import torch

        modeling_module = None
        for mod_name, mod in sys.modules.items():
            if 'modeling_unlimitedocr' in mod_name:
                modeling_module = mod
                break

        if modeling_module is None:
            return False

        UnlimitedOCRModel = getattr(modeling_module, 'UnlimitedOCRModel', None)
        if UnlimitedOCRModel is None:
            return False

        # Avoid redundant patching
        if getattr(UnlimitedOCRModel, '_swift_multi_gpu_patched', False):
            return True

        _orig_cat = torch.cat
        def _safe_cat(tensors, dim=0, **cat_kwargs):
            ref_device = None
            has_mismatch = False
            for t in tensors:
                if isinstance(t, torch.Tensor):
                    if ref_device is None:
                        ref_device = t.device
                    elif t.device != ref_device:
                        has_mismatch = True
            if not has_mismatch or ref_device is None:
                return _orig_cat(tensors, dim, **cat_kwargs)
            aligned = [
                t.to(ref_device) if isinstance(t, torch.Tensor) and t.device != ref_device else t for t in tensors
            ]
            return _orig_cat(aligned, dim, **cat_kwargs)

        _orig_masked_scatter_ = torch.Tensor.masked_scatter_
        def _safe_masked_scatter_(tensor_self, mask, source):
            dev = tensor_self.device
            if mask.device != dev:
                mask = mask.to(dev)
            if source.device != dev:
                source = source.to(dev)
            return _orig_masked_scatter_(tensor_self, mask, source)

        torch.cat = _safe_cat
        if hasattr(modeling_module, 'torch'):
            modeling_module.torch.cat = _safe_cat
        torch.Tensor.masked_scatter_ = _safe_masked_scatter_

        UnlimitedOCRModel._swift_multi_gpu_patched = True
        return True

Comment thread swift/model/model_arch.py Outdated

register_model_arch(
MultiModelKeys(
ModelArch.unlimited_ocr,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with all other model architecture registrations in this file, use MLLMModelArch.unlimited_ocr instead of ModelArch.unlimited_ocr.

Suggested change
ModelArch.unlimited_ocr,
MLLMModelArch.unlimited_ocr,

Comment on lines +472 to +478
self.auto_model_cls = self.auto_model_cls or AutoModel
model = super(DeepseekOCRLoader, self).get_model(model_dir, *args, **kwargs)
patch_output_clone(model.model.embed_tokens)
patch_output_to_input_device(model.model.sam_model)
patch_output_to_input_device(getattr(model.model, self.visual_name))
patch_output_to_input_device(model.model.projector)
patch_output_to_input_device(model.model)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since UnlimitedOCRLoader inherits from DeepseekOCRLoader, calling super(DeepseekOCRLoader, self).get_model(...) bypasses DeepseekOCRLoader.get_model and duplicates all of its patching logic. This is highly redundant and hard to maintain.

Instead, use super().get_model(...) to let DeepseekOCRLoader apply its patches, and then simply apply the additional patch_output_to_input_device(model.model) patch.

Suggested change
self.auto_model_cls = self.auto_model_cls or AutoModel
model = super(DeepseekOCRLoader, self).get_model(model_dir, *args, **kwargs)
patch_output_clone(model.model.embed_tokens)
patch_output_to_input_device(model.model.sam_model)
patch_output_to_input_device(getattr(model.model, self.visual_name))
patch_output_to_input_device(model.model.projector)
patch_output_to_input_device(model.model)
model = super().get_model(model_dir, *args, **kwargs)
patch_output_to_input_device(model.model)

@Jintao-Huang Jintao-Huang merged commit 0c3e6ea into modelscope:main Jun 26, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants