Skip to content

ryanlin10/SyLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SyLLM — Structured Reasoning LLMs with Formal Verification

Teaching LLMs to reason correctly, verifiably, and by construction.

License: MIT Python 3.8+ PyTorch Z3 vLLM


What if every inference step an LLM produces could be formally verified? What if the training data itself were provably sound — with no fabricated reasoning, no hallucinated premises, and no broken proof steps?

SyLLM is an end-to-end research system that fine-tunes LLMs to produce structured, tag-delimited semi-formal language proofs — and then verifies every step using a Z3 SMT solver. Training data is generated via backward proof chain construction, guaranteeing soundness by construction. Reinforcement learning uses an outcome-gated process reward that scores both the correctness of conclusions and the soundness of intermediate reasoning steps.


Why This Matters

Modern LLMs can write convincing reasoning that is subtly wrong. Chain-of-thought prompting generates plausible-looking steps that can violate logic rules, contradict each other, or hallucinate unseen premises. SyLLM attacks this problem at three levels:

  1. Data: Synthetic proofs generated by backward natural deduction are sound by construction — every premise is essential, every step is justified by an explicit inference rule.
  2. Training: Two-stage LoRA fine-tuning teaches the model what to conclude (Stage 0), then how to write a structured proof trace (Stage 1).
  3. Reinforcement Learning: GRPO with outcome-gated process rewards reinforces correct verdicts while penalizing wrong answers regardless of how "well-reasoned" they appear.

Key Innovations

Backward Natural Deduction Chain Generation

Standard forward data generation risks fabricated premises and dead-end derivations. SyLLM uses goal-directed backward construction: start from the desired conclusion, recursively choose a rule that derives it, and construct the required sub-goals — guaranteeing every premise is load-bearing and the entire chain is valid.

All 13 natural deduction rules are implemented, including discharge rules that create ASSUME/DERIVE/DISCHARGE proof blocks:

Category Rules
Conjunction AND_INTRO, AND_ELIM
Disjunction OR_INTRO, OR_ELIM (discharge)
Implication IMPLIES_INTRO (discharge), IMPLIES_ELIM
Negation NOT_INTRO (discharge), NOT_ELIM
Biconditional IFF_INTRO, IFF_ELIM
First-Order FORALL_INTRO, FORALL_ELIM, EXISTS_INTRO, EXISTS_ELIM (discharge)

Every generated chain is independently verified by Z3 as a backstop.

Outcome-Gated Process Reward

A key insight in GRPO training: process quality (sound intermediate steps) should not be able to compensate for a wrong answer. SyLLM implements an outcome-gated reward:

Correct answer:   reward = 0.7 × (+1.0) + 0.3 × log(1 + sound_steps)
Wrong answer:     reward = 0.7 × (−1.0)    ← no process offset
No verdict:       reward = 0.3 × log(1 + sound_steps)  ← cold-start gradient

This prevents the model from gaming the reward by writing elaborate but incorrect reasoning. The process reward uses a log scale to prevent an arms race in step count.

Parallel Z3 Verification in the RL Loop

Each GRPO step scores 8–16 model generations for soundness. Z3's Python bindings use a global shared context that is not thread-safe — using a thread pool causes C++ assertion violations. SyLLM uses a per-call ProcessPoolExecutor with the spawn start method, giving each worker an isolated Z3 context. Workers exist only during the ~1–2 seconds of actual Z3 work and are torn down immediately, preventing the idle-worker OS kill that would otherwise corrupt the training loop.


Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        DATA PIPELINE                            │
│                                                                 │
│  syntax_tree.py          chain_generator.py                     │
│  ┌──────────────┐        ┌──────────────────────────────────┐   │
│  │ FormulaNode  │──────▶ │ Backward ND Construction         │   │
│  │ (AST)        │        │ _justify(goal, depth)            │   │
│  └──────────────┘        │ 13 rules + Z3 verification       │   │
│         │                └──────────────┬───────────────────┘   │
│  nl_renderer.py                         │                        │
│  ┌──────────────┐         ┌─────────────▼──────────────┐        │
│  │ AST → NL     │         │ Stage 0 JSONL              │        │
│  │ {if P, Q}    │         │ (premises + conclusion)    │        │
│  └──────────────┘         ├────────────────────────────┤        │
│                           │ Stage 1 JSONL              │        │
│                           │ (full proof trace)         │        │
│                           └────────────────────────────┘        │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                       TRAINING PIPELINE                         │
│                                                                 │
│  Stage 0 SFT               Stage 1 SFT            GRPO RL       │
│  ┌─────────────┐           ┌─────────────┐       ┌───────────┐  │
│  │ lora_       │──────────▶│ lora_       │──────▶│ train_    │  │
│  │ finetune.py │           │ finetune.py │       │ grpo.py   │  │
│  │ (LoRA r=16) │           │ Stage 1 data│       │           │  │
│  └─────────────┘           └─────────────┘       └─────┬─────┘  │
│                                                         │        │
│                                               ┌─────────▼──────┐ │
│                                               │ soundness_     │ │
│                                               │ reward.py      │ │
│                                               │ + Z3 scoring   │ │
│                                               └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    VERIFICATION PIPELINE                        │
│                                                                 │
│  Model output → Semi-formal parser → FormulaNode AST           │
│                                              │                  │
│                                    Z3Translator                 │
│                                              │                  │
│                                    z3.BoolRef expressions       │
│                                              │                  │
│                              z3.Solver().check() == z3.unsat   │
│                              (entailment check)                 │
└─────────────────────────────────────────────────────────────────┘

Semi-Formal Proof Format

The model outputs structured proofs using tag-delimited natural language with curly-bracket connectives:

<PREMISE> {if {the bridge is structurally sound} and {the load limit is respected}, then {traffic can cross}} </PREMISE>
<PREMISE> the bridge is structurally sound </PREMISE>
<PREMISE> the load limit is respected </PREMISE>
<ASSUME> {it is not the case that {traffic can cross}} </ASSUME>
<CONCLUSION> {traffic can cross} </CONCLUSION>
<DISCHARGE> {it is not the case that {traffic can cross}} </DISCHARGE>

Connectives: {A and B}, {A or B}, {if A, then B}, {A if and only if B}, {it is not the case that A}, {for all x, BODY}, {there exist x such that BODY}

The semi-formal parser is a recursive descent parser that converts these into FormulaNode ASTs, which are then translated to Z3 for formal verification.


Project Structure

SyLLM/
├── src/
│   ├── data/
│   │   ├── syntax_tree.py              # Polymorphic logic formula AST (prop + FOL)
│   │   ├── inference_generator.py      # 16 inference patterns (11 prop + 5 FOL)
│   │   ├── chain_generator.py          # Backward ND proof chain generator (13 rules)
│   │   ├── nl_renderer.py              # Formula AST → natural language
│   │   ├── atomic_proposition_generator.py  # Diverse propositions via OpenAI API
│   │   └── stage2_generator.py         # FOLIO dataset preparation
│   ├── training/
│   │   ├── grpo.py                     # GRPO trainer (869 LOC)
│   │   ├── soundness_reward.py         # Outcome-gated process reward function
│   │   └── finetune.py                 # LoRA fine-tuning utilities
│   ├── verification/
│   │   ├── verifier.py                 # Z3-based symbolic verification
│   │   ├── parser.py                   # Semi-formal recursive descent parser
│   │   └── translator.py               # FormulaNode → Z3 expressions
│   ├── inference/
│   │   └── predictor.py                # vLLM batched inference + LoRA loading
│   └── evaluation/
│       ├── evaluator.py                # Metrics: entailment acc, ECE, parse rate
│       └── benchmark_evaluator.py      # FOLIO, LogiQA benchmark runner
├── scripts/
│   ├── lora_finetune.py               # Main SFT entry point (~1100 LOC)
│   ├── train_grpo.py                  # GRPO RL entry point
│   ├── generate_chain_data.py         # Generate Stage 0 / Stage 1 chain data
│   ├── prepare_grpo_data.py           # Prepare FOLIO data for GRPO
│   ├── generate_logic_data.py         # Generate logic inference examples
│   └── evaluate.py                    # Evaluation runner
├── experiments/
│   ├── exp1_premise_perturbation/     # Sensitivity to premise changes
│   ├── exp2_positional_permutation/   # Stability to premise ordering
│   ├── exp3_monotonicity/             # Monotonicity violation detection
│   ├── exp4_lora_svd/                 # LoRA weight SVD analysis across stages
│   └── exp5_probing/                  # Linear probing of intermediate representations
├── config.yaml                        # Global configuration
└── requirements.txt                   # Dependencies

Installation

git clone https://github.com/ryanlin10/LLM-Syllogistic-Formal-Verification.git
cd LLM-Syllogistic-Formal-Verification
pip install -r requirements.txt

Requirements: Python 3.8+, PyTorch 2.0+, CUDA GPU (H100/H200 recommended for 24B models), Z3 SMT solver


Quick Start

1. Generate Proof Chain Training Data

# Stage 0: premises + conclusion only (10k examples, ~14MB)
python3 scripts/generate_chain_data.py --stage 0 -n 10000 -o ./data/

# Stage 1: full proof traces with intermediate steps (10k examples, ~26MB)
python3 scripts/generate_chain_data.py --stage 1 -n 10000 -o ./data/

2. Two-Stage Supervised Fine-Tuning

# Stage 0: Teach the model what to conclude
python3 scripts/lora_finetune.py \
  --model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \
  --train-path ./data/chain_stage0_n10000.jsonl \
  --output-dir ./outputs/stage0 \
  --bf16 --4bit --use-wandb

# Stage 1: Teach the model how to write structured proofs
python3 scripts/lora_finetune.py \
  --model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \
  --lora-adapter ./outputs/stage0/final \
  --train-path ./data/chain_stage1_n10000.jsonl \
  --output-dir ./outputs/stage1 \
  --bf16 --4bit --use-wandb

3. Prepare FOLIO Data for GRPO

python3 scripts/prepare_grpo_data.py \
  --output-path data/grpo_folio_prompted.jsonl

This loads 1202 examples from the tasksource/folio HuggingFace dataset, adds the 13-rule instruction preamble and few-shot demos, and writes GRPO-ready JSONL.

4. Reinforcement Learning with GRPO

python3 scripts/train_grpo.py \
  --model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \
  --lora-adapter ./outputs/stage1/final \
  --train-path data/grpo_folio_prompted.jsonl \
  --output-dir ./outputs/grpo_folio \
  --task-type verdict --verifier-weight 0.3 --outcome-weight 0.7 \
  --group-size 8 --batch-size 2 \
  --max-prompt-length 2048 --max-gen-length 300 \
  --n-verify-workers 16 \
  --kl-coeff 0.1 --lr 5e-6 --epochs 3 \
  --bf16 --4bit --temperature 0.8 \
  --wandb-project grpo-folio

5. Inference

python3 scripts/inference_demo.py "All humans are mortal. Socrates is human. Therefore?"

6. Evaluate

python3 scripts/evaluate.py \
  --model_outputs outputs.jsonl \
  --ground_truth data/test.jsonl

Model Registry

Key Model Context
mistral-small-24b-instruct mistralai/Mistral-Small-3.2-24B-Instruct-2506 128K
deepseek-v3 deepseek-ai/deepseek-v3
llama3-8b-instruct meta-llama/Meta-Llama-3-8B-Instruct 8K

All models use q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj as LoRA target modules. Switch models:

python3 scripts/switch_model.py --list
python3 scripts/switch_model.py mistralai/Mistral-Small-3.2-24B-Instruct-2506

Experiments

Five experiments analyze and validate model behavior:

Experiment Question Key Finding
Exp 1: Premise Perturbation Does the model respond to factual changes? 86% → 74% accuracy after perturbation; 16% sensitivity rate — model generalizes, doesn't memorize
Exp 2: Positional Permutation Is reasoning stable to premise ordering? Tracks accuracy drop and stability by premise count
Exp 3: Monotonicity Does adding premises help/hurt correctly? Measures monotonicity violations
Exp 4: LoRA SVD Analysis What structure do LoRA matrices learn? Stage 0: gate_proj dominates (1.3 avg); Stage 1: 2.4× growth, richer attention; effective rank 8–14
Exp 5: Probing Which layers encode logical structure? Linear probing classifiers on intermediate representations

Exp 4 Highlights: LoRA Weight Structure

The SVD analysis of LoRA matrices (B·A for each layer and module) reveals the geometry of what the model learns at each training stage:

  • Stage 0 (conclusion-only training): Total Frobenius norm 146.3. gate_proj and up_proj dominate — the gating mechanism learns to identify relevant logical patterns.
  • Stage 1 (full proof trace): Total Frobenius norm 233.5. Attention modules grow substantially (2.4× in some layers). The model is developing structured attention over proof steps.
  • Top layers by activation (Stage 0): layers 37, 38, 36, 16, 21, 22, 19, 17, 35, 39 — reasoning concentrates in the upper third of the network.
  • Effective ranks: 8–14 across modules. v_proj is most concentrated (eff_rank 10.5 → 11.4), suggesting the value projection learns a compact, specialized subspace for logic.

Verification System

Model output
     │
     ▼
Semi-formal parser (recursive descent)
     │  {if P, then Q}, {P and Q}, {for all x, ...}
     ▼
FormulaNode AST
     │
     ▼
Z3Translator
     │  AtomNode → z3.Bool(name)
     │  BinaryNode(AND) → z3.And(l, r)
     │  QuantifiedNode(FORALL) → z3.ForAll([x], body)
     ▼
z3.Solver
     │  solver.add(*premises)
     │  solver.add(z3.Not(conclusion))
     ▼
z3.unsat → ENTAILED ✓
z3.sat   → NOT ENTAILED ✗

The parser handles free-text input ("Premise. Therefore, Conclusion.") and structured tag-based input (<PREMISE>...</PREMISE><CONCLUSION>...</CONCLUSION>). The Z3 translator maintains shared context so identical atom text maps to the same Boolean variable.


Evaluation Metrics

Metric Description
Entailment Accuracy Fraction of conclusions correctly classified as entailed/not-entailed
Verifier Calibration (ECE) Expected calibration error between verifier confidence and accuracy
Premise Precision/Recall Quality of generated premises against ground truth
Evidence Recall Fraction of premises with linked evidence spans
Parse Success Rate Fraction of model outputs successfully parsed into formal logic

Configuration

Config priority: .env > environment variables > config.yaml

Key environment variables: MODEL_NAME, DATA_DIR, OUTPUT_DIR

# config.yaml
model:
  base_model: "mistralai/Mistral-Small-3.2-24B-Instruct-2506"
  lora_adapter: "./outputs/stage1/final"
data:
  train_path: "./data/train.jsonl"
  val_path: "./data/val.jsonl"

Performance Notes (H200 143GB)

  • LoRA efficiency: 101M / 24B trainable parameters (0.42%), r=16, α=32
  • 4-bit quantization: ~12GB model weights; full training footprint ~80–120GB
  • GRPO step speed: ~70–80s/step with group_size=8, batch_size=2, parallel Z3, Flash Attention 2
  • Z3 verification: ~0.03s for 16 segments in parallel (after worker startup); sequential would be ~400s for 80 segments

Requirements

torch>=2.0
transformers>=4.35
peft>=0.7
accelerate>=0.24
vllm>=0.6
z3-solver>=4.12
datasets>=2.14
wandb>=0.15
openai>=2.0
faiss-cpu
sentence-transformers>=2.2

License

MIT License. Ensure compliance with individual model licenses (Mistral, Llama, DeepSeek) for commercial use.


Citation

If you use SyLLM in your research:

@software{syllm2026,
  title  = {SyLLM: Structured Reasoning LLMs with Formal Verification},
  author = {Lin, Ryan},
  year   = {2026},
  url    = {https://github.com/ryanlin10/LLM-Syllogistic-Formal-Verification}
}

About

A natural language formal verification and reinforcement learning framework to enhance reasoning capabilities within language models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages