What if every inference step an LLM produces could be formally verified? What if the training data itself were provably sound — with no fabricated reasoning, no hallucinated premises, and no broken proof steps?
SyLLM is an end-to-end research system that fine-tunes LLMs to produce structured, tag-delimited semi-formal language proofs — and then verifies every step using a Z3 SMT solver. Training data is generated via backward proof chain construction, guaranteeing soundness by construction. Reinforcement learning uses an outcome-gated process reward that scores both the correctness of conclusions and the soundness of intermediate reasoning steps.
Modern LLMs can write convincing reasoning that is subtly wrong. Chain-of-thought prompting generates plausible-looking steps that can violate logic rules, contradict each other, or hallucinate unseen premises. SyLLM attacks this problem at three levels:
- Data: Synthetic proofs generated by backward natural deduction are sound by construction — every premise is essential, every step is justified by an explicit inference rule.
- Training: Two-stage LoRA fine-tuning teaches the model what to conclude (Stage 0), then how to write a structured proof trace (Stage 1).
- Reinforcement Learning: GRPO with outcome-gated process rewards reinforces correct verdicts while penalizing wrong answers regardless of how "well-reasoned" they appear.
Standard forward data generation risks fabricated premises and dead-end derivations. SyLLM uses goal-directed backward construction: start from the desired conclusion, recursively choose a rule that derives it, and construct the required sub-goals — guaranteeing every premise is load-bearing and the entire chain is valid.
All 13 natural deduction rules are implemented, including discharge rules that create ASSUME/DERIVE/DISCHARGE proof blocks:
| Category | Rules |
|---|---|
| Conjunction | AND_INTRO, AND_ELIM |
| Disjunction | OR_INTRO, OR_ELIM (discharge) |
| Implication | IMPLIES_INTRO (discharge), IMPLIES_ELIM |
| Negation | NOT_INTRO (discharge), NOT_ELIM |
| Biconditional | IFF_INTRO, IFF_ELIM |
| First-Order | FORALL_INTRO, FORALL_ELIM, EXISTS_INTRO, EXISTS_ELIM (discharge) |
Every generated chain is independently verified by Z3 as a backstop.
A key insight in GRPO training: process quality (sound intermediate steps) should not be able to compensate for a wrong answer. SyLLM implements an outcome-gated reward:
Correct answer: reward = 0.7 × (+1.0) + 0.3 × log(1 + sound_steps)
Wrong answer: reward = 0.7 × (−1.0) ← no process offset
No verdict: reward = 0.3 × log(1 + sound_steps) ← cold-start gradient
This prevents the model from gaming the reward by writing elaborate but incorrect reasoning. The process reward uses a log scale to prevent an arms race in step count.
Each GRPO step scores 8–16 model generations for soundness. Z3's Python bindings use a global shared context that is not thread-safe — using a thread pool causes C++ assertion violations. SyLLM uses a per-call ProcessPoolExecutor with the spawn start method, giving each worker an isolated Z3 context. Workers exist only during the ~1–2 seconds of actual Z3 work and are torn down immediately, preventing the idle-worker OS kill that would otherwise corrupt the training loop.
┌─────────────────────────────────────────────────────────────────┐
│ DATA PIPELINE │
│ │
│ syntax_tree.py chain_generator.py │
│ ┌──────────────┐ ┌──────────────────────────────────┐ │
│ │ FormulaNode │──────▶ │ Backward ND Construction │ │
│ │ (AST) │ │ _justify(goal, depth) │ │
│ └──────────────┘ │ 13 rules + Z3 verification │ │
│ │ └──────────────┬───────────────────┘ │
│ nl_renderer.py │ │
│ ┌──────────────┐ ┌─────────────▼──────────────┐ │
│ │ AST → NL │ │ Stage 0 JSONL │ │
│ │ {if P, Q} │ │ (premises + conclusion) │ │
│ └──────────────┘ ├────────────────────────────┤ │
│ │ Stage 1 JSONL │ │
│ │ (full proof trace) │ │
│ └────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ TRAINING PIPELINE │
│ │
│ Stage 0 SFT Stage 1 SFT GRPO RL │
│ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ lora_ │──────────▶│ lora_ │──────▶│ train_ │ │
│ │ finetune.py │ │ finetune.py │ │ grpo.py │ │
│ │ (LoRA r=16) │ │ Stage 1 data│ │ │ │
│ └─────────────┘ └─────────────┘ └─────┬─────┘ │
│ │ │
│ ┌─────────▼──────┐ │
│ │ soundness_ │ │
│ │ reward.py │ │
│ │ + Z3 scoring │ │
│ └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ VERIFICATION PIPELINE │
│ │
│ Model output → Semi-formal parser → FormulaNode AST │
│ │ │
│ Z3Translator │
│ │ │
│ z3.BoolRef expressions │
│ │ │
│ z3.Solver().check() == z3.unsat │
│ (entailment check) │
└─────────────────────────────────────────────────────────────────┘
The model outputs structured proofs using tag-delimited natural language with curly-bracket connectives:
<PREMISE> {if {the bridge is structurally sound} and {the load limit is respected}, then {traffic can cross}} </PREMISE>
<PREMISE> the bridge is structurally sound </PREMISE>
<PREMISE> the load limit is respected </PREMISE>
<ASSUME> {it is not the case that {traffic can cross}} </ASSUME>
<CONCLUSION> {traffic can cross} </CONCLUSION>
<DISCHARGE> {it is not the case that {traffic can cross}} </DISCHARGE>
Connectives: {A and B}, {A or B}, {if A, then B}, {A if and only if B}, {it is not the case that A}, {for all x, BODY}, {there exist x such that BODY}
The semi-formal parser is a recursive descent parser that converts these into FormulaNode ASTs, which are then translated to Z3 for formal verification.
SyLLM/
├── src/
│ ├── data/
│ │ ├── syntax_tree.py # Polymorphic logic formula AST (prop + FOL)
│ │ ├── inference_generator.py # 16 inference patterns (11 prop + 5 FOL)
│ │ ├── chain_generator.py # Backward ND proof chain generator (13 rules)
│ │ ├── nl_renderer.py # Formula AST → natural language
│ │ ├── atomic_proposition_generator.py # Diverse propositions via OpenAI API
│ │ └── stage2_generator.py # FOLIO dataset preparation
│ ├── training/
│ │ ├── grpo.py # GRPO trainer (869 LOC)
│ │ ├── soundness_reward.py # Outcome-gated process reward function
│ │ └── finetune.py # LoRA fine-tuning utilities
│ ├── verification/
│ │ ├── verifier.py # Z3-based symbolic verification
│ │ ├── parser.py # Semi-formal recursive descent parser
│ │ └── translator.py # FormulaNode → Z3 expressions
│ ├── inference/
│ │ └── predictor.py # vLLM batched inference + LoRA loading
│ └── evaluation/
│ ├── evaluator.py # Metrics: entailment acc, ECE, parse rate
│ └── benchmark_evaluator.py # FOLIO, LogiQA benchmark runner
├── scripts/
│ ├── lora_finetune.py # Main SFT entry point (~1100 LOC)
│ ├── train_grpo.py # GRPO RL entry point
│ ├── generate_chain_data.py # Generate Stage 0 / Stage 1 chain data
│ ├── prepare_grpo_data.py # Prepare FOLIO data for GRPO
│ ├── generate_logic_data.py # Generate logic inference examples
│ └── evaluate.py # Evaluation runner
├── experiments/
│ ├── exp1_premise_perturbation/ # Sensitivity to premise changes
│ ├── exp2_positional_permutation/ # Stability to premise ordering
│ ├── exp3_monotonicity/ # Monotonicity violation detection
│ ├── exp4_lora_svd/ # LoRA weight SVD analysis across stages
│ └── exp5_probing/ # Linear probing of intermediate representations
├── config.yaml # Global configuration
└── requirements.txt # Dependencies
git clone https://github.com/ryanlin10/LLM-Syllogistic-Formal-Verification.git
cd LLM-Syllogistic-Formal-Verification
pip install -r requirements.txtRequirements: Python 3.8+, PyTorch 2.0+, CUDA GPU (H100/H200 recommended for 24B models), Z3 SMT solver
# Stage 0: premises + conclusion only (10k examples, ~14MB)
python3 scripts/generate_chain_data.py --stage 0 -n 10000 -o ./data/
# Stage 1: full proof traces with intermediate steps (10k examples, ~26MB)
python3 scripts/generate_chain_data.py --stage 1 -n 10000 -o ./data/# Stage 0: Teach the model what to conclude
python3 scripts/lora_finetune.py \
--model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \
--train-path ./data/chain_stage0_n10000.jsonl \
--output-dir ./outputs/stage0 \
--bf16 --4bit --use-wandb
# Stage 1: Teach the model how to write structured proofs
python3 scripts/lora_finetune.py \
--model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \
--lora-adapter ./outputs/stage0/final \
--train-path ./data/chain_stage1_n10000.jsonl \
--output-dir ./outputs/stage1 \
--bf16 --4bit --use-wandbpython3 scripts/prepare_grpo_data.py \
--output-path data/grpo_folio_prompted.jsonlThis loads 1202 examples from the tasksource/folio HuggingFace dataset, adds the 13-rule instruction preamble and few-shot demos, and writes GRPO-ready JSONL.
python3 scripts/train_grpo.py \
--model mistralai/Mistral-Small-3.2-24B-Instruct-2506 \
--lora-adapter ./outputs/stage1/final \
--train-path data/grpo_folio_prompted.jsonl \
--output-dir ./outputs/grpo_folio \
--task-type verdict --verifier-weight 0.3 --outcome-weight 0.7 \
--group-size 8 --batch-size 2 \
--max-prompt-length 2048 --max-gen-length 300 \
--n-verify-workers 16 \
--kl-coeff 0.1 --lr 5e-6 --epochs 3 \
--bf16 --4bit --temperature 0.8 \
--wandb-project grpo-foliopython3 scripts/inference_demo.py "All humans are mortal. Socrates is human. Therefore?"python3 scripts/evaluate.py \
--model_outputs outputs.jsonl \
--ground_truth data/test.jsonl| Key | Model | Context |
|---|---|---|
mistral-small-24b-instruct |
mistralai/Mistral-Small-3.2-24B-Instruct-2506 |
128K |
deepseek-v3 |
deepseek-ai/deepseek-v3 |
— |
llama3-8b-instruct |
meta-llama/Meta-Llama-3-8B-Instruct |
8K |
All models use q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj as LoRA target modules. Switch models:
python3 scripts/switch_model.py --list
python3 scripts/switch_model.py mistralai/Mistral-Small-3.2-24B-Instruct-2506Five experiments analyze and validate model behavior:
| Experiment | Question | Key Finding |
|---|---|---|
| Exp 1: Premise Perturbation | Does the model respond to factual changes? | 86% → 74% accuracy after perturbation; 16% sensitivity rate — model generalizes, doesn't memorize |
| Exp 2: Positional Permutation | Is reasoning stable to premise ordering? | Tracks accuracy drop and stability by premise count |
| Exp 3: Monotonicity | Does adding premises help/hurt correctly? | Measures monotonicity violations |
| Exp 4: LoRA SVD Analysis | What structure do LoRA matrices learn? | Stage 0: gate_proj dominates (1.3 avg); Stage 1: 2.4× growth, richer attention; effective rank 8–14 |
| Exp 5: Probing | Which layers encode logical structure? | Linear probing classifiers on intermediate representations |
The SVD analysis of LoRA matrices (B·A for each layer and module) reveals the geometry of what the model learns at each training stage:
- Stage 0 (conclusion-only training): Total Frobenius norm 146.3.
gate_projandup_projdominate — the gating mechanism learns to identify relevant logical patterns. - Stage 1 (full proof trace): Total Frobenius norm 233.5. Attention modules grow substantially (2.4× in some layers). The model is developing structured attention over proof steps.
- Top layers by activation (Stage 0): layers 37, 38, 36, 16, 21, 22, 19, 17, 35, 39 — reasoning concentrates in the upper third of the network.
- Effective ranks: 8–14 across modules.
v_projis most concentrated (eff_rank 10.5 → 11.4), suggesting the value projection learns a compact, specialized subspace for logic.
Model output
│
▼
Semi-formal parser (recursive descent)
│ {if P, then Q}, {P and Q}, {for all x, ...}
▼
FormulaNode AST
│
▼
Z3Translator
│ AtomNode → z3.Bool(name)
│ BinaryNode(AND) → z3.And(l, r)
│ QuantifiedNode(FORALL) → z3.ForAll([x], body)
▼
z3.Solver
│ solver.add(*premises)
│ solver.add(z3.Not(conclusion))
▼
z3.unsat → ENTAILED ✓
z3.sat → NOT ENTAILED ✗
The parser handles free-text input ("Premise. Therefore, Conclusion.") and structured tag-based input (<PREMISE>...</PREMISE><CONCLUSION>...</CONCLUSION>). The Z3 translator maintains shared context so identical atom text maps to the same Boolean variable.
| Metric | Description |
|---|---|
| Entailment Accuracy | Fraction of conclusions correctly classified as entailed/not-entailed |
| Verifier Calibration (ECE) | Expected calibration error between verifier confidence and accuracy |
| Premise Precision/Recall | Quality of generated premises against ground truth |
| Evidence Recall | Fraction of premises with linked evidence spans |
| Parse Success Rate | Fraction of model outputs successfully parsed into formal logic |
Config priority: .env > environment variables > config.yaml
Key environment variables: MODEL_NAME, DATA_DIR, OUTPUT_DIR
# config.yaml
model:
base_model: "mistralai/Mistral-Small-3.2-24B-Instruct-2506"
lora_adapter: "./outputs/stage1/final"
data:
train_path: "./data/train.jsonl"
val_path: "./data/val.jsonl"- LoRA efficiency: 101M / 24B trainable parameters (0.42%), r=16, α=32
- 4-bit quantization: ~12GB model weights; full training footprint ~80–120GB
- GRPO step speed: ~70–80s/step with
group_size=8,batch_size=2, parallel Z3, Flash Attention 2 - Z3 verification: ~0.03s for 16 segments in parallel (after worker startup); sequential would be ~400s for 80 segments
torch>=2.0
transformers>=4.35
peft>=0.7
accelerate>=0.24
vllm>=0.6
z3-solver>=4.12
datasets>=2.14
wandb>=0.15
openai>=2.0
faiss-cpu
sentence-transformers>=2.2
MIT License. Ensure compliance with individual model licenses (Mistral, Llama, DeepSeek) for commercial use.
If you use SyLLM in your research:
@software{syllm2026,
title = {SyLLM: Structured Reasoning LLMs with Formal Verification},
author = {Lin, Ryan},
year = {2026},
url = {https://github.com/ryanlin10/LLM-Syllogistic-Formal-Verification}
}