convergence-debugger

诊断并解决AI共识系统中的收敛检测问题,包括语义相似性判断异常、投票结果与相似性状态冲突、提前停止失效等情况,通过检查后端选择、配置参数与日志输出,确保多轮推理能在满足条件时正确终止。

快捷安装

在终端运行此命令,即可一键安装该 Skill 到您的 Claude 中

npx skills add blueman82/ai-counsel --skill "convergence-debugger"

Convergence Detection Debugger

Overview

The AI Counsel convergence detection system (deliberation/convergence.py) determines when models have reached consensus and can stop deliberating early. It uses semantic similarity comparison between consecutive rounds, with voting outcomes taking precedence when available.

Common Issue: Convergence not detected → wasted API calls Common Issue: Early stopping not triggering → deliberation runs full max_rounds Common Issue: Semantic vs voting status conflict → confusing results

Diagnostic Workflow

Step 1: Examine the Transcript

What to look for:

  • Are responses actually similar between rounds?
  • Is there a “Convergence Information” section?
  • What’s the reported status and similarity scores?
  • Are there votes? What’s the voting outcome?

File location:

# Transcripts are in project root
ls -lt transcripts/*.md | head -5

# Open most recent
open "transcripts/$(ls -t transcripts/*.md | head -1)"

Convergence section example:

## Convergence Information
- **Status**: refining (40.00% - 85.00% similarity)
- **Average Similarity**: 72.31%
- **Minimum Similarity**: 68.45%

Voting section example (overrides semantic status):

## Final Voting Results
- **Winner**: TypeScript ✓
- **Status**: majority_decision
- **Tally**: TypeScript: 2, JavaScript: 1

Missing convergence section? → Check if round_number <= min_rounds_before_check (see Step 2)

Step 2: Check Configuration

Read the config:

cat config.yaml

Key settings to verify:

deliberation:
  convergence_detection:
    enabled: true  # Must be true
    semantic_similarity_threshold: 0.85  # Convergence if ALL participants >= this
    divergence_threshold: 0.40  # Diverging if ANY participant < this
    min_rounds_before_check: 1  # Must be <= (total_rounds - 1)
    consecutive_stable_rounds: 2  # Require this many stable rounds

  early_stopping:
    enabled: true  # Must be true for model-controlled stopping
    threshold: 0.66  # Fraction of models that must want to stop (2/3)
    respect_min_rounds: true  # Won't stop before defaults.rounds

Common misconfigurations:

ProblemCauseFix
No convergence info in transcriptmin_rounds_before_check too highFor 2-round deliberation, use min_rounds_before_check: 1
Early stopping never triggersrespect_min_rounds: true but models converge before defaults.roundsSet to false or reduce defaults.rounds
Convergence threshold too strictsemantic_similarity_threshold: 0.95Lower to 0.80-0.85 for practical convergence
Everything marked “diverging”divergence_threshold: 0.70 too highUse default 0.40 (models rarely agree <40%)

Step 3: Check Backend Selection

The system auto-selects the best available backend:

  1. SentenceTransformerBackend (best) - requires sentence-transformers
  2. TFIDFBackend (good) - requires scikit-learn
  3. JaccardBackend (fallback) - zero dependencies (word overlap)

Check what’s installed:

python -c "import sentence_transformers; print('✓ SentenceTransformer available')" 2>/dev/null || echo "✗ SentenceTransformer not available"
python -c "import sklearn; print('✓ TF-IDF available')" 2>/dev/null || echo "✗ TF-IDF not available"

Check what backend was used:

# Look at server logs
tail -50 mcp_server.log | grep -i "backend\|similarity"

Expected log output:

INFO - ConvergenceDetector initialized with SentenceTransformerBackend
INFO - Using SentenceTransformerBackend (best accuracy)

If using Jaccard (fallback):

  • Similarity scores will be lower (word overlap only)
  • Semantic paraphrasing NOT detected
  • Consider installing optional dependencies:
# Install enhanced backends
pip install -r requirements-optional.txt

# Or individually
pip install sentence-transformers  # Best
pip install scikit-learn           # Good

Step 4: Debug Semantic Similarity Scores

Scenario: Responses look identical but similarity is low

Possible causes:

  1. Using Jaccard backend (doesn’t understand semantics)
  2. Responses have different formatting/structure
  3. Models added different examples/details

Test similarity manually:

# Create test script: test_similarity.py
from deliberation.convergence import ConvergenceDetector
from models.config import load_config

config = load_config("config.yaml")
detector = ConvergenceDetector(config)

text1 = "I prefer TypeScript for type safety and better tooling"
text2 = "TypeScript is better because it has types and good IDE support"

score = detector.backend.compute_similarity(text1, text2)
print(f"Similarity: {score:.2%}")
print(f"Backend: {detector.backend.__class__.__name__}")

Run it:

python test_similarity.py

Expected results by backend:

  • SentenceTransformer: 75-85% (understands semantic similarity)
  • TF-IDF: 50-65% (word importance weighting)
  • Jaccard: 30-45% (simple word overlap)

Step 5: Debug Voting vs Semantic Status Conflicts

The voting outcome ALWAYS overrides semantic similarity status when votes are present.

Status precedence (highest to lowest):

  1. unanimous_consensus - All models voted same option
  2. majority_decision - 2+ models agreed (e.g., 2-1 vote)
  3. tie - Equal votes for all options (e.g., 1-1-1)
  4. Semantic status - Only used if no votes present:
    • converged - All participants ≥85% similar
    • refining - Between 40-85% similarity
    • diverging - Any participant <40% similar
    • impasse - Stable disagreement over multiple rounds

Check if votes are being parsed:

# Search transcript for VOTE markers
grep -A 5 "VOTE:" "transcripts/$(ls -t transcripts/*.md | head -1)"

Expected vote format in model responses:

VOTE: {"option": "TypeScript", "confidence": 0.85, "rationale": "Type safety is crucial", "continue_debate": false}

If votes aren’t being parsed:

  • Check that models are outputting exact VOTE: {json} format
  • Verify JSON is valid (use online JSON validator)
  • Check logs for parsing errors: grep -i "vote\|parse" mcp_server.log

Step 6: Debug Early Stopping

Early stopping requires:

  1. early_stopping.enabled: true in config
  2. At least threshold fraction of models set continue_debate: false
  3. Current round ≥ defaults.rounds (if respect_min_rounds: true)

Example: 3 models, threshold 0.66 (66%)

  • Round 1: All say continue_debate: true → continues
  • Round 2: 2 models say continue_debate: false → stops (2/3 = 66.7%)

Debug steps:

  1. Check if enabled:
grep -A 3 "early_stopping:" config.yaml
  1. Check model votes in transcript:
# Look for continue_debate flags
grep -i "continue_debate" "transcripts/$(ls -t transcripts/*.md | head -1)"
  1. Check logs for early stop decision:
grep -i "early stop\|continue_debate" mcp_server.log | tail -20
  1. Common issues:
ProblemCauseSolution
Models want to stop but deliberation continuesrespect_min_rounds: true and not at min rounds yetWait for min rounds or set to false
Threshold not metOnly 1/3 models want to stop (33% < 66%)Need 2/3 consensus
Not enabledenabled: falseSet to true
Models not outputting flagVote JSON missing continue_debate fieldAdd to model prompts

Step 7: Debug Impasse Detection

Impasse = stable disagreement over multiple rounds

Requirements:

  1. Status is diverging (min_similarity < 0.40)
  2. consecutive_stable_rounds threshold reached (default: 2)

Check impasse logic:

# Read convergence.py lines 380-385
# Impasse is only detected if diverging AND stable

Common issue: Never reaches impasse

  • Models keep changing positions → not stable
  • Divergence threshold too low → never marks as “diverging”
  • Need at least 2-3 rounds of consistent disagreement

Manual check:

# Look at similarity scores across rounds
grep "Minimum Similarity" "transcripts/$(ls -t transcripts/*.md | head -1)"

If similarity jumps around (45% → 25% → 60%): → Models aren’t stable, impasse won’t trigger

Step 8: Performance Diagnostics

If convergence detection is slow:

  1. Check if SentenceTransformer is downloading models:
# First run downloads ~500MB model
tail -f mcp_server.log | grep -i "loading\|download"
  1. Model is cached after first load:
  • Subsequent deliberations are instant (model reused from memory)
  • Cache is per-process (each server restart reloads)
  1. Check computation time:
# Add timing to test script
import time

start = time.time()
score = detector.backend.compute_similarity(text1, text2)
elapsed = time.time() - start

print(f"Computation time: {elapsed*1000:.2f}ms")

Expected times:

  • SentenceTransformer: 50-200ms per comparison (first run slower)
  • TF-IDF: 10-50ms per comparison
  • Jaccard: <1ms per comparison

Quick Reference

Configuration Parameters

# Convergence thresholds
semantic_similarity_threshold: 0.85  # Range: 0.0-1.0, higher = stricter
divergence_threshold: 0.40           # Range: 0.0-1.0, lower = more sensitive

# Round constraints
min_rounds_before_check: 1           # Must be <= (total_rounds - 1)
consecutive_stable_rounds: 2         # Stability requirement

# Early stopping
early_stopping.threshold: 0.66       # Fraction of models needed (0.5 = majority)
respect_min_rounds: true             # Honor defaults.rounds minimum

Status Definitions

StatusMeaningSimilarity Range
convergedAll participants agree≥85% (by default)
refiningModerate agreement40-85%
divergingLow agreement<40%
impasseStable disagreement<40% for 2+ rounds
unanimous_consensusAll voted same (overrides semantic)N/A (voting)
majority_decision2+ voted same (overrides semantic)N/A (voting)
tieEqual votes (overrides semantic)N/A (voting)

Common Fixes

Problem: No convergence info in transcript

# Fix: Lower min_rounds_before_check
min_rounds_before_check: 1  # For 2-round deliberations

Problem: Never converges despite identical responses

# Fix: Install better backend
pip install sentence-transformers

Problem: Early stopping not working

# Fix: Check these settings
early_stopping:
  enabled: true
  threshold: 0.66
  respect_min_rounds: false  # Allow stopping before min rounds

Problem: Everything marked “diverging”

# Fix: Lower divergence threshold
divergence_threshold: 0.40  # Default (not 0.70)

Files to Check

  • Config: /Users/harrison/Github/ai-counsel/config.yaml
  • Engine: /Users/harrison/Github/ai-counsel/deliberation/convergence.py
  • Transcripts: /Users/harrison/Github/ai-counsel/transcripts/*.md
  • Logs: /Users/harrison/Github/ai-counsel/mcp_server.log
  • Schema: /Users/harrison/Github/ai-counsel/models/schema.py (Vote models)
  • Config models: /Users/harrison/Github/ai-counsel/models/config.py

Testing Convergence Detection

Create integration test:

# tests/integration/test_convergence_debug.py
import pytest
from deliberation.convergence import ConvergenceDetector
from models.config import load_config
from models.schema import RoundResponse, Participant

def test_convergence_identical_responses():
    """Test that identical responses trigger convergence."""
    config = load_config("config.yaml")
    detector = ConvergenceDetector(config)

    # Create identical responses
    round1 = [
        RoundResponse(participant="claude", response="TypeScript is best", vote=None),
        RoundResponse(participant="codex", response="TypeScript is best", vote=None),
    ]
    round2 = [
        RoundResponse(participant="claude", response="TypeScript is best", vote=None),
        RoundResponse(participant="codex", response="TypeScript is best", vote=None),
    ]

    result = detector.check_convergence(round2, round1, round_number=2)

    assert result is not None, "Should check convergence at round 2"
    assert result.avg_similarity > 0.90, f"Identical responses should be >90% similar, got {result.avg_similarity}"
    print(f"Backend: {detector.backend.__class__.__name__}")
    print(f"Similarity: {result.avg_similarity:.2%}")
    print(f"Status: {result.status}")

Run test:

pytest tests/integration/test_convergence_debug.py -v -s

Advanced Debugging

Enable Debug Logging

# Add to deliberation/engine.py or server.py
import logging
logging.basicConfig(level=logging.DEBUG)

Inspect Backend State

# In Python shell or test
from deliberation.convergence import ConvergenceDetector
from models.config import load_config

config = load_config("config.yaml")
detector = ConvergenceDetector(config)

print(f"Backend: {detector.backend.__class__.__name__}")
print(f"Threshold: {detector.config.semantic_similarity_threshold}")
print(f"Min rounds: {detector.config.min_rounds_before_check}")
print(f"Consecutive stable: {detector.config.consecutive_stable_rounds}")

Compare All Backends

# test_all_backends.py
from deliberation.convergence import (
    JaccardBackend,
    TFIDFBackend,
    SentenceTransformerBackend
)

text1 = "I prefer TypeScript for type safety"
text2 = "TypeScript is better because it has types"

backends = {
    "Jaccard": JaccardBackend(),
}

try:
    backends["TF-IDF"] = TFIDFBackend()
except ImportError:
    print("TF-IDF not available")

try:
    backends["SentenceTransformer"] = SentenceTransformerBackend()
except ImportError:
    print("SentenceTransformer not available")

for name, backend in backends.items():
    score = backend.compute_similarity(text1, text2)
    print(f"{name:20s}: {score:.2%}")

Summary

Always check in order:

  1. ✅ Transcript - Does convergence info appear?
  2. ✅ Config - Are thresholds reasonable?
  3. ✅ Backend - Is the best backend installed?
  4. ✅ Voting - Are votes being parsed correctly?
  5. ✅ Early stopping - Is it enabled and configured correctly?
  6. ✅ Logs - Any errors or warnings?

Most common fixes:

  • Lower min_rounds_before_check to 1 for short deliberations
  • Install sentence-transformers for better semantic detection
  • Set early_stopping.respect_min_rounds: false for faster stopping
  • Lower semantic_similarity_threshold from 0.95 to 0.85
  • Check that models output valid VOTE: JSON with continue_debate field