Python MCP Evaluation Server¶

Overview¶

The Ultimate AI Evaluation Platform providing the most comprehensive AI assessment tools in the MCP ecosystem. Features 63 specialized tools across 14 categories for complete AI system evaluation using LLM-as-a-judge techniques combined with rule-based metrics.

Author: Mihai Criveti

Key Highlights:

🤖 63 specialized evaluation tools
📊 14 distinct tool categories
🎯 LLM-as-a-judge with bias mitigation
📈 Statistical rigor with confidence intervals
🌐 Multi-modal assessment capabilities
🔄 Extensible rubric system
🚀 Multiple server modes (MCP, REST, HTTP Bridge)

Quick Start¶

Installation¶

# Navigate to server directory
cd mcp-servers/python/mcp_eval_server

# Install with development dependencies
pip install -e ".[dev]"

Running the Server¶

MCP Server Mode (stdio)¶

# Launch MCP server for Claude Desktop, MCP clients
python -m mcp_eval_server.server

# Or use make command
make dev

REST API Server Mode¶

# Launch FastAPI REST server
python -m mcp_eval_server.rest_server --port 8080 --host 0.0.0.0

# Or use make command
make serve-rest

# Access interactive docs
open http://localhost:8080/docs

HTTP Bridge Mode¶

# MCP protocol over HTTP with Server-Sent Events
make serve-http

# Access via JSON-RPC on port 9000
curl -X POST -H 'Content-Type: application/json' \
     -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list"}' \
     http://localhost:9000/

Health Monitoring¶

# Liveness probe
curl http://localhost:8080/health

# Readiness probe
curl http://localhost:8080/ready

# Performance metrics
curl http://localhost:8080/metrics

Tool Categories¶

🤖 LLM-as-a-Judge Tools (5 tools)¶

evaluate_single¶

Evaluate a single response with customizable criteria.

{
  "tool": "evaluate_single",
  "arguments": {
    "response": "The capital of France is Paris.",
    "criteria": ["accuracy", "clarity", "completeness"],
    "weights": {"accuracy": 0.5, "clarity": 0.3, "completeness": 0.2},
    "judge_model": "gpt-4"
  }
}

compare_pairwise¶

Compare two responses head-to-head with position bias mitigation.

{
  "tool": "compare_pairwise",
  "arguments": {
    "response_a": "Response 1 text",
    "response_b": "Response 2 text",
    "criteria": ["relevance", "coherence"],
    "mitigate_position_bias": true
  }
}

rank_multiple¶

Rank multiple responses using tournament or scoring algorithms.

{
  "tool": "rank_multiple",
  "arguments": {
    "responses": ["Response 1", "Response 2", "Response 3"],
    "method": "tournament",
    "criteria": ["quality", "accuracy"]
  }
}

📝 Prompt Evaluation Tools (4 tools)¶

analyze_prompt_clarity¶

Detect ambiguity and provide improvement recommendations.

{
  "tool": "analyze_prompt_clarity",
  "arguments": {
    "prompt": "Write a story about a bank",
    "context": "creative writing",
    "suggest_improvements": true
  }
}

test_prompt_consistency¶

Analyze variance across multiple runs.

{
  "tool": "test_prompt_consistency",
  "arguments": {
    "prompt": "Generate a product description",
    "num_runs": 10,
    "temperature_range": [0.3, 0.7, 1.0]
  }
}

🛠️ Agent Evaluation Tools (4 tools)¶

evaluate_tool_usage¶

Assess agent tool selection and usage patterns.

{
  "tool": "evaluate_tool_usage",
  "arguments": {
    "agent_trace": [...],
    "available_tools": ["search", "calculate", "summarize"],
    "task_requirements": ["find information", "compute result"]
  }
}

assess_reasoning_chain¶

Evaluate logical reasoning and coherence.

{
  "tool": "assess_reasoning_chain",
  "arguments": {
    "reasoning_steps": [...],
    "expected_logic": "deductive",
    "check_consistency": true
  }
}

🔗 RAG Evaluation Tools (8 tools)¶

evaluate_retrieval¶

Assess retrieval relevance and precision.

{
  "tool": "evaluate_retrieval",
  "arguments": {
    "query": "What is quantum computing?",
    "retrieved_docs": [...],
    "relevance_threshold": 0.7,
    "use_reranking": true
  }
}

check_grounding¶

Verify response grounding in source documents.

{
  "tool": "check_grounding",
  "arguments": {
    "response": "Generated answer",
    "source_docs": [...],
    "require_citations": true
  }
}

⚖️ Bias & Fairness Tools (6 tools)¶

detect_demographic_bias¶

Identify demographic biases in responses.

{
  "tool": "detect_demographic_bias",
  "arguments": {
    "responses": [...],
    "demographics": ["gender", "age", "ethnicity"],
    "baseline_comparison": true
  }
}

analyze_representation¶

Check representation equity across groups.

{
  "tool": "analyze_representation",
  "arguments": {
    "content": "Generated text",
    "groups": ["professional", "cultural", "geographic"],
    "expected_distribution": {...}
  }
}

🛡️ Robustness Tools (5 tools)¶

test_adversarial¶

Test against adversarial inputs.

{
  "tool": "test_adversarial",
  "arguments": {
    "base_input": "Original prompt",
    "attack_types": ["typo", "semantic", "injection"],
    "num_variants": 20
  }
}

check_stability¶

Analyze output stability across variations.

{
  "tool": "check_stability",
  "arguments": {
    "prompt_template": "Explain {topic} in simple terms",
    "variations": ["quantum physics", "machine learning"],
    "stability_threshold": 0.8
  }
}

🔒 Safety & Alignment Tools (4 tools)¶

detect_harmful_content¶

Identify potentially harmful content.

{
  "tool": "detect_harmful_content",
  "arguments": {
    "content": "Generated text",
    "categories": ["violence", "bias", "misinformation"],
    "sensitivity": "high"
  }
}

check_instruction_adherence¶

Verify alignment with instructions.

{
  "tool": "check_instruction_adherence",
  "arguments": {
    "instructions": ["Be concise", "Use examples"],
    "response": "Generated response",
    "strict_mode": true
  }
}

🌍 Multilingual Tools (4 tools)¶

evaluate_translation¶

Assess translation quality across languages.

{
  "tool": "evaluate_translation",
  "arguments": {
    "source_text": "Hello world",
    "translated_text": "Bonjour le monde",
    "source_lang": "en",
    "target_lang": "fr",
    "check_fluency": true
  }
}

check_cross_lingual_consistency¶

Verify consistency across languages.

{
  "tool": "check_cross_lingual_consistency",
  "arguments": {
    "responses": {
      "en": "English response",
      "fr": "Réponse française",
      "es": "Respuesta española"
    },
    "check_semantic": true
  }
}

⚡ Performance Tools (4 tools)¶

measure_latency¶

Track response latency metrics.

{
  "tool": "measure_latency",
  "arguments": {
    "operation": "text_generation",
    "input_size": 1000,
    "num_samples": 100,
    "percentiles": [50, 90, 95, 99]
  }
}

analyze_efficiency¶

Evaluate computational efficiency.

{
  "tool": "analyze_efficiency",
  "arguments": {
    "model": "gpt-3.5-turbo",
    "task": "summarization",
    "input_tokens": 500,
    "measure_memory": true
  }
}

🔐 Privacy Tools (8 tools)¶

detect_pii¶

Identify personally identifiable information.

{
  "tool": "detect_pii",
  "arguments": {
    "text": "John Doe lives at 123 Main St",
    "pii_types": ["name", "address", "phone", "email"],
    "redact": true
  }
}

check_data_minimization¶

Verify data minimization practices.

{
  "tool": "check_data_minimization",
  "arguments": {
    "collected_fields": [...],
    "required_fields": [...],
    "purpose": "user_registration"
  }
}

🔄 Workflow Tools (3 tools)¶

run_evaluation_suite¶

Execute comprehensive evaluation suites.

{
  "tool": "run_evaluation_suite",
  "arguments": {
    "suite_name": "production_readiness",
    "components": ["safety", "performance", "quality"],
    "parallel": true
  }
}

compare_results¶

Compare evaluation results across versions.

{
  "tool": "compare_results",
  "arguments": {
    "baseline_results": {...},
    "current_results": {...},
    "significance_level": 0.05
  }
}

Configuration¶

Environment Variables¶

# LLM Configuration
OPENAI_API_KEY=your-key-here
AZURE_OPENAI_ENDPOINT=https://your-instance.openai.azure.com
ANTHROPIC_API_KEY=your-anthropic-key

# Server Configuration
EVAL_SERVER_PORT=8080
EVAL_SERVER_HOST=0.0.0.0
EVAL_CACHE_SIZE=1000
EVAL_MAX_WORKERS=4

# Evaluation Settings
DEFAULT_JUDGE_MODEL=gpt-4
CONFIDENCE_LEVEL=0.95
POSITION_BIAS_MITIGATION=true

Configuration File¶

Create config.yaml:

evaluation:
  default_judge: gpt-4
  temperature: 0.0
  max_retries: 3
  timeout: 30

  criteria_weights:
    accuracy: 0.3
    relevance: 0.3
    coherence: 0.2
    safety: 0.2

caching:
  enabled: true
  ttl: 3600
  max_size: 1000

logging:
  level: INFO
  format: json
  output: stdout

Advanced Usage¶

Custom Rubrics¶

# Define custom evaluation rubric
{
  "tool": "evaluate_with_rubric",
  "arguments": {
    "response": "AI-generated content",
    "rubric": {
      "technical_accuracy": {
        "weight": 0.4,
        "criteria": "Factually correct technical details"
      },
      "clarity": {
        "weight": 0.3,
        "criteria": "Clear and understandable explanation"
      },
      "completeness": {
        "weight": 0.3,
        "criteria": "Covers all required aspects"
      }
    }
  }
}

Batch Evaluation¶

# Evaluate multiple samples in parallel
{
  "tool": "batch_evaluate",
  "arguments": {
    "samples": [
      {"id": "1", "response": "Response 1"},
      {"id": "2", "response": "Response 2"}
    ],
    "evaluation_type": "quality",
    "parallel_workers": 4
  }
}

Multi-Judge Consensus¶

# Use multiple judges for consensus
{
  "tool": "multi_judge_consensus",
  "arguments": {
    "response": "Content to evaluate",
    "judges": ["gpt-4", "claude-3", "gemini-pro"],
    "aggregation": "weighted_average",
    "confidence_threshold": 0.8
  }
}

Example Workflows¶

Complete Model Evaluation Pipeline¶

# 1. Quality evaluation
{
  "tool": "evaluate_single",
  "arguments": {
    "response": "Model output",
    "criteria": ["quality", "accuracy", "relevance"]
  }
}

# 2. Safety check
{
  "tool": "detect_harmful_content",
  "arguments": {
    "content": "Model output",
    "categories": ["all"]
  }
}

# 3. Bias detection
{
  "tool": "detect_demographic_bias",
  "arguments": {
    "responses": ["Model output"],
    "demographics": ["all"]
  }
}

# 4. Performance measurement
{
  "tool": "measure_latency",
  "arguments": {
    "operation": "inference",
    "num_samples": 100
  }
}

# 5. Generate report
{
  "tool": "generate_evaluation_report",
  "arguments": {
    "include_all_metrics": true
  }
}

A/B Testing Workflow¶

# Compare two model versions
{
  "tool": "run_ab_test",
  "arguments": {
    "model_a": "v1.0",
    "model_b": "v2.0",
    "test_cases": [...],
    "metrics": ["quality", "latency", "safety"],
    "sample_size": 1000,
    "confidence_level": 0.95
  }
}

Performance Optimization¶

Caching: Results are cached to avoid redundant evaluations
Parallel Processing: Multi-threaded evaluation for batch operations
Lazy Loading: Models loaded on-demand
Connection Pooling: Efficient API connection management
Async Operations: Non-blocking I/O for improved throughput

Troubleshooting¶

API Key Issues¶

# Check API keys are set
echo $OPENAI_API_KEY
echo $ANTHROPIC_API_KEY

# Test API connectivity
curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY"

Memory Issues¶

# Reduce cache size in config.yaml
caching:
  max_size: 100
  ttl: 600

Timeout Errors¶

# Increase timeouts in config.yaml
evaluation:
  timeout: 60
  max_retries: 5

Author¶

Mihai Criveti

GitHub: cmihai
Project: IBM MCP Context Forge