Python Data Analysis Server¶
Overview¶
A comprehensive MCP server providing advanced data analysis, statistical testing, visualization, and transformation capabilities. This server enables AI applications to perform sophisticated data science workflows through a standardized interface.
Key Features:
- 📊 Multi-format data loading (CSV, JSON, Parquet, SQL, Excel)
- 📈 Statistical analysis and hypothesis testing
- 📉 Rich visualization with matplotlib, seaborn, and plotly
- ⏰ Time series analysis and forecasting
- 🔄 Data transformation and feature engineering
- 🔍 SQL-like querying with pandas
- 💾 Smart dataset caching with automatic eviction
Quick Start¶
Installation¶
# Navigate to the server directory
cd mcp-servers/python/data_analysis_server
# Install dependencies
pip install -r requirements.txt
# Or install in development mode
pip install -e .[dev]
Running the Server¶
# Run with default configuration
python -m data_analysis_server.server
# Run with custom configuration
python -m data_analysis_server.server --config my_config.yaml
Integration with MCP Gateway¶
# Register with MCP Gateway
curl -X POST http://localhost:4444/servers \
  -H "Content-Type: application/json" \
  -d '{
    "name": "data-analysis-server",
    "transport": "stdio",
    "command": "python -m data_analysis_server.server",
    "description": "Advanced data analysis and visualization server"
  }'
Available Tools¶
Data Management¶
load_dataset¶
Load data from various sources and formats.
{
  "tool": "load_dataset",
  "arguments": {
    "source": "./data/sales.csv",
    "format": "csv",
    "dataset_id": "sales_data",
    "sample_size": 1000,
    "cache_data": true
  }
}
Supported formats:
- CSV, TSV
- JSON, JSONL
- Parquet
- Excel (xlsx, xls)
- SQL databases
list_datasets¶
List all currently loaded datasets.
Statistical Analysis¶
analyze_dataset¶
Perform comprehensive dataset analysis.
{
  "tool": "analyze_dataset",
  "arguments": {
    "dataset_id": "sales_data",
    "analysis_type": "exploratory",
    "include_distributions": true,
    "include_correlations": true
  }
}
Analysis types:
- exploratory- Full EDA with summary statistics
- descriptive- Basic statistics only
- correlation- Correlation analysis
- distribution- Distribution analysis
statistical_test¶
Perform various statistical tests.
{
  "tool": "statistical_test",
  "arguments": {
    "dataset_id": "sales_data",
    "test_type": "t_test",
    "column": "revenue",
    "group_column": "region",
    "alpha": 0.05
  }
}
Supported tests:
- Parametric: t-test, ANOVA, linear regression
- Non-parametric: Mann-Whitney U, Wilcoxon, Kruskal-Wallis
- Correlation: Pearson, Spearman, Chi-square
Data Visualization¶
create_visualization¶
Generate various types of plots.
{
  "tool": "create_visualization",
  "arguments": {
    "dataset_id": "sales_data",
    "plot_type": "scatter",
    "x": "advertising_spend",
    "y": "revenue",
    "hue": "product_category",
    "title": "Revenue vs Advertising Spend",
    "save_path": "./plots/revenue_analysis.png"
  }
}
Plot types:
- Distribution: histogram, kde, box, violin
- Relationship: scatter, line, regression
- Categorical: bar, count, swarm, strip
- Matrix: heatmap, pair plot
- Time Series: line, area, seasonal decomposition
create_interactive_plot¶
Generate interactive Plotly visualizations.
{
  "tool": "create_interactive_plot",
  "arguments": {
    "dataset_id": "sales_data",
    "plot_type": "3d_scatter",
    "x": "price",
    "y": "quantity",
    "z": "revenue",
    "color": "region",
    "save_html": true
  }
}
Time Series Analysis¶
time_series_analysis¶
Perform time series decomposition and analysis.
{
  "tool": "time_series_analysis",
  "arguments": {
    "dataset_id": "sales_data",
    "date_column": "date",
    "value_column": "daily_revenue",
    "frequency": "D",
    "decomposition_type": "additive",
    "include_forecast": true,
    "forecast_periods": 30
  }
}
Features:
- Trend detection
- Seasonality analysis
- Stationarity testing
- ARIMA forecasting
- Seasonal decomposition
Data Transformation¶
transform_data¶
Apply various transformations to datasets.
{
  "tool": "transform_data",
  "arguments": {
    "dataset_id": "sales_data",
    "transformations": [
      {"type": "scale", "columns": ["price", "quantity"], "method": "standard"},
      {"type": "encode", "columns": ["category"], "method": "onehot"},
      {"type": "impute", "columns": ["rating"], "method": "mean"}
    ],
    "save_as": "sales_data_transformed"
  }
}
Transformations:
- Scaling: standard, minmax, robust, normalizer
- Encoding: label, onehot, ordinal, target
- Imputation: mean, median, mode, forward fill
- Feature Engineering: polynomial, binning, interaction
clean_data¶
Automated data cleaning operations.
{
  "tool": "clean_data",
  "arguments": {
    "dataset_id": "sales_data",
    "remove_duplicates": true,
    "handle_missing": "drop",
    "remove_outliers": true,
    "outlier_method": "iqr",
    "save_as": "sales_data_clean"
  }
}
Data Querying¶
query_data¶
Execute SQL-like queries on datasets using pandas.
{
  "tool": "query_data",
  "arguments": {
    "dataset_id": "sales_data",
    "query": "SELECT region, AVG(revenue) as avg_revenue FROM data WHERE date > '2024-01-01' GROUP BY region ORDER BY avg_revenue DESC",
    "limit": 100
  }
}
Supported SQL features:
- SELECT with column aliases
- WHERE clauses with complex conditions
- GROUP BY with aggregations
- ORDER BY (ASC/DESC)
- JOINs between datasets
- LIMIT and OFFSET
filter_data¶
Apply filters to create dataset subsets.
{
  "tool": "filter_data",
  "arguments": {
    "dataset_id": "sales_data",
    "filters": [
      {"column": "revenue", "operator": ">", "value": 1000},
      {"column": "region", "operator": "in", "value": ["North", "South"]}
    ],
    "save_as": "high_revenue_sales"
  }
}
Configuration¶
Create a config.yaml file:
server:
  max_datasets: 100
  max_memory_mb: 1024
  cache_ttl_seconds: 3600
visualization:
  plot_output_dir: "./plots"
  default_dpi: 100
  default_figsize: [10, 6]
  style: "seaborn"
analysis:
  max_query_results: 10000
  default_sample_size: 5000
  confidence_level: 0.95
logging:
  level: "INFO"
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
Environment Variables¶
- DATA_ANALYSIS_CONFIG: Path to configuration file
- DATA_ANALYSIS_CACHE_DIR: Directory for cached datasets
- DATA_ANALYSIS_PLOT_DIR: Directory for saved visualizations
- DATA_ANALYSIS_LOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR)
Advanced Usage¶
Working with Large Datasets¶
# Use sampling for large datasets
{
  "tool": "load_dataset",
  "arguments": {
    "source": "large_data.parquet",
    "format": "parquet",
    "dataset_id": "large_data",
    "sample_size": 10000,
    "sampling_method": "stratified",
    "stratify_column": "category"
  }
}
Custom Statistical Models¶
# Linear regression with multiple features
{
  "tool": "statistical_test",
  "arguments": {
    "dataset_id": "sales_data",
    "test_type": "linear_regression",
    "dependent": "revenue",
    "independent": ["price", "advertising", "season"],
    "include_intercept": true,
    "return_coefficients": true
  }
}
Advanced Visualizations¶
# Complex multi-panel visualization
{
  "tool": "create_visualization",
  "arguments": {
    "dataset_id": "sales_data",
    "plot_type": "pair_plot",
    "variables": ["revenue", "cost", "profit", "units"],
    "hue": "product_line",
    "diag_kind": "kde",
    "corner": true
  }
}
Example Workflows¶
Complete EDA Pipeline¶
# 1. Load data
{
  "tool": "load_dataset",
  "arguments": {
    "source": "sales_2024.csv",
    "dataset_id": "sales"
  }
}
# 2. Clean data
{
  "tool": "clean_data",
  "arguments": {
    "dataset_id": "sales",
    "remove_duplicates": true,
    "handle_missing": "impute"
  }
}
# 3. Exploratory analysis
{
  "tool": "analyze_dataset",
  "arguments": {
    "dataset_id": "sales",
    "analysis_type": "exploratory"
  }
}
# 4. Visualize distributions
{
  "tool": "create_visualization",
  "arguments": {
    "dataset_id": "sales",
    "plot_type": "histogram",
    "column": "revenue",
    "bins": 30
  }
}
# 5. Statistical testing
{
  "tool": "statistical_test",
  "arguments": {
    "dataset_id": "sales",
    "test_type": "anova",
    "column": "revenue",
    "group_column": "region"
  }
}
Time Series Forecasting¶
# Load and prepare time series data
{
  "tool": "load_dataset",
  "arguments": {
    "source": "daily_sales.csv",
    "dataset_id": "timeseries"
  }
}
# Perform time series analysis
{
  "tool": "time_series_analysis",
  "arguments": {
    "dataset_id": "timeseries",
    "date_column": "date",
    "value_column": "sales",
    "frequency": "D",
    "include_forecast": true,
    "forecast_periods": 90
  }
}
Performance Considerations¶
- Dataset Caching: Frequently accessed datasets are kept in memory
- Lazy Loading: Large files are loaded on-demand
- Query Optimization: SQL queries are converted to efficient pandas operations
- Memory Management: Automatic eviction of least-recently-used datasets
- Parallel Processing: Multi-core support for heavy computations
Troubleshooting¶
Common Issues¶
Out of Memory:
Slow Queries:
# Use sampling for large datasets
{
  "tool": "query_data",
  "arguments": {
    "dataset_id": "large_data",
    "query": "SELECT * FROM data TABLESAMPLE(10 PERCENT)"
  }
}
Missing Dependencies:
# Install optional dependencies
pip install plotly  # For interactive plots
pip install xlrd    # For old Excel files
pip install sqlalchemy  # For SQL databases