
Scarf (short for Self-Contained Application Refactoring) benchmark is a suite of Java applications across frameworks: Jakarta EE, Quarkus, and Spring for evaluating agentic transformation between the frameworks. This suite enables systematic assessment of AI agents' ability to migrate enterprise Java applications while preserving functionality, idiomatic patterns, and architectural integrity across different runtime environments.
The benchmark includes comprehensive examples ranging from focused layer-specific demonstrations to complete production-grade applications, each with verified implementations across all supported frameworks.
Manual Conversions with Developer Verification
All applications in this benchmark have been manually converted and verified by experienced developers. Each implementation has undergone rigorous testing to ensure functional correctness, adherence to framework-specific idioms, and preservation of architectural integrity across Jakarta EE, Quarkus, and Spring frameworks.
Getting Started with the Scarf Benchmark#
-
Leaderboard ---
Run through an example to quickly set up Scarfbench and evaluate agentic solutions.
-
Compare AI agents and transformation tools on the benchmark suite. Track performance metrics and identify best practices for automated application migration.
Benchmark Applications#
This benchmark contains self-contained applications demonstrating core Java EE functionalities and their framework-specific implementations. Each example has been manually converted and verified across all target frameworks, with smoke tests included to verify application behavior after transformation.
The benchmark includes two types of examples:
Focused Examples#
Application examples organized per layer, where each example demonstrates a specific technology within that layer (e.g., persistence, presentation, integration).
-
Core business logic implementations using Enterprise JavaBeans (EJBs). Demonstrates stateful, stateless, and singleton session beans for shopping carts, currency conversion, hit counters, web services, and standalone EJB usage.
-
CDI and dependency injection patterns including custom qualifiers, interceptors, decorators, producer methods, event observers, and alternative implementations for conditional bean selection.
-
Enterprise features including managed executors for concurrency, asynchronous EJB methods, interceptors for cross-cutting concerns, and timer services for scheduled task execution.
-
Integration technologies featuring Jakarta Batch processing, JMS messaging patterns, message-driven beans, JAX-WS web services, and Java Connector Architecture for enterprise system integration.
-
Data persistence patterns using JPA entities with CRUD operations, complex entity relationships, composite keys, inheritance strategies, and JPQL queries for database interactions.
-
Web tier implementations including servlets, JAX-RS REST APIs, WebSocket endpoints, server-sent events, file uploads, filters, listeners, and real-time communication patterns.
-
Authentication and authorization patterns featuring Jakarta Security identity stores, form-based and basic authentication, EJB security, role-based access control, and password hashing.
Whole Applications#
Complete, functioning applications that demonstrate the coordination and interaction between multiple layers.
-
Domain-Driven Design cargo shipping tracker with Jakarta Faces, CDI, Enterprise Beans, JPA, REST, Batch, and JMS. Showcases aggregates, repositories, and domain events following Eric Evans' DDD patterns.
-
Event-driven microservices with Orders, Barista, and Kitchen services via Kafka. Demonstrates MicroProfile stack, reactive messaging, distributed transactions, and eventual consistency.
-
High-performance stock trading benchmark with stateless session beans, JPA optimistic locking, transaction management, and connection pooling. Used for measuring server performance.
-
Veterinary clinic management with Jakarta Faces (PrimeFaces), complex JPA relationships, CDI, and Bean Validation. Complete workflows for owners, pets, visits, and veterinarians.
-
Medium.com clone with MicroProfile JWT, JAX-RS REST API, article management, comments, favorites, tags, and user following. Includes Testcontainers integration tests.
Roadmap#
ScarfBench is actively maintained and continuously evolving to support the research community. We are committed to expanding the benchmark's capabilities and improving its utility for evaluating AI-driven application transformation. Here's what's coming:
-
Comprehensive Smoke Tests
We are developing an extensive suite of automated smoke tests to validate functional equivalence across framework migrations. These tests will ensure that transformed applications maintain their original behavior, catching subtle regressions and framework-specific issues that may arise during migration.
-
Dynamic Leaderboard
A live leaderboard will track and compare the performance of different AI agents and transformation tools across the benchmark suite. This will provide transparent, reproducible metrics for the research community and help identify best practices in automated application migration.
-
Rich Taxonomy of Errors
We are building a comprehensive taxonomy that categorizes transformation errors, anti-patterns, and common pitfalls. This taxonomy will help researchers understand where current approaches struggle and guide development of more robust transformation strategies.
ScarfBench will continue to receive regular updates with new applications, enhanced documentation, and improved tooling. We welcome community contributions and feedback to make this benchmark more valuable for advancing the state of automated application transformation.
Contact#
For any questions, feedback, or suggestions, please contact the authors:
| Name | |
|---|---|
| Rahul Krishna | i.m.ralk@gmail.com |
| Raju Pavuluri | pavuluri@us.ibm.com |