Latency Breakdown (typical)
Total Request Time: ~100-200ms
┌─────────────────────────────────────────────────────┐
│ Total: ~150ms │
├─────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────┐ │
│ │ Java Gateway: ~20ms │ │
│ │ - Parse request: 2ms │ │
│ │ - Validate data: 3ms │ │
│ │ - HTTP call: 10ms │ │
│ │ - Response building: 5ms │ │
│ └─────────────────────────────────────┘ │
│ ┌─────────────────────────────────────┐ │
│ │ Network: ~10-30ms │ │
│ │ - Request transmission: 5ms │ │
│ │ - Response transmission: 5ms │ │
│ │ - Latency: 0-20ms │ │
│ └─────────────────────────────────────┘ │
│ ┌─────────────────────────────────────┐ │
│ │ Python Service: ~80-120ms │ │
│ │ - Request parsing: 5ms │ │
│ │ - Model prediction: 70-100ms │ │
│ │ - Response formatting: 5ms │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Throughput (requests per second)
- Single instance: 10-50 RPS (requests per second)
- With horizontal scaling: Linear increase with load balancing
- Bottleneck: Python model inference (70-100ms per prediction)
Resource Usage
| Service |
CPU |
Memory |
Disk |
| Java Gateway |
0.1-0.2 cores |
512 MB |
100 MB |
| Python Service |
0.5-1.0 cores |
1-2 GB |
200 MB + model size |
| Model Storage |
N/A |
N/A |
50-500 MB per model |