Skip to main content

Monitoring & Metrics

The Transfer Learning Video Processing Pipeline includes comprehensive monitoring and metrics tracking capabilities to help you understand performance, resource usage, and processing status.

Real-time Dashboard

The pipeline provides a beautiful real-time dashboard powered by Rich, accessible through the CLI:
transfer-learning monitor

Dashboard Options

video_id
string
Filter metrics for a specific video ID
refresh_rate
number
default:"2.0"
Dashboard refresh rate in seconds
show_history
boolean
default:"false"
Show historical metrics instead of real-time data

Dashboard Sections

The dashboard is divided into several panels:
  1. Header Panel
    • Current video being processed
    • Processing duration
    • Frame progress
  2. Performance Metrics
    • Mean and P95 latency
    • Throughput statistics
    • Total processing time
  3. System Resources
    • CPU usage (color-coded)
    • Memory usage (color-coded)
    • Disk usage
    • Network I/O
    • GPU metrics (if available)
  4. API Statistics
    • API call counts
    • Mean latency per API
    • Success/failure rates
  5. Custom Metrics
    • User-defined metrics
    • Application-specific KPIs

Metrics Collection

System Metrics

The pipeline automatically tracks system resource usage:
from transfer_learning.monitoring.metrics import MetricsTracker

metrics = MetricsTracker()
metrics.start_processing("video_id")

# System metrics are collected automatically
# Access current metrics
system_metrics = metrics.current_metrics.system
print(f"CPU Usage: {system_metrics.cpu_percent}%")

Performance Metrics

Track performance-related metrics:
# Record API call latency
metrics.record_api_call("openai_vision", 150.5)  # 150.5ms

# Add custom processing time
metrics.current_metrics.performance.add_processing_time(200.5)

# Record throughput
metrics.current_metrics.performance.add_throughput(10.5)  # 10.5 items/second

Custom Metrics

Add your own custom metrics:
# Add a custom metric
metrics.add_custom_metric("quality_score", 0.95)
metrics.add_custom_metric("detected_objects", ["person", "car", "dog"])

Using the Timer

The pipeline provides a convenient Timer context manager for timing operations:
from transfer_learning.monitoring.metrics import Timer

# Time a processing operation
with Timer(metrics_tracker, "frame_processing", operation_type="processing"):
    process_frame(frame)

# Time an API call
with Timer(metrics_tracker, "openai_api", operation_type="api"):
    response = openai.ChatCompletion.create(...)

Metrics Storage

Metrics are automatically saved to JSON files in dated directories:
metrics/
  └── 2024-03-05/
      └── video_123_20240305_143022.json

Metrics File Format

{
  "video_id": "video_123",
  "start_time": "2024-03-05T14:30:22.123456",
  "end_time": "2024-03-05T14:35:42.987654",
  "frame_count": 150,
  "processed_frames": 150,
  "performance_metrics": {
    "latency_stats": {
      "mean": 125.5,
      "median": 120.0,
      "p95": 180.5,
      "p99": 200.0
    },
    "throughput_stats": {
      "mean": 10.5,
      "max": 15.2
    }
  },
  "system_metrics": {
    "cpu_percent": 45.2,
    "memory_percent": 62.8,
    "disk_usage_percent": 78.5,
    "network_bytes_sent": 1024,
    "network_bytes_recv": 2048,
    "gpu_utilization": 85.2,
    "gpu_memory_used": 4096
  },
  "custom_metrics": {
    "quality_score": 0.95,
    "detected_objects": ["person", "car", "dog"]
  },
  "system_info": {
    "platform": "Darwin-24.3.0-x86_64",
    "python_version": "3.10.0",
    "processor": "i386",
    "memory_total": 17179869184,
    "disk_total": 994662584320
  }
}

Historical Analysis

View historical metrics for a specific video:
transfer-learning monitor --video-id=video_123 --show-history
This will load and display the most recent metrics file for the specified video ID.

Best Practices

  1. Start/End Tracking
    metrics.start_processing("video_id")
    try:
        # Your processing code
    finally:
        metrics.end_processing()
    
  2. Use Timers for Operations
    with Timer(metrics, "operation_name"):
        # Your operation code
    
  3. Add Custom Metrics
    metrics.add_custom_metric("key", value)
    
  4. Monitor Resource Usage
    • Keep an eye on system resource usage
    • Set up alerts for high resource utilization
    • Use color-coded indicators in the dashboard
  5. Regular Cleanup
    # Clean up old metrics files
    transfer-learning cleanup --max-age-hours=168  # 1 week
    

Configuration

Configure monitoring settings in your .env file:
ENABLE_MONITORING=true
LOG_LEVEL=INFO
METRICS_ENABLED=true
Or use the config command:
transfer-learning config

Troubleshooting

Common Issues

  1. High Resource Usage
    • Check system metrics panel for bottlenecks
    • Adjust batch sizes and concurrent processing
    • Consider GPU offloading if available
  2. Slow Processing
    • Monitor API latencies in the dashboard
    • Check network I/O metrics
    • Optimize batch sizes and caching
  3. Missing Metrics
    • Ensure metrics tracking is enabled
    • Check file permissions in metrics directory
    • Verify proper start/end processing calls

Debug Mode

Enable debug logging for more detailed information:
LOG_LEVEL=DEBUG transfer-learning monitor

API Reference

MetricsTracker

start_processing
method
Start tracking metrics for a video processing run
end_processing
method
End tracking metrics and save results
record_api_call
method
Record an API call with its latency
add_custom_metric
method
Add a custom metric

Timer

__init__
constructor
Initialize a timer for operation tracking