diff --git a/dockmon/.dockerignore b/dockmon/.dockerignore
new file mode 100644
index 0000000..3cd7b04
--- /dev/null
+++ b/dockmon/.dockerignore
@@ -0,0 +1,54 @@
+# Git
+.git
+.gitignore
+.github
+
+# Documentation
+README.md
+*.md
+wiki-content
+
+# Tests
+tests
+test-results
+playwright-report
+node_modules
+package.json
+package-lock.json
+
+# Screenshots (keep in git for README, but not in Docker image)
+screenshots
+
+# Development
+.claude
+.vscode
+.idea
+
+# Python
+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.Python
+*.so
+*.egg
+*.egg-info
+dist
+build
+venv
+env
+
+# Docker
+docker-compose.yml
+.dockerignore
+
+# Scripts (if not needed in image)
+scripts
+
+# Temporary files
+*.tmp
+*.bak
+*.swp
+*.swo
+*~
+.DS_Store
diff --git a/dockmon/.gitignore b/dockmon/.gitignore
new file mode 100644
index 0000000..1f7154e
--- /dev/null
+++ b/dockmon/.gitignore
@@ -0,0 +1,79 @@
+
+# Claude Development Context and Internal Files
+CLAUDE_CONTEXT.md
+BASELINE_STATUS.md
+
+# Wiki Content (lives in wiki repo, not main repo)
+wiki-content/
+wiki-repo/
+
+# Documentation (internal development docs)
+docs/
+
+# Test Infrastructure (Development Only)
+tests/
+test-results/
+test-results.json
+playwright-report/
+*-results.log
+baseline*.log
+pre-refactor*.log
+post-refactor*.log
+
+# Node.js Dependencies (from test setup)
+node_modules/
+package-lock.json
+npm-debug.log*
+
+# SSL Certificates
+docker/certs/
+*.crt
+*.key
+*.pem
+*.csr
+
+# Database and persistent data
+backend/data/
+*.db
+*.db-journal
+*.db-wal
+*.db-shm
+
+# Environment files
+.env
+.env.*
+!.env.example
+
+# Python
+__pycache__/
+backend/__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+venv/
+env/
+*.egg-info/
+dist/
+build/
+
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Logs
+*.log
+logs/
+
+# Temporary files
+*.tmp
+*.bak
+.coverage
+htmlcov/
diff --git a/dockmon/LICENSE b/dockmon/LICENSE
new file mode 100644
index 0000000..f1bc979
--- /dev/null
+++ b/dockmon/LICENSE
@@ -0,0 +1,17 @@
+MIT License
+
+Copyright (c) 2024 darthnorse
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
diff --git a/dockmon/README.md b/dockmon/README.md
new file mode 100644
index 0000000..3fd5a17
--- /dev/null
+++ b/dockmon/README.md
@@ -0,0 +1,135 @@
+# DockMon
+
+A comprehensive Docker container monitoring and management platform with real-time monitoring, intelligent auto-restart, multi-channel alerting, and complete event logging.
+
+
+
+
+
+
+
+
+
+
+
+## Key Features
+
+- **Multi-Host Monitoring** - Monitor containers across unlimited Docker hosts (local and remote)
+- **Real-Time Statistics** - Live CPU, memory, network, and disk I/O metrics for hosts and containers
+- **Real-Time Container Logs** - View logs from multiple containers simultaneously with live updates
+- **Event Viewer** - Comprehensive audit trail with filtering, search, and real-time updates
+- **Intelligent Auto-Restart** - Per-container auto-restart with configurable retry logic
+- **Advanced Alerting** - Discord, Slack, Telegram, Pushover, Gotify, SMTP with customizable templates
+- **Real-Time Dashboard** - Drag-and-drop customizable widgets with WebSocket updates
+- **Secure by Design** - Session-based auth, rate limiting, mTLS for remote hosts
+- **Mobile-Friendly** - Responsive design that works seamlessly on all devices
+
+## Documentation
+
+- **[Complete User Guide](https://github.com/darthnorse/dockmon/wiki)** - Full documentation
+- **[Quick Start](https://github.com/darthnorse/dockmon/wiki/Quick-Start)** - Get started in 5 minutes
+- **[Installation](https://github.com/darthnorse/dockmon/wiki/Installation)** - Docker, unRAID, Synology, QNAP
+- **[Configuration](https://github.com/darthnorse/dockmon/wiki/Notifications)** - Alerts, notifications, settings
+- **[Security](https://github.com/darthnorse/dockmon/wiki/Security-Guide)** - Best practices and mTLS setup
+- **[Remote Monitoring](https://github.com/darthnorse/dockmon/wiki/Remote-Docker-Setup)** - Monitor remote Docker hosts
+- **[Event Viewer](https://github.com/darthnorse/dockmon/wiki/Event-Viewer)** - Comprehensive audit trail with filtering
+- **[Container Logs](https://github.com/darthnorse/dockmon/wiki/Container-Logs)** - Real-time multi-container log viewer
+- **[API Reference](https://github.com/darthnorse/dockmon/wiki/API-Reference)** - REST and WebSocket APIs
+- **[FAQ](https://github.com/darthnorse/dockmon/wiki/FAQ)** - Frequently asked questions
+- **[Troubleshooting](https://github.com/darthnorse/dockmon/wiki/Troubleshooting)** - Common issues
+
+## Use Cases
+
+### Home Lab
+- Monitor all your Docker containers in one place
+- Get notified when critical services go down
+- Automatically restart failed containers
+- Track container events and changes
+
+### Small Business
+- Centralized monitoring across multiple servers
+- Multi-channel alerting (Discord, Slack, Telegram, Pushover, Gotify, SMTP)
+- Schedule maintenance windows with blackout periods
+- Audit trail of all container operations
+
+### Development Teams
+- Monitor dev, staging, and production environments
+- Quick container management (start, stop, restart, logs)
+- Test notifications before deploying to production
+- Share monitoring dashboard with team
+
+## Support & Community
+
+- **[Report Issues](https://github.com/darthnorse/dockmon/issues)** - Found a bug?
+- **[Discussions](https://github.com/darthnorse/dockmon/discussions)** - Ask questions, share ideas
+- **[Wiki](https://github.com/darthnorse/dockmon/wiki)** - Complete documentation
+- **[Star on GitHub](https://github.com/darthnorse/dockmon)** - Show your support!
+
+## Roadmap
+
+### Completed (v1.0)
+- [x] Full backend API with FastAPI
+- [x] WebSocket real-time updates
+- [x] Multi-channel notifications
+- [x] Comprehensive event logging
+- [x] Event log viewer with filtering and search
+- [x] Real-time container logs viewer (multi-container support)
+- [x] Drag-and-drop dashboard
+- [x] Auto-restart with retry logic
+
+### Completed (v1.1)
+- [x] Real-time performance metrics (CPU, memory, network, disk I/O)
+- [x] Host-level and container-level statistics
+- [x] TLS/mTLS support for secure remote Docker connections
+- [x] Optimized streaming architecture with Go backend
+
+### Planned (v1.5+)
+- [ ] Performance metrics dashboard with historical graphs
+- [ ] Container auto-update feature with version tracking
+- [ ] Configuration export/import
+- [ ] Automatic Proxmox LXC installation script
+
+See the [full roadmap](https://github.com/darthnorse/dockmon/wiki/Roadmap) for details.
+
+## Contributing
+
+Contributions are welcome! Here's how you can help:
+
+- Report bugs via [GitHub Issues](https://github.com/darthnorse/dockmon/issues)
+- Suggest features in [Discussions](https://github.com/darthnorse/dockmon/discussions)
+- Improve documentation (edit the [Wiki](https://github.com/darthnorse/dockmon/wiki))
+- Submit pull requests (see [Contributing Guide](https://github.com/darthnorse/dockmon/wiki/Contributing))
+
+## Development
+
+Want to contribute code or run DockMon in development mode?
+
+See [Development Setup](https://github.com/darthnorse/dockmon/wiki/Development-Setup) for:
+- Local development environment setup
+- Architecture overview
+- Running tests
+- Building from source
+
+## License
+
+MIT License - see [LICENSE](LICENSE) file for details.
+
+## Author
+
+Created by [darthnorse](https://github.com/darthnorse)
+
+## Acknowledgments
+
+This project has been developed with **vibe coding** and **AI assistance** using Claude Code. The codebase includes clean, well-documented code with proper error handling, comprehensive testing considerations, modern async/await patterns, robust database design, and production-ready deployment configurations.
+
+---
+
+
+ If DockMon helps you, please consider giving it a star!
+
+
+
+ Documentation •
+ Issues •
+ Discussions
+
\ No newline at end of file
diff --git a/dockmon/backend/auth/__init__.py b/dockmon/backend/auth/__init__.py
new file mode 100644
index 0000000..a906ca5
--- /dev/null
+++ b/dockmon/backend/auth/__init__.py
@@ -0,0 +1 @@
+# Authentication module for DockMon
\ No newline at end of file
diff --git a/dockmon/backend/auth/routes.py b/dockmon/backend/auth/routes.py
new file mode 100644
index 0000000..6cdfd7e
--- /dev/null
+++ b/dockmon/backend/auth/routes.py
@@ -0,0 +1,260 @@
+"""
+Authentication Routes for DockMon
+Handles login, logout, API key access, and session management endpoints
+"""
+
+import os
+import secrets
+import logging
+from typing import Optional
+
+from fastapi import APIRouter, Request, Response, HTTPException, Depends
+from fastapi.responses import JSONResponse
+
+from models.auth_models import LoginRequest, ChangePasswordRequest
+from security.rate_limiting import rate_limit_auth
+from security.audit import security_audit
+from auth.session_manager import session_manager
+from database import DatabaseManager
+
+
+logger = logging.getLogger(__name__)
+
+
+router = APIRouter(prefix="/api/auth", tags=["authentication"])
+
+
+def _is_localhost_or_internal(client_ip: str) -> bool:
+ """Check if request is from localhost or internal network"""
+ import ipaddress
+ try:
+ addr = ipaddress.ip_address(client_ip)
+
+ # Allow localhost
+ if addr.is_loopback:
+ return True
+
+ # Allow private networks (RFC 1918) - for Docker networks and internal deployments
+ if addr.is_private:
+ return True
+
+ return False
+ except ValueError:
+ # Invalid IP format
+ return False
+
+
+from config.paths import DATABASE_PATH
+
+# Initialize database for user management - use centralized path config
+db = DatabaseManager(DATABASE_PATH)
+
+# Ensure default user exists on startup
+db.get_or_create_default_user()
+
+
+def _get_session_from_cookie(request: Request) -> Optional[str]:
+ """Extract session ID from cookie"""
+ return request.cookies.get("dockmon_session")
+
+
+async def verify_frontend_session(request: Request) -> bool:
+ """Dependency to verify frontend session authentication"""
+ session_id = _get_session_from_cookie(request)
+
+ if not session_manager.validate_session(session_id, request):
+ raise HTTPException(
+ status_code=401,
+ detail="Authentication required"
+ )
+
+ return True
+
+
+# Note: This endpoint will be implemented in main.py since it needs monitor instance
+# @router.get("/key") - implemented directly in main.py
+
+
+@router.post("/login")
+async def login(login_data: LoginRequest, request: Request, response: Response, rate_limit_check: bool = rate_limit_auth):
+ """Frontend login endpoint"""
+ client_ip = request.client.host
+ user_agent = request.headers.get("user-agent", "Unknown")
+
+ # Verify credentials using database
+ user_info = db.verify_user_credentials(login_data.username, login_data.password)
+ if not user_info:
+ security_audit.log_login_failure(client_ip, user_agent, "Invalid credentials")
+ raise HTTPException(
+ status_code=401,
+ detail="Invalid username or password"
+ )
+
+ # Create session with username
+ session_id = session_manager.create_session(request, user_info["username"])
+
+ # Set secure cookie
+ # Detect if we're using HTTPS based on common headers
+ is_https = request.url.scheme == "https" or request.headers.get("x-forwarded-proto") == "https"
+
+ response.set_cookie(
+ key="dockmon_session",
+ value=session_id,
+ httponly=True, # Prevent XSS access to cookie
+ secure=is_https, # Use secure flag when on HTTPS
+ samesite="lax", # CSRF protection
+ max_age=24*60*60 # 24 hours
+ )
+
+ return {
+ "success": True,
+ "message": "Login successful",
+ "username": user_info["username"],
+ "must_change_password": user_info["must_change_password"],
+ "is_first_login": user_info["is_first_login"]
+ }
+
+
+@router.post("/logout")
+async def logout(request: Request, response: Response, authenticated: bool = Depends(verify_frontend_session)):
+ """Frontend logout endpoint"""
+ session_id = _get_session_from_cookie(request)
+
+ if session_id:
+ session_manager.delete_session(session_id)
+
+ # Clear cookie
+ response.delete_cookie("dockmon_session")
+
+ return {"success": True, "message": "Logout successful"}
+
+
+@router.get("/status")
+async def auth_status(request: Request):
+ """Check authentication status"""
+ session_id = _get_session_from_cookie(request)
+ authenticated = session_manager.validate_session(session_id, request) if session_id else False
+
+ response = {
+ "authenticated": authenticated,
+ "session_valid": authenticated
+ }
+
+ # If authenticated, include username and password change requirement
+ if authenticated:
+ username = session_manager.get_session_username(session_id)
+ if username:
+ response["username"] = username
+ # Check if user must change password
+ with db.get_session() as session:
+ from database import User
+ user = session.query(User).filter(User.username == username).first()
+ if user:
+ response["must_change_password"] = user.must_change_password
+ response["is_first_login"] = user.is_first_login
+
+ return response
+
+
+@router.post("/change-password")
+async def change_password(password_data: ChangePasswordRequest, request: Request, authenticated: bool = Depends(verify_frontend_session)):
+ """Change user password"""
+ session_id = _get_session_from_cookie(request)
+ username = session_manager.get_session_username(session_id)
+
+ if not username:
+ raise HTTPException(
+ status_code=401,
+ detail="Session invalid"
+ )
+
+ # Verify current password
+ user_info = db.verify_user_credentials(username, password_data.current_password)
+ if not user_info:
+ raise HTTPException(
+ status_code=401,
+ detail="Current password is incorrect"
+ )
+
+ # Change password
+ success = db.change_user_password(username, password_data.new_password)
+ if not success:
+ raise HTTPException(
+ status_code=500,
+ detail="Failed to change password"
+ )
+
+ # Log security event
+ client_ip = request.client.host
+ user_agent = request.headers.get("user-agent", "Unknown")
+ security_audit.log_password_change(client_ip, user_agent, username)
+
+ return {
+ "success": True,
+ "message": "Password changed successfully"
+ }
+
+
+@router.post("/change-username")
+async def change_username(username_data: dict, request: Request, authenticated: bool = Depends(verify_frontend_session)):
+ """Change username"""
+ session_id = _get_session_from_cookie(request)
+ current_username = session_manager.get_session_username(session_id)
+
+ if not current_username:
+ raise HTTPException(
+ status_code=401,
+ detail="Session invalid"
+ )
+
+ # Verify current password
+ current_password = username_data.get("current_password")
+ new_username = username_data.get("new_username", "").strip()
+
+ if not current_password or not new_username:
+ raise HTTPException(
+ status_code=400,
+ detail="Current password and new username required"
+ )
+
+ # Verify current credentials
+ user_info = db.verify_user_credentials(current_username, current_password)
+ if not user_info:
+ raise HTTPException(
+ status_code=401,
+ detail="Current password is incorrect"
+ )
+
+ # Validate new username
+ if len(new_username) < 3 or len(new_username) > 50:
+ raise HTTPException(
+ status_code=400,
+ detail="Username must be between 3 and 50 characters"
+ )
+
+ # Check if new username already exists (but allow keeping the same username)
+ if new_username != current_username and db.username_exists(new_username):
+ raise HTTPException(
+ status_code=400,
+ detail="Username already exists"
+ )
+
+ # Change username
+ if not db.change_username(current_username, new_username):
+ raise HTTPException(
+ status_code=500,
+ detail="Failed to change username"
+ )
+
+ # Update session with new username
+ session_manager.update_session_username(session_id, new_username)
+
+ # Log security event
+ client_ip = request.client.host
+ user_agent = request.headers.get("user-agent", "Unknown")
+ security_audit.log_username_change(client_ip, user_agent, current_username, new_username)
+
+ return {
+ "success": True,
+ "message": "Username changed successfully"
+ }
\ No newline at end of file
diff --git a/dockmon/backend/auth/session_manager.py b/dockmon/backend/auth/session_manager.py
new file mode 100644
index 0000000..6f547f8
--- /dev/null
+++ b/dockmon/backend/auth/session_manager.py
@@ -0,0 +1,138 @@
+"""
+Session Management System for DockMon
+Provides secure session tokens with IP validation and automatic cleanup
+"""
+
+import logging
+import secrets
+import threading
+import time
+from datetime import datetime, timedelta
+from typing import Dict, Optional
+
+from fastapi import Request
+
+from security.audit import security_audit
+
+logger = logging.getLogger(__name__)
+
+
+class SessionManager:
+ """
+ Custom session management for frontend authentication
+ Provides secure session tokens with configurable expiry
+ """
+ def __init__(self):
+ self.sessions: Dict[str, dict] = {}
+ self.session_timeout = timedelta(hours=24) # 24 hour sessions
+ self._sessions_lock = threading.Lock()
+ self._shutdown_event = threading.Event()
+ self._cleanup_thread = threading.Thread(target=self._periodic_cleanup, daemon=True)
+ self._cleanup_thread.start()
+
+ def _periodic_cleanup(self):
+ """Run cleanup every hour"""
+ while not self._shutdown_event.wait(timeout=3600):
+ try:
+ deleted = self.cleanup_expired_sessions()
+ if deleted > 0:
+ logger.info(f"Cleaned up {deleted} expired sessions")
+ except Exception as e:
+ logger.error(f"Session cleanup failed: {e}", exc_info=True)
+
+ def create_session(self, request: Request, username: str = None) -> str:
+ """Create a new session token"""
+ session_id = secrets.token_urlsafe(32)
+ client_ip = request.client.host
+ user_agent = request.headers.get("user-agent", "Unknown")
+
+ with self._sessions_lock:
+ self.sessions[session_id] = {
+ "created_at": datetime.utcnow(),
+ "last_accessed": datetime.utcnow(),
+ "client_ip": client_ip,
+ "user_agent": user_agent,
+ "authenticated": True,
+ "username": username
+ }
+
+ # Security audit log
+ security_audit.log_login_success(client_ip, user_agent, session_id)
+
+ return session_id
+
+ def validate_session(self, session_id: Optional[str], request: Request) -> bool:
+ """Validate session token and update last accessed time"""
+ if not session_id:
+ return False
+
+ with self._sessions_lock:
+ if session_id not in self.sessions:
+ return False
+
+ session = self.sessions[session_id]
+ current_time = datetime.utcnow()
+ client_ip = request.client.host
+
+ # Check if session has expired
+ if current_time - session["created_at"] > self.session_timeout:
+ del self.sessions[session_id]
+ security_audit.log_session_expired(client_ip, session_id)
+ return False
+
+ # Validate IP consistency for security
+ if session["client_ip"] != client_ip:
+ security_audit.log_session_hijack_attempt(
+ original_ip=session["client_ip"],
+ attempted_ip=client_ip,
+ session_id=session_id
+ )
+ del self.sessions[session_id]
+ return False
+
+ # Update last accessed time
+ session["last_accessed"] = current_time
+ return True
+
+ def delete_session(self, session_id: str):
+ """Delete a session (logout)"""
+ with self._sessions_lock:
+ if session_id in self.sessions:
+ del self.sessions[session_id]
+
+ def get_session_username(self, session_id: str) -> Optional[str]:
+ """Get username from session"""
+ with self._sessions_lock:
+ if session_id in self.sessions:
+ return self.sessions[session_id].get("username")
+ return None
+
+ def update_session_username(self, session_id: str, new_username: str):
+ """Update username in session"""
+ with self._sessions_lock:
+ if session_id in self.sessions:
+ self.sessions[session_id]["username"] = new_username
+
+ def cleanup_expired_sessions(self):
+ """Clean up expired sessions periodically"""
+ current_time = datetime.utcnow()
+ expired_sessions = []
+
+ with self._sessions_lock:
+ for session_id, session_data in self.sessions.items():
+ if current_time - session_data["created_at"] > self.session_timeout:
+ expired_sessions.append(session_id)
+
+ for session_id in expired_sessions:
+ self.delete_session(session_id)
+
+ return len(expired_sessions)
+
+ def shutdown(self):
+ """Shutdown the session manager and cleanup thread"""
+ self._shutdown_event.set()
+ self._cleanup_thread.join(timeout=5)
+
+
+# Global session manager instance
+session_manager = SessionManager()
\ No newline at end of file
diff --git a/dockmon/backend/blackout_manager.py b/dockmon/backend/blackout_manager.py
new file mode 100644
index 0000000..9fae432
--- /dev/null
+++ b/dockmon/backend/blackout_manager.py
@@ -0,0 +1,268 @@
+"""
+Blackout Window Management for DockMon
+Handles alert suppression during maintenance windows
+"""
+
+import asyncio
+import logging
+from datetime import datetime, time, timedelta, timezone
+from typing import Dict, List, Optional, Tuple
+from database import DatabaseManager
+
+logger = logging.getLogger(__name__)
+
+
+class BlackoutManager:
+ """Manages blackout windows and deferred alerts"""
+
+ def __init__(self, db: DatabaseManager):
+ self.db = db
+ self._check_task: Optional[asyncio.Task] = None
+ self._last_check: Optional[datetime] = None
+ self._connection_manager = None # Will be set when monitoring starts
+
+ def is_in_blackout_window(self) -> Tuple[bool, Optional[str]]:
+ """
+ Check if current time is within any blackout window
+ Returns: (is_blackout, window_name)
+ """
+ try:
+ settings = self.db.get_settings()
+ if not settings or not settings.blackout_windows:
+ return False, None
+
+ # Get timezone offset from settings (in minutes), default to 0 (UTC)
+ timezone_offset = getattr(settings, 'timezone_offset', 0)
+
+ # Get current time in UTC and convert to user's timezone
+ now_utc = datetime.now(timezone.utc)
+ now_local = now_utc + timedelta(minutes=timezone_offset)
+ current_time = now_local.time()
+ current_weekday = now_local.weekday() # 0=Monday, 6=Sunday
+
+ for window in settings.blackout_windows:
+ if not window.get('enabled', True):
+ continue
+
+ days = window.get('days', [])
+ start_str = window.get('start', '00:00')
+ end_str = window.get('end', '00:00')
+
+ start_time = datetime.strptime(start_str, '%H:%M').time()
+ end_time = datetime.strptime(end_str, '%H:%M').time()
+
+ # Handle overnight windows (e.g., 23:00 to 02:00)
+ if start_time > end_time:
+ # For overnight windows, check if we're in the late night part (before midnight)
+ # or the early morning part (after midnight)
+ if current_time >= start_time:
+ # Late night part - check if today is in the window
+ if current_weekday in days:
+ window_name = window.get('name', f"{start_str}-{end_str}")
+ return True, window_name
+ elif current_time < end_time:
+ # Early morning part - check if YESTERDAY was in the window
+ prev_day = (current_weekday - 1) % 7
+ if prev_day in days:
+ window_name = window.get('name', f"{start_str}-{end_str}")
+ return True, window_name
+ else:
+ # Regular same-day window
+ if current_weekday in days and start_time <= current_time < end_time:
+ window_name = window.get('name', f"{start_str}-{end_str}")
+ return True, window_name
+
+ return False, None
+
+ except Exception as e:
+ logger.error(f"Error checking blackout window: {e}")
+ return False, None
+
+ def get_last_window_end_time(self) -> Optional[datetime]:
+ """Get when the last blackout window ended (for tracking)"""
+ return getattr(self, '_last_window_end', None)
+
+ def set_last_window_end_time(self, end_time: datetime):
+ """Set when the last blackout window ended"""
+ self._last_window_end = end_time
+
+ async def check_container_states_after_blackout(self, notification_service, monitor) -> Dict:
+ """
+ Check all container states after blackout window ends.
+ Alert if any containers are in problematic states.
+ Returns summary of what was found.
+
+ Args:
+ notification_service: The notification service instance
+ monitor: The DockerMonitor instance (reused, not created)
+ """
+ summary = {
+ 'containers_down': [],
+ 'total_checked': 0,
+ 'window_name': None
+ }
+
+ try:
+
+ problematic_states = ['exited', 'dead', 'paused', 'removing']
+
+ # Check all containers across all hosts
+ for host_id, host in monitor.hosts.items():
+ if not host.client:
+ continue
+
+ try:
+ containers = host.client.containers.list(all=True)
+ summary['total_checked'] += len(containers)
+
+ for container in containers:
+ if container.status in problematic_states:
+ # Get exit code if container exited
+ exit_code = None
+ if container.status == 'exited':
+ try:
+ exit_code = container.attrs.get('State', {}).get('ExitCode')
+ except (AttributeError, KeyError, TypeError) as e:
+ logger.debug(f"Could not get exit code for container {container.id[:12]}: {e}")
+
+ summary['containers_down'].append({
+ 'id': container.id[:12],
+ 'name': container.name,
+ 'host_id': host_id,
+ 'host_name': host.name,
+ 'state': container.status,
+ 'exit_code': exit_code,
+ 'image': container.image.tags[0] if container.image.tags else 'unknown'
+ })
+
+ except Exception as e:
+ logger.error(f"Error checking containers on host {host.name}: {e}")
+
+ # Send alert if any containers are down
+ if summary['containers_down'] and notification_service:
+ await self._send_post_blackout_alert(notification_service, summary)
+
+ except Exception as e:
+ logger.error(f"Error checking container states after blackout: {e}")
+
+ return summary
+
+ async def _send_post_blackout_alert(self, notification_service, summary: Dict):
+ """Send alert for containers found in problematic state after blackout"""
+ try:
+ containers_down = summary['containers_down']
+
+ # Get all alert rules that monitor state changes
+ alert_rules = self.db.get_alert_rules()
+
+ # For each container that's down, check if it matches any alert rules
+ for container_info in containers_down:
+ # Find matching alert rules for this container
+ matching_rules = []
+ for rule in alert_rules:
+ if not rule.enabled:
+ continue
+
+ # Check if this rule monitors the problematic state
+ if rule.trigger_states and container_info['state'] in rule.trigger_states:
+ # Check if container matches rule's container pattern
+ if self._container_matches_rule(container_info, rule):
+ matching_rules.append(rule)
+
+ # Send alert through matching rules
+ if matching_rules:
+ from notifications import AlertEvent
+ event = AlertEvent(
+ container_id=container_info['id'],
+ container_name=container_info['name'],
+ host_id=container_info['host_id'],
+ host_name=container_info['host_name'],
+ old_state='unknown_during_blackout',
+ new_state=container_info['state'],
+ exit_code=container_info.get('exit_code'),
+ timestamp=datetime.now(),
+ image=container_info['image'],
+ triggered_by='post_blackout_check'
+ )
+
+ # Send through each matching rule's channels
+ for rule in matching_rules:
+ try:
+ # Add note about blackout in the event
+ event.notes = f"Container found in {container_info['state']} state after maintenance window ended"
+ await notification_service.send_alert(event, rule)
+ except Exception as e:
+ logger.error(f"Failed to send post-blackout alert for {container_info['name']}: {e}")
+
+ except Exception as e:
+ logger.error(f"Error sending post-blackout alerts: {e}")
+
+ def _container_matches_rule(self, container_info: Dict, rule) -> bool:
+ """Check if container matches an alert rule's container criteria"""
+ try:
+ # If rule has specific container+host pairs
+ if hasattr(rule, 'containers') and rule.containers:
+ for container_spec in rule.containers:
+ if (container_spec.container_name == container_info['name'] and
+ container_spec.host_id == container_info['host_id']):
+ return True
+ return False
+
+ # Otherwise, rule applies to all containers
+ return True
+
+ except Exception as e:
+ logger.error(f"Error matching container to rule: {e}")
+ return False
+
+ async def start_monitoring(self, notification_service, monitor, connection_manager=None):
+ """Start monitoring for blackout window transitions
+
+ Args:
+ notification_service: The notification service instance
+ monitor: The DockerMonitor instance (reused, not created)
+ connection_manager: Optional WebSocket connection manager
+ """
+ self._connection_manager = connection_manager
+ self._monitor = monitor # Store monitor reference
+
+ async def monitor_loop():
+ was_in_blackout = False
+
+ while True:
+ try:
+ is_blackout, window_name = self.is_in_blackout_window()
+
+ # Check if blackout status changed
+ if was_in_blackout != is_blackout:
+ # Broadcast status change to all WebSocket clients
+ if self._connection_manager:
+ await self._connection_manager.broadcast({
+ 'type': 'blackout_status_changed',
+ 'data': {
+ 'is_blackout': is_blackout,
+ 'window_name': window_name
+ }
+ })
+
+ # If we just exited blackout, process suppressed alerts
+ if was_in_blackout and not is_blackout:
+ logger.info(f"Blackout window ended. Processing suppressed alerts...")
+ await notification_service.process_suppressed_alerts(self._monitor)
+
+ was_in_blackout = is_blackout
+
+ # Check every 15 seconds for more responsive updates
+ await asyncio.sleep(15)
+
+ except Exception as e:
+ logger.error(f"Error in blackout monitoring: {e}")
+ await asyncio.sleep(15)
+
+ self._check_task = asyncio.create_task(monitor_loop())
+
+ def stop_monitoring(self):
+ """Stop the monitoring task"""
+ if self._check_task:
+ self._check_task.cancel()
+ self._check_task = None
\ No newline at end of file
diff --git a/dockmon/backend/config/__init__.py b/dockmon/backend/config/__init__.py
new file mode 100644
index 0000000..0dbcc16
--- /dev/null
+++ b/dockmon/backend/config/__init__.py
@@ -0,0 +1 @@
+# Configuration module for DockMon
\ No newline at end of file
diff --git a/dockmon/backend/config/paths.py b/dockmon/backend/config/paths.py
new file mode 100644
index 0000000..1651c3e
--- /dev/null
+++ b/dockmon/backend/config/paths.py
@@ -0,0 +1,27 @@
+"""
+Centralized path configuration for DockMon
+Ensures all modules use consistent, volume-mounted paths
+"""
+
+import os
+
+# Base paths - these MUST use absolute paths to the volume mount
+# The /app/data directory is mounted as a volume in Docker
+DATA_DIR = os.getenv('DOCKMON_DATA_DIR', '/app/data')
+
+# Database path - MUST be in the volume mount for persistence
+DATABASE_PATH = os.path.join(DATA_DIR, 'dockmon.db')
+DATABASE_URL = f'sqlite:///{DATABASE_PATH}'
+
+# Credentials file - also in volume for persistence
+CREDENTIALS_FILE = os.path.join(DATA_DIR, 'frontend_credentials.txt')
+
+# Certificates directory for TLS
+CERTS_DIR = os.path.join(DATA_DIR, 'certs')
+
+# Ensure data directory exists with proper permissions
+def ensure_data_dirs():
+ """Create data directories if they don't exist"""
+ for directory in [DATA_DIR, CERTS_DIR]:
+ # Use mode parameter to avoid TOCTOU race condition
+ os.makedirs(directory, mode=0o700, exist_ok=True)
\ No newline at end of file
diff --git a/dockmon/backend/config/settings.py b/dockmon/backend/config/settings.py
new file mode 100644
index 0000000..0dd5fbf
--- /dev/null
+++ b/dockmon/backend/config/settings.py
@@ -0,0 +1,165 @@
+"""
+Configuration Management for DockMon
+Centralizes all environment-based configuration and settings
+"""
+
+import os
+import logging
+from logging.handlers import RotatingFileHandler
+from typing import List
+
+
+def setup_logging():
+ """Configure application logging with rotation"""
+ from .paths import DATA_DIR
+
+ # Create logs directory with secure permissions
+ log_dir = os.path.join(DATA_DIR, 'logs')
+ os.makedirs(log_dir, mode=0o700, exist_ok=True)
+
+ # Set up root logger
+ root_logger = logging.getLogger()
+
+ # Check if handlers are already configured (prevent duplicate handlers)
+ if root_logger.handlers:
+ return
+
+ root_logger.setLevel(logging.INFO)
+
+ # Console handler for stdout
+ console_handler = logging.StreamHandler()
+ console_handler.setLevel(logging.INFO)
+ console_formatter = logging.Formatter(
+ '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+ )
+ console_handler.setFormatter(console_formatter)
+
+ # File handler with rotation for application logs
+ # Max 10MB per file, keep 14 backups
+ file_handler = RotatingFileHandler(
+ os.path.join(log_dir, 'dockmon.log'),
+ maxBytes=10*1024*1024, # 10MB
+ backupCount=14, # Keep 14 old files
+ encoding='utf-8'
+ )
+ file_handler.setLevel(logging.INFO)
+ file_handler.setFormatter(console_formatter)
+
+ # Add handlers to root logger
+ root_logger.addHandler(console_handler)
+ root_logger.addHandler(file_handler)
+
+
+def _is_docker_container_id(hostname: str) -> bool:
+ """Check if hostname looks like a Docker container ID"""
+ if len(hostname) == 64 or len(hostname) == 12:
+ try:
+ int(hostname, 16) # Check if it's hexadecimal
+ return True
+ except ValueError:
+ pass
+ return False
+
+
+def get_cors_origins() -> List[str]:
+ """Get CORS origins from environment or use defaults"""
+ # Check for custom origins from environment
+ custom_origins = os.getenv('DOCKMON_CORS_ORIGINS')
+ if custom_origins:
+ return [origin.strip() for origin in custom_origins.split(',')]
+
+ # Default origins for development and common deployment scenarios
+ default_origins = [
+ "http://localhost:3000",
+ "http://localhost:8080",
+ "http://localhost:8081",
+ "http://127.0.0.1:3000",
+ "http://127.0.0.1:8080",
+ "http://127.0.0.1:8081"
+ ]
+
+ # Auto-detect common production patterns (but skip Docker container IDs)
+ hostname = os.getenv('HOSTNAME', 'localhost')
+ if hostname != 'localhost' and not _is_docker_container_id(hostname):
+ default_origins.extend([
+ f"http://{hostname}:3000",
+ f"http://{hostname}:8080",
+ f"https://{hostname}:3000",
+ f"https://{hostname}:8080"
+ ])
+
+ return default_origins
+
+
+class RateLimitConfig:
+ """Rate limiting configuration from environment variables"""
+
+ @staticmethod
+ def get_limits() -> dict:
+ """Get all rate limiting configuration from environment"""
+ return {
+ # endpoint_pattern: (requests_per_minute, burst_limit, violation_threshold)
+ "default": (
+ int(os.getenv('DOCKMON_RATE_LIMIT_DEFAULT', 120)),
+ int(os.getenv('DOCKMON_RATE_BURST_DEFAULT', 20)),
+ int(os.getenv('DOCKMON_RATE_VIOLATIONS_DEFAULT', 8))
+ ),
+ "auth": (
+ int(os.getenv('DOCKMON_RATE_LIMIT_AUTH', 60)),
+ int(os.getenv('DOCKMON_RATE_BURST_AUTH', 15)),
+ int(os.getenv('DOCKMON_RATE_VIOLATIONS_AUTH', 5))
+ ),
+ "hosts": (
+ int(os.getenv('DOCKMON_RATE_LIMIT_HOSTS', 60)),
+ int(os.getenv('DOCKMON_RATE_BURST_HOSTS', 15)),
+ int(os.getenv('DOCKMON_RATE_VIOLATIONS_HOSTS', 8))
+ ),
+ "containers": (
+ int(os.getenv('DOCKMON_RATE_LIMIT_CONTAINERS', 200)),
+ int(os.getenv('DOCKMON_RATE_BURST_CONTAINERS', 40)),
+ int(os.getenv('DOCKMON_RATE_VIOLATIONS_CONTAINERS', 15))
+ ),
+ "notifications": (
+ int(os.getenv('DOCKMON_RATE_LIMIT_NOTIFICATIONS', 30)),
+ int(os.getenv('DOCKMON_RATE_BURST_NOTIFICATIONS', 10)),
+ int(os.getenv('DOCKMON_RATE_VIOLATIONS_NOTIFICATIONS', 5))
+ ),
+ }
+
+
+class AppConfig:
+ """Main application configuration"""
+
+ # Server settings
+ HOST = os.getenv('DOCKMON_HOST', '0.0.0.0')
+ PORT = int(os.getenv('DOCKMON_PORT', 8080))
+
+ # Security settings
+ CORS_ORIGINS = get_cors_origins()
+
+ # Import centralized paths
+ from .paths import DATABASE_URL as DEFAULT_DATABASE_URL, CREDENTIALS_FILE as DEFAULT_CREDENTIALS_FILE
+
+ # Database settings
+ DATABASE_URL = os.getenv('DOCKMON_DATABASE_URL', DEFAULT_DATABASE_URL)
+
+ # Logging
+ LOG_LEVEL = os.getenv('DOCKMON_LOG_LEVEL', 'INFO')
+
+ # Authentication
+ CREDENTIALS_FILE = os.getenv('DOCKMON_CREDENTIALS_FILE', DEFAULT_CREDENTIALS_FILE)
+ SESSION_TIMEOUT_HOURS = int(os.getenv('DOCKMON_SESSION_TIMEOUT_HOURS', 24))
+
+ # Rate limiting
+ RATE_LIMITS = RateLimitConfig.get_limits()
+
+ @classmethod
+ def validate(cls):
+ """Validate configuration"""
+ if cls.PORT < 1 or cls.PORT > 65535:
+ raise ValueError(f"Invalid port: {cls.PORT}")
+
+ if cls.SESSION_TIMEOUT_HOURS < 1:
+ raise ValueError(f"Session timeout must be at least 1 hour: {cls.SESSION_TIMEOUT_HOURS}")
+
+ return True
\ No newline at end of file
diff --git a/dockmon/backend/database.py b/dockmon/backend/database.py
new file mode 100644
index 0000000..de38200
--- /dev/null
+++ b/dockmon/backend/database.py
@@ -0,0 +1,1091 @@
+"""
+Database models and operations for DockMon
+Uses SQLite for persistent storage of configuration and settings
+"""
+
+from datetime import datetime, timedelta
+from typing import Optional, List, Dict, Any
+from sqlalchemy import create_engine, Column, String, Integer, Boolean, DateTime, JSON, ForeignKey, Text, UniqueConstraint, text
+from sqlalchemy.ext.declarative import declarative_base
+from sqlalchemy.orm import sessionmaker, Session, relationship
+from sqlalchemy.pool import StaticPool
+import json
+import os
+import logging
+import secrets
+import bcrypt
+
+logger = logging.getLogger(__name__)
+
+Base = declarative_base()
+
+class User(Base):
+ """User authentication and settings"""
+ __tablename__ = "users"
+
+ id = Column(Integer, primary_key=True, autoincrement=True)
+ username = Column(String, nullable=False, unique=True)
+ password_hash = Column(String, nullable=False)
+ is_first_login = Column(Boolean, default=True)
+ must_change_password = Column(Boolean, default=False)
+ dashboard_layout = Column(Text, nullable=True) # JSON string of GridStack layout
+ event_sort_order = Column(String, default='desc') # 'desc' (newest first) or 'asc' (oldest first)
+ container_sort_order = Column(String, default='name-asc') # Container sort preference on dashboard
+ modal_preferences = Column(Text, nullable=True) # JSON string of modal size/position preferences
+ created_at = Column(DateTime, default=datetime.now)
+ updated_at = Column(DateTime, default=datetime.now, onupdate=datetime.now)
+ last_login = Column(DateTime, nullable=True)
+
+class DockerHostDB(Base):
+ """Docker host configuration"""
+ __tablename__ = "docker_hosts"
+
+ id = Column(String, primary_key=True)
+ name = Column(String, nullable=False, unique=True)
+ url = Column(String, nullable=False)
+ tls_cert = Column(Text, nullable=True)
+ tls_key = Column(Text, nullable=True)
+ tls_ca = Column(Text, nullable=True)
+ security_status = Column(String, nullable=True) # 'secure', 'insecure', 'unknown'
+ is_active = Column(Boolean, default=True)
+ created_at = Column(DateTime, default=datetime.now)
+ updated_at = Column(DateTime, default=datetime.now, onupdate=datetime.now)
+
+ # Relationships
+ auto_restart_configs = relationship("AutoRestartConfig", back_populates="host", cascade="all, delete-orphan")
+
+class AutoRestartConfig(Base):
+ """Auto-restart configuration for containers"""
+ __tablename__ = "auto_restart_configs"
+
+ id = Column(Integer, primary_key=True, autoincrement=True)
+ host_id = Column(String, ForeignKey("docker_hosts.id"))
+ container_id = Column(String, nullable=False)
+ container_name = Column(String, nullable=False)
+ enabled = Column(Boolean, default=True)
+ max_retries = Column(Integer, default=3)
+ retry_delay = Column(Integer, default=30)
+ restart_count = Column(Integer, default=0)
+ last_restart = Column(DateTime, nullable=True)
+ created_at = Column(DateTime, default=datetime.now)
+ updated_at = Column(DateTime, default=datetime.now, onupdate=datetime.now)
+
+ # Relationships
+ host = relationship("DockerHostDB", back_populates="auto_restart_configs")
+
+
+class GlobalSettings(Base):
+ """Global application settings"""
+ __tablename__ = "global_settings"
+
+ id = Column(Integer, primary_key=True, default=1)
+ max_retries = Column(Integer, default=3)
+ retry_delay = Column(Integer, default=30)
+ default_auto_restart = Column(Boolean, default=False)
+ polling_interval = Column(Integer, default=2)
+ connection_timeout = Column(Integer, default=10)
+ log_retention_days = Column(Integer, default=7)
+ event_retention_days = Column(Integer, default=30) # Keep events for 30 days
+ enable_notifications = Column(Boolean, default=True)
+ auto_cleanup_events = Column(Boolean, default=True) # Auto cleanup old events
+ alert_template = Column(Text, nullable=True) # Global notification template
+ blackout_windows = Column(JSON, nullable=True) # Array of blackout time windows
+ first_run_complete = Column(Boolean, default=False) # Track if first run setup is complete
+ polling_interval_migrated = Column(Boolean, default=False) # Track if polling interval has been migrated to 2s
+ timezone_offset = Column(Integer, default=0) # Timezone offset in minutes from UTC
+ show_host_stats = Column(Boolean, default=True) # Show host statistics graphs on dashboard
+ show_container_stats = Column(Boolean, default=True) # Show container statistics on dashboard
+ updated_at = Column(DateTime, default=datetime.now, onupdate=datetime.now)
+
+class NotificationChannel(Base):
+ """Notification channel configuration"""
+ __tablename__ = "notification_channels"
+
+ id = Column(Integer, primary_key=True, autoincrement=True)
+ name = Column(String, nullable=False, unique=True)
+ type = Column(String, nullable=False) # telegram, discord, slack, pushover
+ config = Column(JSON, nullable=False) # Channel-specific configuration
+ enabled = Column(Boolean, default=True)
+ created_at = Column(DateTime, default=datetime.now)
+ updated_at = Column(DateTime, default=datetime.now, onupdate=datetime.now)
+
+class AlertRuleDB(Base):
+ """Alert rules for container state changes"""
+ __tablename__ = "alert_rules"
+
+ id = Column(String, primary_key=True)
+ name = Column(String, nullable=False)
+ trigger_events = Column(JSON, nullable=True) # list of Docker events that trigger alert
+ trigger_states = Column(JSON, nullable=True) # list of states that trigger alert
+ notification_channels = Column(JSON, nullable=False) # list of channel IDs
+ cooldown_minutes = Column(Integer, default=15) # prevent spam
+ enabled = Column(Boolean, default=True)
+ last_triggered = Column(DateTime, nullable=True)
+ created_at = Column(DateTime, default=datetime.now)
+ updated_at = Column(DateTime, default=datetime.now, onupdate=datetime.now)
+
+ # Relationships
+ containers = relationship("AlertRuleContainer", back_populates="alert_rule", cascade="all, delete-orphan")
+
+class AlertRuleContainer(Base):
+ """Container+Host pairs for alert rules"""
+ __tablename__ = "alert_rule_containers"
+
+ id = Column(Integer, primary_key=True, autoincrement=True)
+ alert_rule_id = Column(String, ForeignKey("alert_rules.id", ondelete="CASCADE"), nullable=False)
+ host_id = Column(String, ForeignKey("docker_hosts.id", ondelete="CASCADE"), nullable=False)
+ container_name = Column(String, nullable=False)
+ created_at = Column(DateTime, default=datetime.now)
+
+ # Relationships
+ alert_rule = relationship("AlertRuleDB", back_populates="containers")
+ host = relationship("DockerHostDB")
+
+ # Unique constraint to prevent duplicates
+ __table_args__ = (
+ UniqueConstraint('alert_rule_id', 'host_id', 'container_name', name='_alert_container_uc'),
+ )
+
+class EventLog(Base):
+ """Comprehensive event logging for all DockMon activities"""
+ __tablename__ = "event_logs"
+
+ id = Column(Integer, primary_key=True, autoincrement=True)
+ correlation_id = Column(String, nullable=True) # For linking related events
+
+ # Event categorization
+ category = Column(String, nullable=False) # container, host, system, alert, notification
+ event_type = Column(String, nullable=False) # state_change, action_taken, error, etc.
+ severity = Column(String, nullable=False, default='info') # debug, info, warning, error, critical
+
+ # Target information
+ host_id = Column(String, nullable=True)
+ host_name = Column(String, nullable=True)
+ container_id = Column(String, nullable=True)
+ container_name = Column(String, nullable=True)
+
+ # Event details
+ title = Column(String, nullable=False) # Short description
+ message = Column(Text, nullable=True) # Detailed description
+ old_state = Column(String, nullable=True)
+ new_state = Column(String, nullable=True)
+ triggered_by = Column(String, nullable=True) # user, system, auto_restart, alert
+
+ # Additional data
+ details = Column(JSON, nullable=True) # Structured additional data
+ duration_ms = Column(Integer, nullable=True) # For performance tracking
+
+ # Timestamps
+ timestamp = Column(DateTime, default=datetime.now, nullable=False)
+
+ # Index for efficient queries
+ __table_args__ = (
+ {"sqlite_autoincrement": True},
+ )
+
+class DatabaseManager:
+ """Database management and operations"""
+
+ def __init__(self, db_path: str = "data/dockmon.db"):
+ """Initialize database connection"""
+ self.db_path = db_path
+
+ # Ensure data directory exists
+ data_dir = os.path.dirname(db_path)
+ os.makedirs(data_dir, exist_ok=True)
+
+ # Set secure permissions on data directory (rwx for owner only)
+ try:
+ os.chmod(data_dir, 0o700)
+ logger.info(f"Set secure permissions (700) on data directory: {data_dir}")
+ except OSError as e:
+ logger.warning(f"Could not set permissions on data directory {data_dir}: {e}")
+
+ # Create engine with connection pooling
+ self.engine = create_engine(
+ f"sqlite:///{db_path}",
+ connect_args={"check_same_thread": False},
+ poolclass=StaticPool,
+ echo=False
+ )
+
+ # Create session factory
+ self.SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=self.engine)
+
+ # Create tables if they don't exist
+ Base.metadata.create_all(bind=self.engine)
+
+ # Run database migrations
+ self._run_migrations()
+
+ # Set secure permissions on database file (rw for owner only)
+ self._secure_database_file()
+
+ # Initialize default settings if needed
+ self._initialize_defaults()
+
+ def _run_migrations(self):
+ """Run database migrations for schema updates"""
+ try:
+ with self.get_session() as session:
+ # Migration: Populate security_status for existing hosts
+ hosts_without_security_status = session.query(DockerHostDB).filter(
+ DockerHostDB.security_status.is_(None)
+ ).all()
+
+ for host in hosts_without_security_status:
+ # Determine security status based on existing data
+ if host.url and not host.url.startswith('unix://'):
+ if host.tls_cert and host.tls_key:
+ host.security_status = 'secure'
+ else:
+ host.security_status = 'insecure'
+ # Unix socket connections don't need security status
+
+ if hosts_without_security_status:
+ session.commit()
+ print(f"Migrated {len(hosts_without_security_status)} hosts with security status")
+
+ # Migration: Add event_sort_order column to users table if it doesn't exist
+ inspector = session.connection().engine.dialect.get_columns(session.connection(), 'users')
+ column_names = [col.get('name', '') for col in inspector if 'name' in col]
+
+ if 'event_sort_order' not in column_names:
+ # Add the column using raw SQL
+ session.execute(text("ALTER TABLE users ADD COLUMN event_sort_order VARCHAR DEFAULT 'desc'"))
+ session.commit()
+ print("Added event_sort_order column to users table")
+
+ # Migration: Add container_sort_order column to users table if it doesn't exist
+ if 'container_sort_order' not in column_names:
+ # Add the column using raw SQL
+ session.execute(text("ALTER TABLE users ADD COLUMN container_sort_order VARCHAR DEFAULT 'name-asc'"))
+ session.commit()
+ print("Added container_sort_order column to users table")
+
+ # Migration: Add modal_preferences column to users table if it doesn't exist
+ if 'modal_preferences' not in column_names:
+ # Add the column using raw SQL
+ session.execute(text("ALTER TABLE users ADD COLUMN modal_preferences TEXT"))
+ session.commit()
+ print("Added modal_preferences column to users table")
+
+ # Migration: Add show_host_stats and show_container_stats columns to global_settings table
+ settings_inspector = session.connection().engine.dialect.get_columns(session.connection(), 'global_settings')
+ settings_column_names = [col['name'] for col in settings_inspector]
+
+ if 'show_host_stats' not in settings_column_names:
+ session.execute(text("ALTER TABLE global_settings ADD COLUMN show_host_stats BOOLEAN DEFAULT 1"))
+ session.commit()
+ print("Added show_host_stats column to global_settings table")
+
+ if 'show_container_stats' not in settings_column_names:
+ session.execute(text("ALTER TABLE global_settings ADD COLUMN show_container_stats BOOLEAN DEFAULT 1"))
+ session.commit()
+ print("Added show_container_stats column to global_settings table")
+
+ # Migration: Drop deprecated container_history table
+ # This table has been replaced by the EventLog table
+ inspector_result = session.connection().engine.dialect.get_table_names(session.connection())
+ if 'container_history' in inspector_result:
+ session.execute(text("DROP TABLE container_history"))
+ session.commit()
+ print("Dropped deprecated container_history table (replaced by EventLog)")
+
+ # Migration: Add polling_interval_migrated column if it doesn't exist
+ if 'polling_interval_migrated' not in settings_column_names:
+ session.execute(text("ALTER TABLE global_settings ADD COLUMN polling_interval_migrated BOOLEAN DEFAULT 0"))
+ session.commit()
+ print("Added polling_interval_migrated column to global_settings table")
+
+ # Migration: Update polling_interval to 2 seconds (only once, on first startup after this update)
+ settings = session.query(GlobalSettings).first()
+ if settings and not settings.polling_interval_migrated:
+ # Only update if the user hasn't customized it (still at old default of 5 or 10)
+ if settings.polling_interval >= 5:
+ settings.polling_interval = 2
+ settings.polling_interval_migrated = True
+ session.commit()
+ print("Migrated polling_interval to 2 seconds (from previous default)")
+ else:
+ # User has already customized to something < 5, just mark as migrated
+ settings.polling_interval_migrated = True
+ session.commit()
+
+ except Exception as e:
+ print(f"Migration warning: {e}")
+ # Don't fail startup on migration errors
+
+ def _secure_database_file(self):
+ """Set secure file permissions on the SQLite database file"""
+ try:
+ if os.path.exists(self.db_path):
+ # Set file permissions to 600 (read/write for owner only)
+ os.chmod(self.db_path, 0o600)
+ logger.info(f"Set secure permissions (600) on database file: {self.db_path}")
+ else:
+ # File doesn't exist yet - will be created by SQLAlchemy
+ # Schedule permission setting for after first connection
+ self._schedule_file_permissions()
+ except OSError as e:
+ logger.warning(f"Could not set permissions on database file {self.db_path}: {e}")
+
+ def _schedule_file_permissions(self):
+ """Schedule file permission setting for after database file is created"""
+ # Create a connection to ensure the file exists
+ with self.engine.connect() as conn:
+ pass
+
+ # Now set permissions
+ try:
+ if os.path.exists(self.db_path):
+ os.chmod(self.db_path, 0o600)
+ logger.info(f"Set secure permissions (600) on newly created database file: {self.db_path}")
+ except OSError as e:
+ logger.warning(f"Could not set permissions on newly created database file {self.db_path}: {e}")
+
+ def _initialize_defaults(self):
+ """Initialize default settings if they don't exist"""
+ with self.get_session() as session:
+ # Check if global settings exist
+ settings = session.query(GlobalSettings).first()
+ if not settings:
+ settings = GlobalSettings()
+ session.add(settings)
+ session.commit()
+
+ def get_session(self) -> Session:
+ """Get a database session"""
+ return self.SessionLocal()
+
+ # Docker Host Operations
+ def add_host(self, host_data: dict) -> DockerHostDB:
+ """Add a new Docker host"""
+ with self.get_session() as session:
+ try:
+ host = DockerHostDB(**host_data)
+ session.add(host)
+ session.commit()
+ session.refresh(host)
+ logger.info(f"Added host {host.name} ({host.id[:8]}) to database")
+ return host
+ except Exception as e:
+ logger.error(f"Failed to add host to database: {e}")
+ raise
+
+ def get_hosts(self, active_only: bool = True) -> List[DockerHostDB]:
+ """Get all Docker hosts ordered by creation time"""
+ with self.get_session() as session:
+ query = session.query(DockerHostDB)
+ if active_only:
+ query = query.filter(DockerHostDB.is_active == True)
+ # Order by created_at to ensure consistent ordering (oldest first)
+ query = query.order_by(DockerHostDB.created_at)
+ return query.all()
+
+ def get_host(self, host_id: str) -> Optional[DockerHostDB]:
+ """Get a specific Docker host"""
+ with self.get_session() as session:
+ return session.query(DockerHostDB).filter(DockerHostDB.id == host_id).first()
+
+ def update_host(self, host_id: str, updates: dict) -> Optional[DockerHostDB]:
+ """Update a Docker host"""
+ with self.get_session() as session:
+ try:
+ host = session.query(DockerHostDB).filter(DockerHostDB.id == host_id).first()
+ if host:
+ for key, value in updates.items():
+ setattr(host, key, value)
+ host.updated_at = datetime.now()
+ session.commit()
+ session.refresh(host)
+ logger.info(f"Updated host {host.name} ({host_id[:8]}) in database")
+ return host
+ except Exception as e:
+ logger.error(f"Failed to update host {host_id[:8]} in database: {e}")
+ raise
+
+ def delete_host(self, host_id: str) -> bool:
+ """Delete a Docker host and clean up related alert rules"""
+ with self.get_session() as session:
+ try:
+ host = session.query(DockerHostDB).filter(DockerHostDB.id == host_id).first()
+ if not host:
+ logger.warning(f"Attempted to delete non-existent host {host_id[:8]}")
+ return False
+
+ host_name = host.name
+
+ # Get all alert rules
+ all_rules = session.query(AlertRuleDB).all()
+
+ # Process each alert rule to remove containers from the deleted host
+ for rule in all_rules:
+ if not rule.containers:
+ continue
+
+ # Filter out containers from the deleted host
+ # rule.containers is a list of AlertRuleContainer objects
+ remaining_containers = [
+ c for c in rule.containers
+ if c.host_id != host_id
+ ]
+
+ if not remaining_containers:
+ # No containers left, delete the entire alert
+ session.delete(rule)
+ logger.info(f"Deleted alert rule '{rule.name}' (all containers were on deleted host {host_id})")
+ elif len(remaining_containers) < len(rule.containers):
+ # Some containers remain, update the alert
+ rule.containers = remaining_containers
+ rule.updated_at = datetime.now()
+ logger.info(f"Updated alert rule '{rule.name}' (removed containers from host {host_id})")
+
+ # Delete the host
+ session.delete(host)
+ session.commit()
+ logger.info(f"Deleted host {host_name} ({host_id[:8]}) from database")
+ return True
+ except Exception as e:
+ logger.error(f"Failed to delete host {host_id[:8]} from database: {e}")
+ raise
+
+ # Auto-Restart Configuration
+ def get_auto_restart_config(self, host_id: str, container_id: str) -> Optional[AutoRestartConfig]:
+ """Get auto-restart configuration for a container"""
+ with self.get_session() as session:
+ return session.query(AutoRestartConfig).filter(
+ AutoRestartConfig.host_id == host_id,
+ AutoRestartConfig.container_id == container_id
+ ).first()
+
+ def set_auto_restart(self, host_id: str, container_id: str, container_name: str, enabled: bool):
+ """Set auto-restart configuration for a container"""
+ with self.get_session() as session:
+ try:
+ config = session.query(AutoRestartConfig).filter(
+ AutoRestartConfig.host_id == host_id,
+ AutoRestartConfig.container_id == container_id
+ ).first()
+
+ if config:
+ config.enabled = enabled
+ config.updated_at = datetime.now()
+ if not enabled:
+ config.restart_count = 0
+ logger.info(f"Updated auto-restart for {container_name} ({container_id[:12]}): enabled={enabled}")
+ else:
+ config = AutoRestartConfig(
+ host_id=host_id,
+ container_id=container_id,
+ container_name=container_name,
+ enabled=enabled
+ )
+ session.add(config)
+ logger.info(f"Created auto-restart config for {container_name} ({container_id[:12]}): enabled={enabled}")
+
+ session.commit()
+ except Exception as e:
+ logger.error(f"Failed to set auto-restart for {container_id[:12]}: {e}")
+ raise
+
+ def increment_restart_count(self, host_id: str, container_id: str):
+ """Increment restart count for a container"""
+ with self.get_session() as session:
+ try:
+ config = session.query(AutoRestartConfig).filter(
+ AutoRestartConfig.host_id == host_id,
+ AutoRestartConfig.container_id == container_id
+ ).first()
+
+ if config:
+ config.restart_count += 1
+ config.last_restart = datetime.now()
+ session.commit()
+ logger.debug(f"Incremented restart count for {container_id[:12]} to {config.restart_count}")
+ except Exception as e:
+ logger.error(f"Failed to increment restart count for {container_id[:12]}: {e}")
+ raise
+
+ def reset_restart_count(self, host_id: str, container_id: str):
+ """Reset restart count for a container"""
+ with self.get_session() as session:
+ try:
+ config = session.query(AutoRestartConfig).filter(
+ AutoRestartConfig.host_id == host_id,
+ AutoRestartConfig.container_id == container_id
+ ).first()
+
+ if config:
+ config.restart_count = 0
+ session.commit()
+ logger.debug(f"Reset restart count for {container_id[:12]}")
+ except Exception as e:
+ logger.error(f"Failed to reset restart count for {container_id[:12]}: {e}")
+ raise
+
+ # Global Settings
+ def get_settings(self) -> GlobalSettings:
+ """Get global settings"""
+ with self.get_session() as session:
+ return session.query(GlobalSettings).first()
+
+ def update_settings(self, updates: dict) -> GlobalSettings:
+ """Update global settings"""
+ with self.get_session() as session:
+ try:
+ settings = session.query(GlobalSettings).first()
+ for key, value in updates.items():
+ if hasattr(settings, key):
+ setattr(settings, key, value)
+ settings.updated_at = datetime.now()
+ session.commit()
+ session.refresh(settings)
+ logger.info("Changed global settings")
+ return settings
+ except Exception as e:
+ logger.error(f"Failed to update global settings: {e}")
+ raise
+
+ # Notification Channels
+ def add_notification_channel(self, channel_data: dict) -> NotificationChannel:
+ """Add a notification channel"""
+ with self.get_session() as session:
+ try:
+ channel = NotificationChannel(**channel_data)
+ session.add(channel)
+ session.commit()
+ session.refresh(channel)
+ logger.info(f"Added notification channel: {channel.name} (type: {channel.type})")
+ return channel
+ except Exception as e:
+ logger.error(f"Failed to add notification channel: {e}")
+ raise
+
+ def get_notification_channels(self, enabled_only: bool = True) -> List[NotificationChannel]:
+ """Get all notification channels"""
+ with self.get_session() as session:
+ query = session.query(NotificationChannel)
+ if enabled_only:
+ query = query.filter(NotificationChannel.enabled == True)
+ return query.all()
+
+ def get_notification_channels_by_ids(self, channel_ids: List[int]) -> List[NotificationChannel]:
+ """Get notification channels by their IDs"""
+ with self.get_session() as session:
+ channels = session.query(NotificationChannel).filter(
+ NotificationChannel.id.in_(channel_ids),
+ NotificationChannel.enabled == True
+ ).all()
+
+ # Detach from session to avoid lazy loading issues
+ session.expunge_all()
+ return channels
+
+ def update_notification_channel(self, channel_id: int, updates: dict) -> Optional[NotificationChannel]:
+ """Update a notification channel"""
+ with self.get_session() as session:
+ try:
+ channel = session.query(NotificationChannel).filter(NotificationChannel.id == channel_id).first()
+ if channel:
+ for key, value in updates.items():
+ setattr(channel, key, value)
+ channel.updated_at = datetime.now()
+ session.commit()
+ session.refresh(channel)
+ logger.info(f"Updated notification channel: {channel.name} (ID: {channel_id})")
+ return channel
+ except Exception as e:
+ logger.error(f"Failed to update notification channel {channel_id}: {e}")
+ raise
+
+ def delete_notification_channel(self, channel_id: int) -> bool:
+ """Delete a notification channel"""
+ with self.get_session() as session:
+ try:
+ channel = session.query(NotificationChannel).filter(NotificationChannel.id == channel_id).first()
+ if channel:
+ channel_name = channel.name
+ session.delete(channel)
+ session.commit()
+ logger.info(f"Deleted notification channel: {channel_name} (ID: {channel_id})")
+ return True
+ logger.warning(f"Attempted to delete non-existent notification channel {channel_id}")
+ return False
+ except Exception as e:
+ logger.error(f"Failed to delete notification channel {channel_id}: {e}")
+ raise
+
+ # Alert Rules
+ def add_alert_rule(self, rule_data: dict) -> AlertRuleDB:
+ """Add an alert rule with container+host pairs"""
+ with self.get_session() as session:
+ try:
+ # Extract containers list if present
+ containers_data = rule_data.pop('containers', None)
+
+ rule = AlertRuleDB(**rule_data)
+ session.add(rule)
+ session.flush() # Flush to get the ID without committing
+
+ # Add container+host pairs if provided
+ if containers_data:
+ for container in containers_data:
+ container_pair = AlertRuleContainer(
+ alert_rule_id=rule.id,
+ host_id=container['host_id'],
+ container_name=container['container_name']
+ )
+ session.add(container_pair)
+
+ session.commit()
+ logger.info(f"Added alert rule: {rule.name} (ID: {rule.id})")
+ except Exception as e:
+ logger.error(f"Failed to add alert rule: {e}")
+ raise
+
+ # Create a detached copy with all needed attributes
+ rule_dict = {
+ 'id': rule.id,
+ 'name': rule.name,
+ 'trigger_events': rule.trigger_events,
+ 'trigger_states': rule.trigger_states,
+ 'notification_channels': rule.notification_channels,
+ 'cooldown_minutes': rule.cooldown_minutes,
+ 'enabled': rule.enabled,
+ 'last_triggered': rule.last_triggered,
+ 'created_at': rule.created_at,
+ 'updated_at': rule.updated_at
+ }
+
+ # Return a new instance that's not attached to the session
+ detached_rule = AlertRuleDB(**rule_dict)
+ detached_rule.containers = [] # Initialize empty containers list
+
+ return detached_rule
+
+ def get_alert_rule(self, rule_id: str) -> Optional[AlertRuleDB]:
+ """Get a single alert rule by ID"""
+ with self.get_session() as session:
+ from sqlalchemy.orm import joinedload
+ rule = session.query(AlertRuleDB).options(joinedload(AlertRuleDB.containers)).filter(AlertRuleDB.id == rule_id).first()
+ if rule:
+ session.expunge(rule)
+ return rule
+
+ def get_alert_rules(self, enabled_only: bool = True) -> List[AlertRuleDB]:
+ """Get all alert rules"""
+ with self.get_session() as session:
+ from sqlalchemy.orm import joinedload
+ query = session.query(AlertRuleDB).options(joinedload(AlertRuleDB.containers))
+ if enabled_only:
+ query = query.filter(AlertRuleDB.enabled == True)
+ rules = query.all()
+ # Detach from session to avoid lazy loading issues
+ for rule in rules:
+ session.expunge(rule)
+ return rules
+
+ def update_alert_rule(self, rule_id: str, updates: dict) -> Optional[AlertRuleDB]:
+ """Update an alert rule and its container+host pairs"""
+ with self.get_session() as session:
+ try:
+ from sqlalchemy.orm import joinedload
+ rule = session.query(AlertRuleDB).options(joinedload(AlertRuleDB.containers)).filter(AlertRuleDB.id == rule_id).first()
+ if rule:
+ # Check if containers field is present before extracting it
+ has_containers_update = 'containers' in updates
+ containers_data = updates.pop('containers', None)
+
+ # Update rule fields
+ for key, value in updates.items():
+ setattr(rule, key, value)
+ rule.updated_at = datetime.now()
+
+ # Update container+host pairs if containers field was explicitly provided
+ # (could be None for "all containers", empty list, or list with specific containers)
+ if has_containers_update:
+ # Delete existing container pairs
+ session.query(AlertRuleContainer).filter(
+ AlertRuleContainer.alert_rule_id == rule_id
+ ).delete()
+
+ # Add new container pairs (if containers_data is None or empty, no new pairs are added)
+ if containers_data:
+ for container in containers_data:
+ container_pair = AlertRuleContainer(
+ alert_rule_id=rule_id,
+ host_id=container['host_id'],
+ container_name=container['container_name']
+ )
+ session.add(container_pair)
+
+ session.commit()
+ session.refresh(rule)
+ # Load containers relationship
+ _ = rule.containers
+ session.expunge(rule)
+ logger.info(f"Updated alert rule: {rule.name} (ID: {rule_id})")
+ return rule
+ except Exception as e:
+ logger.error(f"Failed to update alert rule {rule_id}: {e}")
+ raise
+
+ def delete_alert_rule(self, rule_id: str) -> bool:
+ """Delete an alert rule"""
+ with self.get_session() as session:
+ try:
+ rule = session.query(AlertRuleDB).filter(AlertRuleDB.id == rule_id).first()
+ if rule:
+ rule_name = rule.name
+ session.delete(rule)
+ session.commit()
+ logger.info(f"Deleted alert rule: {rule_name} (ID: {rule_id})")
+ return True
+ logger.warning(f"Attempted to delete non-existent alert rule {rule_id}")
+ return False
+ except Exception as e:
+ logger.error(f"Failed to delete alert rule {rule_id}: {e}")
+ raise
+
+ def get_alerts_dependent_on_channel(self, channel_id: int) -> List[dict]:
+ """Find alerts that would be orphaned if this channel is deleted (only have this one channel)"""
+ with self.get_session() as session:
+ all_rules = session.query(AlertRuleDB).all()
+ dependent_alerts = []
+
+ for rule in all_rules:
+ # Parse notification_channels JSON
+ channels = rule.notification_channels if isinstance(rule.notification_channels, list) else []
+ # Check if this is the ONLY channel for this alert
+ if len(channels) == 1 and channel_id in channels:
+ dependent_alerts.append({
+ 'id': rule.id,
+ 'name': rule.name
+ })
+
+ return dependent_alerts
+
+ # Event Logging Operations
+ def add_event(self, event_data: dict) -> EventLog:
+ """Add an event to the event log"""
+ with self.get_session() as session:
+ event = EventLog(**event_data)
+ session.add(event)
+ session.commit()
+ session.refresh(event)
+ return event
+
+ def get_events(self,
+ category: Optional[List[str]] = None,
+ event_type: Optional[str] = None,
+ severity: Optional[List[str]] = None,
+ host_id: Optional[List[str]] = None,
+ container_id: Optional[str] = None,
+ container_name: Optional[str] = None,
+ start_date: Optional[datetime] = None,
+ end_date: Optional[datetime] = None,
+ correlation_id: Optional[str] = None,
+ search: Optional[str] = None,
+ limit: int = 100,
+ offset: int = 0) -> tuple[List[EventLog], int]:
+ """Get events with filtering and pagination - returns (events, total_count)
+
+ Multi-select filters (category, severity, host_id) accept lists for OR filtering.
+ """
+ with self.get_session() as session:
+ query = session.query(EventLog)
+
+ # Apply filters - use IN clause for lists
+ if category:
+ if isinstance(category, list) and category:
+ query = query.filter(EventLog.category.in_(category))
+ elif isinstance(category, str):
+ query = query.filter(EventLog.category == category)
+ if event_type:
+ query = query.filter(EventLog.event_type == event_type)
+ if severity:
+ if isinstance(severity, list) and severity:
+ query = query.filter(EventLog.severity.in_(severity))
+ elif isinstance(severity, str):
+ query = query.filter(EventLog.severity == severity)
+ if host_id:
+ if isinstance(host_id, list) and host_id:
+ query = query.filter(EventLog.host_id.in_(host_id))
+ elif isinstance(host_id, str):
+ query = query.filter(EventLog.host_id == host_id)
+ if container_id:
+ query = query.filter(EventLog.container_id == container_id)
+ if container_name:
+ query = query.filter(EventLog.container_name.like(f'%{container_name}%'))
+ if start_date:
+ query = query.filter(EventLog.timestamp >= start_date)
+ if end_date:
+ query = query.filter(EventLog.timestamp <= end_date)
+ if correlation_id:
+ query = query.filter(EventLog.correlation_id == correlation_id)
+ if search:
+ search_term = f'%{search}%'
+ query = query.filter(
+ (EventLog.title.like(search_term)) |
+ (EventLog.message.like(search_term)) |
+ (EventLog.container_name.like(search_term))
+ )
+
+ # Get total count for pagination
+ total_count = query.count()
+
+ # Apply ordering, limit and offset
+ events = query.order_by(EventLog.timestamp.desc()).offset(offset).limit(limit).all()
+
+ return events, total_count
+
+ def get_event_by_id(self, event_id: int) -> Optional[EventLog]:
+ """Get a specific event by ID"""
+ with self.get_session() as session:
+ return session.query(EventLog).filter(EventLog.id == event_id).first()
+
+ def get_events_by_correlation(self, correlation_id: str) -> List[EventLog]:
+ """Get all events with the same correlation ID"""
+ with self.get_session() as session:
+ return session.query(EventLog).filter(
+ EventLog.correlation_id == correlation_id
+ ).order_by(EventLog.timestamp.asc()).all()
+
+ def cleanup_old_events(self, days: int = 30):
+ """Clean up old event logs"""
+ with self.get_session() as session:
+ cutoff_date = datetime.now() - timedelta(days=days)
+ deleted_count = session.query(EventLog).filter(
+ EventLog.timestamp < cutoff_date
+ ).delete()
+ session.commit()
+ return deleted_count
+
+ def get_event_statistics(self,
+ start_date: Optional[datetime] = None,
+ end_date: Optional[datetime] = None) -> Dict[str, Any]:
+ """Get event statistics for dashboard"""
+ with self.get_session() as session:
+ query = session.query(EventLog)
+
+ if start_date:
+ query = query.filter(EventLog.timestamp >= start_date)
+ if end_date:
+ query = query.filter(EventLog.timestamp <= end_date)
+
+ total_events = query.count()
+
+ # Count by category
+ category_counts = {}
+ for category, count in session.query(EventLog.category,
+ session.func.count(EventLog.id)).group_by(EventLog.category).all():
+ category_counts[category] = count
+
+ # Count by severity
+ severity_counts = {}
+ for severity, count in session.query(EventLog.severity,
+ session.func.count(EventLog.id)).group_by(EventLog.severity).all():
+ severity_counts[severity] = count
+
+ return {
+ 'total_events': total_events,
+ 'category_counts': category_counts,
+ 'severity_counts': severity_counts,
+ 'period_start': start_date.isoformat() if start_date else None,
+ 'period_end': end_date.isoformat() if end_date else None
+ }
+
+
+ # User management methods
+ def _hash_password(self, password: str) -> str:
+ """Hash a password using bcrypt with salt"""
+ # Generate salt and hash password
+ salt = bcrypt.gensalt(rounds=12) # 12 rounds is a good balance of security/speed
+ hashed = bcrypt.hashpw(password.encode('utf-8'), salt)
+ return hashed.decode('utf-8')
+
+ def _verify_password(self, password: str, hashed: str) -> bool:
+ """Verify a password against a bcrypt hash"""
+ return bcrypt.checkpw(password.encode('utf-8'), hashed.encode('utf-8'))
+
+ def get_or_create_default_user(self) -> None:
+ """Create default admin user if no users exist"""
+ with self.get_session() as session:
+ # Check if ANY user exists (not just 'admin')
+ user_count = session.query(User).count()
+ if user_count == 0:
+ # Only create default admin user if no users exist at all
+ user = User(
+ username="admin",
+ password_hash=self._hash_password("dockmon123"), # Default password
+ is_first_login=True,
+ must_change_password=True
+ )
+ session.add(user)
+ session.commit()
+ logger.info("Created default admin user")
+
+ def verify_user_credentials(self, username: str, password: str) -> Optional[Dict[str, Any]]:
+ """Verify user credentials and return user info if valid"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+
+ # Prevent timing attack: always run bcrypt even if user doesn't exist
+ if user:
+ is_valid = self._verify_password(password, user.password_hash)
+ else:
+ # Run dummy bcrypt to maintain constant time
+ dummy_hash = "$2b$12$LQv3c1yqBWVHxkd0LHAkCOYz6TtxMQJqhN8/LewY5GyYFj.N/wx9S"
+ self._verify_password(password, dummy_hash)
+ is_valid = False
+
+ if user and is_valid:
+ # Update last login
+ user.last_login = datetime.now()
+ session.commit()
+ return {
+ "username": user.username,
+ "is_first_login": user.is_first_login,
+ "must_change_password": user.must_change_password
+ }
+ return None
+
+ def change_user_password(self, username: str, new_password: str) -> bool:
+ """Change user password"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+ if user:
+ user.password_hash = self._hash_password(new_password)
+ user.is_first_login = False
+ user.must_change_password = False
+ user.updated_at = datetime.now()
+ session.commit()
+ logger.info(f"Password changed for user: {username}")
+ return True
+ return False
+
+ def username_exists(self, username: str) -> bool:
+ """Check if username already exists"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+ return user is not None
+
+ def change_username(self, old_username: str, new_username: str) -> bool:
+ """Change user's username"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == old_username).first()
+ if user:
+ user.username = new_username
+ user.updated_at = datetime.now()
+ session.commit()
+ logger.info(f"Username changed from {old_username} to {new_username}")
+ return True
+ return False
+
+ def reset_user_password(self, username: str, new_password: str = None) -> str:
+ """Reset user password (for CLI tool)"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+ if not user:
+ return None
+
+ # Generate new password if not provided
+ if not new_password:
+ new_password = secrets.token_urlsafe(12)
+
+ user.password_hash = self._hash_password(new_password)
+ user.must_change_password = True
+ user.updated_at = datetime.now()
+ session.commit()
+ logger.info(f"Password reset for user: {username}")
+ return new_password
+
+ def list_users(self) -> List[str]:
+ """List all usernames"""
+ with self.get_session() as session:
+ users = session.query(User.username).all()
+ return [u[0] for u in users]
+
+ def get_dashboard_layout(self, username: str) -> Optional[str]:
+ """Get dashboard layout for a user"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+ if user:
+ return user.dashboard_layout
+ return None
+
+ def save_dashboard_layout(self, username: str, layout: str) -> bool:
+ """Save dashboard layout for a user"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+ if user:
+ user.dashboard_layout = layout
+ user.updated_at = datetime.now()
+ session.commit()
+ return True
+ return False
+
+ def get_modal_preferences(self, username: str) -> Optional[str]:
+ """Get modal preferences for a user"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+ if user:
+ return user.modal_preferences
+ return None
+
+ def save_modal_preferences(self, username: str, preferences: str) -> bool:
+ """Save modal preferences for a user"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+ if user:
+ user.modal_preferences = preferences
+ user.updated_at = datetime.now()
+ session.commit()
+ return True
+ return False
+
+ def get_event_sort_order(self, username: str) -> str:
+ """Get event sort order preference for a user"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+ if user and user.event_sort_order:
+ return user.event_sort_order
+ return 'desc' # Default to newest first
+
+ def save_event_sort_order(self, username: str, sort_order: str) -> bool:
+ """Save event sort order preference for a user"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+ if user:
+ # Validate sort order
+ if sort_order not in ['asc', 'desc']:
+ return False
+ user.event_sort_order = sort_order
+ user.updated_at = datetime.now()
+ session.commit()
+ return True
+ return False
+
+ def get_container_sort_order(self, username: str) -> str:
+ """Get container sort order preference for a user"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+ if user and user.container_sort_order:
+ return user.container_sort_order
+ return 'name-asc' # Default to name A-Z
+
+ def save_container_sort_order(self, username: str, sort_order: str) -> bool:
+ """Save container sort order preference for a user"""
+ with self.get_session() as session:
+ user = session.query(User).filter(User.username == username).first()
+ if user:
+ # Validate sort order
+ valid_sorts = ['name-asc', 'name-desc', 'status', 'memory-desc', 'memory-asc', 'cpu-desc', 'cpu-asc']
+ if sort_order not in valid_sorts:
+ return False
+ user.container_sort_order = sort_order
+ user.updated_at = datetime.now()
+ session.commit()
+ return True
+ return False
\ No newline at end of file
diff --git a/dockmon/backend/docker_monitor/__init__.py b/dockmon/backend/docker_monitor/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/dockmon/backend/docker_monitor/monitor.py b/dockmon/backend/docker_monitor/monitor.py
new file mode 100644
index 0000000..e5016f3
--- /dev/null
+++ b/dockmon/backend/docker_monitor/monitor.py
@@ -0,0 +1,1554 @@
+"""
+Docker Monitoring Core for DockMon
+Main monitoring class for Docker containers and hosts
+"""
+
+import asyncio
+import logging
+import os
+import shutil
+import time
+import uuid
+from datetime import datetime
+from typing import Dict, List, Optional
+
+import docker
+from docker import DockerClient
+from fastapi import HTTPException
+
+from config.paths import DATABASE_PATH, CERTS_DIR
+from database import DatabaseManager, AutoRestartConfig, GlobalSettings, DockerHostDB
+from models.docker_models import DockerHost, DockerHostConfig, Container
+from models.settings_models import AlertRule, NotificationSettings
+from websocket.connection import ConnectionManager
+from realtime import RealtimeMonitor
+from notifications import NotificationService, AlertProcessor
+from event_logger import EventLogger, EventSeverity, EventType
+from stats_client import get_stats_client
+from docker_monitor.stats_manager import StatsManager
+from auth.session_manager import session_manager
+
+
+logger = logging.getLogger(__name__)
+
+
+def _handle_task_exception(task: asyncio.Task) -> None:
+ """Handle exceptions from fire-and-forget async tasks"""
+ try:
+ task.result()
+ except asyncio.CancelledError:
+ pass # Task was cancelled, this is normal
+ except Exception as e:
+ logger.error(f"Unhandled exception in background task: {e}", exc_info=True)
+
+
+def sanitize_host_id(host_id: str) -> str:
+ """
+ Sanitize host ID to prevent path traversal attacks.
+ Only allows valid UUID format or alphanumeric + dash characters.
+ """
+ if not host_id:
+ raise ValueError("Host ID cannot be empty")
+
+ # Check for path traversal attempts
+ if ".." in host_id or "/" in host_id or "\\" in host_id:
+ raise ValueError(f"Invalid host ID format: {host_id}")
+
+ # Try to validate as UUID first
+ try:
+ uuid.UUID(host_id)
+ return host_id
+ except ValueError:
+ # If not a valid UUID, only allow alphanumeric and dashes
+ import re
+ if re.match(r'^[a-zA-Z0-9\-]+$', host_id):
+ return host_id
+ else:
+ raise ValueError(f"Invalid host ID format: {host_id}")
+
+
+class DockerMonitor:
+ """Main monitoring class for Docker containers"""
+
+ def __init__(self):
+ self.hosts: Dict[str, DockerHost] = {}
+ self.clients: Dict[str, DockerClient] = {}
+ self.db = DatabaseManager(DATABASE_PATH) # Initialize database with centralized path
+ self.settings = self.db.get_settings() # Load settings from DB
+ self.notification_settings = NotificationSettings()
+ self.auto_restart_status: Dict[str, bool] = {}
+ self.restart_attempts: Dict[str, int] = {}
+ self.restarting_containers: Dict[str, bool] = {} # Track containers currently being restarted
+ self.monitoring_task: Optional[asyncio.Task] = None
+
+ # Reconnection tracking with exponential backoff
+ self.reconnect_attempts: Dict[str, int] = {} # Track reconnect attempts per host
+ self.last_reconnect_attempt: Dict[str, float] = {} # Track last attempt time per host
+ self.manager = ConnectionManager()
+ self.realtime = RealtimeMonitor() # Real-time monitoring
+ self.event_logger = EventLogger(self.db, self.manager) # Event logging service with WebSocket support
+ self.notification_service = NotificationService(self.db, self.event_logger) # Notification service
+ self.alert_processor = AlertProcessor(self.notification_service) # Alert processor
+ self._container_states: Dict[str, str] = {} # Track container states for change detection
+ self._recent_user_actions: Dict[str, float] = {} # Track recent user actions: {container_key: timestamp}
+ self.cleanup_task: Optional[asyncio.Task] = None # Background cleanup task
+
+ # Locks for shared data structures to prevent race conditions
+ self._state_lock = asyncio.Lock()
+ self._actions_lock = asyncio.Lock()
+ self._restart_lock = asyncio.Lock()
+
+ # Stats collection manager
+ self.stats_manager = StatsManager()
+ self._load_persistent_config() # Load saved hosts and configs
+
+ def add_host(self, config: DockerHostConfig, existing_id: str = None, skip_db_save: bool = False, suppress_event_loop_errors: bool = False) -> DockerHost:
+ """Add a new Docker host to monitor"""
+ client = None # Track client for cleanup on error
+ try:
+ # Validate certificates if provided (before trying to use them)
+ if config.tls_cert or config.tls_key or config.tls_ca:
+ self._validate_certificates(config)
+
+ # Create Docker client
+ if config.url.startswith("unix://"):
+ client = docker.DockerClient(base_url=config.url)
+ else:
+ # For TCP connections
+ tls_config = None
+ if config.tls_cert and config.tls_key:
+ # Create persistent certificate storage directory
+ safe_id = sanitize_host_id(existing_id or str(uuid.uuid4()))
+ cert_dir = os.path.join(CERTS_DIR, safe_id)
+
+ # Create with secure permissions - handle TOCTOU race condition
+ try:
+ os.makedirs(cert_dir, mode=0o700, exist_ok=False)
+ except FileExistsError:
+ # Verify it's actually a directory and not a symlink/file
+ import stat
+ st = os.lstat(cert_dir) # Use lstat to not follow symlinks
+ if not stat.S_ISDIR(st.st_mode):
+ raise ValueError("Certificate path exists but is not a directory")
+
+ # Write certificate files
+ cert_file = os.path.join(cert_dir, 'client-cert.pem')
+ key_file = os.path.join(cert_dir, 'client-key.pem')
+ ca_file = os.path.join(cert_dir, 'ca.pem') if config.tls_ca else None
+
+ with open(cert_file, 'w') as f:
+ f.write(config.tls_cert)
+ with open(key_file, 'w') as f:
+ f.write(config.tls_key)
+ if ca_file and config.tls_ca:
+ with open(ca_file, 'w') as f:
+ f.write(config.tls_ca)
+
+ # Set secure permissions
+ os.chmod(cert_file, 0o600)
+ os.chmod(key_file, 0o600)
+ if ca_file:
+ os.chmod(ca_file, 0o600)
+
+ tls_config = docker.tls.TLSConfig(
+ client_cert=(cert_file, key_file),
+ ca_cert=ca_file,
+ verify=bool(config.tls_ca)
+ )
+
+ client = docker.DockerClient(
+ base_url=config.url,
+ tls=tls_config,
+ timeout=self.settings.connection_timeout
+ )
+
+ # Test connection
+ client.ping()
+
+ # Validate TLS configuration for TCP connections
+ security_status = self._validate_host_security(config)
+
+ # Create host object with existing ID if provided (for persistence after restarts)
+ # Sanitize the ID to prevent path traversal
+ host_id = existing_id or str(uuid.uuid4())
+ try:
+ host_id = sanitize_host_id(host_id)
+ except ValueError as e:
+ logger.error(f"Invalid host ID: {e}")
+ raise HTTPException(status_code=400, detail=str(e))
+
+ host = DockerHost(
+ id=host_id,
+ name=config.name,
+ url=config.url,
+ status="online",
+ security_status=security_status
+ )
+
+ # Store client and host
+ self.clients[host.id] = client
+ self.hosts[host.id] = host
+
+ # Save to database only if not reconnecting to an existing host
+ if not skip_db_save:
+ db_host = self.db.add_host({
+ 'id': host.id,
+ 'name': config.name,
+ 'url': config.url,
+ 'tls_cert': config.tls_cert,
+ 'tls_key': config.tls_key,
+ 'tls_ca': config.tls_ca,
+ 'security_status': security_status
+ })
+
+ # Register host with stats and event services
+ # Only register if we're adding a NEW host (not during startup/reconnect)
+ # During startup, monitor_containers() handles all registrations
+ if not skip_db_save: # New host being added by user
+ try:
+ import asyncio
+ stats_client = get_stats_client()
+
+ async def register_host():
+ try:
+ await stats_client.add_docker_host(host.id, host.url, config.tls_ca, config.tls_cert, config.tls_key)
+ logger.info(f"Registered {host.name} ({host.id[:8]}) with stats service")
+
+ await stats_client.add_event_host(host.id, host.url, config.tls_ca, config.tls_cert, config.tls_key)
+ logger.info(f"Registered {host.name} ({host.id[:8]}) with event service")
+ except Exception as e:
+ logger.error(f"Failed to register {host.name} with Go services: {e}")
+
+ # Try to create task if event loop is running
+ try:
+ task = asyncio.create_task(register_host())
+ task.add_done_callback(_handle_task_exception)
+ except RuntimeError:
+ # No event loop running - will be registered by monitor_containers()
+ logger.debug(f"No event loop yet - {host.name} will be registered when monitoring starts")
+ except Exception as e:
+ logger.warning(f"Could not register {host.name} with Go services: {e}")
+
+ # Log host connection
+ self.event_logger.log_host_connection(
+ host_name=host.name,
+ host_id=host.id,
+ host_url=config.url,
+ connected=True
+ )
+
+ # Log host added (only for new hosts, not reconnects)
+ if not skip_db_save:
+ self.event_logger.log_host_added(
+ host_name=host.name,
+ host_id=host.id,
+ host_url=config.url,
+ triggered_by="user"
+ )
+
+ logger.info(f"Added Docker host: {host.name} ({host.url})")
+ return host
+
+ except Exception as e:
+ # Clean up client if it was created but not stored
+ if client is not None:
+ try:
+ client.close()
+ logger.debug(f"Closed orphaned Docker client for {config.name}")
+ except Exception as close_error:
+ logger.debug(f"Error closing Docker client: {close_error}")
+
+ # Suppress event loop errors during first run startup
+ if suppress_event_loop_errors and "no running event loop" in str(e):
+ logger.debug(f"Event loop warning for {config.name} (expected during startup): {e}")
+ # Re-raise so the caller knows host was added but with event loop issue
+ raise
+ else:
+ logger.error(f"Failed to add host {config.name}: {e}")
+ error_msg = self._get_user_friendly_error(str(e))
+ raise HTTPException(status_code=400, detail=error_msg)
+
+ def _get_user_friendly_error(self, error: str) -> str:
+ """Convert technical Docker errors to user-friendly messages"""
+ error_lower = error.lower()
+
+ # SSL/TLS certificate errors
+ if 'ssl' in error_lower or 'tls' in error_lower:
+ if 'pem lib' in error_lower or 'pem' in error_lower:
+ return (
+ "SSL certificate error: The certificates provided appear to be invalid or don't match. "
+ "Please verify:\n"
+ "• The certificates are for the correct server (check hostname/IP)\n"
+ "• The client certificate and private key are a matching pair\n"
+ "• The CA certificate matches the server's certificate\n"
+ "• The certificates haven't expired"
+ )
+ elif 'certificate verify failed' in error_lower:
+ return (
+ "SSL certificate verification failed: The server's certificate is not trusted by the CA certificate you provided. "
+ "Make sure you're using the correct CA certificate that signed the server's certificate."
+ )
+ elif 'ssleof' in error_lower or 'connection reset' in error_lower:
+ return (
+ "SSL connection failed: The server closed the connection during SSL handshake. "
+ "This usually means the server doesn't recognize the certificates. "
+ "Verify you're using the correct certificates for this server."
+ )
+ else:
+ return f"SSL/TLS error: Unable to establish secure connection. {error}"
+
+ # Connection errors
+ elif 'connection refused' in error_lower:
+ return (
+ "Connection refused: The Docker daemon is not accepting connections on this address. "
+ "Make sure:\n"
+ "• Docker is running on the remote host\n"
+ "• The Docker daemon is configured to listen on the specified port\n"
+ "• Firewall allows connections to the port"
+ )
+ elif 'timeout' in error_lower or 'timed out' in error_lower:
+ return (
+ "Connection timeout: Unable to reach the Docker daemon. "
+ "Check that the host address is correct and the host is reachable on your network."
+ )
+ elif 'no route to host' in error_lower or 'network unreachable' in error_lower:
+ return (
+ "Network unreachable: Cannot reach the specified host. "
+ "Verify the IP address/hostname is correct and the host is on your network."
+ )
+ elif 'http request to an https server' in error_lower:
+ return (
+ "Protocol mismatch: You're trying to connect without TLS to a server that requires TLS. "
+ "The server expects HTTPS connections. Please provide TLS certificates or change the server configuration."
+ )
+
+ # Return original error if we don't have a friendly version
+ return error
+
+ def _validate_certificates(self, config: DockerHostConfig):
+ """Validate certificate format before attempting to use them"""
+
+ def check_cert_format(cert_data: str, cert_type: str):
+ """Check if certificate has proper PEM format markers"""
+ if not cert_data or not cert_data.strip():
+ raise HTTPException(
+ status_code=400,
+ detail=f"{cert_type} is empty. Please paste the certificate content."
+ )
+
+ cert_data = cert_data.strip()
+
+ # Check for BEGIN marker
+ if "-----BEGIN" not in cert_data:
+ raise HTTPException(
+ status_code=400,
+ detail=f"{cert_type} is missing the '-----BEGIN' header. Make sure you copied the complete certificate including the BEGIN line."
+ )
+
+ # Check for END marker
+ if "-----END" not in cert_data:
+ raise HTTPException(
+ status_code=400,
+ detail=f"{cert_type} is missing the '-----END' footer. Make sure you copied the complete certificate including the END line."
+ )
+
+ # Check BEGIN comes before END
+ begin_pos = cert_data.find("-----BEGIN")
+ end_pos = cert_data.find("-----END")
+ if begin_pos >= end_pos:
+ raise HTTPException(
+ status_code=400,
+ detail=f"{cert_type} format is invalid. The '-----BEGIN' line should come before the '-----END' line."
+ )
+
+ # Check for certificate data between markers
+ cert_content = cert_data[begin_pos:end_pos + 50] # Include END marker
+ lines = cert_content.split('\n')
+ if len(lines) < 3: # Should have BEGIN, at least one data line, and END
+ raise HTTPException(
+ status_code=400,
+ detail=f"{cert_type} appears to be incomplete. Make sure you copied all lines between BEGIN and END."
+ )
+
+ # Validate each certificate type
+ if config.tls_ca:
+ check_cert_format(config.tls_ca, "CA Certificate")
+
+ if config.tls_cert:
+ check_cert_format(config.tls_cert, "Client Certificate")
+
+ if config.tls_key:
+ # Private keys can be PRIVATE KEY or RSA PRIVATE KEY
+ key_data = config.tls_key.strip()
+ if "-----BEGIN" not in key_data or "-----END" not in key_data:
+ raise HTTPException(
+ status_code=400,
+ detail="Client Private Key is incomplete. Make sure you copied the complete key including both '-----BEGIN' and '-----END' lines."
+ )
+
+ def _validate_host_security(self, config: DockerHostConfig) -> str:
+ """Validate the security configuration of a Docker host"""
+ if config.url.startswith("unix://"):
+ return "secure" # Unix sockets are secure (local only)
+ elif config.url.startswith("tcp://"):
+ if config.tls_cert and config.tls_key and config.tls_ca:
+ return "secure" # Has TLS certificates
+ else:
+ logger.warning(f"Host {config.name} configured without TLS - connection is insecure!")
+ return "insecure" # TCP without TLS
+ else:
+ return "unknown" # Unknown protocol
+
+ def _cleanup_host_certificates(self, host_id: str):
+ """Clean up certificate files for a host"""
+ safe_id = sanitize_host_id(host_id)
+ cert_dir = os.path.join(CERTS_DIR, safe_id)
+
+ # Defense in depth: verify path is within CERTS_DIR
+ abs_cert_dir = os.path.abspath(cert_dir)
+ abs_certs_dir = os.path.abspath(CERTS_DIR)
+ if not abs_cert_dir.startswith(abs_certs_dir):
+ logger.error(f"Path traversal attempt detected: {host_id}")
+ raise ValueError("Invalid certificate path")
+
+ if os.path.exists(cert_dir):
+ try:
+ shutil.rmtree(cert_dir)
+ logger.info(f"Cleaned up certificate files for host {host_id}")
+ except Exception as e:
+ logger.warning(f"Failed to clean up certificates for host {host_id}: {e}")
+
+ async def remove_host(self, host_id: str):
+ """Remove a Docker host"""
+ # Validate host_id to prevent path traversal
+ try:
+ host_id = sanitize_host_id(host_id)
+ except ValueError as e:
+ logger.error(f"Invalid host ID: {e}")
+ raise HTTPException(status_code=400, detail=str(e))
+
+ if host_id in self.hosts:
+ # Get host info before removing
+ host = self.hosts[host_id]
+ host_name = host.name
+
+ del self.hosts[host_id]
+ if host_id in self.clients:
+ self.clients[host_id].close()
+ del self.clients[host_id]
+
+ # Remove from Go stats and event services (await to ensure cleanup completes before returning)
+ try:
+ stats_client = get_stats_client()
+
+ try:
+ # Remove from stats service (closes Docker client and stops all container streams)
+ await stats_client.remove_docker_host(host_id)
+ logger.info(f"Removed {host_name} ({host_id[:8]}) from stats service")
+
+ # Remove from event service
+ await stats_client.remove_event_host(host_id)
+ logger.info(f"Removed {host_name} ({host_id[:8]}) from event service")
+ except asyncio.TimeoutError:
+ # Timeout during cleanup is expected - Go service closes connections immediately
+ logger.debug(f"Timeout removing {host_name} from Go services (expected during cleanup)")
+ except Exception as e:
+ logger.error(f"Failed to remove {host_name} from Go services: {e}")
+ except Exception as e:
+ logger.warning(f"Failed to remove host {host_id} from Go services: {e}")
+
+ # Clean up certificate files
+ self._cleanup_host_certificates(host_id)
+ # Remove from database
+ self.db.delete_host(host_id)
+
+ # Clean up container state tracking for this host
+ async with self._state_lock:
+ containers_to_remove = [key for key in self._container_states.keys() if key.startswith(f"{host_id}:")]
+ for container_key in containers_to_remove:
+ del self._container_states[container_key]
+
+ # Clean up recent user actions for this host
+ async with self._actions_lock:
+ actions_to_remove = [key for key in self._recent_user_actions.keys() if key.startswith(f"{host_id}:")]
+ for container_key in actions_to_remove:
+ del self._recent_user_actions[container_key]
+
+ # Clean up notification service's container state tracking for this host
+ notification_states_to_remove = [key for key in self.notification_service._last_container_state.keys() if key.startswith(f"{host_id}:")]
+ for container_key in notification_states_to_remove:
+ del self.notification_service._last_container_state[container_key]
+
+ # Clean up alert processor's container state tracking for this host
+ alert_processor_states_to_remove = [key for key in self.alert_processor._container_states.keys() if key.startswith(f"{host_id}:")]
+ for container_key in alert_processor_states_to_remove:
+ del self.alert_processor._container_states[container_key]
+
+ # Clean up notification service's alert cooldown tracking for this host
+ alert_cooldowns_to_remove = [key for key in self.notification_service._last_alerts.keys() if key.startswith(f"{host_id}:")]
+ for container_key in alert_cooldowns_to_remove:
+ del self.notification_service._last_alerts[container_key]
+
+ # Clean up reconnection tracking for this host
+ if host_id in self.reconnect_attempts:
+ del self.reconnect_attempts[host_id]
+ if host_id in self.last_reconnect_attempt:
+ del self.last_reconnect_attempt[host_id]
+
+ # Clean up auto-restart tracking for this host
+ async with self._restart_lock:
+ auto_restart_to_remove = [key for key in self.auto_restart_status.keys() if key.startswith(f"{host_id}:")]
+ for container_key in auto_restart_to_remove:
+ del self.auto_restart_status[container_key]
+ if container_key in self.restart_attempts:
+ del self.restart_attempts[container_key]
+ if container_key in self.restarting_containers:
+ del self.restarting_containers[container_key]
+
+ # Clean up stats manager's streaming containers for this host
+ # Remove using the full composite key (format: "host_id:container_id")
+ for container_key in containers_to_remove:
+ self.stats_manager.streaming_containers.discard(container_key)
+
+ if containers_to_remove:
+ logger.debug(f"Cleaned up {len(containers_to_remove)} container state entries for removed host {host_id[:8]}")
+ if notification_states_to_remove:
+ logger.debug(f"Cleaned up {len(notification_states_to_remove)} notification state entries for removed host {host_id[:8]}")
+ if alert_processor_states_to_remove:
+ logger.debug(f"Cleaned up {len(alert_processor_states_to_remove)} alert processor state entries for removed host {host_id[:8]}")
+ if alert_cooldowns_to_remove:
+ logger.debug(f"Cleaned up {len(alert_cooldowns_to_remove)} alert cooldown entries for removed host {host_id[:8]}")
+ if auto_restart_to_remove:
+ logger.debug(f"Cleaned up {len(auto_restart_to_remove)} auto-restart entries for removed host {host_id[:8]}")
+
+ # Log host removed
+ self.event_logger.log_host_removed(
+ host_name=host_name,
+ host_id=host_id,
+ triggered_by="user"
+ )
+
+ logger.info(f"Removed host {host_id}")
+
+ def update_host(self, host_id: str, config: DockerHostConfig):
+ """Update an existing Docker host"""
+ # Validate host_id to prevent path traversal
+ try:
+ host_id = sanitize_host_id(host_id)
+ except ValueError as e:
+ logger.error(f"Invalid host ID: {e}")
+ raise HTTPException(status_code=400, detail=str(e))
+
+ client = None # Track client for cleanup on error
+ try:
+ # Get existing host from database to check if we need to preserve certificates
+ existing_host = self.db.get_host(host_id)
+ if not existing_host:
+ raise HTTPException(status_code=404, detail=f"Host {host_id} not found")
+
+ # If certificates are not provided in the update, use existing ones
+ # This allows updating just the name without providing certificates again
+ if not config.tls_cert and existing_host.tls_cert:
+ config.tls_cert = existing_host.tls_cert
+ if not config.tls_key and existing_host.tls_key:
+ config.tls_key = existing_host.tls_key
+ if not config.tls_ca and existing_host.tls_ca:
+ config.tls_ca = existing_host.tls_ca
+
+ # Only validate certificates if NEW ones are provided (not using existing)
+ # Check if any NEW certificate data was actually sent in the request
+ if (config.tls_cert and config.tls_cert != existing_host.tls_cert) or \
+ (config.tls_key and config.tls_key != existing_host.tls_key) or \
+ (config.tls_ca and config.tls_ca != existing_host.tls_ca):
+ self._validate_certificates(config)
+
+ # Remove the existing host from memory first
+ if host_id in self.hosts:
+ # Close existing client first (this should stop the monitoring task)
+ if host_id in self.clients:
+ logger.info(f"Closing Docker client for host {host_id}")
+ self.clients[host_id].close()
+ del self.clients[host_id]
+
+ # Remove from memory
+ del self.hosts[host_id]
+
+ # Validate TLS configuration
+ security_status = self._validate_host_security(config)
+
+ # Update database
+ updated_db_host = self.db.update_host(host_id, {
+ 'name': config.name,
+ 'url': config.url,
+ 'tls_cert': config.tls_cert,
+ 'tls_key': config.tls_key,
+ 'tls_ca': config.tls_ca,
+ 'security_status': security_status
+ })
+
+ if not updated_db_host:
+ raise Exception(f"Host {host_id} not found in database")
+
+ # Create new Docker client with updated config
+ if config.url.startswith("unix://"):
+ client = docker.DockerClient(base_url=config.url)
+ else:
+ # For TCP connections
+ tls_config = None
+ if config.tls_cert and config.tls_key:
+ # Create persistent certificate storage directory
+ safe_id = sanitize_host_id(host_id)
+ cert_dir = os.path.join(CERTS_DIR, safe_id)
+ # Create with secure permissions to avoid TOCTOU race condition
+ os.makedirs(cert_dir, mode=0o700, exist_ok=True)
+
+ # Write certificate files
+ cert_file = os.path.join(cert_dir, 'client-cert.pem')
+ key_file = os.path.join(cert_dir, 'client-key.pem')
+ ca_file = os.path.join(cert_dir, 'ca.pem') if config.tls_ca else None
+
+ with open(cert_file, 'w') as f:
+ f.write(config.tls_cert)
+ with open(key_file, 'w') as f:
+ f.write(config.tls_key)
+ if ca_file and config.tls_ca:
+ with open(ca_file, 'w') as f:
+ f.write(config.tls_ca)
+
+ # Set secure permissions
+ os.chmod(cert_file, 0o600)
+ os.chmod(key_file, 0o600)
+ if ca_file:
+ os.chmod(ca_file, 0o600)
+
+ tls_config = docker.tls.TLSConfig(
+ client_cert=(cert_file, key_file),
+ ca_cert=ca_file,
+ verify=bool(config.tls_ca)
+ )
+
+ client = docker.DockerClient(
+ base_url=config.url,
+ tls=tls_config,
+ timeout=self.settings.connection_timeout
+ )
+
+ # Test connection
+ client.ping()
+
+ # Create host object with existing ID
+ host = DockerHost(
+ id=host_id,
+ name=config.name,
+ url=config.url,
+ status="online",
+ security_status=security_status
+ )
+
+ # Store client and host
+ self.clients[host.id] = client
+ self.hosts[host.id] = host
+
+ # Re-register host with stats and event services (in case URL changed)
+ # Note: add_docker_host() automatically closes old client if it exists
+ try:
+ import asyncio
+ stats_client = get_stats_client()
+
+ async def reregister_host():
+ try:
+ # Re-register with stats service (automatically closes old client)
+ await stats_client.add_docker_host(host.id, host.url, config.tls_ca, config.tls_cert, config.tls_key)
+ logger.info(f"Re-registered {host.name} ({host.id[:8]}) with stats service")
+
+ # Remove and re-add event monitoring
+ await stats_client.remove_event_host(host.id)
+ await stats_client.add_event_host(host.id, host.url, config.tls_ca, config.tls_cert, config.tls_key)
+ logger.info(f"Re-registered {host.name} ({host.id[:8]}) with event service")
+ except Exception as e:
+ logger.error(f"Failed to re-register {host.name} with Go services: {e}")
+
+ # Create task to re-register (fire and forget)
+ task = asyncio.create_task(reregister_host())
+ task.add_done_callback(_handle_task_exception)
+ except Exception as e:
+ logger.warning(f"Could not re-register {host.name} with Go services: {e}")
+
+ # Log host update
+ self.event_logger.log_host_connection(
+ host_name=host.name,
+ host_id=host.id,
+ host_url=config.url,
+ connected=True
+ )
+
+ logger.info(f"Successfully updated host {host_id}: {host.name} ({host.url})")
+ return host
+
+ except Exception as e:
+ # Clean up client if it was created but not stored
+ if client and host_id not in self.clients:
+ try:
+ client.close()
+ logger.debug(f"Closed orphaned Docker client for host {host_id[:8]}")
+ except Exception as close_error:
+ logger.debug(f"Error closing Docker client: {close_error}")
+
+ logger.error(f"Failed to update host {host_id}: {e}")
+ error_msg = self._get_user_friendly_error(str(e))
+ raise HTTPException(status_code=400, detail=error_msg)
+
+ async def get_containers(self, host_id: Optional[str] = None) -> List[Container]:
+ """Get containers from one or all hosts"""
+ containers = []
+
+ hosts_to_check = [host_id] if host_id else list(self.hosts.keys())
+
+ for hid in hosts_to_check:
+ host = self.hosts.get(hid)
+ if not host:
+ continue
+
+ # Try to reconnect if host exists but has no client (offline)
+ if hid not in self.clients:
+ # Exponential backoff: 5s, 10s, 20s, 40s, 80s, max 5 minutes
+ now = time.time()
+ attempts = self.reconnect_attempts.get(hid, 0)
+ last_attempt = self.last_reconnect_attempt.get(hid, 0)
+ backoff_seconds = min(5 * (2 ** attempts), 300)
+
+ # Skip reconnection if we're in backoff period
+ if now - last_attempt < backoff_seconds:
+ time_remaining = backoff_seconds - (now - last_attempt)
+ logger.debug(f"Skipping reconnection for {host.name} - backoff active (attempt {attempts}, {time_remaining:.1f}s remaining)")
+ host.status = "offline"
+ continue
+
+ # Record this reconnection attempt
+ self.last_reconnect_attempt[hid] = now
+ logger.info(f"Attempting to reconnect to offline host {host.name} (attempt {attempts + 1})")
+
+ # Attempt to reconnect offline hosts
+ try:
+ # Fetch TLS certs from database for reconnection
+ with self.db.get_session() as session:
+ db_host = session.query(DockerHostDB).filter_by(id=hid).first()
+
+ if host.url.startswith("unix://"):
+ client = docker.DockerClient(base_url=host.url)
+ elif db_host and db_host.tls_cert and db_host.tls_key and db_host.tls_ca:
+ # Reconnect with TLS using certs from database
+ logger.debug(f"Reconnecting to {host.name} with TLS")
+
+ # Write certs to temporary files for TLS config
+ cert_dir = os.path.join(CERTS_DIR, hid)
+ os.makedirs(cert_dir, exist_ok=True)
+
+ cert_file = os.path.join(cert_dir, 'cert.pem')
+ key_file = os.path.join(cert_dir, 'key.pem')
+ ca_file = os.path.join(cert_dir, 'ca.pem') if db_host.tls_ca else None
+
+ with open(cert_file, 'w') as f:
+ f.write(db_host.tls_cert)
+ with open(key_file, 'w') as f:
+ f.write(db_host.tls_key)
+ if ca_file:
+ with open(ca_file, 'w') as f:
+ f.write(db_host.tls_ca)
+
+ # Set secure permissions
+ os.chmod(cert_file, 0o600)
+ os.chmod(key_file, 0o600)
+ if ca_file:
+ os.chmod(ca_file, 0o600)
+
+ tls_config = docker.tls.TLSConfig(
+ client_cert=(cert_file, key_file),
+ ca_cert=ca_file,
+ verify=bool(db_host.tls_ca)
+ )
+
+ client = docker.DockerClient(
+ base_url=host.url,
+ tls=tls_config,
+ timeout=self.settings.connection_timeout
+ )
+ else:
+ # Reconnect without TLS
+ client = docker.DockerClient(
+ base_url=host.url,
+ timeout=self.settings.connection_timeout
+ )
+
+ # Test the connection
+ client.ping()
+ # Connection successful - add to clients
+ self.clients[hid] = client
+
+ # Reset reconnection attempts on success
+ self.reconnect_attempts[hid] = 0
+ logger.info(f"Reconnected to offline host: {host.name}")
+
+ # Re-register with stats and events service
+ try:
+ stats_client = get_stats_client()
+ tls_ca = db_host.tls_ca if db_host else None
+ tls_cert = db_host.tls_cert if db_host else None
+ tls_key = db_host.tls_key if db_host else None
+
+ await stats_client.add_docker_host(hid, host.url, tls_ca, tls_cert, tls_key)
+ await stats_client.add_event_host(hid, host.url, tls_ca, tls_cert, tls_key)
+ logger.info(f"Re-registered {host.name} ({hid[:8]}) with stats/events service after reconnection")
+ except Exception as e:
+ logger.warning(f"Failed to re-register {host.name} with Go services after reconnection: {e}")
+
+ except Exception as e:
+ # Increment reconnection attempts on failure
+ self.reconnect_attempts[hid] = attempts + 1
+
+ # Still offline - update status and continue
+ host.status = "offline"
+ host.error = f"Connection failed: {str(e)}"
+ host.last_checked = datetime.now()
+
+ # Log with backoff info to help debugging
+ next_attempt_in = min(5 * (2 ** (attempts + 1)), 300)
+ logger.debug(f"Host {host.name} still offline (attempt {attempts + 1}). Next retry in {next_attempt_in}s")
+ continue
+
+ client = self.clients[hid]
+
+ try:
+ docker_containers = client.containers.list(all=True)
+ host.status = "online"
+ host.container_count = len(docker_containers)
+ host.error = None
+
+ for dc in docker_containers:
+ try:
+ container_id = dc.id[:12]
+
+ # Try to get image info, but handle missing images gracefully
+ # Access dc.image first to trigger any errors before accessing its properties
+ try:
+ container_image = dc.image
+ image_name = container_image.tags[0] if container_image.tags else container_image.short_id
+ except Exception as img_error:
+ # Image may have been deleted - use image ID from container attrs
+ # This is common when containers reference deleted images
+ image_name = dc.attrs.get('Config', {}).get('Image', 'unknown')
+ if image_name == 'unknown':
+ # Try to get from ImageID in attrs
+ image_id = dc.attrs.get('Image', '')
+ if image_id.startswith('sha256:'):
+ image_name = image_id[:19] # sha256: + first 12 chars
+ else:
+ image_name = image_id[:12] if image_id else 'unknown'
+
+ container = Container(
+ id=dc.id,
+ short_id=container_id,
+ name=dc.name,
+ state=dc.status,
+ status=dc.attrs['State']['Status'],
+ host_id=hid,
+ host_name=host.name,
+ image=image_name,
+ created=dc.attrs['Created'],
+ auto_restart=self._get_auto_restart_status(hid, container_id),
+ restart_attempts=self.restart_attempts.get(container_id, 0)
+ )
+ containers.append(container)
+ except Exception as container_error:
+ # Log but don't fail the whole host for one bad container
+ logger.warning(f"Skipping container {dc.name if hasattr(dc, 'name') else 'unknown'} on {host.name} due to error: {container_error}")
+ continue
+
+ except Exception as e:
+ logger.error(f"Error getting containers from {host.name}: {e}")
+ host.status = "offline"
+ host.error = str(e)
+
+ host.last_checked = datetime.now()
+
+ # Fetch stats from Go stats service and populate container stats
+ try:
+ from stats_client import get_stats_client
+ stats_client = get_stats_client()
+ container_stats = await stats_client.get_container_stats()
+
+ # Populate stats for each container using composite key (host_id:container_id)
+ for container in containers:
+ # Use composite key to support containers with duplicate IDs on different hosts
+ composite_key = f"{container.host_id}:{container.id}"
+ stats = container_stats.get(composite_key, {})
+ if stats:
+ container.cpu_percent = stats.get('cpu_percent')
+ container.memory_usage = stats.get('memory_usage')
+ container.memory_limit = stats.get('memory_limit')
+ container.memory_percent = stats.get('memory_percent')
+ container.network_rx = stats.get('network_rx')
+ container.network_tx = stats.get('network_tx')
+ container.disk_read = stats.get('disk_read')
+ container.disk_write = stats.get('disk_write')
+ logger.debug(f"Populated stats for {container.name} ({container.short_id}) on {container.host_name}: CPU {container.cpu_percent}%")
+ except Exception as e:
+ logger.warning(f"Failed to fetch container stats from stats service: {e}")
+
+ return containers
+
+ def restart_container(self, host_id: str, container_id: str) -> bool:
+ """Restart a specific container"""
+ if host_id not in self.clients:
+ raise HTTPException(status_code=404, detail="Host not found")
+
+ host = self.hosts.get(host_id)
+ host_name = host.name if host else 'Unknown Host'
+
+ start_time = time.time()
+ try:
+ client = self.clients[host_id]
+ container = client.containers.get(container_id)
+ container_name = container.name
+
+ container.restart(timeout=10)
+ duration_ms = int((time.time() - start_time) * 1000)
+
+ logger.info(f"Restarted container '{container_name}' on host '{host_name}'")
+
+ # Log the successful restart
+ self.event_logger.log_container_action(
+ action="restart",
+ container_name=container_name,
+ container_id=container_id,
+ host_name=host_name,
+ host_id=host_id,
+ success=True,
+ triggered_by="user",
+ duration_ms=duration_ms
+ )
+ return True
+ except Exception as e:
+ duration_ms = int((time.time() - start_time) * 1000)
+ logger.error(f"Failed to restart container '{container_name}' on host '{host_name}': {e}")
+
+ # Log the failed restart
+ self.event_logger.log_container_action(
+ action="restart",
+ container_name=container_id, # Use ID if name unavailable
+ container_id=container_id,
+ host_name=host_name,
+ host_id=host_id,
+ success=False,
+ triggered_by="user",
+ error_message=str(e),
+ duration_ms=duration_ms
+ )
+ raise HTTPException(status_code=500, detail=str(e))
+
+ def stop_container(self, host_id: str, container_id: str) -> bool:
+ """Stop a specific container"""
+ if host_id not in self.clients:
+ raise HTTPException(status_code=404, detail="Host not found")
+
+ host = self.hosts.get(host_id)
+ host_name = host.name if host else 'Unknown Host'
+
+ start_time = time.time()
+ try:
+ client = self.clients[host_id]
+ container = client.containers.get(container_id)
+ container_name = container.name
+
+ container.stop(timeout=10)
+ duration_ms = int((time.time() - start_time) * 1000)
+
+ logger.info(f"Stopped container '{container_name}' on host '{host_name}'")
+
+ # Track this user action to suppress critical severity on expected state change
+ container_key = f"{host_id}:{container_id}"
+ self._recent_user_actions[container_key] = time.time()
+ logger.info(f"Tracked user stop action for {container_key}")
+
+ # Log the successful stop
+ self.event_logger.log_container_action(
+ action="stop",
+ container_name=container_name,
+ container_id=container_id,
+ host_name=host_name,
+ host_id=host_id,
+ success=True,
+ triggered_by="user",
+ duration_ms=duration_ms
+ )
+ return True
+ except Exception as e:
+ duration_ms = int((time.time() - start_time) * 1000)
+ logger.error(f"Failed to stop container '{container_name}' on host '{host_name}': {e}")
+
+ # Log the failed stop
+ self.event_logger.log_container_action(
+ action="stop",
+ container_name=container_id,
+ container_id=container_id,
+ host_name=host_name,
+ host_id=host_id,
+ success=False,
+ triggered_by="user",
+ error_message=str(e),
+ duration_ms=duration_ms
+ )
+ raise HTTPException(status_code=500, detail=str(e))
+
+ def start_container(self, host_id: str, container_id: str) -> bool:
+ """Start a specific container"""
+ if host_id not in self.clients:
+ raise HTTPException(status_code=404, detail="Host not found")
+
+ host = self.hosts.get(host_id)
+ host_name = host.name if host else 'Unknown Host'
+
+ start_time = time.time()
+ try:
+ client = self.clients[host_id]
+ container = client.containers.get(container_id)
+ container_name = container.name
+
+ container.start()
+ duration_ms = int((time.time() - start_time) * 1000)
+
+ logger.info(f"Started container '{container_name}' on host '{host_name}'")
+
+ # Track this user action to suppress critical severity on expected state change
+ container_key = f"{host_id}:{container_id}"
+ self._recent_user_actions[container_key] = time.time()
+ logger.info(f"Tracked user start action for {container_key}")
+
+ # Log the successful start
+ self.event_logger.log_container_action(
+ action="start",
+ container_name=container_name,
+ container_id=container_id,
+ host_name=host_name,
+ host_id=host_id,
+ success=True,
+ triggered_by="user",
+ duration_ms=duration_ms
+ )
+ return True
+ except Exception as e:
+ duration_ms = int((time.time() - start_time) * 1000)
+ logger.error(f"Failed to start container '{container_name}' on host '{host_name}': {e}")
+
+ # Log the failed start
+ self.event_logger.log_container_action(
+ action="start",
+ container_name=container_id,
+ container_id=container_id,
+ host_name=host_name,
+ host_id=host_id,
+ success=False,
+ triggered_by="user",
+ error_message=str(e),
+ duration_ms=duration_ms
+ )
+ raise HTTPException(status_code=500, detail=str(e))
+
+ def toggle_auto_restart(self, host_id: str, container_id: str, container_name: str, enabled: bool):
+ """Toggle auto-restart for a container"""
+ # Get host name for logging
+ host = self.hosts.get(host_id)
+ host_name = host.name if host else 'Unknown Host'
+
+ # Use host_id:container_id as key to prevent collisions between hosts
+ container_key = f"{host_id}:{container_id}"
+ self.auto_restart_status[container_key] = enabled
+ if not enabled:
+ self.restart_attempts[container_key] = 0
+ self.restarting_containers[container_key] = False
+ # Save to database
+ self.db.set_auto_restart(host_id, container_id, container_name, enabled)
+ logger.info(f"Auto-restart {'enabled' if enabled else 'disabled'} for container '{container_name}' on host '{host_name}'")
+
+ async def check_orphaned_alerts(self):
+ """Check for alert rules that reference non-existent containers
+ Returns dict mapping alert_rule_id to list of orphaned container entries"""
+ orphaned = {}
+
+ try:
+ # Get all alert rules
+ with self.db.get_session() as session:
+ from database import AlertRuleDB, AlertRuleContainer
+ alert_rules = session.query(AlertRuleDB).all()
+
+ # Get all current containers (name + host_id pairs)
+ current_containers = {}
+ for container in await self.get_containers():
+ key = f"{container.host_id}:{container.name}"
+ current_containers[key] = True
+
+ # Check each alert rule's containers
+ for rule in alert_rules:
+ orphaned_containers = []
+ for alert_container in rule.containers:
+ key = f"{alert_container.host_id}:{alert_container.container_name}"
+ if key not in current_containers:
+ # Container doesn't exist anymore
+ orphaned_containers.append({
+ 'host_id': alert_container.host_id,
+ 'host_name': alert_container.host.name if alert_container.host else 'Unknown',
+ 'container_name': alert_container.container_name
+ })
+
+ if orphaned_containers:
+ orphaned[rule.id] = {
+ 'rule_name': rule.name,
+ 'orphaned_containers': orphaned_containers
+ }
+
+ if orphaned:
+ logger.info(f"Found {len(orphaned)} alert rules with orphaned containers")
+
+ return orphaned
+
+ except Exception as e:
+ logger.error(f"Error checking orphaned alerts: {e}")
+ return {}
+
+ async def _handle_docker_event(self, event: dict):
+ """Handle Docker events from Go service"""
+ try:
+ action = event.get('action', '')
+ container_id = event.get('container_id', '')
+ container_name = event.get('container_name', '')
+ host_id = event.get('host_id', '')
+ attributes = event.get('attributes', {})
+ timestamp_str = event.get('timestamp', '')
+
+ # Filter out noisy exec_* events (health checks, etc.)
+ if action.startswith('exec_'):
+ return
+
+ # Only log important events
+ important_events = ['create', 'start', 'stop', 'die', 'kill', 'destroy', 'pause', 'unpause', 'restart', 'oom', 'health_status']
+ if action in important_events:
+ logger.info(f"Docker event: {action} - {container_name} ({container_id[:12]}) on host {host_id[:8]}")
+
+ # Process event for notifications/alerts
+ if self.notification_service and action in ['die', 'oom', 'kill', 'health_status', 'restart']:
+ # Parse timestamp
+ from datetime import datetime
+ try:
+ timestamp = datetime.fromisoformat(timestamp_str.replace('Z', '+00:00'))
+ except (ValueError, AttributeError, TypeError) as e:
+ logger.warning(f"Failed to parse timestamp '{timestamp_str}': {e}, using current time")
+ timestamp = datetime.now()
+
+ # Get exit code for die events
+ exit_code = None
+ if action == 'die':
+ exit_code_str = attributes.get('exitCode', '0')
+ try:
+ exit_code = int(exit_code_str)
+ except (ValueError, TypeError):
+ exit_code = None
+
+ # Create alert event
+ from notifications import DockerEventAlert
+ alert_event = DockerEventAlert(
+ container_id=container_id,
+ container_name=container_name,
+ host_id=host_id,
+ event_type=action,
+ timestamp=timestamp,
+ attributes=attributes,
+ exit_code=exit_code
+ )
+
+ # Process in background to not block event monitoring
+ task = asyncio.create_task(self.notification_service.process_docker_event(alert_event))
+ task.add_done_callback(_handle_task_exception)
+
+ except Exception as e:
+ logger.error(f"Error handling Docker event from Go service: {e}")
+
+ async def monitor_containers(self):
+ """Main monitoring loop"""
+ logger.info("Starting container monitoring...")
+
+ # Get stats client instance
+ # Note: streaming_containers is now managed by self.stats_manager
+ stats_client = get_stats_client()
+
+ # Register all hosts with the stats and event services on startup
+ for host_id, host in self.hosts.items():
+ try:
+ # Get TLS certificates from database
+ session = self.db.get_session()
+ try:
+ db_host = session.query(DockerHostDB).filter_by(id=host_id).first()
+ tls_ca = db_host.tls_ca if db_host else None
+ tls_cert = db_host.tls_cert if db_host else None
+ tls_key = db_host.tls_key if db_host else None
+ finally:
+ session.close()
+
+ # Register with stats service
+ await stats_client.add_docker_host(host_id, host.url, tls_ca, tls_cert, tls_key)
+ logger.info(f"Registered host {host.name} ({host_id[:8]}) with stats service")
+
+ # Register with event service
+ await stats_client.add_event_host(host_id, host.url, tls_ca, tls_cert, tls_key)
+ logger.info(f"Registered host {host.name} ({host_id[:8]}) with event service")
+ except Exception as e:
+ logger.error(f"Failed to register host {host_id} with services: {e}")
+
+ # Connect to event stream WebSocket
+ try:
+ await stats_client.connect_event_stream(self._handle_docker_event)
+ logger.info("Connected to Go event stream")
+ except Exception as e:
+ logger.error(f"Failed to connect to event stream: {e}")
+
+ while True:
+ try:
+ containers = await self.get_containers()
+
+ # Centralized stats collection decision using StatsManager
+ has_viewers = self.manager.has_active_connections()
+
+ if has_viewers:
+ # Determine which containers need stats (centralized logic)
+ containers_needing_stats = self.stats_manager.determine_containers_needing_stats(
+ containers,
+ self.settings
+ )
+
+ # Sync streams with what's needed (start new, stop old)
+ await self.stats_manager.sync_container_streams(
+ containers,
+ containers_needing_stats,
+ stats_client,
+ _handle_task_exception
+ )
+ else:
+ # No active viewers - stop all streams
+ await self.stats_manager.stop_all_streams(stats_client, _handle_task_exception)
+
+ # Track container state changes and log them
+ for container in containers:
+ container_key = f"{container.host_id}:{container.id}"
+ current_state = container.status
+
+ # Hold lock during entire read-process-write to prevent race conditions
+ async with self._state_lock:
+ previous_state = self._container_states.get(container_key)
+
+ # Log state changes
+ if previous_state is not None and previous_state != current_state:
+ # Check if this state change was expected (recent user action)
+ async with self._actions_lock:
+ last_user_action = self._recent_user_actions.get(container_key, 0)
+ time_since_action = time.time() - last_user_action
+ is_user_initiated = time_since_action < 30 # Within 30 seconds
+
+ logger.info(f"State change for {container_key}: {previous_state} → {current_state}, "
+ f"time_since_action={time_since_action:.1f}s, user_initiated={is_user_initiated}")
+
+ # Clean up old tracking entries (5 minutes or older)
+ if time_since_action >= 300:
+ async with self._actions_lock:
+ self._recent_user_actions.pop(container_key, None)
+
+ self.event_logger.log_container_state_change(
+ container_name=container.name,
+ container_id=container.short_id,
+ host_name=container.host_name,
+ host_id=container.host_id,
+ old_state=previous_state,
+ new_state=current_state,
+ triggered_by="user" if is_user_initiated else "system"
+ )
+
+ # Update tracked state (still inside lock)
+ self._container_states[container_key] = current_state
+
+ # Check for containers that need auto-restart
+ for container in containers:
+ if (container.status == "exited" and
+ self._get_auto_restart_status(container.host_id, container.short_id)):
+
+ # Use host_id:container_id as key to prevent collisions between hosts
+ container_key = f"{container.host_id}:{container.short_id}"
+ attempts = self.restart_attempts.get(container_key, 0)
+ is_restarting = self.restarting_containers.get(container_key, False)
+
+ if attempts < self.settings.max_retries and not is_restarting:
+ self.restarting_containers[container_key] = True
+ task = asyncio.create_task(
+ self.auto_restart_container(container)
+ )
+ task.add_done_callback(_handle_task_exception)
+
+ # Process alerts for container state changes
+ await self.alert_processor.process_container_update(containers, self.hosts)
+
+ # Only fetch and broadcast stats if there are active viewers
+ if has_viewers:
+ # Prepare broadcast data
+ broadcast_data = {
+ "containers": [c.dict() for c in containers],
+ "hosts": [h.dict() for h in self.hosts.values()],
+ "timestamp": datetime.now().isoformat()
+ }
+
+ # Only include host metrics if host stats are enabled
+ if self.stats_manager.should_broadcast_host_metrics(self.settings):
+ # Get host metrics from stats service (fast HTTP call)
+ host_metrics = await stats_client.get_host_stats()
+ logger.debug(f"Retrieved metrics for {len(host_metrics)} hosts from stats service")
+ broadcast_data["host_metrics"] = host_metrics
+
+ # Broadcast update to all connected clients
+ await self.manager.broadcast({
+ "type": "containers_update",
+ "data": broadcast_data
+ })
+
+ except Exception as e:
+ logger.error(f"Error in monitoring loop: {e}")
+
+ await asyncio.sleep(self.settings.polling_interval)
+
+ async def auto_restart_container(self, container: Container):
+ """Attempt to auto-restart a container"""
+ container_id = container.short_id
+ # Use host_id:container_id as key to prevent collisions between hosts
+ container_key = f"{container.host_id}:{container_id}"
+
+ self.restart_attempts[container_key] = self.restart_attempts.get(container_key, 0) + 1
+ attempt = self.restart_attempts[container_key]
+
+ correlation_id = self.event_logger.create_correlation_id()
+
+ logger.info(
+ f"Auto-restart attempt {attempt}/{self.settings.max_retries} "
+ f"for container '{container.name}' on host '{container.host_name}'"
+ )
+
+ # Wait before attempting restart
+ await asyncio.sleep(self.settings.retry_delay)
+
+ try:
+ success = self.restart_container(container.host_id, container.id)
+ if success:
+ self.restart_attempts[container_key] = 0
+
+ # Log successful auto-restart
+ self.event_logger.log_auto_restart_attempt(
+ container_name=container.name,
+ container_id=container_id,
+ host_name=container.host_name,
+ host_id=container.host_id,
+ attempt=attempt,
+ max_attempts=self.settings.max_retries,
+ success=True,
+ correlation_id=correlation_id
+ )
+
+ await self.manager.broadcast({
+ "type": "auto_restart_success",
+ "data": {
+ "container_id": container_id,
+ "container_name": container.name,
+ "host": container.host_name
+ }
+ })
+ except Exception as e:
+ logger.error(f"Auto-restart failed for {container.name}: {e}")
+
+ # Log failed auto-restart
+ self.event_logger.log_auto_restart_attempt(
+ container_name=container.name,
+ container_id=container_id,
+ host_name=container.host_name,
+ host_id=container.host_id,
+ attempt=attempt,
+ max_attempts=self.settings.max_retries,
+ success=False,
+ error_message=str(e),
+ correlation_id=correlation_id
+ )
+
+ if attempt >= self.settings.max_retries:
+ self.auto_restart_status[container_key] = False
+ await self.manager.broadcast({
+ "type": "auto_restart_failed",
+ "data": {
+ "container_id": container_id,
+ "container_name": container.name,
+ "attempts": attempt,
+ "max_retries": self.settings.max_retries
+ }
+ })
+ finally:
+ # Always clear the restarting flag when done (success or failure)
+ self.restarting_containers[container_key] = False
+
+ def _load_persistent_config(self):
+ """Load saved configuration from database"""
+ try:
+ # Load saved hosts
+ db_hosts = self.db.get_hosts(active_only=True)
+
+ # Detect and warn about duplicate hosts (same URL)
+ seen_urls = {}
+ for host in db_hosts:
+ if host.url in seen_urls:
+ logger.warning(
+ f"Duplicate host detected: '{host.name}' ({host.id}) and "
+ f"'{seen_urls[host.url]['name']}' ({seen_urls[host.url]['id']}) "
+ f"both use URL '{host.url}'. Consider removing duplicates."
+ )
+ else:
+ seen_urls[host.url] = {'name': host.name, 'id': host.id}
+
+ # Check if this is first run
+ with self.db.get_session() as session:
+ settings = session.query(GlobalSettings).first()
+ if not settings:
+ # Create default settings
+ settings = GlobalSettings()
+ session.add(settings)
+ session.commit()
+
+ # Auto-add local Docker only on first run (outside session context)
+ with self.db.get_session() as session:
+ settings = session.query(GlobalSettings).first()
+ if settings and not settings.first_run_complete and not db_hosts and os.path.exists('/var/run/docker.sock'):
+ logger.info("First run detected - adding local Docker automatically")
+ host_added = False
+ try:
+ config = DockerHostConfig(
+ name="Local Docker",
+ url="unix:///var/run/docker.sock",
+ tls_cert=None,
+ tls_key=None,
+ tls_ca=None
+ )
+ self.add_host(config, suppress_event_loop_errors=True)
+ host_added = True
+ logger.info("Successfully added local Docker host")
+ except Exception as e:
+ # Check if this is the benign "no running event loop" error during startup
+ # The host is actually added successfully despite this error
+ error_str = str(e)
+ if "no running event loop" in error_str:
+ host_added = True
+ logger.debug(f"Event loop warning during first run (host added successfully): {e}")
+ else:
+ logger.error(f"Failed to add local Docker: {e}")
+ session.rollback()
+
+ # Mark first run as complete if host was added
+ if host_added:
+ settings.first_run_complete = True
+ session.commit()
+ logger.info("First run setup complete")
+
+ for db_host in db_hosts:
+ try:
+ config = DockerHostConfig(
+ name=db_host.name,
+ url=db_host.url,
+ tls_cert=db_host.tls_cert,
+ tls_key=db_host.tls_key,
+ tls_ca=db_host.tls_ca
+ )
+ # Try to connect to the host with existing ID and preserve security status
+ host = self.add_host(config, existing_id=db_host.id, skip_db_save=True, suppress_event_loop_errors=True)
+ # Override with stored security status
+ if hasattr(host, 'security_status') and db_host.security_status:
+ host.security_status = db_host.security_status
+ except Exception as e:
+ # Suppress event loop errors during startup
+ error_str = str(e)
+ if "no running event loop" not in error_str:
+ logger.error(f"Failed to reconnect to saved host {db_host.name}: {e}")
+ # Add host to UI even if connection failed, mark as offline
+ # This prevents "disappearing hosts" bug after restart
+ host = DockerHost(
+ id=db_host.id,
+ name=db_host.name,
+ url=db_host.url,
+ status="offline",
+ client=None
+ )
+ host.security_status = db_host.security_status or "unknown"
+ self.hosts[db_host.id] = host
+ logger.info(f"Added host {db_host.name} in offline mode - connection will retry")
+
+ # Load auto-restart configurations
+ for host_id in self.hosts:
+ configs = self.db.get_session().query(AutoRestartConfig).filter(
+ AutoRestartConfig.host_id == host_id,
+ AutoRestartConfig.enabled == True
+ ).all()
+ for config in configs:
+ # Use host_id:container_id as key to prevent collisions between hosts
+ container_key = f"{config.host_id}:{config.container_id}"
+ self.auto_restart_status[container_key] = True
+ self.restart_attempts[container_key] = config.restart_count
+
+ logger.info(f"Loaded {len(self.hosts)} hosts from database")
+ except Exception as e:
+ logger.error(f"Error loading persistent config: {e}")
+
+ def _get_auto_restart_status(self, host_id: str, container_id: str) -> bool:
+ """Get auto-restart status for a container"""
+ # Use host_id:container_id as key to prevent collisions between hosts
+ container_key = f"{host_id}:{container_id}"
+
+ # Check in-memory cache first
+ if container_key in self.auto_restart_status:
+ return self.auto_restart_status[container_key]
+
+ # Check database
+ config = self.db.get_auto_restart_config(host_id, container_id)
+ if config:
+ self.auto_restart_status[container_key] = config.enabled
+ return config.enabled
+
+ return False
+
+ async def cleanup_old_data(self):
+ """Periodic cleanup of old data"""
+ logger.info("Starting periodic data cleanup...")
+
+ while True:
+ try:
+ settings = self.db.get_settings()
+
+ if settings.auto_cleanup_events:
+ # Clean up old events
+ event_deleted = self.db.cleanup_old_events(settings.event_retention_days)
+ if event_deleted > 0:
+ self.event_logger.log_system_event(
+ "Automatic Event Cleanup",
+ f"Cleaned up {event_deleted} events older than {settings.event_retention_days} days",
+ EventSeverity.INFO,
+ EventType.STARTUP
+ )
+
+ # Clean up expired sessions (runs daily regardless of event cleanup setting)
+ expired_count = session_manager.cleanup_expired_sessions()
+ if expired_count > 0:
+ logger.info(f"Cleaned up {expired_count} expired sessions")
+
+ # Sleep for 24 hours before next cleanup
+ await asyncio.sleep(24 * 60 * 60) # 24 hours
+
+ except Exception as e:
+ logger.error(f"Error in cleanup task: {e}")
+ # Wait 1 hour before retrying
+ await asyncio.sleep(60 * 60) # 1 hour
\ No newline at end of file
diff --git a/dockmon/backend/docker_monitor/stats_manager.py b/dockmon/backend/docker_monitor/stats_manager.py
new file mode 100644
index 0000000..bb0855f
--- /dev/null
+++ b/dockmon/backend/docker_monitor/stats_manager.py
@@ -0,0 +1,170 @@
+"""
+Stats Collection Manager for DockMon
+Centralized logic for determining which containers need stats collection
+"""
+
+import asyncio
+import logging
+from typing import Set, List
+from models.docker_models import Container
+from database import GlobalSettings
+
+logger = logging.getLogger(__name__)
+
+
+class StatsManager:
+ """Manages stats collection decisions based on settings and modal state"""
+
+ def __init__(self):
+ """Initialize stats manager"""
+ self.streaming_containers: Set[str] = set() # Currently streaming container keys (host_id:container_id)
+ self.modal_containers: Set[str] = set() # Composite keys (host_id:container_id) with open modals
+ self._streaming_lock = asyncio.Lock() # Protect streaming_containers set from race conditions
+
+ def add_modal_container(self, container_id: str, host_id: str) -> None:
+ """Track that a container modal is open"""
+ composite_key = f"{host_id}:{container_id}"
+ self.modal_containers.add(composite_key)
+ logger.debug(f"Container modal opened for {container_id[:12]} on host {host_id[:8]} - stats tracking enabled")
+
+ def remove_modal_container(self, container_id: str, host_id: str) -> None:
+ """Remove container from modal tracking"""
+ composite_key = f"{host_id}:{container_id}"
+ self.modal_containers.discard(composite_key)
+ logger.debug(f"Container modal closed for {container_id[:12]} on host {host_id[:8]}")
+
+ def clear_modal_containers(self) -> None:
+ """Clear all modal containers (e.g., on WebSocket disconnect)"""
+ if self.modal_containers:
+ logger.debug(f"Clearing {len(self.modal_containers)} modal containers")
+ self.modal_containers.clear()
+
+ def determine_containers_needing_stats(
+ self,
+ containers: List[Container],
+ settings: GlobalSettings
+ ) -> Set[str]:
+ """
+ Centralized decision: determine which containers need stats collection
+
+ Rules:
+ 1. If show_container_stats OR show_host_stats is ON → collect ALL running containers
+ (host stats are aggregated from container stats)
+ 2. Always collect stats for containers with open modals
+
+ Args:
+ containers: List of all containers
+ settings: Global settings with show_container_stats and show_host_stats flags
+
+ Returns:
+ Set of composite keys (host_id:container_id) that need stats collection
+ """
+ containers_needing_stats = set()
+
+ # Rule 1: Container stats OR host stats enabled = ALL running containers
+ # (host stats need container data for aggregation)
+ if settings.show_container_stats or settings.show_host_stats:
+ for container in containers:
+ if container.status == 'running':
+ containers_needing_stats.add(f"{container.host_id}:{container.id}")
+
+ # Rule 2: Always add modal containers (even if settings are off)
+ # Modal containers are already stored as composite keys
+ for modal_composite_key in self.modal_containers:
+ # Verify container is still running before adding
+ for container in containers:
+ container_key = f"{container.host_id}:{container.id}"
+ if container_key == modal_composite_key and container.status == 'running':
+ containers_needing_stats.add(container_key)
+ break
+
+ return containers_needing_stats
+
+ async def sync_container_streams(
+ self,
+ containers: List[Container],
+ containers_needing_stats: Set[str],
+ stats_client,
+ error_callback
+ ) -> None:
+ """
+ Synchronize container stats streams with what's needed
+
+ Starts streams for containers that need stats but aren't streaming yet
+ Stops streams for containers that no longer need stats
+
+ Args:
+ containers: List of all containers
+ containers_needing_stats: Set of composite keys (host_id:container_id) that need stats
+ stats_client: Stats client instance
+ error_callback: Callback for handling async task errors
+ """
+ async with self._streaming_lock:
+ # Start streams for containers that need stats but aren't streaming yet
+ for container in containers:
+ container_key = f"{container.host_id}:{container.id}"
+ if container_key in containers_needing_stats and container_key not in self.streaming_containers:
+ task = asyncio.create_task(
+ stats_client.start_container_stream(
+ container.id,
+ container.name,
+ container.host_id
+ )
+ )
+ task.add_done_callback(error_callback)
+ self.streaming_containers.add(container_key)
+ logger.debug(f"Started stats stream for {container.name} on {container.host_name}")
+
+ # Stop streams for containers that no longer need stats
+ containers_to_stop = self.streaming_containers - containers_needing_stats
+
+ for container_key in containers_to_stop:
+ # Extract host_id and container_id from the key (format: host_id:container_id)
+ try:
+ host_id, container_id = container_key.split(':', 1)
+ except ValueError:
+ logger.error(f"Invalid container key format: {container_key}")
+ self.streaming_containers.discard(container_key)
+ continue
+
+ task = asyncio.create_task(stats_client.stop_container_stream(container_id, host_id))
+ task.add_done_callback(error_callback)
+ self.streaming_containers.discard(container_key)
+ logger.debug(f"Stopped stats stream for container {container_id[:12]}")
+
+ async def stop_all_streams(self, stats_client, error_callback) -> None:
+ """
+ Stop all active stats streams
+
+ Used when there are no active viewers
+
+ Args:
+ stats_client: Stats client instance
+ error_callback: Callback for handling async task errors
+ """
+ async with self._streaming_lock:
+ if self.streaming_containers:
+ logger.info(f"Stopping {len(self.streaming_containers)} stats streams")
+ for container_key in list(self.streaming_containers):
+ # Extract host_id and container_id from the key (format: host_id:container_id)
+ try:
+ host_id, container_id = container_key.split(':', 1)
+ except ValueError:
+ logger.error(f"Invalid container key format during cleanup: {container_key}")
+ continue
+
+ task = asyncio.create_task(stats_client.stop_container_stream(container_id, host_id))
+ task.add_done_callback(error_callback)
+ self.streaming_containers.clear()
+
+ def should_broadcast_host_metrics(self, settings: GlobalSettings) -> bool:
+ """Determine if host metrics should be included in broadcast"""
+ return settings.show_host_stats
+
+ def get_stats_summary(self) -> dict:
+ """Get current stats collection summary for debugging"""
+ return {
+ "streaming_containers": len(self.streaming_containers),
+ "modal_containers": len(self.modal_containers),
+ "modal_container_ids": list(self.modal_containers)
+ }
diff --git a/dockmon/backend/event_logger.py b/dockmon/backend/event_logger.py
new file mode 100644
index 0000000..890abb3
--- /dev/null
+++ b/dockmon/backend/event_logger.py
@@ -0,0 +1,606 @@
+"""
+Comprehensive event logging service for DockMon
+Provides structured logging for all system activities
+"""
+
+import asyncio
+import logging
+import time
+import uuid
+from datetime import datetime, timezone
+from typing import Dict, List, Optional, Any, Union
+from enum import Enum
+from dataclasses import dataclass
+from database import DatabaseManager, EventLog
+
+logger = logging.getLogger(__name__)
+
+class EventCategory(str, Enum):
+ """Event categories"""
+ CONTAINER = "container"
+ HOST = "host"
+ SYSTEM = "system"
+ ALERT = "alert"
+ NOTIFICATION = "notification"
+ USER = "user"
+
+class EventType(str, Enum):
+ """Event types"""
+ # Container events
+ STATE_CHANGE = "state_change"
+ ACTION_TAKEN = "action_taken"
+ AUTO_RESTART = "auto_restart"
+
+ # Host events
+ CONNECTION = "connection"
+ DISCONNECTION = "disconnection"
+ HOST_ADDED = "host_added"
+ HOST_REMOVED = "host_removed"
+
+ # System events
+ STARTUP = "startup"
+ SHUTDOWN = "shutdown"
+ ERROR = "error"
+ PERFORMANCE = "performance"
+
+ # Alert events
+ RULE_TRIGGERED = "rule_triggered"
+ RULE_CREATED = "rule_created"
+ RULE_DELETED = "rule_deleted"
+
+ # Notification events
+ SENT = "sent"
+ FAILED = "failed"
+ CHANNEL_CREATED = "channel_created"
+ CHANNEL_TESTED = "channel_tested"
+
+ # User events
+ LOGIN = "login"
+ LOGOUT = "logout"
+ CONFIG_CHANGED = "config_changed"
+
+class EventSeverity(str, Enum):
+ """Event severity levels"""
+ DEBUG = "debug"
+ INFO = "info"
+ WARNING = "warning"
+ ERROR = "error"
+ CRITICAL = "critical"
+
+@dataclass
+class EventContext:
+ """Context information for events"""
+ correlation_id: Optional[str] = None
+ host_id: Optional[str] = None
+ host_name: Optional[str] = None
+ container_id: Optional[str] = None
+ container_name: Optional[str] = None
+ user_id: Optional[str] = None
+ user_name: Optional[str] = None
+
+class EventLogger:
+ """Comprehensive event logging service"""
+
+ def __init__(self, db: DatabaseManager, websocket_manager=None):
+ self.db = db
+ self.websocket_manager = websocket_manager
+ self._event_queue = asyncio.Queue(maxsize=10000) # Prevent unbounded memory growth
+ self._processing_task: Optional[asyncio.Task] = None
+ self._active_correlations: Dict[str, List[str]] = {}
+ self._dropped_events_count = 0 # Track dropped events for monitoring
+
+ async def start(self):
+ """Start the event processing task"""
+ if not self._processing_task:
+ self._processing_task = asyncio.create_task(self._process_events())
+ logger.info("Event logger started")
+
+ async def stop(self):
+ """Stop the event processing task"""
+ if self._processing_task:
+ self._processing_task.cancel()
+ try:
+ await self._processing_task
+ except asyncio.CancelledError:
+ pass
+
+ # Drain the queue to prevent memory leak
+ while not self._event_queue.empty():
+ try:
+ self._event_queue.get_nowait()
+ self._event_queue.task_done()
+ except Exception:
+ break
+
+ logger.info("Event logger stopped")
+
+ def log_event(self,
+ category: EventCategory,
+ event_type: EventType,
+ title: str,
+ severity: EventSeverity = EventSeverity.INFO,
+ message: Optional[str] = None,
+ context: Optional[EventContext] = None,
+ old_state: Optional[str] = None,
+ new_state: Optional[str] = None,
+ triggered_by: Optional[str] = None,
+ details: Optional[Dict[str, Any]] = None,
+ duration_ms: Optional[int] = None):
+ """Log an event asynchronously"""
+
+ if context is None:
+ context = EventContext()
+
+ event_data = {
+ 'correlation_id': context.correlation_id,
+ 'category': category.value,
+ 'event_type': event_type.value,
+ 'severity': severity.value,
+ 'host_id': context.host_id,
+ 'host_name': context.host_name,
+ 'container_id': context.container_id,
+ 'container_name': context.container_name,
+ 'title': title,
+ 'message': message,
+ 'old_state': old_state,
+ 'new_state': new_state,
+ 'triggered_by': triggered_by,
+ 'details': details or {},
+ 'duration_ms': duration_ms,
+ 'timestamp': datetime.now(timezone.utc)
+ }
+
+ # Add to queue for async processing
+ try:
+ self._event_queue.put_nowait(event_data)
+ except asyncio.QueueFull:
+ self._dropped_events_count += 1
+ # Log more prominently for critical events
+ if severity in [EventSeverity.CRITICAL, EventSeverity.ERROR]:
+ logger.error(f"Event queue FULL! Dropped {severity.value} event: {title} (total dropped: {self._dropped_events_count})")
+ else:
+ # Periodic warning to avoid log spam
+ if self._dropped_events_count % 100 == 1:
+ logger.warning(f"Event queue full, dropped {self._dropped_events_count} events total")
+
+ # Also log to Python logger for immediate visibility
+ python_logger_level = {
+ EventSeverity.DEBUG: logging.DEBUG,
+ EventSeverity.INFO: logging.INFO,
+ EventSeverity.WARNING: logging.WARNING,
+ EventSeverity.ERROR: logging.ERROR,
+ EventSeverity.CRITICAL: logging.CRITICAL
+ }[severity]
+
+ logger.log(python_logger_level, f"[{category.value.upper()}] {title}: {message or ''}")
+
+ def create_correlation_id(self) -> str:
+ """Create a new correlation ID for linking related events"""
+ correlation_id = str(uuid.uuid4())
+ self._active_correlations[correlation_id] = []
+ return correlation_id
+
+ def end_correlation(self, correlation_id: str):
+ """End a correlation session"""
+ if correlation_id in self._active_correlations:
+ del self._active_correlations[correlation_id]
+
+ async def _process_events(self):
+ """Process events from the queue"""
+ while True:
+ try:
+ event_data = await self._event_queue.get()
+ # Save to database
+ event_obj = self.db.add_event(event_data)
+
+ # Broadcast to WebSocket clients
+ if self.websocket_manager and event_obj:
+ try:
+ await self.websocket_manager.broadcast({
+ 'type': 'new_event',
+ 'event': {
+ 'id': event_obj.id,
+ 'correlation_id': event_obj.correlation_id,
+ 'category': event_obj.category,
+ 'event_type': event_obj.event_type,
+ 'severity': event_obj.severity,
+ 'host_id': event_obj.host_id,
+ 'host_name': event_obj.host_name,
+ 'container_id': event_obj.container_id,
+ 'container_name': event_obj.container_name,
+ 'title': event_obj.title,
+ 'message': event_obj.message,
+ 'old_state': event_obj.old_state,
+ 'new_state': event_obj.new_state,
+ 'triggered_by': event_obj.triggered_by,
+ 'details': event_obj.details,
+ 'duration_ms': event_obj.duration_ms,
+ 'timestamp': event_obj.timestamp.isoformat()
+ }
+ })
+ except Exception as ws_error:
+ logger.debug(f"WebSocket broadcast failed (non-critical): {ws_error}")
+
+ self._event_queue.task_done()
+ except asyncio.CancelledError:
+ break
+ except Exception as e:
+ logger.error(f"Error processing event: {e}")
+
+ # Convenience methods for common event types
+
+ def log_container_state_change(self,
+ container_name: str,
+ container_id: str,
+ host_name: str,
+ host_id: str,
+ old_state: str,
+ new_state: str,
+ triggered_by: str = "system",
+ correlation_id: Optional[str] = None):
+ """Log container state change"""
+ # Match severity with alert rule definitions
+ if triggered_by == "user":
+ # User-initiated changes are WARNING (intentional but noteworthy)
+ # except for starting containers which is INFO
+ if new_state in ['running', 'restarting']:
+ severity = EventSeverity.INFO
+ else:
+ severity = EventSeverity.WARNING
+ elif new_state in ['exited', 'dead']:
+ severity = EventSeverity.CRITICAL # Unexpected crash
+ elif new_state in ['stopped', 'paused']:
+ severity = EventSeverity.WARNING # Stopped but not crashed
+ else:
+ severity = EventSeverity.INFO
+
+ context = EventContext(
+ correlation_id=correlation_id,
+ host_id=host_id,
+ host_name=host_name,
+ container_id=container_id,
+ container_name=container_name
+ )
+
+ # Add context to message if user-initiated
+ if triggered_by == "user":
+ title = f"Container {container_name} state changed (user action)"
+ message = f"Container '{container_name}' on host '{host_name}' state changed from {old_state} to {new_state} (user action)"
+ else:
+ title = f"Container {container_name} state changed"
+ message = f"Container '{container_name}' on host '{host_name}' state changed from {old_state} to {new_state}"
+
+ self.log_event(
+ category=EventCategory.CONTAINER,
+ event_type=EventType.STATE_CHANGE,
+ title=title,
+ severity=severity,
+ message=message,
+ context=context,
+ old_state=old_state,
+ new_state=new_state,
+ triggered_by=triggered_by
+ )
+
+ def log_container_action(self,
+ action: str,
+ container_name: str,
+ container_id: str,
+ host_name: str,
+ host_id: str,
+ success: bool,
+ triggered_by: str = "user",
+ error_message: Optional[str] = None,
+ duration_ms: Optional[int] = None,
+ correlation_id: Optional[str] = None):
+ """Log container action (start, stop, restart, etc.)"""
+ severity = EventSeverity.ERROR if not success else EventSeverity.INFO
+ title = f"Container {action} {'succeeded' if success else 'failed'}"
+ message = f"Container '{container_name}' on host '{host_name}' {action} {'completed successfully' if success else 'failed'}"
+
+ if error_message:
+ message += f": {error_message}"
+
+ context = EventContext(
+ correlation_id=correlation_id,
+ host_id=host_id,
+ host_name=host_name,
+ container_id=container_id,
+ container_name=container_name
+ )
+
+ self.log_event(
+ category=EventCategory.CONTAINER,
+ event_type=EventType.ACTION_TAKEN,
+ title=title,
+ severity=severity,
+ message=message,
+ context=context,
+ triggered_by=triggered_by,
+ duration_ms=duration_ms,
+ details={'action': action, 'success': success, 'error': error_message}
+ )
+
+ def log_auto_restart_attempt(self,
+ container_name: str,
+ container_id: str,
+ host_name: str,
+ host_id: str,
+ attempt: int,
+ max_attempts: int,
+ success: bool,
+ error_message: Optional[str] = None,
+ correlation_id: Optional[str] = None):
+ """Log auto-restart attempt"""
+ severity = EventSeverity.ERROR if not success else EventSeverity.INFO
+ title = f"Auto-restart attempt {attempt}/{max_attempts}"
+ message = f"Auto-restart attempt {attempt} of {max_attempts} for container '{container_name}' on host '{host_name}' {'succeeded' if success else 'failed'}"
+
+ if error_message:
+ message += f": {error_message}"
+
+ context = EventContext(
+ correlation_id=correlation_id,
+ host_id=host_id,
+ host_name=host_name,
+ container_id=container_id,
+ container_name=container_name
+ )
+
+ self.log_event(
+ category=EventCategory.CONTAINER,
+ event_type=EventType.AUTO_RESTART,
+ title=title,
+ severity=severity,
+ message=message,
+ context=context,
+ triggered_by="auto_restart",
+ details={'attempt': attempt, 'max_attempts': max_attempts, 'success': success, 'error': error_message}
+ )
+
+ def log_host_connection(self,
+ host_name: str,
+ host_id: str,
+ host_url: str,
+ connected: bool,
+ error_message: Optional[str] = None):
+ """Log host connection/disconnection"""
+ severity = EventSeverity.WARNING if not connected else EventSeverity.INFO
+ event_type = EventType.CONNECTION if connected else EventType.DISCONNECTION
+ title = f"Host {host_name} {'connected' if connected else 'disconnected'}"
+ message = f"Docker host {host_name} ({host_url}) {'connected successfully' if connected else 'disconnected'}"
+
+ if error_message:
+ message += f": {error_message}"
+
+ context = EventContext(
+ host_id=host_id,
+ host_name=host_name
+ )
+
+ self.log_event(
+ category=EventCategory.HOST,
+ event_type=event_type,
+ title=title,
+ severity=severity,
+ message=message,
+ context=context,
+ details={'url': host_url, 'connected': connected, 'error': error_message}
+ )
+
+ def log_alert_triggered(self,
+ rule_name: str,
+ rule_id: str,
+ container_name: str,
+ container_id: str,
+ host_name: str,
+ host_id: str,
+ old_state: str,
+ new_state: str,
+ channels_notified: int,
+ total_channels: int,
+ correlation_id: Optional[str] = None):
+ """Log alert rule trigger"""
+ severity = EventSeverity.WARNING if new_state in ['exited', 'dead'] else EventSeverity.INFO
+ title = f"Alert rule '{rule_name}' triggered"
+ message = f"Alert rule triggered for {container_name} state change ({old_state} → {new_state}). Notified {channels_notified}/{total_channels} channels."
+
+ context = EventContext(
+ correlation_id=correlation_id,
+ host_id=host_id,
+ host_name=host_name,
+ container_id=container_id,
+ container_name=container_name
+ )
+
+ self.log_event(
+ category=EventCategory.ALERT,
+ event_type=EventType.RULE_TRIGGERED,
+ title=title,
+ severity=severity,
+ message=message,
+ context=context,
+ old_state=old_state,
+ new_state=new_state,
+ details={'rule_id': rule_id, 'channels_notified': channels_notified, 'total_channels': total_channels}
+ )
+
+ def log_notification_sent(self,
+ channel_name: str,
+ channel_type: str,
+ success: bool,
+ container_name: str,
+ error_message: Optional[str] = None,
+ correlation_id: Optional[str] = None):
+ """Log notification attempt"""
+ severity = EventSeverity.ERROR if not success else EventSeverity.INFO
+ title = f"Notification {'sent' if success else 'failed'} via {channel_name}"
+ message = f"Notification via {channel_name} ({channel_type}) {'sent successfully' if success else 'failed'}"
+
+ if error_message:
+ message += f": {error_message}"
+
+ context = EventContext(
+ correlation_id=correlation_id,
+ container_name=container_name
+ )
+
+ self.log_event(
+ category=EventCategory.NOTIFICATION,
+ event_type=EventType.SENT if success else EventType.FAILED,
+ title=title,
+ severity=severity,
+ message=message,
+ context=context,
+ details={'channel_name': channel_name, 'channel_type': channel_type, 'success': success, 'error': error_message}
+ )
+
+ def log_host_added(self,
+ host_name: str,
+ host_id: str,
+ host_url: str,
+ triggered_by: str = "user"):
+ """Log host addition"""
+ context = EventContext(
+ host_id=host_id,
+ host_name=host_name
+ )
+
+ self.log_event(
+ category=EventCategory.HOST,
+ event_type=EventType.HOST_ADDED,
+ title=f"Host {host_name} added",
+ severity=EventSeverity.INFO,
+ message=f"Docker host '{host_name}' ({host_url}) was added to monitoring",
+ context=context,
+ triggered_by=triggered_by,
+ details={'url': host_url}
+ )
+
+ def log_host_removed(self,
+ host_name: str,
+ host_id: str,
+ triggered_by: str = "user"):
+ """Log host removal"""
+ context = EventContext(
+ host_id=host_id,
+ host_name=host_name
+ )
+
+ self.log_event(
+ category=EventCategory.HOST,
+ event_type=EventType.HOST_REMOVED,
+ title=f"Host {host_name} removed",
+ severity=EventSeverity.INFO,
+ message=f"Docker host '{host_name}' was removed from monitoring",
+ context=context,
+ triggered_by=triggered_by
+ )
+
+ def log_alert_rule_created(self,
+ rule_name: str,
+ rule_id: str,
+ container_count: int,
+ channels: List[str],
+ triggered_by: str = "user"):
+ """Log alert rule creation"""
+ self.log_event(
+ category=EventCategory.ALERT,
+ event_type=EventType.RULE_CREATED,
+ title=f"Alert rule '{rule_name}' created",
+ severity=EventSeverity.INFO,
+ message=f"New alert rule '{rule_name}' created with {container_count} container(s) and {len(channels)} notification channel(s)",
+ triggered_by=triggered_by,
+ details={'rule_id': rule_id, 'container_count': container_count, 'channels': channels}
+ )
+
+ def log_alert_rule_deleted(self,
+ rule_name: str,
+ rule_id: str,
+ triggered_by: str = "user"):
+ """Log alert rule deletion"""
+ self.log_event(
+ category=EventCategory.ALERT,
+ event_type=EventType.RULE_DELETED,
+ title=f"Alert rule '{rule_name}' deleted",
+ severity=EventSeverity.INFO,
+ message=f"Alert rule '{rule_name}' was deleted",
+ triggered_by=triggered_by,
+ details={'rule_id': rule_id}
+ )
+
+ def log_notification_channel_created(self,
+ channel_name: str,
+ channel_type: str,
+ triggered_by: str = "user"):
+ """Log notification channel creation"""
+ self.log_event(
+ category=EventCategory.NOTIFICATION,
+ event_type=EventType.CHANNEL_CREATED,
+ title=f"Notification channel '{channel_name}' created",
+ severity=EventSeverity.INFO,
+ message=f"New notification channel '{channel_name}' ({channel_type}) was created",
+ triggered_by=triggered_by,
+ details={'channel_name': channel_name, 'channel_type': channel_type}
+ )
+
+ def log_system_event(self,
+ title: str,
+ message: str,
+ severity: EventSeverity = EventSeverity.INFO,
+ event_type: EventType = EventType.STARTUP,
+ details: Optional[Dict[str, Any]] = None):
+ """Log system-level events"""
+ self.log_event(
+ category=EventCategory.SYSTEM,
+ event_type=event_type,
+ title=title,
+ severity=severity,
+ message=message,
+ details=details
+ )
+
+class PerformanceTimer:
+ """Context manager for timing operations"""
+
+ def __init__(self, event_logger: EventLogger, operation_name: str, context: Optional[EventContext] = None):
+ self.event_logger = event_logger
+ self.operation_name = operation_name
+ self.context = context or EventContext()
+ self.start_time = None
+ self.correlation_id = None
+
+ def __enter__(self):
+ self.start_time = time.time()
+ self.correlation_id = self.event_logger.create_correlation_id()
+ self.context.correlation_id = self.correlation_id
+ return self
+
+ def __exit__(self, exc_type, exc_val, exc_tb):
+ duration_ms = int((time.time() - self.start_time) * 1000)
+
+ if exc_type is None:
+ # Success
+ self.event_logger.log_event(
+ category=EventCategory.SYSTEM,
+ event_type=EventType.PERFORMANCE,
+ title=f"{self.operation_name} completed",
+ severity=EventSeverity.DEBUG,
+ message=f"Operation '{self.operation_name}' completed in {duration_ms}ms",
+ context=self.context,
+ duration_ms=duration_ms
+ )
+ else:
+ # Error occurred
+ self.event_logger.log_event(
+ category=EventCategory.SYSTEM,
+ event_type=EventType.ERROR,
+ title=f"{self.operation_name} failed",
+ severity=EventSeverity.ERROR,
+ message=f"Operation '{self.operation_name}' failed after {duration_ms}ms: {exc_val}",
+ context=self.context,
+ duration_ms=duration_ms,
+ details={'error_type': exc_type.__name__ if exc_type else None, 'error_message': str(exc_val)}
+ )
+
+ self.event_logger.end_correlation(self.correlation_id)
\ No newline at end of file
diff --git a/dockmon/backend/main.py b/dockmon/backend/main.py
new file mode 100644
index 0000000..1dca962
--- /dev/null
+++ b/dockmon/backend/main.py
@@ -0,0 +1,1632 @@
+#!/usr/bin/env python3
+"""
+DockMon Backend - Docker Container Monitoring System
+Supports multiple Docker hosts with auto-restart and alerts
+"""
+
+import asyncio
+import json
+import logging
+import time
+import uuid
+from datetime import datetime, timedelta
+from typing import Dict, List, Optional, Any
+from contextlib import asynccontextmanager
+
+import docker
+from docker import DockerClient
+from docker.errors import DockerException, APIError
+from fastapi import FastAPI, WebSocket, WebSocketDisconnect, HTTPException, Request, Depends, status, Cookie, Response, Query
+from fastapi.middleware.cors import CORSMiddleware
+# Session-based auth - no longer need HTTPBearer
+from fastapi.responses import FileResponse
+from database import DatabaseManager
+from realtime import RealtimeMonitor
+from notifications import NotificationService, AlertProcessor
+from event_logger import EventLogger, EventContext, EventCategory, EventType, EventSeverity, PerformanceTimer
+
+# Import extracted modules
+from config.settings import AppConfig, get_cors_origins, setup_logging
+from models.docker_models import DockerHostConfig, DockerHost
+from models.settings_models import GlobalSettings, AlertRule
+from models.request_models import (
+ AutoRestartRequest, AlertRuleCreate, AlertRuleUpdate,
+ NotificationChannelCreate, NotificationChannelUpdate, EventLogFilter
+)
+from security.audit import security_audit
+from security.rate_limiting import rate_limiter, rate_limit_auth, rate_limit_hosts, rate_limit_containers, rate_limit_notifications, rate_limit_default
+from auth.routes import router as auth_router, verify_frontend_session
+verify_session_auth = verify_frontend_session
+from websocket.connection import ConnectionManager, DateTimeEncoder
+from websocket.rate_limiter import ws_rate_limiter
+from docker_monitor.monitor import DockerMonitor
+
+# Configure logging
+setup_logging()
+logger = logging.getLogger(__name__)
+
+
+
+
+
+
+# ==================== FastAPI Application ====================
+
+# Create monitor instance
+monitor = DockerMonitor()
+
+
+# ==================== Authentication ====================
+
+# Session-based authentication only - no API keys needed
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+ """Manage application lifecycle"""
+ # Startup
+ logger.info("Starting DockMon backend...")
+
+ # Ensure default user exists
+ monitor.db.get_or_create_default_user()
+
+ await monitor.event_logger.start()
+ monitor.event_logger.log_system_event("DockMon Backend Starting", "DockMon backend is initializing", EventSeverity.INFO, EventType.STARTUP)
+
+ # Connect security audit logger to event logger
+ security_audit.set_event_logger(monitor.event_logger)
+ monitor.monitoring_task = asyncio.create_task(monitor.monitor_containers())
+ monitor.cleanup_task = asyncio.create_task(monitor.cleanup_old_data())
+
+ # Start blackout window monitoring with WebSocket support
+ await monitor.notification_service.blackout_manager.start_monitoring(
+ monitor.notification_service,
+ monitor, # Pass DockerMonitor instance to avoid re-initialization
+ monitor.manager # Pass ConnectionManager for WebSocket broadcasts
+ )
+ yield
+ # Shutdown
+ logger.info("Shutting down DockMon backend...")
+ monitor.event_logger.log_system_event("DockMon Backend Shutting Down", "DockMon backend is shutting down", EventSeverity.INFO, EventType.SHUTDOWN)
+ if monitor.monitoring_task:
+ monitor.monitoring_task.cancel()
+ if monitor.cleanup_task:
+ monitor.cleanup_task.cancel()
+ # Stop blackout monitoring
+ monitor.notification_service.blackout_manager.stop_monitoring()
+ # Close stats client (HTTP session and WebSocket)
+ from stats_client import get_stats_client
+ await get_stats_client().close()
+ # Close notification service
+ await monitor.notification_service.close()
+ # Stop event logger
+ await monitor.event_logger.stop()
+ # Dispose SQLAlchemy engine
+ monitor.db.engine.dispose()
+ logger.info("SQLAlchemy engine disposed")
+
+app = FastAPI(
+ title="DockMon API",
+ version="1.0.0",
+ lifespan=lifespan
+)
+
+# Configure CORS - Production ready with environment-based configuration
+app.add_middleware(
+ CORSMiddleware,
+ allow_origins=AppConfig.CORS_ORIGINS,
+ allow_credentials=True,
+ allow_methods=["GET", "POST", "PUT", "DELETE", "OPTIONS"],
+ allow_headers=["Content-Type", "Authorization"],
+)
+
+logger.info(f"CORS configured for origins: {AppConfig.CORS_ORIGINS}")
+
+# ==================== API Routes ====================
+
+# Register authentication router
+app.include_router(auth_router)
+
+@app.get("/")
+async def root(authenticated: bool = Depends(verify_session_auth)):
+ """Backend API root - frontend is served separately"""
+ return {"message": "DockMon Backend API", "version": "1.0.0", "docs": "/docs"}
+
+@app.get("/health")
+async def health_check():
+ """Health check endpoint for Docker health checks - no authentication required"""
+ return {"status": "healthy", "service": "dockmon-backend"}
+
+def _is_localhost_or_internal(ip: str) -> bool:
+ """Check if IP is localhost or internal network (Docker networks, private networks)"""
+ import ipaddress
+ try:
+ addr = ipaddress.ip_address(ip)
+
+ # Allow localhost
+ if addr.is_loopback:
+ return True
+
+ # Allow private networks (RFC 1918) - for Docker networks and internal deployments
+ if addr.is_private:
+ return True
+
+ return False
+ except ValueError:
+ # Invalid IP format
+ return False
+
+
+# ==================== Frontend Authentication ====================
+
+async def verify_session_auth(request: Request):
+ """Verify authentication via session cookie only"""
+ from auth.routes import _get_session_from_cookie
+ from auth.session_manager import session_manager
+
+ # Since backend only listens on 127.0.0.1, all requests must come through nginx
+ # No need to check client IP - the backend binding ensures security
+
+ # Check session authentication
+ session_id = _get_session_from_cookie(request)
+ if session_id and session_manager.validate_session(session_id, request):
+ return True
+
+ # No valid session found
+ raise HTTPException(
+ status_code=401,
+ detail="Authentication required - please login"
+ )
+
+
+
+@app.get("/api/hosts")
+async def get_hosts(authenticated: bool = Depends(verify_session_auth)):
+ """Get all configured Docker hosts"""
+ return list(monitor.hosts.values())
+
+@app.post("/api/hosts")
+async def add_host(config: DockerHostConfig, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_hosts, request: Request = None):
+ """Add a new Docker host"""
+ try:
+ host = monitor.add_host(config)
+
+ # Security audit log - successful privileged action
+ if request:
+ security_audit.log_privileged_action(
+ client_ip=request.client.host if hasattr(request, 'client') else "unknown",
+ action="ADD_DOCKER_HOST",
+ target=f"{config.name} ({config.url})",
+ success=True,
+ user_agent=request.headers.get('user-agent', 'unknown')
+ )
+
+ # Broadcast host addition to WebSocket clients so they refresh
+ await monitor.manager.broadcast({
+ "type": "host_added",
+ "data": {"host_id": host.id, "host_name": host.name}
+ })
+
+ return host
+ except Exception as e:
+ # Security audit log - failed privileged action
+ if request:
+ security_audit.log_privileged_action(
+ client_ip=request.client.host if hasattr(request, 'client') else "unknown",
+ action="ADD_DOCKER_HOST",
+ target=f"{config.name} ({config.url})",
+ success=False,
+ user_agent=request.headers.get('user-agent', 'unknown')
+ )
+ raise
+
+@app.put("/api/hosts/{host_id}")
+async def update_host(host_id: str, config: DockerHostConfig, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_hosts):
+ """Update an existing Docker host"""
+ host = monitor.update_host(host_id, config)
+ return host
+
+@app.delete("/api/hosts/{host_id}")
+async def remove_host(host_id: str, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_hosts):
+ """Remove a Docker host"""
+ await monitor.remove_host(host_id)
+
+ # Broadcast host removal to WebSocket clients so they refresh
+ await monitor.manager.broadcast({
+ "type": "host_removed",
+ "data": {"host_id": host_id}
+ })
+
+ return {"status": "success", "message": f"Host {host_id} removed"}
+
+@app.get("/api/hosts/{host_id}/metrics")
+async def get_host_metrics(host_id: str, authenticated: bool = Depends(verify_session_auth)):
+ """Get aggregated metrics for a Docker host (CPU, RAM, Network)"""
+ try:
+ host = monitor.hosts.get(host_id)
+ if not host:
+ raise HTTPException(status_code=404, detail="Host not found")
+
+ client = monitor.clients.get(host_id)
+ if not client:
+ raise HTTPException(status_code=503, detail="Host client not available")
+
+ # Get all running containers on this host
+ containers = client.containers.list(filters={'status': 'running'})
+
+ total_cpu = 0.0
+ total_memory_used = 0
+ total_memory_limit = 0
+ total_net_rx = 0
+ total_net_tx = 0
+ container_count = 0
+
+ for container in containers:
+ try:
+ stats = container.stats(stream=False)
+
+ # Calculate CPU percentage
+ cpu_delta = stats['cpu_stats']['cpu_usage']['total_usage'] - \
+ stats['precpu_stats']['cpu_usage']['total_usage']
+ system_delta = stats['cpu_stats']['system_cpu_usage'] - \
+ stats['precpu_stats']['system_cpu_usage']
+
+ if system_delta > 0:
+ num_cpus = len(stats['cpu_stats']['cpu_usage'].get('percpu_usage', [1]))
+ cpu_percent = (cpu_delta / system_delta) * num_cpus * 100.0
+ total_cpu += cpu_percent
+
+ # Memory
+ mem_usage = stats['memory_stats'].get('usage', 0)
+ mem_limit = stats['memory_stats'].get('limit', 1)
+ total_memory_used += mem_usage
+ total_memory_limit += mem_limit
+
+ # Network I/O
+ networks = stats.get('networks', {})
+ for net_stats in networks.values():
+ total_net_rx += net_stats.get('rx_bytes', 0)
+ total_net_tx += net_stats.get('tx_bytes', 0)
+
+ container_count += 1
+
+ except Exception as e:
+ logger.warning(f"Failed to get stats for container {container.id}: {e}")
+ continue
+
+ # Calculate percentages
+ avg_cpu = round(total_cpu / container_count, 1) if container_count > 0 else 0.0
+ memory_percent = round((total_memory_used / total_memory_limit) * 100, 1) if total_memory_limit > 0 else 0.0
+
+ return {
+ "cpu_percent": avg_cpu,
+ "memory_percent": memory_percent,
+ "memory_used_bytes": total_memory_used,
+ "memory_limit_bytes": total_memory_limit,
+ "network_rx_bytes": total_net_rx,
+ "network_tx_bytes": total_net_tx,
+ "container_count": container_count,
+ "timestamp": int(time.time())
+ }
+
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Error fetching metrics for host {host_id}: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.get("/api/containers")
+async def get_containers(host_id: Optional[str] = None, authenticated: bool = Depends(verify_session_auth)):
+ """Get all containers"""
+ return await monitor.get_containers(host_id)
+
+@app.post("/api/hosts/{host_id}/containers/{container_id}/restart")
+async def restart_container(host_id: str, container_id: str, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_containers):
+ """Restart a container"""
+ success = monitor.restart_container(host_id, container_id)
+ return {"status": "success" if success else "failed"}
+
+@app.post("/api/hosts/{host_id}/containers/{container_id}/stop")
+async def stop_container(host_id: str, container_id: str, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_containers):
+ """Stop a container"""
+ success = monitor.stop_container(host_id, container_id)
+ return {"status": "success" if success else "failed"}
+
+@app.post("/api/hosts/{host_id}/containers/{container_id}/start")
+async def start_container(host_id: str, container_id: str, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_containers):
+ """Start a container"""
+ success = monitor.start_container(host_id, container_id)
+ return {"status": "success" if success else "failed"}
+
+@app.get("/api/hosts/{host_id}/containers/{container_id}/logs")
+async def get_container_logs(
+ host_id: str,
+ container_id: str,
+ tail: int = 100,
+ since: Optional[str] = None, # ISO timestamp for getting logs since a specific time
+ authenticated: bool = Depends(verify_session_auth)
+ # No rate limiting - authenticated users can poll logs freely
+):
+ """Get container logs - Portainer-style polling approach"""
+ if host_id not in monitor.clients:
+ raise HTTPException(status_code=404, detail="Host not found")
+
+ try:
+ client = monitor.clients[host_id]
+
+ # Run blocking Docker calls in executor with timeout
+ loop = asyncio.get_event_loop()
+
+ # Get container with timeout
+ try:
+ container = await asyncio.wait_for(
+ loop.run_in_executor(None, client.containers.get, container_id),
+ timeout=5.0
+ )
+ except asyncio.TimeoutError:
+ raise HTTPException(status_code=504, detail="Timeout getting container")
+
+ # Prepare log options
+ log_kwargs = {
+ 'timestamps': True,
+ 'tail': tail
+ }
+
+ # Add since parameter if provided (for getting only new logs)
+ if since:
+ try:
+ # Parse ISO timestamp and convert to Unix timestamp for Docker
+ import dateutil.parser
+ dt = dateutil.parser.parse(since)
+ # Docker's 'since' expects Unix timestamp as float
+ import time
+ unix_ts = time.mktime(dt.timetuple())
+ log_kwargs['since'] = unix_ts
+ log_kwargs['tail'] = 'all' # Get all logs since timestamp
+ except Exception as e:
+ logger.debug(f"Could not parse 'since' parameter: {e}")
+ pass # Invalid timestamp, ignore
+
+ # Fetch logs with timeout
+ try:
+ logs = await asyncio.wait_for(
+ loop.run_in_executor(
+ None,
+ lambda: container.logs(**log_kwargs).decode('utf-8', errors='ignore')
+ ),
+ timeout=5.0
+ )
+ except asyncio.TimeoutError:
+ raise HTTPException(status_code=504, detail="Timeout fetching logs")
+
+ # Parse logs and extract timestamps
+ # Docker log format with timestamps: "2025-09-30T19:30:45.123456789Z actual log message"
+ parsed_logs = []
+ for line in logs.split('\n'):
+ if not line.strip():
+ continue
+
+ # Try to extract timestamp (Docker format: ISO8601 with nanoseconds)
+ try:
+ # Find the space after timestamp
+ space_idx = line.find(' ')
+ if space_idx > 0:
+ timestamp_str = line[:space_idx]
+ log_text = line[space_idx + 1:]
+
+ # Parse timestamp (remove nanoseconds for Python datetime)
+ # Format: 2025-09-30T19:30:45.123456789Z -> 2025-09-30T19:30:45.123456Z
+ if 'T' in timestamp_str and timestamp_str.endswith('Z'):
+ # Truncate to microseconds (6 digits) if nanoseconds present
+ parts = timestamp_str[:-1].split('.')
+ if len(parts) == 2 and len(parts[1]) > 6:
+ timestamp_str = f"{parts[0]}.{parts[1][:6]}Z"
+
+ parsed_logs.append({
+ "timestamp": timestamp_str,
+ "log": log_text
+ })
+ else:
+ # No valid timestamp, use current time
+ parsed_logs.append({
+ "timestamp": datetime.utcnow().isoformat() + 'Z',
+ "log": line
+ })
+ else:
+ # No space found, treat whole line as log
+ parsed_logs.append({
+ "timestamp": datetime.utcnow().isoformat() + 'Z',
+ "log": line
+ })
+ except Exception:
+ # If parsing fails, use current time
+ parsed_logs.append({
+ "timestamp": datetime.utcnow().isoformat() + 'Z',
+ "log": line
+ })
+
+ return {
+ "container_id": container_id,
+ "logs": parsed_logs,
+ "last_timestamp": datetime.utcnow().isoformat() + 'Z' # For next 'since' parameter
+ }
+
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to get logs for {container_id}: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+# Container exec endpoint removed for security reasons
+# Users should use direct SSH, Docker CLI, or other appropriate tools for container access
+
+
+# WebSocket log streaming removed in favor of HTTP polling (Portainer-style)
+# This is more reliable for remote Docker hosts
+
+
+@app.post("/api/containers/{container_id}/auto-restart")
+async def toggle_auto_restart(container_id: str, request: AutoRestartRequest, authenticated: bool = Depends(verify_session_auth)):
+ """Toggle auto-restart for a container"""
+ monitor.toggle_auto_restart(request.host_id, container_id, request.container_name, request.enabled)
+ return {"container_id": container_id, "auto_restart": request.enabled}
+
+@app.get("/api/rate-limit/stats")
+async def get_rate_limit_stats(authenticated: bool = Depends(verify_session_auth)):
+ """Get rate limiter statistics - admin only"""
+ return rate_limiter.get_stats()
+
+@app.get("/api/security/audit")
+async def get_security_audit_stats(authenticated: bool = Depends(verify_session_auth), request: Request = None):
+ """Get security audit statistics - admin only"""
+ if request:
+ security_audit.log_privileged_action(
+ client_ip=request.client.host if hasattr(request, 'client') else "unknown",
+ action="VIEW_SECURITY_AUDIT",
+ target="security_audit_logs",
+ success=True,
+ user_agent=request.headers.get('user-agent', 'unknown')
+ )
+ return security_audit.get_security_stats()
+
+@app.get("/api/settings")
+async def get_settings(authenticated: bool = Depends(verify_session_auth)):
+ """Get global settings"""
+ settings = monitor.db.get_settings()
+ return {
+ "max_retries": settings.max_retries,
+ "retry_delay": settings.retry_delay,
+ "default_auto_restart": settings.default_auto_restart,
+ "polling_interval": settings.polling_interval,
+ "connection_timeout": settings.connection_timeout,
+ "log_retention_days": settings.log_retention_days,
+ "enable_notifications": settings.enable_notifications,
+ "alert_template": getattr(settings, 'alert_template', None),
+ "blackout_windows": getattr(settings, 'blackout_windows', None),
+ "timezone_offset": getattr(settings, 'timezone_offset', 0),
+ "show_host_stats": getattr(settings, 'show_host_stats', True),
+ "show_container_stats": getattr(settings, 'show_container_stats', True)
+ }
+
+@app.post("/api/settings")
+async def update_settings(settings: GlobalSettings, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_default):
+ """Update global settings"""
+ # Check if stats settings changed
+ old_show_host_stats = monitor.settings.show_host_stats
+ old_show_container_stats = monitor.settings.show_container_stats
+
+ updated = monitor.db.update_settings(settings.dict())
+ monitor.settings = updated # Update in-memory settings
+
+ # Log stats collection changes
+ if old_show_host_stats != updated.show_host_stats:
+ logger.info(f"Host stats collection {'enabled' if updated.show_host_stats else 'disabled'}")
+ if old_show_container_stats != updated.show_container_stats:
+ logger.info(f"Container stats collection {'enabled' if updated.show_container_stats else 'disabled'}")
+
+ # Broadcast blackout status change to all clients
+ is_blackout, window_name = monitor.notification_service.blackout_manager.is_in_blackout_window()
+ await monitor.manager.broadcast({
+ 'type': 'blackout_status_changed',
+ 'data': {
+ 'is_blackout': is_blackout,
+ 'window_name': window_name
+ }
+ })
+
+ return settings
+
+@app.get("/api/alerts")
+async def get_alert_rules(authenticated: bool = Depends(verify_session_auth)):
+ """Get all alert rules"""
+ rules = monitor.db.get_alert_rules(enabled_only=False)
+ logger.info(f"Retrieved {len(rules)} alert rules from database")
+
+ # Check for orphaned alerts
+ orphaned = await monitor.check_orphaned_alerts()
+
+ return [{
+ "id": rule.id,
+ "name": rule.name,
+ "containers": [{"host_id": c.host_id, "container_name": c.container_name}
+ for c in rule.containers] if rule.containers else [],
+ "trigger_events": rule.trigger_events,
+ "trigger_states": rule.trigger_states,
+ "notification_channels": rule.notification_channels,
+ "cooldown_minutes": rule.cooldown_minutes,
+ "enabled": rule.enabled,
+ "last_triggered": rule.last_triggered.isoformat() if rule.last_triggered else None,
+ "created_at": rule.created_at.isoformat(),
+ "updated_at": rule.updated_at.isoformat(),
+ "is_orphaned": rule.id in orphaned,
+ "orphaned_containers": orphaned.get(rule.id, {}).get('orphaned_containers', []) if rule.id in orphaned else []
+ } for rule in rules]
+
+
+@app.post("/api/alerts")
+async def create_alert_rule(rule: AlertRuleCreate, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_default):
+ """Create a new alert rule"""
+ try:
+ # Validate cooldown_minutes
+ if rule.cooldown_minutes < 0 or rule.cooldown_minutes > 10080: # Max 1 week
+ raise HTTPException(status_code=400, detail="Cooldown must be between 0 and 10080 minutes (1 week)")
+
+ rule_id = str(uuid.uuid4())
+
+ # Convert ContainerHostPair objects to dicts for database
+ containers_data = None
+ if rule.containers:
+ containers_data = [{"host_id": c.host_id, "container_name": c.container_name}
+ for c in rule.containers]
+
+ logger.info(f"Creating alert rule: {rule.name} with {len(containers_data) if containers_data else 0} container+host pairs")
+
+ db_rule = monitor.db.add_alert_rule({
+ "id": rule_id,
+ "name": rule.name,
+ "containers": containers_data,
+ "trigger_events": rule.trigger_events,
+ "trigger_states": rule.trigger_states,
+ "notification_channels": rule.notification_channels,
+ "cooldown_minutes": rule.cooldown_minutes,
+ "enabled": rule.enabled
+ })
+ logger.info(f"Successfully created alert rule with ID: {db_rule.id}")
+
+ # Log alert rule creation
+ monitor.event_logger.log_alert_rule_created(
+ rule_name=db_rule.name,
+ rule_id=db_rule.id,
+ container_count=len(db_rule.containers) if db_rule.containers else 0,
+ channels=db_rule.notification_channels or [],
+ triggered_by="user"
+ )
+
+ return {
+ "id": db_rule.id,
+ "name": db_rule.name,
+ "containers": [{"host_id": c.host_id, "container_name": c.container_name}
+ for c in db_rule.containers] if db_rule.containers else [],
+ "trigger_events": db_rule.trigger_events,
+ "trigger_states": db_rule.trigger_states,
+ "notification_channels": db_rule.notification_channels,
+ "cooldown_minutes": db_rule.cooldown_minutes,
+ "enabled": db_rule.enabled,
+ "last_triggered": None,
+ "created_at": db_rule.created_at.isoformat(),
+ "updated_at": db_rule.updated_at.isoformat()
+ }
+ except Exception as e:
+ logger.error(f"Failed to create alert rule: {e}")
+ raise HTTPException(status_code=400, detail=str(e))
+
+@app.put("/api/alerts/{rule_id}")
+async def update_alert_rule(rule_id: str, updates: AlertRuleUpdate, authenticated: bool = Depends(verify_session_auth)):
+ """Update an alert rule"""
+ try:
+ # Validate cooldown_minutes if provided
+ if updates.cooldown_minutes is not None:
+ if updates.cooldown_minutes < 0 or updates.cooldown_minutes > 10080: # Max 1 week
+ raise HTTPException(status_code=400, detail="Cooldown must be between 0 and 10080 minutes (1 week)")
+
+ # Include all fields that are explicitly set, even if empty
+ # This allows clearing trigger_events or trigger_states
+ update_data = {}
+ for k, v in updates.dict().items():
+ if v is not None:
+ # Convert empty lists to None for trigger fields
+ if k in ['trigger_events', 'trigger_states'] and isinstance(v, list) and not v:
+ update_data[k] = None
+ # Handle containers field separately
+ elif k == 'containers':
+ # v is already a list of dicts after .dict() call
+ # Include it even if None to clear the containers (set to "all containers")
+ update_data[k] = v
+ else:
+ update_data[k] = v
+ elif k == 'containers':
+ # Explicitly handle containers=None to clear specific container selection
+ update_data[k] = None
+
+ # Validate that at least one trigger type remains after update
+ if 'trigger_events' in update_data or 'trigger_states' in update_data:
+ # Get current rule to check what will remain
+ current_rule = monitor.db.get_alert_rule(rule_id)
+ if current_rule:
+ final_events = update_data.get('trigger_events', current_rule.trigger_events)
+ final_states = update_data.get('trigger_states', current_rule.trigger_states)
+
+ if not final_events and not final_states:
+ raise HTTPException(status_code=400,
+ detail="Alert rule must have at least one trigger event or state")
+
+ db_rule = monitor.db.update_alert_rule(rule_id, update_data)
+
+ if not db_rule:
+ raise HTTPException(status_code=404, detail="Alert rule not found")
+
+ # Refresh in-memory alert rules
+ return {
+ "id": db_rule.id,
+ "name": db_rule.name,
+ "containers": [{"host_id": c.host_id, "container_name": c.container_name}
+ for c in db_rule.containers] if db_rule.containers else [],
+ "trigger_events": db_rule.trigger_events,
+ "trigger_states": db_rule.trigger_states,
+ "notification_channels": db_rule.notification_channels,
+ "cooldown_minutes": db_rule.cooldown_minutes,
+ "enabled": db_rule.enabled,
+ "last_triggered": db_rule.last_triggered.isoformat() if db_rule.last_triggered else None,
+ "created_at": db_rule.created_at.isoformat(),
+ "updated_at": db_rule.updated_at.isoformat()
+ }
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to update alert rule: {e}")
+ raise HTTPException(status_code=400, detail=str(e))
+
+@app.delete("/api/alerts/{rule_id}")
+async def delete_alert_rule(rule_id: str, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_default):
+ """Delete an alert rule"""
+ try:
+ # Get rule info before deleting for logging
+ rule = monitor.db.get_alert_rule(rule_id)
+ if not rule:
+ raise HTTPException(status_code=404, detail="Alert rule not found")
+
+ rule_name = rule.name
+ success = monitor.db.delete_alert_rule(rule_id)
+ if not success:
+ raise HTTPException(status_code=404, detail="Alert rule not found")
+
+ # Refresh in-memory alert rules
+ # Log alert rule deletion
+ monitor.event_logger.log_alert_rule_deleted(
+ rule_name=rule_name,
+ rule_id=rule_id,
+ triggered_by="user"
+ )
+
+ return {"status": "success", "message": f"Alert rule {rule_id} deleted"}
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to delete alert rule: {e}")
+ raise HTTPException(status_code=400, detail=str(e))
+
+@app.get("/api/alerts/orphaned")
+async def get_orphaned_alerts(authenticated: bool = Depends(verify_session_auth)):
+ """Get alert rules that reference non-existent containers"""
+ try:
+ orphaned = await monitor.check_orphaned_alerts()
+ return {
+ "count": len(orphaned),
+ "orphaned_rules": orphaned
+ }
+ except Exception as e:
+ logger.error(f"Failed to check orphaned alerts: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+# ==================== Blackout Window Routes ====================
+
+@app.get("/api/blackout/status")
+async def get_blackout_status(authenticated: bool = Depends(verify_session_auth)):
+ """Get current blackout window status"""
+ try:
+ is_blackout, window_name = monitor.notification_service.blackout_manager.is_in_blackout_window()
+ return {
+ "is_blackout": is_blackout,
+ "current_window": window_name
+ }
+ except Exception as e:
+ logger.error(f"Error getting blackout status: {e}")
+ return {"is_blackout": False, "current_window": None}
+
+# ==================== Notification Channel Routes ====================
+
+
+@app.get("/api/notifications/template-variables")
+async def get_template_variables(authenticated: bool = Depends(verify_session_auth)):
+ """Get available template variables for notification messages"""
+ return {
+ "variables": [
+ {"name": "{CONTAINER_NAME}", "description": "Name of the container"},
+ {"name": "{CONTAINER_ID}", "description": "Short container ID (12 characters)"},
+ {"name": "{HOST_NAME}", "description": "Name of the Docker host"},
+ {"name": "{HOST_ID}", "description": "ID of the Docker host"},
+ {"name": "{OLD_STATE}", "description": "Previous state of the container"},
+ {"name": "{NEW_STATE}", "description": "New state of the container"},
+ {"name": "{IMAGE}", "description": "Docker image name"},
+ {"name": "{TIMESTAMP}", "description": "Full timestamp (YYYY-MM-DD HH:MM:SS)"},
+ {"name": "{TIME}", "description": "Time only (HH:MM:SS)"},
+ {"name": "{DATE}", "description": "Date only (YYYY-MM-DD)"},
+ {"name": "{RULE_NAME}", "description": "Name of the alert rule"},
+ {"name": "{RULE_ID}", "description": "ID of the alert rule"},
+ {"name": "{TRIGGERED_BY}", "description": "What triggered the alert"},
+ {"name": "{EVENT_TYPE}", "description": "Docker event type (if applicable)"},
+ {"name": "{EXIT_CODE}", "description": "Container exit code (if applicable)"}
+ ],
+ "default_template": """🚨 **DockMon Alert**
+
+**Container:** `{CONTAINER_NAME}`
+**Host:** {HOST_NAME}
+**State Change:** `{OLD_STATE}` → `{NEW_STATE}`
+**Image:** {IMAGE}
+**Time:** {TIMESTAMP}
+**Rule:** {RULE_NAME}
+───────────────────────""",
+ "examples": {
+ "simple": "Alert: {CONTAINER_NAME} on {HOST_NAME} changed from {OLD_STATE} to {NEW_STATE}",
+ "detailed": """🔴 Container Alert
+Container: {CONTAINER_NAME} ({CONTAINER_ID})
+Host: {HOST_NAME}
+Status: {OLD_STATE} → {NEW_STATE}
+Image: {IMAGE}
+Time: {TIMESTAMP}
+Triggered by: {RULE_NAME}""",
+ "minimal": "{CONTAINER_NAME}: {NEW_STATE} at {TIME}"
+ }
+ }
+
+@app.get("/api/notifications/channels")
+async def get_notification_channels(authenticated: bool = Depends(verify_session_auth)):
+ """Get all notification channels"""
+ channels = monitor.db.get_notification_channels(enabled_only=False)
+ return [{
+ "id": ch.id,
+ "name": ch.name,
+ "type": ch.type,
+ "config": ch.config,
+ "enabled": ch.enabled,
+ "created_at": ch.created_at.isoformat(),
+ "updated_at": ch.updated_at.isoformat()
+ } for ch in channels]
+
+@app.post("/api/notifications/channels")
+async def create_notification_channel(channel: NotificationChannelCreate, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_notifications):
+ """Create a new notification channel"""
+ try:
+ db_channel = monitor.db.add_notification_channel({
+ "name": channel.name,
+ "type": channel.type,
+ "config": channel.config,
+ "enabled": channel.enabled
+ })
+
+ # Log notification channel creation
+ monitor.event_logger.log_notification_channel_created(
+ channel_name=db_channel.name,
+ channel_type=db_channel.type,
+ triggered_by="user"
+ )
+
+ return {
+ "id": db_channel.id,
+ "name": db_channel.name,
+ "type": db_channel.type,
+ "config": db_channel.config,
+ "enabled": db_channel.enabled,
+ "created_at": db_channel.created_at.isoformat(),
+ "updated_at": db_channel.updated_at.isoformat()
+ }
+ except Exception as e:
+ logger.error(f"Failed to create notification channel: {e}")
+ raise HTTPException(status_code=400, detail=str(e))
+
+@app.put("/api/notifications/channels/{channel_id}")
+async def update_notification_channel(channel_id: int, updates: NotificationChannelUpdate, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_notifications):
+ """Update a notification channel"""
+ try:
+ update_data = {k: v for k, v in updates.dict().items() if v is not None}
+ db_channel = monitor.db.update_notification_channel(channel_id, update_data)
+
+ if not db_channel:
+ raise HTTPException(status_code=404, detail="Channel not found")
+
+ return {
+ "id": db_channel.id,
+ "name": db_channel.name,
+ "type": db_channel.type,
+ "config": db_channel.config,
+ "enabled": db_channel.enabled,
+ "created_at": db_channel.created_at.isoformat(),
+ "updated_at": db_channel.updated_at.isoformat()
+ }
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to update notification channel: {e}")
+ raise HTTPException(status_code=400, detail=str(e))
+
+@app.get("/api/notifications/channels/{channel_id}/dependent-alerts")
+async def get_dependent_alerts(channel_id: int, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_notifications):
+ """Get alerts that would be orphaned if this channel is deleted"""
+ try:
+ dependent_alerts = monitor.db.get_alerts_dependent_on_channel(channel_id)
+ return {"alerts": dependent_alerts}
+ except Exception as e:
+ logger.error(f"Failed to get dependent alerts: {e}")
+ raise HTTPException(status_code=400, detail=str(e))
+
+@app.delete("/api/notifications/channels/{channel_id}")
+async def delete_notification_channel(channel_id: int, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_notifications):
+ """Delete a notification channel and cascade delete alerts that would become orphaned"""
+ try:
+ # Find alerts that would be orphaned (only have this channel)
+ affected_alerts = monitor.db.get_alerts_dependent_on_channel(channel_id)
+
+ # Find all alerts that use this channel (for removal from multi-channel alerts)
+ all_alerts = monitor.db.get_alert_rules()
+
+ # Delete the channel
+ success = monitor.db.delete_notification_channel(channel_id)
+ if not success:
+ raise HTTPException(status_code=404, detail="Channel not found")
+
+ # Delete orphaned alerts
+ deleted_alerts = []
+ for alert in affected_alerts:
+ if monitor.db.delete_alert_rule(alert['id']):
+ deleted_alerts.append(alert['name'])
+
+ # Remove channel from multi-channel alerts
+ updated_alerts = []
+ for alert in all_alerts:
+ # Skip if already deleted
+ if alert.id in [a['id'] for a in affected_alerts]:
+ continue
+
+ # Check if this alert uses the deleted channel
+ channels = alert.notification_channels if isinstance(alert.notification_channels, list) else []
+ if channel_id in channels:
+ # Remove the channel
+ new_channels = [ch for ch in channels if ch != channel_id]
+ monitor.db.update_alert_rule(alert.id, {'notification_channels': new_channels})
+ updated_alerts.append(alert.name)
+
+ result = {
+ "status": "success",
+ "message": f"Channel {channel_id} deleted"
+ }
+
+ if deleted_alerts:
+ result["deleted_alerts"] = deleted_alerts
+ result["message"] += f" and {len(deleted_alerts)} orphaned alert(s) removed"
+
+ if updated_alerts:
+ result["updated_alerts"] = updated_alerts
+ if "deleted_alerts" in result:
+ result["message"] += f", {len(updated_alerts)} alert(s) updated"
+ else:
+ result["message"] += f" and {len(updated_alerts)} alert(s) updated"
+
+ return result
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to delete notification channel: {e}")
+ raise HTTPException(status_code=400, detail=str(e))
+
+@app.post("/api/notifications/channels/{channel_id}/test")
+async def test_notification_channel(channel_id: int, authenticated: bool = Depends(verify_session_auth), rate_limit_check: bool = rate_limit_notifications):
+ """Test a notification channel"""
+ try:
+ if not hasattr(monitor, 'notification_service'):
+ raise HTTPException(status_code=503, detail="Notification service not available")
+
+ result = await monitor.notification_service.test_channel(channel_id)
+ return result
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to test notification channel: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+# ==================== Event Log Routes ====================
+
+@app.get("/api/events")
+async def get_events(
+ category: Optional[List[str]] = Query(None),
+ event_type: Optional[str] = None,
+ severity: Optional[List[str]] = Query(None),
+ host_id: Optional[List[str]] = Query(None),
+ container_id: Optional[str] = None,
+ container_name: Optional[str] = None,
+ start_date: Optional[str] = None,
+ end_date: Optional[str] = None,
+ correlation_id: Optional[str] = None,
+ search: Optional[str] = None,
+ limit: int = 100,
+ offset: int = 0,
+ hours: Optional[int] = None,
+ authenticated: bool = Depends(verify_session_auth),
+ rate_limit_check: bool = rate_limit_default
+):
+ """
+ Get events with filtering and pagination
+
+ Query parameters:
+ - category: Filter by category (container, host, system, alert, notification)
+ - event_type: Filter by event type (state_change, action_taken, etc.)
+ - severity: Filter by severity (debug, info, warning, error, critical)
+ - host_id: Filter by specific host
+ - container_id: Filter by specific container
+ - container_name: Filter by container name (partial match)
+ - start_date: Filter events after this date (ISO 8601 format)
+ - end_date: Filter events before this date (ISO 8601 format)
+ - hours: Shortcut to get events from last X hours (overrides start_date)
+ - correlation_id: Get related events
+ - search: Search in title, message, and container name
+ - limit: Number of results per page (default 100, max 500)
+ - offset: Pagination offset
+ """
+ try:
+ # Validate and parse dates
+ start_datetime = None
+ end_datetime = None
+
+ # If hours parameter is provided, calculate start_date
+ if hours is not None:
+ start_datetime = datetime.now() - timedelta(hours=hours)
+ elif start_date:
+ try:
+ start_datetime = datetime.fromisoformat(start_date.replace('Z', '+00:00'))
+ except ValueError:
+ raise HTTPException(status_code=400, detail="Invalid start_date format. Use ISO 8601 format.")
+
+ if end_date:
+ try:
+ end_datetime = datetime.fromisoformat(end_date.replace('Z', '+00:00'))
+ except ValueError:
+ raise HTTPException(status_code=400, detail="Invalid end_date format. Use ISO 8601 format.")
+
+ # Limit maximum results per page
+ if limit > 500:
+ limit = 500
+
+ # Query events from database
+ events, total_count = monitor.db.get_events(
+ category=category,
+ event_type=event_type,
+ severity=severity,
+ host_id=host_id,
+ container_id=container_id,
+ container_name=container_name,
+ start_date=start_datetime,
+ end_date=end_datetime,
+ correlation_id=correlation_id,
+ search=search,
+ limit=limit,
+ offset=offset
+ )
+
+ # Convert to JSON-serializable format
+ events_json = []
+ for event in events:
+ events_json.append({
+ "id": event.id,
+ "correlation_id": event.correlation_id,
+ "category": event.category,
+ "event_type": event.event_type,
+ "severity": event.severity,
+ "host_id": event.host_id,
+ "host_name": event.host_name,
+ "container_id": event.container_id,
+ "container_name": event.container_name,
+ "title": event.title,
+ "message": event.message,
+ "old_state": event.old_state,
+ "new_state": event.new_state,
+ "triggered_by": event.triggered_by,
+ "details": event.details,
+ "duration_ms": event.duration_ms,
+ "timestamp": event.timestamp.isoformat()
+ })
+
+ return {
+ "events": events_json,
+ "total_count": total_count,
+ "limit": limit,
+ "offset": offset,
+ "has_more": (offset + limit) < total_count
+ }
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to get events: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.get("/api/events/{event_id}")
+async def get_event_by_id(
+ event_id: int,
+ authenticated: bool = Depends(verify_session_auth),
+ rate_limit_check: bool = rate_limit_default
+):
+ """Get a specific event by ID"""
+ try:
+ event = monitor.db.get_event_by_id(event_id)
+ if not event:
+ raise HTTPException(status_code=404, detail="Event not found")
+
+ return {
+ "id": event.id,
+ "correlation_id": event.correlation_id,
+ "category": event.category,
+ "event_type": event.event_type,
+ "severity": event.severity,
+ "host_id": event.host_id,
+ "host_name": event.host_name,
+ "container_id": event.container_id,
+ "container_name": event.container_name,
+ "title": event.title,
+ "message": event.message,
+ "old_state": event.old_state,
+ "new_state": event.new_state,
+ "triggered_by": event.triggered_by,
+ "details": event.details,
+ "duration_ms": event.duration_ms,
+ "timestamp": event.timestamp.isoformat()
+ }
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to get event: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.get("/api/events/correlation/{correlation_id}")
+async def get_events_by_correlation(
+ correlation_id: str,
+ authenticated: bool = Depends(verify_session_auth),
+ rate_limit_check: bool = rate_limit_default
+):
+ """Get all events with the same correlation ID (related events)"""
+ try:
+ events = monitor.db.get_events_by_correlation(correlation_id)
+
+ events_json = []
+ for event in events:
+ events_json.append({
+ "id": event.id,
+ "correlation_id": event.correlation_id,
+ "category": event.category,
+ "event_type": event.event_type,
+ "severity": event.severity,
+ "host_id": event.host_id,
+ "host_name": event.host_name,
+ "container_id": event.container_id,
+ "container_name": event.container_name,
+ "title": event.title,
+ "message": event.message,
+ "old_state": event.old_state,
+ "new_state": event.new_state,
+ "triggered_by": event.triggered_by,
+ "details": event.details,
+ "duration_ms": event.duration_ms,
+ "timestamp": event.timestamp.isoformat()
+ })
+
+ return {"events": events_json, "count": len(events_json)}
+ except Exception as e:
+ logger.error(f"Failed to get events by correlation: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+# ==================== User Dashboard Routes ====================
+
+@app.get("/api/user/dashboard-layout")
+async def get_dashboard_layout(request: Request, authenticated: bool = Depends(verify_session_auth)):
+ """Get dashboard layout for current user"""
+ from auth.routes import _get_session_from_cookie
+ from auth.session_manager import session_manager
+
+ session_id = _get_session_from_cookie(request)
+ username = session_manager.get_session_username(session_id)
+
+ layout = monitor.db.get_dashboard_layout(username)
+ return {"layout": layout}
+
+@app.post("/api/user/dashboard-layout")
+async def save_dashboard_layout(request: Request, authenticated: bool = Depends(verify_session_auth)):
+ """Save dashboard layout for current user"""
+ from auth.routes import _get_session_from_cookie
+ from auth.session_manager import session_manager
+
+ session_id = _get_session_from_cookie(request)
+ username = session_manager.get_session_username(session_id)
+
+ try:
+ body = await request.json()
+ layout_json = body.get('layout')
+
+ if layout_json is None:
+ raise HTTPException(status_code=400, detail="Layout is required")
+
+ # Validate JSON structure
+ if layout_json:
+ try:
+ parsed_layout = json.loads(layout_json) if isinstance(layout_json, str) else layout_json
+
+ # Validate it's a list
+ if not isinstance(parsed_layout, list):
+ raise HTTPException(status_code=400, detail="Layout must be an array of widget positions")
+
+ # Validate each widget has required fields
+ required_fields = ['x', 'y', 'w', 'h']
+ for widget in parsed_layout:
+ if not isinstance(widget, dict):
+ raise HTTPException(status_code=400, detail="Each widget must be an object")
+ for field in required_fields:
+ if field not in widget:
+ raise HTTPException(status_code=400, detail=f"Widget missing required field: {field}")
+ if not isinstance(widget[field], (int, float)):
+ raise HTTPException(status_code=400, detail=f"Widget field '{field}' must be a number")
+
+ # Convert back to string for storage
+ layout_json = json.dumps(parsed_layout)
+ except json.JSONDecodeError:
+ raise HTTPException(status_code=400, detail="Invalid JSON format for layout")
+
+ success = monitor.db.save_dashboard_layout(username, layout_json)
+ if not success:
+ raise HTTPException(status_code=500, detail="Failed to save layout")
+
+ return {"success": True}
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to save dashboard layout: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.get("/api/user/event-sort-order")
+async def get_event_sort_order(request: Request, authenticated: bool = Depends(verify_session_auth)):
+ """Get event sort order preference for current user"""
+ from auth.routes import _get_session_from_cookie
+ from auth.session_manager import session_manager
+
+ session_id = _get_session_from_cookie(request)
+ username = session_manager.get_session_username(session_id)
+
+ sort_order = monitor.db.get_event_sort_order(username)
+ return {"sort_order": sort_order}
+
+@app.post("/api/user/event-sort-order")
+async def save_event_sort_order(request: Request, authenticated: bool = Depends(verify_session_auth)):
+ """Save event sort order preference for current user"""
+ from auth.routes import _get_session_from_cookie
+ from auth.session_manager import session_manager
+
+ session_id = _get_session_from_cookie(request)
+ username = session_manager.get_session_username(session_id)
+
+ try:
+ body = await request.json()
+ sort_order = body.get('sort_order')
+
+ if sort_order not in ['asc', 'desc']:
+ raise HTTPException(status_code=400, detail="sort_order must be 'asc' or 'desc'")
+
+ success = monitor.db.save_event_sort_order(username, sort_order)
+ if not success:
+ raise HTTPException(status_code=500, detail="Failed to save sort order")
+
+ return {"success": True}
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to save event sort order: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.get("/api/user/container-sort-order")
+async def get_container_sort_order(request: Request, authenticated: bool = Depends(verify_session_auth)):
+ """Get container sort order preference for current user"""
+ from auth.routes import _get_session_from_cookie
+ from auth.session_manager import session_manager
+
+ session_id = _get_session_from_cookie(request)
+ username = session_manager.get_session_username(session_id)
+
+ sort_order = monitor.db.get_container_sort_order(username)
+ return {"sort_order": sort_order}
+
+@app.post("/api/user/container-sort-order")
+async def save_container_sort_order(request: Request, authenticated: bool = Depends(verify_session_auth)):
+ """Save container sort order preference for current user"""
+ from auth.routes import _get_session_from_cookie
+ from auth.session_manager import session_manager
+
+ session_id = _get_session_from_cookie(request)
+ username = session_manager.get_session_username(session_id)
+
+ try:
+ body = await request.json()
+ sort_order = body.get('sort_order')
+
+ valid_sorts = ['name-asc', 'name-desc', 'status', 'memory-desc', 'memory-asc', 'cpu-desc', 'cpu-asc']
+ if sort_order not in valid_sorts:
+ raise HTTPException(status_code=400, detail=f"sort_order must be one of: {', '.join(valid_sorts)}")
+
+ success = monitor.db.save_container_sort_order(username, sort_order)
+ if not success:
+ raise HTTPException(status_code=500, detail="Failed to save sort order")
+
+ return {"success": True}
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to save container sort order: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.get("/api/user/modal-preferences")
+async def get_modal_preferences(request: Request, authenticated: bool = Depends(verify_session_auth)):
+ """Get modal preferences for current user"""
+ from auth.routes import _get_session_from_cookie
+ from auth.session_manager import session_manager
+
+ session_id = _get_session_from_cookie(request)
+ username = session_manager.get_session_username(session_id)
+
+ preferences = monitor.db.get_modal_preferences(username)
+ return {"preferences": preferences}
+
+@app.post("/api/user/modal-preferences")
+async def save_modal_preferences(request: Request, authenticated: bool = Depends(verify_session_auth)):
+ """Save modal preferences for current user"""
+ from auth.routes import _get_session_from_cookie
+ from auth.session_manager import session_manager
+
+ session_id = _get_session_from_cookie(request)
+ username = session_manager.get_session_username(session_id)
+
+ try:
+ body = await request.json()
+ preferences = body.get('preferences')
+
+ if preferences is None:
+ raise HTTPException(status_code=400, detail="Preferences are required")
+
+ success = monitor.db.save_modal_preferences(username, preferences)
+ if not success:
+ raise HTTPException(status_code=500, detail="Failed to save preferences")
+
+ return {"success": True}
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to save modal preferences: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+# ==================== Event Log Routes ====================
+# Note: Main /api/events endpoint is defined earlier with full feature set
+
+@app.get("/api/events/{event_id}")
+async def get_event(event_id: int, authenticated: bool = Depends(verify_session_auth)):
+ """Get a specific event by ID"""
+ try:
+ event = monitor.db.get_event_by_id(event_id)
+ if not event:
+ raise HTTPException(status_code=404, detail="Event not found")
+
+ return {
+ "id": event.id,
+ "correlation_id": event.correlation_id,
+ "category": event.category,
+ "event_type": event.event_type,
+ "severity": event.severity,
+ "host_id": event.host_id,
+ "host_name": event.host_name,
+ "container_id": event.container_id,
+ "container_name": event.container_name,
+ "title": event.title,
+ "message": event.message,
+ "old_state": event.old_state,
+ "new_state": event.new_state,
+ "triggered_by": event.triggered_by,
+ "details": event.details,
+ "duration_ms": event.duration_ms,
+ "timestamp": event.timestamp.isoformat()
+ }
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to get event {event_id}: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.get("/api/events/correlation/{correlation_id}")
+async def get_events_by_correlation(correlation_id: str, authenticated: bool = Depends(verify_session_auth)):
+ """Get all events with the same correlation ID"""
+ try:
+ events = monitor.db.get_events_by_correlation(correlation_id)
+
+ return {
+ "correlation_id": correlation_id,
+ "events": [{
+ "id": event.id,
+ "correlation_id": event.correlation_id,
+ "category": event.category,
+ "event_type": event.event_type,
+ "severity": event.severity,
+ "host_id": event.host_id,
+ "host_name": event.host_name,
+ "container_id": event.container_id,
+ "container_name": event.container_name,
+ "title": event.title,
+ "message": event.message,
+ "old_state": event.old_state,
+ "new_state": event.new_state,
+ "triggered_by": event.triggered_by,
+ "details": event.details,
+ "duration_ms": event.duration_ms,
+ "timestamp": event.timestamp.isoformat()
+ } for event in events]
+ }
+ except Exception as e:
+ logger.error(f"Failed to get events by correlation {correlation_id}: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.get("/api/events/statistics")
+async def get_event_statistics(start_date: Optional[str] = None,
+ end_date: Optional[str] = None,
+ authenticated: bool = Depends(verify_session_auth)):
+ """Get event statistics for dashboard"""
+ try:
+ # Parse dates
+ parsed_start_date = None
+ parsed_end_date = None
+
+ if start_date:
+ try:
+ parsed_start_date = datetime.fromisoformat(start_date.replace('Z', '+00:00'))
+ except ValueError:
+ raise HTTPException(status_code=400, detail="Invalid start_date format")
+
+ if end_date:
+ try:
+ parsed_end_date = datetime.fromisoformat(end_date.replace('Z', '+00:00'))
+ except ValueError:
+ raise HTTPException(status_code=400, detail="Invalid end_date format")
+
+ stats = monitor.db.get_event_statistics(
+ start_date=parsed_start_date,
+ end_date=parsed_end_date
+ )
+
+ return stats
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to get event statistics: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.get("/api/events/container/{container_id}")
+async def get_container_events(container_id: str, limit: int = 50, authenticated: bool = Depends(verify_session_auth)):
+ """Get events for a specific container"""
+ try:
+ events, total_count = monitor.db.get_events(
+ container_id=container_id,
+ limit=limit,
+ offset=0
+ )
+
+ return {
+ "container_id": container_id,
+ "events": [{
+ "id": event.id,
+ "correlation_id": event.correlation_id,
+ "category": event.category,
+ "event_type": event.event_type,
+ "severity": event.severity,
+ "host_id": event.host_id,
+ "host_name": event.host_name,
+ "container_id": event.container_id,
+ "container_name": event.container_name,
+ "title": event.title,
+ "message": event.message,
+ "old_state": event.old_state,
+ "new_state": event.new_state,
+ "triggered_by": event.triggered_by,
+ "details": event.details,
+ "duration_ms": event.duration_ms,
+ "timestamp": event.timestamp.isoformat()
+ } for event in events],
+ "total_count": total_count
+ }
+ except Exception as e:
+ logger.error(f"Failed to get events for container {container_id}: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.get("/api/events/host/{host_id}")
+async def get_host_events(host_id: str, limit: int = 50, authenticated: bool = Depends(verify_session_auth)):
+ """Get events for a specific host"""
+ try:
+ events, total_count = monitor.db.get_events(
+ host_id=host_id,
+ limit=limit,
+ offset=0
+ )
+
+ return {
+ "host_id": host_id,
+ "events": [{
+ "id": event.id,
+ "correlation_id": event.correlation_id,
+ "category": event.category,
+ "event_type": event.event_type,
+ "severity": event.severity,
+ "host_id": event.host_id,
+ "host_name": event.host_name,
+ "container_id": event.container_id,
+ "container_name": event.container_name,
+ "title": event.title,
+ "message": event.message,
+ "old_state": event.old_state,
+ "new_state": event.new_state,
+ "triggered_by": event.triggered_by,
+ "details": event.details,
+ "duration_ms": event.duration_ms,
+ "timestamp": event.timestamp.isoformat()
+ } for event in events],
+ "total_count": total_count
+ }
+ except Exception as e:
+ logger.error(f"Failed to get events for host {host_id}: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.delete("/api/events/cleanup")
+async def cleanup_old_events(days: int = 30, authenticated: bool = Depends(verify_session_auth)):
+ """Clean up old events - DANGEROUS: Can delete audit logs"""
+ try:
+ if days < 1:
+ raise HTTPException(status_code=400, detail="Days must be at least 1")
+
+ deleted_count = monitor.db.cleanup_old_events(days)
+
+ monitor.event_logger.log_system_event(
+ "Event Cleanup Completed",
+ f"Cleaned up {deleted_count} events older than {days} days",
+ EventSeverity.INFO,
+ EventType.STARTUP
+ )
+
+ return {
+ "status": "success",
+ "message": f"Cleaned up {deleted_count} events older than {days} days",
+ "deleted_count": deleted_count
+ }
+ except HTTPException:
+ raise
+ except Exception as e:
+ logger.error(f"Failed to cleanup events: {e}")
+ raise HTTPException(status_code=500, detail=str(e))
+
+@app.websocket("/ws")
+async def websocket_endpoint(websocket: WebSocket):
+ """WebSocket endpoint for real-time updates"""
+ # Generate a unique connection ID for rate limiting
+ connection_id = f"ws_{id(websocket)}_{time.time()}"
+
+ await monitor.manager.connect(websocket)
+ await monitor.realtime.subscribe_to_events(websocket)
+
+ # Send initial state
+ settings_dict = {
+ "max_retries": monitor.settings.max_retries,
+ "retry_delay": monitor.settings.retry_delay,
+ "default_auto_restart": monitor.settings.default_auto_restart,
+ "polling_interval": monitor.settings.polling_interval,
+ "connection_timeout": monitor.settings.connection_timeout,
+ "log_retention_days": monitor.settings.log_retention_days,
+ "enable_notifications": monitor.settings.enable_notifications,
+ "alert_template": getattr(monitor.settings, 'alert_template', None),
+ "blackout_windows": getattr(monitor.settings, 'blackout_windows', None),
+ "timezone_offset": getattr(monitor.settings, 'timezone_offset', 0),
+ "show_host_stats": getattr(monitor.settings, 'show_host_stats', True),
+ "show_container_stats": getattr(monitor.settings, 'show_container_stats', True)
+ }
+
+ initial_state = {
+ "type": "initial_state",
+ "data": {
+ "hosts": [h.dict() for h in monitor.hosts.values()],
+ "containers": [c.dict() for c in await monitor.get_containers()],
+ "settings": settings_dict
+ }
+ }
+ await websocket.send_text(json.dumps(initial_state, cls=DateTimeEncoder))
+
+ try:
+ while True:
+ # Keep connection alive and handle incoming messages
+ message = await websocket.receive_json()
+
+ # Check rate limit for incoming messages
+ allowed, reason = ws_rate_limiter.check_rate_limit(connection_id)
+ if not allowed:
+ # Send rate limit error to client
+ await websocket.send_text(json.dumps({
+ "type": "error",
+ "error": "rate_limit",
+ "message": reason
+ }))
+ # Don't process the message
+ continue
+
+ # Handle different message types
+ if message.get("type") == "subscribe_stats":
+ container_id = message.get("container_id")
+ if container_id:
+ await monitor.realtime.subscribe_to_stats(websocket, container_id)
+ # Find the host and start monitoring
+ for host_id, client in monitor.clients.items():
+ try:
+ client.containers.get(container_id)
+ await monitor.realtime.start_container_stats_stream(
+ client, container_id, interval=2
+ )
+ break
+ except Exception as e:
+ logger.debug(f"Container {container_id} not found on host {host_id[:8]}: {e}")
+ continue
+
+ elif message.get("type") == "unsubscribe_stats":
+ container_id = message.get("container_id")
+ if container_id:
+ await monitor.realtime.unsubscribe_from_stats(websocket, container_id)
+
+ elif message.get("type") == "modal_opened":
+ # Track that a container modal is open - keep stats running for this container
+ container_id = message.get("container_id")
+ host_id = message.get("host_id")
+ if container_id and host_id:
+ # Verify container exists and user has access to it
+ try:
+ containers = await monitor.get_containers() # Must await async function
+ container_exists = any(
+ c.id == container_id and c.host_id == host_id
+ for c in containers
+ )
+ if container_exists:
+ monitor.stats_manager.add_modal_container(container_id, host_id)
+ else:
+ logger.warning(f"User attempted to access stats for non-existent container: {container_id[:12]} on host {host_id[:8]}")
+ except Exception as e:
+ logger.error(f"Error validating container access: {e}")
+
+ elif message.get("type") == "modal_closed":
+ # Remove container from modal tracking
+ container_id = message.get("container_id")
+ host_id = message.get("host_id")
+ if container_id and host_id:
+ monitor.stats_manager.remove_modal_container(container_id, host_id)
+
+ elif message.get("type") == "ping":
+ await websocket.send_text(json.dumps({"type": "pong"}, cls=DateTimeEncoder))
+
+ except WebSocketDisconnect:
+ await monitor.manager.disconnect(websocket)
+ await monitor.realtime.unsubscribe_from_events(websocket)
+ # Unsubscribe from all stats
+ for container_id in list(monitor.realtime.stats_subscribers):
+ await monitor.realtime.unsubscribe_from_stats(websocket, container_id)
+ # Clear modal containers (user disconnected, modals are closed)
+ monitor.stats_manager.clear_modal_containers()
+ # Clean up rate limiter tracking
+ ws_rate_limiter.cleanup_connection(connection_id)
\ No newline at end of file
diff --git a/dockmon/backend/models/__init__.py b/dockmon/backend/models/__init__.py
new file mode 100644
index 0000000..6bfcba3
--- /dev/null
+++ b/dockmon/backend/models/__init__.py
@@ -0,0 +1 @@
+# Data models module for DockMon
\ No newline at end of file
diff --git a/dockmon/backend/models/auth_models.py b/dockmon/backend/models/auth_models.py
new file mode 100644
index 0000000..21deba2
--- /dev/null
+++ b/dockmon/backend/models/auth_models.py
@@ -0,0 +1,18 @@
+"""
+Authentication Models for DockMon
+Pydantic models for authentication requests and responses
+"""
+
+from pydantic import BaseModel, Field
+
+
+class LoginRequest(BaseModel):
+ """Login request model with validation"""
+ username: str = Field(..., min_length=1, max_length=50)
+ password: str = Field(..., min_length=1, max_length=100)
+
+
+class ChangePasswordRequest(BaseModel):
+ """Change password request model"""
+ current_password: str = Field(..., min_length=1, max_length=100)
+ new_password: str = Field(..., min_length=8, max_length=100) # Minimum 8 characters for security
\ No newline at end of file
diff --git a/dockmon/backend/models/docker_models.py b/dockmon/backend/models/docker_models.py
new file mode 100644
index 0000000..4ee5aab
--- /dev/null
+++ b/dockmon/backend/models/docker_models.py
@@ -0,0 +1,175 @@
+"""
+Docker Models for DockMon
+Pydantic models for Docker hosts, containers, and configurations
+"""
+
+import re
+import uuid
+import logging
+from datetime import datetime
+from typing import Optional
+
+from pydantic import BaseModel, Field, validator, model_validator
+
+
+logger = logging.getLogger(__name__)
+
+
+class DockerHostConfig(BaseModel):
+ """Configuration for a Docker host"""
+ name: str = Field(..., min_length=1, max_length=100, pattern=r'^[a-zA-Z0-9][a-zA-Z0-9 ._-]*$')
+ url: str = Field(..., min_length=1, max_length=500)
+ tls_cert: Optional[str] = Field(None, max_length=10000)
+ tls_key: Optional[str] = Field(None, max_length=10000)
+ tls_ca: Optional[str] = Field(None, max_length=10000)
+
+ @validator('name')
+ def validate_name(cls, v):
+ """Validate host name for security"""
+ if not v or not v.strip():
+ raise ValueError('Host name cannot be empty')
+ # Prevent XSS and injection
+ sanitized = re.sub(r'[<>"\']', '', v.strip())
+ if len(sanitized) != len(v.strip()):
+ raise ValueError('Host name contains invalid characters')
+ return sanitized
+
+ @validator('url')
+ def validate_url(cls, v):
+ """Validate Docker URL for security - prevent SSRF attacks"""
+ if not v or not v.strip():
+ raise ValueError('URL cannot be empty')
+
+ v = v.strip()
+
+ # Only allow specific protocols
+ allowed_protocols = ['tcp://', 'unix://', 'http://', 'https://']
+ if not any(v.startswith(proto) for proto in allowed_protocols):
+ raise ValueError('URL must use tcp://, unix://, http:// or https:// protocol')
+
+ # Block ONLY the most dangerous SSRF targets (cloud metadata & loopback)
+ # Allow private networks (10.*, 172.16-31.*, 192.168.*) for legitimate Docker hosts
+ extremely_dangerous_patterns = [
+ r'169\.254\.169\.254', # AWS/GCP metadata (specific)
+ r'169\.254\.', # Link-local range (broader)
+ r'metadata\.google\.internal', # GCP metadata
+ r'metadata\.goog', # GCP metadata alternative
+ r'100\.100\.100\.200', # Alibaba Cloud metadata
+ r'fd00:ec2::254', # AWS IPv6 metadata
+ r'0\.0\.0\.0', # All interfaces binding
+ r'::1', # IPv6 localhost
+ r'localhost(?!\:|$)', # Localhost variations but allow localhost:port
+ r'127\.0\.0\.(?!1$)', # 127.x.x.x but allow 127.0.0.1
+ ]
+
+ # Check for extremely dangerous metadata service targets
+ for pattern in extremely_dangerous_patterns:
+ if re.search(pattern, v, re.IGNORECASE):
+ # Special handling for localhost - allow localhost:port but block bare localhost
+ if 'localhost' in pattern.lower() and ':' in v:
+ continue # Allow localhost:2376 etc
+ raise ValueError('URL targets cloud metadata service or dangerous internal endpoint')
+
+ # Additional validation: warn about but allow private networks
+ private_network_patterns = [
+ r'10\.', # 10.0.0.0/8
+ r'172\.(1[6-9]|2[0-9]|3[01])\.', # 172.16.0.0/12
+ r'192\.168\.', # 192.168.0.0/16
+ ]
+
+ # Log private network usage for monitoring (but don't block)
+ for pattern in private_network_patterns:
+ if re.search(pattern, v, re.IGNORECASE):
+ logger.info(f"Docker host configured on private network: {v[:50]}...")
+ break
+
+ return v
+
+ @validator('tls_cert', 'tls_key', 'tls_ca')
+ def validate_certificate(cls, v):
+ """Validate TLS certificate data"""
+ if v is None:
+ return v
+
+ v = v.strip()
+ if not v:
+ return None
+
+ # Basic PEM format validation with helpful error messages
+ if '-----BEGIN' not in v and '-----END' not in v:
+ raise ValueError('Certificate is incomplete. PEM certificates must start with "-----BEGIN" and end with "-----END". Please copy the entire certificate including both lines.')
+ elif '-----BEGIN' not in v:
+ raise ValueError('Certificate is missing the "-----BEGIN" header line. Make sure you copied the complete certificate starting from the "-----BEGIN" line.')
+ elif '-----END' not in v:
+ raise ValueError('Certificate is missing the "-----END" footer line. Make sure you copied the complete certificate including the "-----END" line.')
+
+ # Block potential code injection
+ dangerous_patterns = ['
+
+
+
+
+
+
+
+
+
+
+
+
+
Dashboard
+
+
+
+
+
+ In Blackout Window
+
+
+
+
+
+
+
+
+
+
+
+ Name (A-Z)
+ Name (Z-A)
+ Status (Running first)
+ Memory (High to Low)
+ Memory (Low to High)
+ CPU (High to Low)
+ CPU (Low to High)
+
+
+
+ Refresh
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Time Range
+
+ Last 24 hours
+ Last 12 hours
+ Last 6 hours
+ Last hour
+ All time
+
+
+
+
+
+
+
+
+
Host
+
+
+ All Hosts
+
+
+
+
+
+
+
+
+
+ Search
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Containers
+
+
+ Select containers to view logs...
+
+
+
+
+
+
+
+
+
+ Lines
+
+ 50
+ 100
+ 200
+ 500
+ 1000
+ All
+
+
+
+
+ Search
+
+
+
+
+
+
+ Timestamps
+
+
+
+
+
+
+
+
+
+
Select one or more containers to view logs
+
+
+
+
+
+
+
+
+
+
+
Docker Monitor v1.1.2
+
A modern web-based Docker container monitoring solution with real-time alerts and auto-restart capabilities.
+
Features:
+
+ Multi-host Docker monitoring
+ Container auto-restart with configurable retries
+ Multi-channel alerting (Telegram, Discord, Pushover)
+ Real-time status updates
+ Modern, responsive interface
+
+
+ Source Code:
+
+ GitHub Repository
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Auto-Restart Configuration
+
+
+
Maximum Retry Attempts
+
Number of times to attempt restarting a container before giving up
+
+
+
+ retries
+
+
+
+
+
Retry Delay
+
Time to wait between restart attempts
+
+
+
+ seconds
+
+
+
+
+
Enable Auto-Restart by Default
+
New containers will have auto-restart enabled automatically
+
+
+
+
+
+
Monitoring Settings
+
+
+
Polling Interval
+
How often to check container states (lower = more responsive, higher = less CPU)
+
+
+
+ seconds
+
+
+
+
+
Connection Timeout
+
Maximum time to wait for host response
+
+
+
+ seconds
+
+
+
+
+
Dashboard Display
+
+
+
Show Host Statistics
+
Display CPU and memory graphs on host widgets
+
+
+
+
+
+
Show Container Statistics
+
Display statistics for individual containers (CPU, memory, network, disk)
+
+
+
+
+
+ Save Settings
+
+
+
+
+
+
+
+
+
+
+
+ Channels
+ Message Template
+ Blackout Windows
+
+
+
+
+
+
+ Add Channel
+
+
+
+
+
+
+ Save All Channels
+
+
+
+
+
+
+ Notification Message Template
+
+
+ Use variables like {CONTAINER_NAME}, {HOST_NAME}, {OLD_STATE}, {NEW_STATE}, etc.
+
+
+
+
+
Available Variables
+
+
{CONTAINER_NAME} - Container name
+
{CONTAINER_ID} - Container ID (short)
+
{HOST_NAME} - Host name
+
{HOST_ID} - Host ID
+
{OLD_STATE} - Previous state
+
{NEW_STATE} - New state
+
{IMAGE} - Docker image
+
{TIMESTAMP} - Full timestamp
+
{TIME} - Time only
+
{DATE} - Date only
+
{RULE_NAME} - Alert rule name
+
{EVENT_TYPE} - Docker event type
+
+
+
+
+
Template Examples
+
+ -- Select an example template --
+ Default (Detailed)
+ Simple
+ Minimal
+ With Emojis
+
+
+
+
+ Save Template
+
+
+
+
+
+
+ Configure blackout windows when alerts should be suppressed. Times are defined in your local timezone. Alerts will be deferred and checked after each window ends.
+
+
+
+
+
+
+
+ Add Blackout Window
+
+
+
+
+
+
+
+
+
+
How Blackout Windows Work
+
+
During a blackout window:
+
+ All alerts are suppressed (not sent)
+ Container events continue to be monitored
+ No notifications will be sent to any channel
+
+
When the blackout window ends:
+
+ DockMon checks the current state of all containers
+ Alerts are sent for any containers in problematic states
+ This ensures you're notified of issues that occurred during maintenance
+
+
+ 💡 Tip: Define alert rules that monitor both events AND states for best coverage.
+
+
+
+
+
+ Save Blackout Settings
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Info
+ Statistics
+ Logs
+
+
+
+
+
+
+
+
+
+
+
+
+ CPU Usage
+
+
+
+
+
+
+
+
+ Memory Usage
+
+
+
+
+
+
+
+
+ Network Activity
+
+
+
+
+
+
+
+
+ Disk I/O
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Cancel
+ Delete
+
+
+
+
+
+
+
+
+
+
+
+
+ Dashboard
+
+
+
+ Hosts
+
+
+
+ Alerts
+
+
+
+ Events
+
+
+
+ More
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+