Sensu Alerts & Operational Issues - Ard Labs

Operational Issues Log
Known Instability Issues

This document tracks common operational issues, their resolutions, and troubleshooting procedures for ModMS and Pinpoint services.

Operational Issues Log

Component	Example	Server	Issue	Details	Action Taken	Command
Sams Server Raw Data Storage		`sams.devops.arabiaweather.com`	Disk filling with raw data	Satellite data accumulation	Regularly delete old raw data	`cd /data/raw-data` `# remove old data`
Sams Log Files	Sensu Alert	`sams.devops.arabiaweather.com`	Excessive log file growth	Log files grew too large on the Sams server	Truncated log files	`df -h` `find / -xdev -type f -size +500M 2>/dev/null` `truncate -s 0 /var/lib/docker/containers/<container-id>/*.log`
Historical Server pengine Cronjob		`htz-historical-01` (144.76.56.17)	High memory usage	`pengine` cronjob consuming excessive memory	Killed all `pengine` processes	`killall -9 pengine`
Redis Sentinel Alert		`cluster-n03`, `cluster-n04`	Unnecessary monitoring alert	Only two Redis nodes; Sentinel not required	Disabled Redis Sentinel alert	No command specified
Redis Ping Failure (node03 → node04)		`cluster-n03`, `cluster-n04`	Slave-to-master ping failure	node03 slave couldn’t ping node04 master on port 3680 through HAProxy	Restarted HAProxy on node03	`redis-cli info replication` `systemctl reload haproxy`
Bader 1		`bader-deploy` (85.10.197.28)	Containers unhealthy	Docker containers showing unhealthy status	Restart unhealthy containers	`docker container ls -a` `# restart unhealthy container`
Bader 2		`bader.arabiaweather.com`	Restart order causing lost events	Removed `last_state` and cleared processed map keys; aggregator republished before engine was ready	Restart engine first, then aggregator	`cd data/aggregator-downloads/` `rm last_state` `redis-cli keys 'maps:processed_files:*' \| xargs redis-cli del` `systemctl restart engine.service` `systemctl restart aggregator.service`
Charts Archiver Jobs	Sensu Alert	`backend-charts-1` (3.249.150.105), `backend-charts-2` (34.246.218.122)	Disk space issues, deleted processes holding space	Deleted processes consuming disk space, preventing proper operation	Check disk space, identify and clean up deleted processes, restart unicorn	`df -h` `sudo lsof \| grep deleted` `# delete identified processes` `cd /path/to/aw_charts_backend` `/etc/init.d/unicorn start`

Known Instability Issues

ModMS & Pinpoint Operations

⌘I