lm-sensors: hardware health monitoring project.
safte-monitor. SAFTE stands for SCSI Accessed Fault-Tolerant Enclosure device. If you have a SAF-TE compatible storage enclosure, safte-monitor will let read the enclosure configuration fetching things such as the number of fans, power supplies, slots, and also the mapping of slots to scsi ids. safte-monitor reads disk enclosure status information from SAF-TE capable enclosures (SCSI Accessible Fault Tolerant Enclosures). SAF-TE is a component of SES (SCSI Enclosure Services) which is common on most SCSI disk enclosures these days. saftemon can monitor multiple SAF-TE devices and will automatically probe and detect them. The information retreived includes power supply and fan status, temperature, audible alarm, drive faults, array critical / failed / rebuilding state and door lock status. saftemon logs changes in the status of these enclosure elements to syslog and can optionally execute an alert help program with details of the component failure. This could send a pager message for example. Temperature alert limits also be set.
The SAF-TE spec is on the web and an addendum is at this location. More information about the specification here.
Linux RAS project. Ambitious. Mailing list information can be found on the web.
Linux RAM ECC monitoring with a corresponding Mailing list.
Chris Brady's x86 Memory Testing program (memtest86). This ships with newer versions of SuSE Linux.
Mon: service monitoring daemon.
HAPM: another service monitoring daemon. High Availability Port Monitor (HAPM) is a local port status check. It is a simple, light and fast daemon to check TCP/UDP ports. If one or more monitored ports (per IP) downs then the primary Heartbeat will be killed by HAPM.
OpenNMS is an open-source project dedicated to the creation of an enterprise grade network management platform.
Spumoni enables any program which can be queried via local commands to be health-checked via SNMP. This allows admins to use enterprise-level monitoring programs such as OpenNMS, Tivoli, OpenView, MRTG and RRDTool for even non-SNMP-enabled applications.
Monit: Monit is a utility for monitoring daemons or similar programs running on a Unix system. It will start specified programs if they are not running and restart programs not responding.
VACM: VA Cluster Manager. VACM provides cluster monitoring and control at a very fundamental level.
PIKT: Problem Informant Killer Tool
NOCOL/SNIPS - system and network monitoring tool
Big Brother - Systems and Network monitor. It monitor both network and system information.
Big Sister - a real time system and network health monitoring application
Nagios - Network monitor (formerly Netsaint)
MAT - Monitoring and Administration Tool. MAT is an easy to use network enabled UNIX configuration and monitoring tool. It provides an integrated tool for many common system administration tasks, including Backups, and Replication
WebRAT - a web based administration tool, to administer several nodes on a network, from a central host (administration server). The main purpose of WebRAT is to administer a network with many nodes, remotely. The more the nodes on the network, the more WebRAT will seem to be irreplaceable.
xCAT (Extreme Cluster Administration Toolkit) - A tool kit that can be used for the deployment and administration of (primarily high-performance) Linux clusters.
dwatch - daemon watching program -- last updated in 2001 -- appears to be dead