The "failure" wasn't a system collapse—it was the system getting confused by its own shadow.
To help narrow this down, please tell me (e.g., AWS Lambda, an EC2 script, or a third-party tool). Sharing any specific error codes or error messages from your logs will also help me provide a more targeted solution. Share public link
He dove into the alert logs. Just seconds before the health checker tripped, he saw a flurry of ORA-15130 errors: diskgroup "DATA" is being dismounted . This was the DBA equivalent of a ship taking on water.
By default, Oracle ASM expects a periodic heartbeat write to the Partnership and Status Table (PST) on the disks. If your storage area network (SAN) or Network Attached Storage (NAS) undergoes a controller failover or a brief network hang that lasts longer than the threshold (typically 15 seconds), the heartbeat fails. The Health Checker registers this temporary drop as a localized failure. 3. Duplicate Disk Discovery ( ORA-15063 / ORA-15020 ) asm health checker found 1 new failures
The system is designed to run these checks periodically. When it finds a new issue, it logs the message "ASM Health Checker found 1 new failures" in the ASM alert log, located at <GRID_HOME>/diag/asm/+asm/+<ASM>/alert/log.xml . This alert is often also picked up and reported by Oracle Enterprise Manager (OEM) as a 'Checker Failure Detected' event.
When managing high-performance databases utilizing Oracle Automatic Storage Management (ASM), encountering anomalies in your infrastructure can lead to unexpected downtime if left unaddressed. One critical message that database administrators (DBAs) might encounter in their ASM alert logs is: .
The execution role used by the health checker lost its secretsmanager:GetSecretValue or secretsmanager:DescribeSecret permissions. The "failure" wasn't a system collapse—it was the
Are you seeing any for your users?
asmcmd checkset -g DATA
Enable ASM Scrub (foreground checking):
WARNING: Offline of disk 3 (LOGA2) in group 2 failed on ASM inst 1. ERROR: ORA-15130: diskgroup "LOG" is being dismounted. ASM Health Checker found 1 new failures. Use code with caution. 2. Storage Heartbeat Failures and Timeouts
Any SAN, multipath, or OS upgrade should trigger a manual health check: