Autonomous Health Framework (Part 1)

@ViscosityNA www.viscosityna.com

Autonomous Health Framework: How to Use Your Database "Swiss Army
Knife” (Without Poking an Eye Out)

@ViscosityNA www.viscosityna.com Sean Scott Working with Oracle DB since 1995
• Application Developer • “Accidental” DBA • Application & DB Design • HA, DR & Site Continuity Design • Database Reliability Engineering • Automation Engineer • Cloud Architecture

@ViscosityNA www.viscosityna.com Sean Scott Professional Focus • MAA: RAC, Data
Guard, Sharding • Disaster Recovery, High Availability, Site Continuity • Diagnostic Tools: AHF/TFA, CHA, RDA • Upgrades & Migrations • Engineered Systems: Exadata, ODA

@ViscosityNA www.viscosityna.com Sean Scott Professional Focus • Infrastructure as Code
• Docker & Linux Containers • Automation • Virtualization • Cloud & Cloud Native Technologies

Why you need AHF

Why you need AHF • AHF diagnostic collections required by
MOS for some SR • Diagnostic collections accelerate SR resolution • Cluster-aware ADR log inspection and management • Advanced system and log monitoring • Incident control and notification • Connect to MOS • SMTP, REST APIs

Why you need AHF • Built-in Oracle tools: • ORAchk/EXAchk
• OS Watcher • Cluster Verification Utility (CVU) • Hang Manager • Diagnostic Assistant

Why you need AHF • Integrated with: • Database •
ASM and Clusterware • Automatic Diagnostic Repository (ADR) • Grid Infrastructure Management Repository (GIMR) • Cluster Health Advisor (CHA) & Cluster Health Monitor (CHM) • Enterprise Manager

Why you need AHF • Cluster aware: • Run commands
for all, some nodes • Cross-node configuration and file inspection • Central management for ADR • Consolidated diagnostic collection

Why you need AHF • Over 800 health checks •
400 identified as critical/failures • Severe problem check daily: 2AM • All known problem check weekly: 3AM Sunday • Auto-generates a collection when problems detected • Everything required to diagnose & resolve • Results delivered to the notification email

AHF is FREE!

Download AHF

A brief history lesson… • There are two flavors of
TFA • A version downloaded from MOS • A version included in Grid Infrastructure install & patches • GI version is not fully featured • GI and MOS versions can interfere, conflict

Download AHF • AHF Parent Page: Doc ID 2550798.1 •
AHF On-Premises: Doc ID 2832630.1 (New) • Linux, ZLinux • Solaris x86/SPARC64 • HPUX • AIX 6/7 • Win 64-bit • AHF Gen-2 Cloud: Doc ID 2832594.1 (New)

Download AHF • Major release each quarter • Typically follows
DBRU schedule • Naming convention is year, quarter, release: YY.Q.R • 21.4.0, 21.4.1 • Intermediate releases are common!

Install AHF

Types of installs: Daemon or root • Recommended method •
Cluster awareness • Full AHF capabilities • Includes compliance checks • Enables notifications • Automatic diagnostic collection when issues are detected • May conflict with existing AHF/TFA installations

Types of installs: Local or non-root • Reduced feature set
• No automatic or remote diagnostics, collections • Limited file visibility (must be readable by Oracle home owner) • /var/log/messages • Some Grid Infrastructure logs • May co-exist with Daemon installations • No special pre-install considerations

Don’t take shortcuts Don’t follow Oracle’s installation instructions

Install AHF • Oracle’s instructions work when things are perfect
• Systems are rarely perfect! • AHF and TFA are known for certain… ahem, peculiarities

Recommendation: Remove existing AHF/TFA before install

TFA pre-installation checks # Uninstall TFA (as root) tfactl uninstall
# Check for existing AHF/TFA installs which tfactl which ahfctl

TFA pre-installation checks # Locate leftover setup configuration files find
/ -name tfa_setup.txt # Verify files are removed find / -name tfactl find / -name startOSWbb.sh

TFA pre-installation checks # Remove legacy/existing AHF/TFA installations for d
in $(find / -name uninstalltfa) do cd $(dirname $d) ./tfactl uninstall # cd .. && rm -fr . done # Insure ALL AHF/TFA processes are stopped/inactive prior to uninstall # PERFORM THIS STEP ON ALL NODES

Installation—unzip [root@vna1 ahf]# ls -l total 407412 -rw-r--r--. 1 oracle
dba 417185977 Feb 1 23:17 AHF-LINUX_v21.4.1.zip [root@vna1 ahf]# unzip AHF-LINUX_v21.4.1.zip Archive: AHF-LINUX_v21.4.1.zip inflating: README.txt inflating: ahf_setup extracting: ahf_setup.dat inflating: oracle-tfa.pub

Installation

Post-install checks ahfctl version tfactl status ahfctl statusahf tfactl toolstatus
tfactl print hosts tfactl print components tfactl print protocols tfactl print config -node all

Failed installs and upgrades Common issues

Warning remains after a successful upgrade [root@node1 ahf]# ahfctl statusahf
WARNING - AHF Software is older than 180 days. Please consider upgrading AHF to the latest version using ahfctl upgrade. .---------------------------------------------------------------------------------------------. | Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status | +-------+---------------+-------+------+------------+----------------------+------------------+ | node1 | RUNNING | 28883 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | | node2 | RUNNING | 24554 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | '-------+---------------+-------+------+------------+----------------------+------------------' • Run ahfctl syncpatch

Not all nodes appear after upgrade [root@node1 ahf]# tfactl syncnodes
Current Node List in TFA : 1. node1 2. node2 Node List in Cluster : 1. node1 2. node2 Node List to sync TFA Certificates : 1 node2 Do you want to update this node list? Y|[N]: Syncing TFA Certificates on node2 : TFA_HOME on node2 : /opt/oracle.ahf/tfa ...

Not all nodes appear after upgrade (cont) ... TFA_HOME on
node2 : /opt/oracle.ahf/tfa DATA_DIR on node2 : /opt/oracle.ahf/data/node2/tfa Shutting down TFA on node2... Copying TFA Certificates to node2... Copying SSL Properties to node2... Sleeping for 5 seconds... Starting TFA on node2... .---------------------------------------------------------------------------------------------. | Host | Status of TFA | PID | Port | Version | Build ID | Inventory Status | +-------+---------------+-------+------+------------+----------------------+------------------+ | node1 | RUNNING | 28883 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | | node2 | RUNNING | 30339 | 5000 | 21.4.1.0.0 | 21410020220111213353 | COMPLETE | '-------+---------------+-------+------+------------+----------------------+------------------' [root@node1 ahf]#

Installation and upgrade issues • Post-installation troubleshooting: • ahfctl stopahf;
ahfctl startahf • tfactl stop; tfactl start • tfactl status • ahfctl statusahf • tfactl toolstatus • tfactl syncnodes • ahfctl syncpatch

Installation and upgrade issues • Post-installation troubleshooting: • tfactl diagnosetfa
• Create an SR and upload result to MOS

Recommendation: Post installation configurations

Move the repository to shared storage!

RAC: Move the repository to shared storage Local Repository Local
Repository Files needed by MOS

RAC: Move the repository to shared storage Shared Repository Files
needed by MOS

Configure email notification

Set email notifications [root@node1 ~]# tfactl set [email protected] Successfully set
[email protected] .---------------------------------------------------------------------------. | node1 | +----------------------------------------------+----------------------------+ | Configuration Parameter | Value | +----------------------------------------------+----------------------------+ | Notification Address ( notificationAddress ) | [email protected] | '----------------------------------------------+----------------------------'

Set email notifications Send test email: tfactl sendmail [email protected]

Recommended configurations

Recommended configurations # Repository settings tfactl set autodiagcollect=ON # default
tfactl set trimfiles=ON # default tfactl set reposizeMB= # default=10240 tfactl set rtscan=ON # default tfactl set redact=mask # default=none # Disk space monitoring tfactl set diskUsageMon=ON # default=OFF tfactl set diskUsageMonInterval=240 # Depends on activity. default=60 # Log purge tfactl set autopurge=ON # If space is slim. default=OFF tfactl set manageLogsAutoPurge=ON # default=OFF tfactl set manageLogsAutoPurgeInterval=720 # Set to 12 hours. default=60 tfactl set manageLogsAutoPurgePolicyAge=30d # default=30 tfactl set minfileagetopurge=48 # default=12

Additional configuration options

View configurations Default configuration list is… unsorted :( Some configurations
listed as parameters, others as descriptions :( tfactl print config tfactl print config | grep -e "^\|.*\|.*\|$" | sort tfactl print config | egrep "^\|.*\|.*\|$" | sort tfactl print config | egrep "^\|.*\|.*\|$" | \ awk -F'|' '{print $2, $3}' | sort tfactl get <configuration>

View configurations [root@node1 ~]# tfactl print config | egrep "^\|.*\|.*\|$"
| awk -F'|' '{print $2, $3}' | sort actionrestartlimit 30 Age of Purging Collections (Hours) ( minFileAgeToPurge ) 12 AlertLogLevel ALL Alert Log Scan ( rtscan ) ON Allowed Sqlticker Delay in Minutes ( sqltickerdelay ) 3 analyze OFF arc.backupmissing 1 arc.backupmissing.samples 2 arc.backup.samples 3 arc.backupstatus 1 Archive Backup Delay Minutes ( archbackupdelaymins ) 40 Auto Diagcollection ( autodiagcollect ) ON Automatic Purging ( autoPurge ) ON Automatic Purging Frequency ( purgeFrequency ) 4 Auto Sync Certificates ( autosynccertificates ) ON BaseLogPath ERROR cdb.backupmissing 1 cdb.backupmissing.samples 2 cdb.backup.samples 1 cdb.backupstatus 1 ...

Annoyances

Annoyances • Documentation isn’t always current • Commands, options, and
syntax may not match docs • Run tfactl <command> -h or tfactl <command> help • Some commands are user (root, oracle, grid) specific • Regression (usually minor) • Don’t build complex automation on new features • Don’t (always) rush to upgrade to the latest version • Example: GI can’t always see/manage DB & vice-versa

Annoyances • The transition from tfactl to ahfctl is incomplete
• Commands may be: • …available in both • …deprecated in tfactl • …new and unavailable in tfactl • …not ported to ahfctl (yet)

Annoyances • Date format options in commands are inconsistent •
Some require quotes, some don’t, some work either way • Some take double quotes, others take single quotes • YYYY/MM/DD or YYYY-MM-DD or YYYYMMDD or … • Some take dates and times separately • Sometimes there are -d and -t flags • Some take timestamps • Some work with either, others are specific

However… Many commands (incl. complex ones) have an -example option
[root@node1 ~]# tfactl diagcollect -examples Examples: /opt/oracle.ahf/tfa/bin/tfactl diagcollect Trim and Zip all files updated in the last 1 hours as well as chmos/osw data from across the cluster and collect at the initiating node Note: This collection could be larger than required but is there as the simplest way to capture diagnostics if an issue has recently occurred. /opt/oracle.ahf/tfa/bin/tfactl diagcollect -last 8h Trim and Zip all files updated in the last 8 hours as well as chmos/osw data from across the cluster and collect at the initiating node /opt/oracle.ahf/tfa/bin/tfactl diagcollect -database hrdb,fdb -last 1d -z foo Trim and Zip all files from databases hrdb & fdb in the last 1 day and collect at the initiating node ...

However… Many commands (incl. complex ones) have an -example option
[oracle@node1 ~]$ tfactl analyze -examples Examples: /opt/oracle.ahf/tfa/bin/tfactl analyze -since 5h Show summary of events from alert logs, system messages in last 5 hours. /opt/oracle.ahf/tfa/bin/tfactl analyze -comp os -since 1d Show summary of events from system messages in last 1 day. /opt/oracle.ahf/tfa/bin/tfactl analyze -search "ORA-" -since 2d Search string ORA- in alert and system logs in past 2 days. /opt/oracle.ahf/tfa/bin/tfactl analyze -search "/Starting/c" -since 2d Search case sensitive string "Starting" in past 2 days. /opt/oracle.ahf/tfa/bin/tfactl analyze -comp osw -since 6h Show OSWatcher Top summary in last 6 hours. ...

The best commands in AHF analyze, changes, events

# Perform system analysis of DB, ASM, GI, system, OS
Watcher logs/output tfactl analyze # Options: -search "pattern" # Search in DB and CRS alert logs # Sets the search period to -last 1h # Override with -last xh|xd -verbose timeline file1 file2 # Shows timeline for specified files analyze (Only runs as root in 21.4)

INFO: analyzing all (Alert and Unix System Logs) logs for
the last 1440 minutes... Please wait... INFO: analyzing host: node1 Report title: Analysis of Alert,System Logs Report date range: last ~1 day(s) Report (default) time zone: GMT - Greenwich Mean Time Analysis started at: 03-Feb-2022 06:27:46 PM GMT Elapsed analysis time: 0 second(s). Configuration file: /opt/oracle.ahf/tfa/ext/tnt/conf/tnt.prop Configuration group: all Total message count: 963, from 02-Feb-2022 08:01:39 PM GMT to 03-Feb-2022 04:23:43 PM GMT Messages matching last ~1 day(s): 963, from 02-Feb-2022 08:01:39 PM GMT to 03-Feb-2022 04:23:43 PM GMT last ~1 day(s) error count: 4, from 02-Feb-2022 08:03:31 PM GMT to 02-Feb-2022 08:11:12 PM GMT last ~1 day(s) ignored error count: 0 last ~1 day(s) unique error count: 3 Message types for last ~1 day(s) Occurrences percent server name type ----------- ------- -------------------- ----- 952 98.9% node1 generic 7 0.7% node1 WARNING 4 0.4% node1 ERROR ----------- ------- 963 100.0% analyze (Only runs as root in 21.4)

... Unique error messages for last ~1 day(s) Occurrences percent
server name error ----------- ------- ----------- ----- 2 50.0% node1 [OCSSD(30863)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 . 1 25.0% node1 [OCSSD(2654)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 node2 . 1 25.0% node1 [OCSSD(2654)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 . ----------- ------- 4 100.0% analyze (Only runs as root in 21.4)

changes # Find changes made on the system tfactl changes
# Times and ranges -for "YYYY-MM-DD" -from "YYYY-MM-DD" -to "YYYY-MM-DD" -from "YYYY-MM-DD HH24:MI:SS" -to "YYYY-MM-DD HH24:MI:SS" -last 6h -last 1d

changes [root@node1 ~]# tfactl changes -last 2d Output from host
: node2 ------------------------------ [Feb/02/2022 20:11:16.438]: Package: cvuqdisk-1.0.10-1.x86_64 Output from host : node1 ------------------------------ [Feb/02/2022 19:57:16.438]: Package: cvuqdisk-1.0.10-1.x86_64 [Feb/02/2022 20:11:16.438]: Package: cvuqdisk-1.0.10-1.x86_64

events [root@node1 ~]# tfactl events -last 1d Output from host
: node2 ------------------------------ Event Summary: INFO :3 ERROR :2 WARNING :0 Event Timeline: [Feb/02/2022 20:10:46.649 GMT]: [crs]: 2022-02-02 20:10:46.649 [ORAROOTAGENT(27881)]CRS-5822: Agent '/u01/app/19.3.0.0/grid/ bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:1:3} in /u01/app/grid/diag/crs/node2/crs/trace/ ohasd_orarootagent_root.trc. [Feb/02/2022 20:11:12.856 GMT]: [crs]: 2022-02-02 20:11:12.856 [OCSSD(28472)]CRS-1601: CSSD Reconfiguration complete. Active nodes are node1 node2 . [Feb/02/2022 20:11:57.000 GMT]: [asm.+ASM2]: Reconfiguration started (old inc 0, new inc 4) [Feb/02/2022 20:28:31.000 GMT]: [db.db193h1.DB193H12]: Starting ORACLE instance (normal) (OS id: 24897) [Feb/02/2022 20:28:42.000 GMT]: [db.db193h1.DB193H12]: Reconfiguration started (old inc 0, new inc 4)

The best utilities in AHF param, ps

ps # List processes - default flags are "-ef" ps
pmon ps <flags> pmon tfactl> ps pmon Output from host : vna1 ------------------------------ grid 15260 1 0 14:30 ? 00:00:00 asm_pmon_+ASM1 oracle 16883 1 0 14:31 ? 00:00:00 ora_pmon_VNA1 Output from host : vna2 ------------------------------ grid 8063 1 0 14:25 ? 00:00:00 asm_pmon_+ASM2 oracle 9929 1 0 14:27 ? 00:00:00 ora_pmon_VNA2...

ps tfactl> ps aux pmon Output from host : vna1
------------------------------ grid 15260 0.0 1.0 1556860 79508 ? Ss 14:30 0:00 asm_pmon_+ASM1 oracle 16883 0.0 0.8 2297012 66148 ? Ss 14:31 0:00 ora_pmon_VNA1 Output from host : vna2 ------------------------------ grid 8063 0.0 1.0 1556860 79896 ? Ss 14:25 0:00 asm_pmon_+ASM2 oracle 9929 0.0 0.8 2297012 66168 ? Ss 14:27 0:00 ora_pmon_VNA2

param (Broken in 21.4) # View database parameters - cluster
aware param <parameter> tfactl> param sga_target Output from host : vna1 ------------------------------ .-------------------------------------------------. | DB PARAMETERS | +----------+------+----------+------------+-------+ | DATABASE | HOST | INSTANCE | PARAM | VALUE | +----------+------+----------+------------+-------+ | vna | vna1 | VNA1 | sga_target | 1536M | ‘----------+------+----------+------------+-------'

# View database parameters - cluster aware tfactl> param -h
Output from host : vna1 ------------------------------ Usage : /opt/oracle.ahf/tfa/bin/tfactl [run] param <name pattern> Show value of OS/DB parameters matching input e.g: /opt/oracle.ahf/tfa/bin/tfactl param sga_max /opt/oracle.ahf/tfa/bin/tfactl param sga_min /opt/oracle.ahf/tfa/bin/tfactl param db_unique /opt/oracle.ahf/tfa/bin/tfactl param shmmax /opt/oracle.ahf/tfa/bin/tfactl run param sga_max /opt/oracle.ahf/tfa/bin/tfactl run param sga_min /opt/oracle.ahf/tfa/bin/tfactl run param db_unique /opt/oracle.ahf/tfa/bin/tfactl run param shmmax param (Broken in 21.4)

# There are more parameters for sga* SQL> show parameter
sga NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ allow_group_access_to_sga boolean FALSE lock_sga boolean FALSE pre_page_sga boolean TRUE sga_max_size big integer 1536M sga_min_size big integer 0 sga_target big integer 1536M unified_audit_sga_queue_size integer 1048576 param (Broken in 21.4)

# View database parameters - cluster aware tfactl> param sga_max
Output from host : vna1 ------------------------------ Output from host : vna2 ------------------------------ param (Broken in 21.4)

# View database parameters - cluster aware tfactl> param shmmax
Output from host : vna1 ------------------------------ Output from host : vna2 ------------------------------ param (Broken in 21.4)

The best utilities in AHF alertsummary, grep, tail

alertsummary # Summarize events in database and ASM alert logs
tfactl alertsummary [root@node1 ~]# tfactl alertsummary Output from host : node1 ------------------------------ Reading /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ------------------------------------------------------------------------ 02 02 2022 20:04:57 Database started ------------------------------------------------------------------------ 02 02 2022 20:07:41 Database started Summary: Ora-600=0, Ora-7445=0, Ora-700=0 ~~~~~~~ Warning: Only FATAL errors reported Warning: These errors were seen and NOT reported Ora-15173 Ora-15032 Ora-15017 Ora-15013 Ora-15326

grep # Find patterns in multiple files tfactl grep "ERROR"
alert tfactl grep -i "error" alert,trace [root@node1 ~]# tfactl grep -i "error" alert Output from host : node1 ------------------------------ Searching 'error' in alert Searching /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- 28: PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s) 375:Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_32035.trc: 378:Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_32049.trc: 446:ERROR: /* ASMCMD */ALTER DISKGROUP ALL MOUNT 543: PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s) 1034:Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_28105.trc: ...

tail # Tail logs by name or pattern tfactl tail
alert_ # Tail all logs matching alert_ tfactl tail alert_ORCL1.log -exact # Tail for an exact match tfactl tail -f alert_ # Follow logs(local node only) [root@node1 ~]# tfactl tail -f alert_ Output from host : node1 ------------------------------ ==> /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log <== NOTE: cleaning up empty system-created directory '+DATA/vgtol7-rac-c/OCRBACKUP/backup00.ocr.274.1095654191' 2022-02-03T12:23:35.194335+00:00 NOTE: cleaning up empty system-created directory '+DATA/vgtol7-rac-c/OCRBACKUP/backup01.ocr.274.1095654191' 2022-02-03T16:23:43.602629+00:00 NOTE: cleaning up empty system-created directory '+DATA/vgtol7-rac-c/OCRBACKUP/backup01.ocr.275.1095668599' ==> /u01/app/oracle/diag/rdbms/db193h1/DB193H11/trace/alert_DB193H11.log <== TABLE SYS.WRI$_OPTSTAT_HISTHEAD_HISTORY: ADDED INTERVAL PARTITION SYS_P301 (44594) VALUES LESS THAN (TO_DATE(‘... SYS.WRI$_OPTSTAT_HISTGRM_HISTORY: ADDED INTERVAL PARTITION SYS_P304 (44594) VALUES LESS THAN (TO_DATE(‘... 2022-02-03T06:00:16.143988+00:00 Thread 1 advanced to log sequence 22 (LGWR switch) Current log# 2 seq# 22 mem# 0: +DATA/DB193H1/ONLINELOG/group_2.265.1095625353

The best utilities in AHF pstack, summary

pstack tfactl> pstack -h Output from host : vna1 ------------------------------
Error: pstack command not found in system. If its installed, please set the PATH and try again. yum install -y gdb tfactl> pstack mmon Output from host : vna1 ------------------------------ # pstack output for pid : 15318 #0 0x00007f33bac6928a in semtimedop () from /lib64/libc.so.6 #1 0x0000000011c58285 in sskgpwwait () #2 0x0000000011c543db in skgpwwait () #3 0x000000001144ccba in ksliwat () #4 0x000000001144c06c in kslwaitctx () #5 0x0000000011a6fd40 in ksarcv () #6 0x00000000038174fa in ksbabs () #7 0x0000000003835ab3 in ksbrdp () #8 0x0000000003c19a4d in opirip () #9 0x00000000024c23e5 in opidrv ()

pstack # ahfctl pstack accepts standard flags Usage : /opt/oracle.ahf/tfa/bin/tfactl.pl
[run] pstack <pid|process name> [-n <n>] [-s <secs>] Print stack trace of a running process <n> times. Sleep <secs> seconds between runs. e.g: /opt/oracle.ahf/tfa/bin/tfactl.pl pstack lmd /opt/oracle.ahf/tfa/bin/tfactl.pl pstack 2345 -n 5 -s 5 /opt/oracle.ahf/tfa/bin/tfactl.pl run pstack lmd /opt/oracle.ahf/tfa/bin/tfactl.pl run pstack 2345 -n 5 -s 5

# Generate a system summary tfactl> summary -h --------------------------------------------------------------------------------- Usage
: TFACTL [run] summary -help --------------------------------------------------------------------------------- Command : /opt/oracle.ahf/tfa/bin/tfactl [run] summary [OPTIONS] Following Options are supported: [no_components] : [Default] Complete Summary Collection -overview : [Optional/Default] Complete Summary Collection - Overview -crs : [Optional/Default] CRS Status Summary -asm : [Optional/Default] ASM Status Summary -acfs : [Optional/Default] ACFS Status Summary -database : [Optional/Default] DATABASE Status Summary -exadata : [Optional/Default] EXADATA Status Summary Not enabled/ignored in Windows and Non-Exadata machine -patch : [Optional/Default] Patch Details -listener : [Optional/Default] LISTENER Status Summary -network : [Optional/Default] NETWORK Status Summary -os : [Optional/Default] OS Status Summary -tfa : [Optional/Default] TFA Status Summary -summary : [Optional/Default] Summary Tool Metadata -json : [Optional] - Prepare json report -html : [Optional] - Prepare html report -print : [Optional] - Display [html or json] Report at Console -silent : [Optional] - Interactive console by defauly -history <num> : [Optional] - View Previous <numberof> Summary Collection History in Interpreter -node <node(s)> : [Optional] - local or Comma Separated Node Name(s) -help : Usage/Help. --------------------------------------------------------------------------------- summary (Only runs as root in 21.4)

summary Example output tfactl> summary Executing Summary in Parallel on
Following Nodes: Node : vna1 Node : vna2 LOGFILE LOCATION : /opt/oracle.ahf/…/log/summary_command_20220316151853_vna1_18097.log Component Specific Summary collection : - Collecting CRS details ... Done. - Collecting ASM details ... Done. - Collecting ACFS details ... Done. - Collecting DATABASE details ... Done. - Collecting PATCH details ... Done. - Collecting LISTENER details ... Done. - Collecting NETWORK details ... Done. - Collecting OS details ... Done. - Collecting TFA details ... Done. - Collecting SUMMARY details ... Done. Remote Summary Data Collection : In-Progress - Please wait ... - Data Collection From Node - vna2 .. Done. Prepare Clusterwide Summary Overview ... Done cluster_status_summary

summary Example output (cont) COMPONENT DETAILS STATUS +-----------+---------------------------------------------------------------------------------------------------+---------+ ... PATCH
.----------------------------------------------. OK | CRS_PATCH_CONSISTENCY_ACROSS_NODES : OK | | DATABASE_PATCH_CONSISTENCY_ACROSS_NODES : OK | '----------------------------------------------' LISTENER .-----------------------. OK | LISTNER_STATUS : OK | '-----------------------' NETWORK .---------------------------. OK | CLUSTER_NETWORK_STATUS : | '---------------------------' OS .-----------------------. OK | MEM_USAGE_STATUS : OK | '-----------------------' TFA .----------------------. OK | TFA_STATUS : RUNNING | '----------------------' SUMMARY .------------------------------------. OK | SUMMARY_EXECUTION_TIME : 0H:1M:52S | ‘------------------------------------' +-----------+---------------------------------------------------------------------------------------------------+---------+

summary Interactive menu ### Entering in to SUMMARY Command-Line Interface
### tfactl_summary>list Components : Select Component - select [component_number|component_name] 1 => overview 2 => crs_overview 3 => asm_overview 4 => acfs_overview 5 => database_overview 6 => patch_overview 7 => listener_overview 8 => network_overview 9 => os_overview 10 => tfa_overview 11 => summary_overview tfactl_summary>

summary Interactive menu tfactl_summary>5 ORACLE_HOME_DETAILS ORACLE_HOME_NAME +-----------------------------------------------------------------------------------+------------------+ .-------------------------------------------------------------------------------. OraDB19Home1 |
DATABASE_DETAILS | DATABASE_NAME | +---------------------------------------------------------------+---------------+ | .-----------------------------------------------------------. | VNA | | | DB_BLOCKS | STATUS | DB_CHAINS | INSTANCE_NAME | HOSTNAME | | | | +-----------+--------+-----------+---------------+----------+ | | | | PASS | OPEN | FAIL | VNA1 | vna1 | | | | | PASS | OPEN | FAIL | VNA2 | vna2 | | | | '-----------+--------+-----------+---------------+----------' | | '---------------------------------------------------------------+---------------' +-----------------------------------------------------------------------------------+------------------+ tfactl_summary_databaseoverview>list Status Type: Select Status Type - select [status_type_number|status_type_name] 1 => database_clusterwide_status 2 => database_vna1 3 => database_vna2

summary Interactive menu tfactl_summary_databaseoverview>list Status Type: Select Status Type -
select [status_type_number|status_type_name] 1 => database_clusterwide_status 2 => database_vna1 3 => database_vna2 tfactl_summary_databaseoverview>2 =====> database_sql_statistics =====> database_instance_details =====> database_components_version =====> database_system_events =====> database_hanganalyze =====> database_rman_stats =====> database_incidents =====> database_account_status =====> database_tablespace_details =====> database_status_summary =====> database_sqlmon_statistics =====> database_problems =====> database_statistics =====> database_group_details =====> database_pdb_stats =====> database_configuration_details

Diagnostic Collections

Diagnostic collections diagcollect [ [component1] [component2] ... [componenteN] | [-srdc
<srdc_profile>] | [-defips] ] [-sr <SR#>] [-node <all|local|n1,n2,..>] [-tag <tagname>] [-z <filename>] [-acrlevel <system,database,userdata>] [-last <n><m|h|d> | -from <time> -to <time> | -for <time>] [-nocopy] [-notrim] [-silent] [-cores] [-collectalldirs] [-collectdir <dir1,dir2..>] [-collectfiles <file1,..,fileN,dir1,..,dirN> [-onlycollectfiles] ]

Diagnostic collections - components diagcollect [component1] [component2] ... [componenteN] -acfs
-afd -ahf -ashhtml -ashtext -asm -awrhtml -awrtext -cfgtools -cha -crs   -crsclient -cvu -database -dataguard -dbclient -dbwlm -em -emagent -emagenti -emplugins -install   -ips -ocm -oms -omsi -os -procinfo -qos -rhp -sosreport -tns -wls

Diagnostic collections - 170+ SRDC profiles diagcollect ... -srdc <srdc_profile>
diagcollect -srdc -help <srdc_profile> can be any of the following, DBCORRUPT Required Diagnostic Data Collection for a Generic Database Corruption DBDATAGUARD Required Diagnostic Data Collection for Data Guard issues including Broker Listener_Services SRDC - Data Collection for TNS-12516 / TNS-12518 / TNS-12519 / TNS-12520. Naming_Services SRDC - Data Collection for ORA-12154 / ORA-12514 / ORA-12528. ORA-00020 SRDC for database ORA-00020 Maximum number of processes exceeded ORA-00060 SRDC for ORA-00060. Internal error code. ORA-00494 SRDC for ORA-00494. ORA-00600 SRDC for ORA-00600. Internal error code. ... ora4023 SRDC - ORA-4023 : Checklist of Evidence to Supply ora4063 SRDC - ORA-4063 : Checklist of Evidence to Supply ora445 SRDC - ORA-445 or Unable to Spawn Process: Checklist of Evidence to Supply (Doc ID 2500730.1) xdb600 SRDC - Required Diagnostic Data Collection for XDB ORA-00600 and ORA-07445 zlgeneric SRDC - Zero Data Loss Recovery Appliance (ZDLRA) Data Collection.

Diagnostic collections - Misc diagcollect ... -defips -sr <SR#> -node
<all|local|n1,n2,..> -defips Include in the default collection the IPS Packages for: ASM, CRS and Databases -sr Enter SR number to which the collection will be uploaded -node Specify comma separated list of host names for collection

Diagnostic collections - Time ranges diagcollect ... -last <n><m|h|d> -since
-from <time> -to <time> -for <time> -last <n><m|h|d> Files from last 'n' [m]inutes, 'n' [d]ays or 'n' [h]ours -since Same as -last. Kept for backward compatibility. -from "Mon/dd/yyyy hh:mm:ss" From <time> or "yyyy-mm-dd hh:mm:ss" or "yyyy-mm-ddThh:mm:ss" or "yyyy-mm-dd" -to "Mon/dd/yyyy hh:mm:ss" To <time> or "yyyy-mm-dd hh:mm:ss" or "yyyy-mm-ddThh:mm:ss" or "yyyy-mm-dd" -for "Mon/dd/yyyy" For <date>. or "yyyy-mm-dd"

Diagnostic collections - File management diagcollect ... -nocopy -notrim -tag
<tagname> -z <zipname> -collectalldirs -collectdir <dir1,dir2..> -collectfiles <file1,..,fileN,dir1,..,dirN> [-onlycollectfiles] -nocopy Does not copy back the zip files to initiating node from all nodes -notrim Does not trim the files collected -tag <tagname> The files will be collected into tagname directory inside the repository -z <zipname> The collection zip file will be given this name in the collection repo -collectalldirs Collect all files from a directory marked "Collect All” flag to true -collectdir Specify a comma separated list of directories and the collection will include all files from these irrespective of type and time constraints in addition to the components specified -collectfiles Specify a comma separated list of files/directories and the collection will include the files and directories in addition to the components specified. if -onlycollectfiles is also used, then no other components will be collected.

Diagnostic collections - File redaction diagcollect ... -mask | -sanitize
tfactl set redact=mask tfactl set redact=sanitize tfactl set redact=none sanitize: Replaces sensitive data in collections with random characters mask: Replaces sensitive data in collections with asterisks (*)

Diagnostic collections: diagcollect -examples tfactl diagcollect Trim and Zip all
files updated in the last 1 hours as well as chmos/osw data from across the cluster and collect at the initiating node Note: This collection could be larger than required but is there as the simplest way to capture diagnostics if an issue has recently occurred. tfactl diagcollect -last 8h Trim and Zip all files updated in the last 8 hours as well as chmos/osw data from across the cluster and collect at the initiating node tfactl diagcollect -database hrdb,fdb -last 1d -z foo Trim and Zip all files from databases hrdb & fdb in the last 1 day and collect at the initiating node tfactl diagcollect -crs -os -node node1,node2 -last 6h Trim and Zip all crs files, o/s logs and chmos/osw data from node1 & node2 updated in the last 6 hours and collect at the initiating node

Diagnostic collections: diagcollect -examples tfactl diagcollect -asm -node node1 -from
"Mar/15/2022" -to "Mar/15/2022 21:00:00" Trim and Zip all ASM logs from node1 updated between from and to time and collect at the initiating node tfactl diagcollect -for "Mar/15/2022" Trim and Zip all log files updated on "Mar/15/2022" and collect at the collect at the initiating node tfactl diagcollect -for "Mar/15/2022 21:00:00" Trim and Zip all log files updated from 09:00 on "Mar/15/2022" to 09:00 on “Mar/16/2022"(i.e. 12 hours before and after the time given) and collect at the initiating node tfactl diagcollect -crs -collectdir /tmp_dir1,/tmp_dir2 Trim and Zip all crs files updated in the last 1 hours Also collect all files from /tmp_dir1 and /tmp_dir2 at the initiating node

Managing ADR logs

Managing ADR logs Report space use for database, GI logs
Report space variations over time # Reporting tfactl managelogs -show usage # Show all space use in ADR tfactl managelogs -show usage -gi # Show GI space use tfactl managelogs -show usage -database # Show DB space use tfactl managelogs -show usage -saveusage # Save use for variation reports # Report space use variation tfactl managelogs -show variation -since 1d tfactl managelogs -show variation -since 1d -gi tfactl managelogs -show variation -since 1d -database

Managing ADR logs Purge logs in ADR across cluster nodes
ALERT, INCIDENT, TRACE, CDUMP, HM, UTSCDMP, LOG All diagnostic subdirectories must be owned by dba/grid # Purge ADR files tfactl managelogs -purge -older 30d -dryrun # Estimated space saving tfactl managelogs -purge -older 30d # Purge logs > 30 days old tfactl managelogs -purge -older 30d -gi # GI only tfactl managelogs -purge -older 30d -database # Database only tfactl managelogs -purge -older 30d -database all # All databases tfactl managelogs -purge -older 30d -database SID1,SID3 tfactl managelogs -purge -older 30d -node all # All nodes tfactl managelogs -purge -older 30d -node local # Local node tfactl managelogs -purge -older 30d -node NODE1,NODE3

Managing ADR logs - Things to know • First-time purge
can take a long time for: • Large directories • Many files • NOTE: Purge operation loops over files • Strategies for first time purge: • Delete in batches by age—365 days, 180 days, 90 days, etc. • Delete database and GI homes separately • Delete for individual SIDs, nodes

Managing ADR logs - File ownership • Files cannot be
deleted if subdirectories under ADR_HOME are not owned by grid/oracle or oinstall/dba • One mis-owned subdirectory • No files under that ADR_HOME will be purged • Even subdirectories with correct ownership! • Depending on version • grid may not be able to delete files in database ADR_HOMEs • oracle may not be able to delete files in GI ADR_HOMEs

Managing ADR logs - Files not deleted when • ADR_HOME:
• …schema version is mismatched • …library version is mismatched • …schema version is obsolete • …is not registered • …is for an orphaned CRS event or user • …is for an inactive listener

Managing ADR logs - Files not deleted when • ORACLE_SID
or ORACLE_HOME not present in oratab • Duplicate ORACLE_SIDs are present in oratab • Database unique name is mismatched to its directory • Can occur during cloning operations • ADR_BASE is not set properly • $ORACLE_HOME/log/diag directory is missing • $ORACLE_HOME/log/diag/adrci_dir.mif missing • $ORACLE_HOME/log/diag/adrci_dir.mif doesn’t list ADR_BASE

Questions

Autonomous Health Framework (Part 1)

Autonomous Health Framework (Part 1)

More Decks by Sean Scott

Other Decks in Technology

Featured

Transcript