Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Database Troubleshooting

Avatar for Sean Scott Sean Scott
September 17, 2024

Database Troubleshooting

Avatar for Sean Scott

Sean Scott

September 17, 2024
Tweet

More Decks by Sean Scott

Other Decks in Technology

Transcript

  1. Database Reliability Engineering MAA ⁘ RAC ⁘ RMAN Data Guard

    ⁘ Sharding ⁘ Partitioning Information Lifecycle Management Exadata & Engineered Systems Database Modernization Upgrades ⁘ Patching ⁘ Migrations Cloud ⁘ Hybrid Automation DevOps ⁘ IaC ⁘ Containers ⁘ Terraform Vagrant ⁘ Ansible Observability AHF ⁘ TFA ⁘ CHA ⁘ CHM
  2. www.viscosityna.com @ViscosityNA Oracle on Docker Running Oracle Databases in Linux

    Containers Free sample chapter: https://oraclesean.com
  3. Fixes vs. solutions Fixes: • often quick and undocumented (I

    know what's wrong!) • alter system state, make it dif fi cult to discover the original condition and confuse later troubleshooting • can spiral out of control
  4. Fixes vs. solutions Fixes: • don't address root causes •

    may introduce unintended effects • are perceived as faster but often take longer than deliberate efforts toward a solution
  5. First moves • take a deep breath • list what

    you know: • add as you learn more about the problem • determine the impact—who and what are affected • establish implications to SLAs • re fi ne the scope • centralize information • create dedicated Teams/Slack channels • regularly share updates, observations
  6. First moves Multiple systems/symptoms: • what they have in common

    may be a clue • prioritize the issue that is most likely the root cause • share responsibilities and coordinate efforts across a team
  7. Leadership responsibilities Resist the urge to micromanage! • don't interfere—let

    people work! • support and protect the team • assign a liaison for all communications—no exceptions! • agree on an update cycle • organize breaks, snacks, drinks • plan early for shift work/relief teams
  8. Preserve evidence • create a new directory dedicated to the

    issue • make copies of logs, directories, con fi gurations, etc. • document every change!
  9. Ask questions • what changed? • when did it start?

    • who discovered it? • who/what does it affect? • how do we reproduce it? • what's been done so far? • if intermittent, how often or what is the timing? • is it related to or dependent on something else?
  10. Ask questions Eliminate ambiguities • "It's slow." How slow? •

    "It started recently." When, exactly? • "It's been like this for a while." For how long? • "It only happens sometimes." What do events have in common? Establish expectations • How long should it take?
  11. Logs are friends Oracle is a well-instrumented application. • Read

    the logs. • No, really. Read the logs. • Look for errors in the minutes/hours/days before the incident • Look for recent changes • grep the diagnostic directory for similar/related errors/entries
  12. Believe only what you can prove • "It's not X.

    I checked." Really? Show me proof. • "It can't possibly be X!" It's... probably X! • "We didn't change anything." Except for... • "That change is unrelated." It's probably that change
  13. Proof isn't always proof • ...when queries are incorrect or

    based on assumptions; • ...when queries come from a blog post; • ...when using duplicates or "improvements" of built-in instrumentation. You can't prove a negative! • "If that were true, we'd get an alert." • "We've never had that problem before."
  14. Reproduce the issue Test system guidelines: • Identical (or nearly

    identical) con fi gurations and topologies You can't test a RAC issue on a single-node system. • Populated with representative data Performance problems in 1M row tables won't show in small samples. • Similar visibility to production Stakeholders and monitoring tools need access to duplicate results.
  15. Test systems To be useful, test systems, tools & utilities

    need to exist before the problem occurs.
  16. What to do when you're stuck • Read the documentation

    • Recruit a second set of eyes • Ask questions • Diagram the problem or draw a picture • Take a break
  17. Database tools • Autonomous Health Framework (AHF) • changes, events,

    analyze, tail, param • diagcollect • OraCHK/ExaCHK • Remote Diagnostic Assistant (RDA) • Enterprise Manager (OEM) • EDB360 (https://carlos-sierra.net) • ShellCheck online (https://www.shellcheck.net) install (https://github.com/koalaman/shellcheck)
  18. Database tables • dba_errors • dba_source • dba_autotask* • dba_scheduler*

    • v$rman_output • v$rman_status • v$session* • v$active_session_* • v$backup_* • v$datafile* • v$parameter* • v$spparameter
  19. OS tools & utilities • Read the documentation • Recruit

    a second set of eyes • Ask questions • Diagram the problem or draw a picture • Take a break
  20. www.viscosityna.com @ViscosityNA tee, >, >> diff curl, wget, mailx mktemp

    ps wc date du, df iostat, vmstat, sar, etc. env | sort history cat more, less tail, head, watch, strace grep awk sed Regular expressions OS tools & utilities
  21. www.viscosityna.com @ViscosityNA export NLS_DATE_FORMAT='YYYY-MM-DD HH24:MI:SS' bash: set -e: exit on

    non-zero return code set -u: treat unset variables as errors set -x: generate verbose output Environmental setup
  22. www.viscosityna.com @ViscosityNA alter session set nls_date_format='YYYY-MM-DD HH24:MI:SS'; select sys_context() --

    Write messages to the alert log: dbms_system.ksdwrt(2, to_char(sysdate, 'YYYY-MM-DD HH24:MI:SS') || ' My text'); SQL*Plus setup
  23. www.viscosityna.com @ViscosityNA set pages 9999 set lines 200 set timing

    on set time on -- Format columns col member format a20 col member for a20 SQL*Plus setup
  24. Track your work • Create a dedicated directory • Add

    only! Never overwrite anything! • Use timestamps in the fi lename • Log everything • Redirect all output to a fi le • >>, tee -a • Add times strategically throughout • Too much information is better than not enough • Don't assume—capture basic information (environment, settings, etc)
  25. Finding answers • Chrome: Return 100 AI-free results by adding

    to its search engine settings: {google:baseURL}search?q=%s&{google:RLZ} {google:originalQueryForSuggestion} {google:assistedQueryStats}{google:searchFieldtrialParameter} {google:language}{google:prefetchSource}{google:searchClient} {google:sourceId}{google:contextualSearchVersion} ie={inputEncoding}&num=100
  26. Finding answers • Not everything you read on the internet

    is true. • Being repeated on multiple blogs doesn't make it accurate. • Does it apply to: • your situation? • your version? • your OS? • Be cautious of "silver bullet" fi xes.
  27. Finding answers • Wrapping terms in quotes forces them into

    the results. • These are not the same: • ORA-600 kdspf 4194 • ORA-600 "kdspf" "4194"
  28. Errors are errors. There is no such thing as "acceptable

    errors" in production environments.