Oracle data guard switch over and fail over

1. [PRIMARY] Switch log file on primary database.

SQL>alter system switch logfile;
2. [PRIMARY] Check switchover status before switching database.

SQL>select switchover_status from v$database;
You must see “TO_STANDBY” as result.

3. [PRIMARY] Switch primary database to standby database.

SQL>alter database commit to switchover to physical standby with session shutdown;

SQL>shutdown immediate;

SQL>startup nomount;

SQL>alter database mount standby database;
4. [PRIMARY] Defer for archive log apply. Because I didn’t set  my standby database as primary yet.

SQL>alter system set log_archive_dest_state_2=defer;
5. [Standby] Switch standby database to primary. Check switchover status before switching database.

SQL>select switchover_status from v$database;
You must see “TO_PRIMARY” as result. Now let’s swtich

SQL>alter database commit to switchover to primary;

SQL>shutdown immediate;

SQL>startup;
Our switchover process is successfully completed .
6. [PRIMARY] Start real-time recovery process..

SQL>recover managed standby database using current logfile disconnect;
Finally let’s open our database with “Read Only with Apply”.

SQL>recover managed standby database cancel;

SQL>alter database open;

SQL>recover managed standby database using current logfile disconnect;
FAILOVER:

In short, the failover is the deformation of the production (primary) database and activating standby database as the primary. It is not reversible. When enabled, re-create the standby database. What to do in case of failover:

(Important note: PRIMARY is the primary server and Standby is the standby server)

1. [PRIMARY] If the primary database is accessible and running, then it must provided  to send redo buffer to the standby database.

SQL> alter system flush redo to standby_db_name;

SQL>alter system archive log current;
If you don’t receive an error, you can continue with step 5th. In this case, the system can be opened by zero data loss. If you receive an error, We continue with step 2 to open the system at least data loss.

2. [Standby] We must run the following query to learn last applied archive log sequence number.

SQL> SELECT UNIQUE THREAD# AS THREAD, MAX(SEQUENCE#) OVER (PARTITION BY thread#) AS LAST from V$ARCHIVED_LOG;
3. [PRIMARY’dan Standby’ye] If you can access archive logs which are not copied to standby then copy archives to standby. After copy archive log files we must register them to standby database. This operation must be done for every thread.

SQL> alter database register physical logfile '/oracle/ora11g/dbs/arch/ TALIP_991834413_1_102.arc ';
4. [Standby] Check the standby database for redo gap. If there is a gap then we must copy archive log files and register.

SQL> SELECT THREAD#, LOW_SEQUENCE#, HIGH_SEQUENCE# FROM V$ARCHIVE_GAP;

SQL> alter database register physical logfile '/oracle/ora11g/dbs/arch/ TALIP_991834413_1_101.arc ';
As a result of the above query until it returns to zero.

5. [Standby] Stop the redo apply process in standby database.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
6. [Standby] Finish to apply archive logs copied from primary.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH;
If you get an error, it means there are redo logs not applied. Consider 2th and 4th steps. You can also continue with following command;

SQL> ALTER DATABASE ACTIVATE PHYSICAL STANDBY DATABASE;
In this situation you can open database in 8th step.  If you get no error, continue with 7th step.

7. [Standby] Switch standby database to primary database.

SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WITH SESSION SHUTDOWN;
8. [Standby] Open database.

SQL> ALTER DATABASE OPEN;
After opening standby database as primary with failover you must take full backup.

Oracle Data guard Protection modes

Maximum Protection
In the Maximum Protection mode a transaction is only confirmed as “Committed”, when the data has been written both local and into at least one Standby Redolog file. If the Standby Database or the network between the databases breaks down, transactions could not be performed any longer, the primary database shuts down automatically. Oracle recommends to only use Maximum Protection mode, if at least two Standby databases exist.

For the Maximum Protection mode the following parameters must be set for the Redo transport:

AFFIRM
SYNC
Maximum Availibility
This mode is a compromise between data security and performance. First the Maximum Availability mode works just like the Maximum Protection mode. That means all transactions are transmitted simultaneously and the commit of the transaction is only sent, when the transaction is saved both local and on at least one Standby Redolog file. With the 12c Oracle feature “FastSync” the performance can be increased a bit by already confirming the transaction, when the data reached the Standby side, so when it is located in that memory.

Unlike the Maximum Protection mode, the primary database continues working after a short time, if the Standby side breaks down or a network error occurs. By this it switches to the Maximum Performance mode automatically, that means transactions are committed immediately. Once the Standby database is available again it automatically switches back to the Maximum Protection mode.

Following parameters are responsible for the Redolog transport in Maximum Availability mode:

AFFIRM
SYNC
NET_TIMEOUT
The parameter NET_TIMEOUT (default 30) tells the time (in seconds) after that the database shall switch to the Maximum Performance mode.

Maximum Performance
The Maximum Performance mode is used, when the primary database must not be compromised. That means transactions are confirmed, once they are saved into the local Redolog files and asynchronously transmitted to the Standby database. So in case of a breakdown of the primary database you can expect a loss of transactions.

Following parameters are responsible for the Redolog transport here:

NOAFFIRM
ASYNC

Oracle RAC GRID Daemons and Background process


Oracle Cluster Specific Daemons:
Crsd :
The CRS daemon (crsd) manages cluster resources based on configuration information that is stored in Oracle Cluster Registry (OCR) for each resource. This includes start, stop, monitor, and failover operations. The crsd process generates events when the status of a resource changes.

Cssd :
 Cluster Synchronization Service (CSS): Manages the cluster configuration by controlling which nodes are members of the cluster and by notifying members when a node joins or leaves the cluster. If you are using certified third-party clusterware, then CSS processes interfaces with your clusterware to manage node membership information. CSS has three separate processes: the CSS daemon (ocssd), the CSS Agent (cssdagent), and the CSS Monitor (cssdmonitor). The cssdagent process monitors the cluster and provides input/output fencing. This service formerly was provided by Oracle Process Monitor daemon (oprocd), also known as OraFenceService on Windows. A cssdagent failure results in Oracle Clusterware restarting the node. 
Diskmon :
Disk Monitor daemon (diskmon): Monitors and performs input/output fencing for Oracle Exadata Storage Server. As Exadata storage can be added to any Oracle RAC node at any point in time, the diskmon daemon is always started when ocssd is started. 
Evmd :
Event Manager (EVM): Is a background process that publishes Oracle Clusterware events 
Mdnsd :
Multicast domain name service (mDNS): Allows DNS requests. The mDNS process is a background process on Linux and UNIX, and a service on Windows. 
Gnsd :
Oracle Grid Naming Service (GNS): Is a gateway between the cluster mDNS and external DNS servers. The GNS process performs name resolution within the cluster. 
Ons :
Oracle Notification Service (ONS): Is a publish-and-subscribe service for communicating Fast Application Notification (FAN) events 
Oraagent :
oraagent: Extends clusterware to support Oracle-specific requirements and complex resources. It runs server callout scripts when FAN events occur. This process was known as RACG in Oracle Clusterware 11g Release 1 (11.1). 
Orarootagent :
Oracle root agent (orarootagent): Is a specialized oraagent process that helps CRSD manage resources owned by root, such as the network, and the Grid virtual IP address 
Oclskd :
Cluster kill daemon (oclskd): Handles instance/node evictions requests that have been escalated to CSS .
Gipcd :
Grid IPC daemon (gipcd): Is a helper daemon for the communications infrastructure 
Ctssd :
Cluster time synchronisation daemon(ctssd) to manage the time syncrhonization between nodes, rather depending on NTP. 

 RAC Background Process:

LMSn — Global Cache Service Process: It can mainly handle the cache fusion part. It handles the consistent copies of blocks that are transferred between instances. It receives the request from the LMD to perform lock requests. It rolls back any uncommitted transactions. There can be up to 10 LMS process running and can be started dynamically if demand requires. It also handles the global deadlock detections and monitors for the lock conversion timeouts.

LMON    — Global Enqueue Service Monitor: This process manages the GES, it maintains the consistency of GCSmemory in case of process death. It also responsible for cluster reconfiguration and locks reconfiguration.

LMD     — Global Enqueue Service Daemon: This manages eneque manager service request for the GCS. It also handles the deadlock detections and remote resource requests the other instances.

LCK0    — Instance Enqueue Process: Manages the instance resource requests and cross instance call operations for shared resources. It builds a list of invalid lock elements and validates the lock elements during recovery.

DIAG    — Diagnosability Daemon

GCS ensures a single system image of the data even though the data is accessed by multiple instances.

GES maintains or handles the synchronization of the dictionary cache, library cache, transaction locks, and DDL locks. In other words, GES manages enqueues other than data blocks. To synchronize access to the data dictionary cache, latches are used in exclusive (X) mode and in single-node cluster databases. Global enqueues are used in cluster database mode

RAC/Grid startup sequence in Oracle 11gR2


OHASD has access to OLR (oracle local registry). OHASD then reads the OLR content and initialize accordingly. 

OHASD brings up GPnP (ora.gpnpd)Daemon and CSS (ora.cssd) Daemon. 

CSS Daemon has access to the GPNP Profile stored on the local file system. I even found a copy of GPNP Profile directly stored  in OLR (in Oracle 12c release 2)

The Voting Files locations on ASM Disks are accessed by CSSD with well-known pointers in the ASM Disk headers and CSSD is able to complete initialization and start or join an existing cluster.

OHASD starts an ASM instance. The ASM instance uses special code to locate the contents of the ASM SPFILE, if it is stored in a Diskgroup. 

With an ASM instance operating and its Diskgroups mounted, access to Clusterware’s OCR is available to CRS.

OHASD then starts CRSD (ora.crsd)damon with access to the OCR in an ASM Diskgroup.

And thus Clusterware completes initialization and brings up other cluster managed resources defined in OCR.

Level 1: OHASD Spawns:
  • cssdagent – Agent responsible for spawning CSSD.
  • orarootagent – Agent responsible for managing all root owned ohasd resources.
  • oraagent – Agent responsible for managing all oracle owned ohasd resources.
  • cssdmonitor – Monitors CSSD and node health (along wth the cssdagent).
Level 2: OHASD rootagent spawns:
  • CRSD – Primary daemon responsible for managing cluster resources.
  • CTSSD – Cluster Time Synchronization Services Daemon
  • Diskmon
  • ACFS (ASM Cluster File System) Drivers
Level 2: OHASD oraagent spawns:
  • MDNSD – Used for DNS lookup
  • GIPCD – Used for inter-process and inter-node communication
  • GPNPD – Grid Plug & Play Profile Daemon
  • EVMD – Event Monitor Daemon
  • ASM – Resource for monitoring ASM instances
Level 3: CRSD spawns:
  • orarootagent – Agent responsible for managing all root owned crsd resources.
  • oraagent – Agent responsible for managing all oracle owned crsd resources.
Level 4: CRSD rootagent spawns:
  • Network resource – To monitor the public network
  • SCAN VIP(s) – Single Client Access Name Virtual IPs
  • Node VIPs – One per node
  • ACFS Registery – For mounting ASM Cluster File System
  • GNS VIP (optional) – VIP for GNS
Level 4: CRSD oraagent spawns:
  • ASM Resouce – ASM Instance(s) resource
  • Diskgroup – Used for managing/monitoring ASM diskgroups.
  • DB Resource – Used for monitoring and managing the DB and instances
  • SCAN Listener – Listener for single client access name, listening on SCAN VIP
  • Listener – Node listener listening on the Node VIP
  • Services – Used for monitoring and managing services
  • ONS – Oracle Notification Service
  • eONS – Enhanced Oracle Notification Service
  • GSD – For 9i backward compatibility
  • GNS (optional) – Grid Naming Service – Performs name resolution

Long running queries troubleshooting in oracle database


In general application team come to DBA’s and complaint on one application query, last time or previous times it was done in few seconds now it is taking 1 minute or more time. For this issue no concreate answer. Every DBA has their own approaches to solve the issue.

Step:1
Find the long running query:
Long running session are currently running, then we can use v$session find the long running query sql_id, using sql_id get the sql using v$sqltext view.
Sql>select sql_id from v$session where sid=1234;
      Sql_id
 btdzd9ktsa55n
sql> select sql_text from v$sqltext where sql_id=’ btdzd9ktsa55n’ order by piece;
Sql_text
select name, accno, htown, amt, transactions from abd.transact where mobile_no=’9933445599’;
If in-case query already completed, then we can find the sql_id from AWR, ASH and dba_hist_active_session_history views on specific time interval.

Step2:
1.     Run the Sql_Tuning_advisor for the sql_id:
SET SERVEROUTPUT ON
DECLARE
  l_sql_tune_task_id  VARCHAR2(200);
BEGIN
  l_sql_tune_task_id := DBMS_SQLTUNE.create_tuning_task (
                          sql_id      => 'btdzd9ktsa55n',
                          scope       => DBMS_SQLTUNE.scope_comprehensive,
                          time_limit  => 60,
                          task_name   => 'btdzd9ktsa55n_tuning',
                          description => 'Tuning for btdzd9ktsa55n');
  DBMS_OUTPUT.put_line('l_sql_tune_task_id: ' || l_sql_tune_task_id);
END;
/
2.     Execute the DBMS_SQLTUNE using task_name
EXEC DBMS_SQLTUNE.execute_tuning_task(task_name => 'btdzd9ktsa55n_tuning');
3.     Once tuning task done, execute report tuning task.

SELECT DBMS_SQLTUNE.report_tuning_task('btdzd9ktsa55n_tuning') FROM dual;

It gives us few recommendations have to take the actions accordingly.
Recommendations could be gather statistics, creating indexes, drop indexes……..

Step 3:
Few time we didn’t get any recommendations from the Sql_tune, then that time we have to verify is there any plan changed for the query using dba_sqlstat, dba_hist_snapshot and DBA_HIST_SQL_PLAN.

select distinct b.BEGIN_INTERVAL_TIME as Snap, a.PLAN_HASH_VALUE as plan, a.EXECUTIONS_DELTA as EXECUTIONS, a.ELAPSED_TIME_DELTA/1000000 as ELAPSED_SEC, ROWS_PROCESSED_DELTA as "ROWS" , a.ROWS_PROCESSED_DELTA/CASE WHEN a.EXECUTION_DELTA = 0 THEN -1 ELSE a.EXECUTIONS_DELTA END "Avg Rows", a.ELAPSED_TIME_DELTA/1000000/CASE WHEN a. EXECUTION_DELTA = 0 THEN -1 ELSE a.EXECUTION_DELTA END "Avg Elapsed", a.optimizer_cost, a.SQL_PROFILE from DBA_HIST_SQLSTAT a, DBA_HIST_SNAPHOT b where a.SQL_ID = '&sqlid' and a.snap_id = b.snap_id order by b.BEGIN_INTERVAL_TIME;

or

SELECT DISTINCT sql_id, plan_hash_value
FROM dba_hist_sqlstat q,
    (
    SELECT /*+ NO_MERGE */ MIN(snap_id) min_snap, MAX(snap_id) max_snap
    FROM dba_hist_snapshot ss
    WHERE ss.begin_interval_time BETWEEN (SYSDATE - &No_Days) AND SYSDATE
    ) s
WHERE q.snap_id BETWEEN s.min_snap AND s.max_snap
  AND q.sql_id IN ( '&SQLID')
/

Above queries gives us sql paln from number of days, from the list we can choose the best plan and pin the best plan.
How to pin the sql plan: