본문 바로가기

Database

[HANA] 2100009 - FAQ: SAP HANA Savepoints

Symptom

 

You face issues or have questions related to savepoints in SAP HANA environments.

Environment

 

SAP HANA

Cause

 

1. What are savepoints in SAP HANA environments?
2. When is a savepoint triggered?
3. Where can I find more information related to savepoints?
4. Which indications exist for problems related to savepoints?
5. Are savepoints online operations?
6. How can typical savepoint issues be analyzed and resolved?
7. What is the prepare flush retry count and how can I optimize it?
8. What kinds of snapshots exist?
9. When and how are pages actually flushed to disk?
10. What are shadow pages?
11. What are typical initiations and purposes for savepoints?
12. What are reasons for snapshots being retained for a long time?
13. What are savepoint callbacks?

Resolution

 

1. What are savepoints in SAP HANA environments?

Savepoints are required to synchronize changes in memory with the persistency on disk level. All modified pages of row and column store are written to disk during a savepoint.

Each SAP HANA host and service has its own savepoints.

The data belonging to a savepoint represents a consistent state of the data on disk and remains untouched until the next savepoint operation has been completed.

The availability of a recent savepoint improves the restart time of SAP HANA, because less redo logs need to be applied to make the database consistent.

2. When is a savepoint triggered?

Savepoints are triggered in the following ways:

Scenario Details

Savepoint interval (automatic)

During normal operations savepoints are automatically triggered when a predefined time since the last savepoint is passed. The length of the time interval between two consecutive savepoints can be controlled with the following parameter:

Its default value is 300, so savepoints are taken in intervals of 300 seconds (5 minutes).

System command (manual)

The following command can be used to execute a savepoint manually:

Soft shutdown

A soft shutdown invokes a savepoint before the services are stopped.

A hard shutdown doesn't trigger a savepoint. This can increase the subsequent restart time.

Backup

A global savepoint is performed before a data backup is started.

A savepoint is written after the backup of a specific service if finished.

Startup

After a consistent database state is reached during startup, a savepoint is performed.

Snapshots

Snapshots are savepoints that are preserved for longer use and so they are not overwritten by the next savepoint.

Reclaim Datavolume

When RECLAIM DATAVOLUME is executed to defragment persistence (SAP Note 2400005), it regularly triggers savepoints.

3. Where can I find more information related to savepoints?

The following SAP HANA views contain savepoint related information:

View Details
M_SAVEPOINT_STATISTICS Global savepoint information per host and service
M_SAVEPOINTS Detailed information for individual savepoints
M_SERVICE_THREADS
M_SERVICE_THREAD_SAMPLES
HOST_SERVICE_THREAD_SAMPLES
As of SAP HANA SPS 10 savepoint details are logged for THREAD_TYPE = 'PeriodicSavepoint' (see SAP Note 2114710).

 

The following SQL statements of SAP Note 1969700 can be used to analyze savepoints:

SQL statement Details
SQL: "HANA_IO_Savepoints" Detailed information for individual savepoints
SQL: "HANA_IO_Snapshots" Snapshot information

 

4. Which indications exist for problems related to savepoints?

The following SAP HANA alerts indicate problems in the area of savepoints:

Alert Name Description
28 Most recent savepoint operation Determines how long ago the last savepoint was defined, that is, how long ago a complete, consistent image of the database was persisted to disk.
54 Savepoint duration Identifies long-running savepoint operations.
66 Storage snapshot is prepared Determines whether or not the period, during which the database is prepared for a storage snapshot, exceeds a given threshold.
107 Inconsistent database fallback snapshot Determines if an inconsistent fallback snapshots exist.
108 Database fallback snapshot age Determines if a snapshot exists for an extended period of time.

 

SQL: "HANA_Configuration_MiniChecks" (SAP Notes 1969700, 1999993) returns a potentially critical issue (C = 'X') for one of the following individual checks:

Check ID Details
M0346 Long waitForLock savepoint phases (last day)
M0348 Long critical savepoint phases (last day)
M0350 Blocking savepoint phases > 10 s (last day)
M0351 Blocking savepoint phase avg. (s, last day)
M0352 Blocking savepoint phase max. (s, last day)
M0355 Time since last savepoint (s)
M0356 Savepoint crit. phase write throughput (MB/s)
M0357 Savepoint write throughput (MB/s)
M0358 Savepoints taking longer than 900 s (last day)
M0380 Age of oldest backup snapshot (days)
M0381 Age of oldest fallback snapshot (days)
M0383 Max. size of shadow pages (GB, last day)
M0385 Savepoint vol. per day vs. data (%, last week)
M0386 Max. savepoint prepare flush retries (current)
M0387 Avg. savepoint prepare flush retries (current)
M1830 Age of oldest replication snapshot (h)

 

SQL: "HANA_TraceFiles_MiniChecks" (SAP Notes 1969700, 2380176) returns one of the following check IDs:

Check ID Details
T0869 Long runtime of savepoint callback

 

INSERT / UPDATE / DELETE threads may be blocked by a savepoint if they are in state SharedLockEnter waiting for a lock of type ConsistentChangeLock. See SAP Note 1999998 for more information.

 

 

5. Are savepoints online operations?

The majority of the savepoint is performed online without holding a lock, but the finalization of the savepoint requires a lock. This step is called the blocking phase of the savepoint. It consists of two major phases:

Phase Sub phase Mini check Thread detail Description
Blocking WaitForLock 346 ("Long waitForLock savepoint phases (last day)") enterCriticalPhase(waitForLock)

Before the critical phase is entered, a ConsistentChangeLock needs to be allocated by the savepoint. If this lock is held by other threads / transactions, the duration of this phase is increasing. At the same time all other modifications on the underlying table like INSERT, UPDATE or DELETE are blocked by the savepoint with ConsistentChangeLock.

→ Critical단계로 들어가기 전에 ConsistentChangeLock 할당이 필요 (Modification 동시 수행이 blocked 됨)

Blocking Critical 348 ("Long critical savepoint phases (last day)") processCriticalPhase

Once the ConsistentChangeLock is acquired, the actual critical phase is entered and remaining I/O writes are performed in order to guarantee a consistent set of data on disk level. During this time other transactions aren't allowed to perform changes on the underlying table and are blocked with ConsistentChangeLock.

→ ConsistenChangeLock이 할당되면 실제 Critical단계가 수행됨 (Disk 레벨의 consistent를 위해서 I/O write 수행됨 = 다른 Transaction이 변경을 수행하지 못함)

 

Usually the blocking phase shouldn't take longer than 1 to 2 seconds.

→ savepoint 마지막단계에서 lock을 필요로 하지만 1~2초 정도에서 마무리 됨 (이에 대한 영향도 검증 필요?)

6. How can typical savepoint issues be analyzed and resolved?

You can use the following approaches to analyze and resolve typical savepoint issues. Column "Check ID" refers to the related check ID of the SAP HANA Mini Checks (SAP Note 1999993).

Symptoms Check ID Thread detail Details
Long critical phase M0348 processCriticalPhase

Delays during the critical phase are often caused by problems in the disk I/O area. See SAP Note 1999930 for further information about analyzing the SAP HANA I/O performance. Particularly pay attention for the trigger read and write ratios, because they can indicate increased amounts of synchronous I/O (SAP Note 1930979).

You can check for the savepoint write throughput via SQL: "HANA_Configuration_MiniChecks" ('Savepoint write throughput (MB/s)') of SAP Note 1999993. Values below the expectation can be caused by problems in the I/O area or by significant flush overhead outside of the I/O area like data volume encryption.

Be aware that a long waitForLock phase (see below) can increase the runtime of the critical phase because all changes introduced during the waitForLock phase need to be written to disk. So if both the critical phase and the waitForLock phase takes longer for the same savepoint, you should focus on analyzing the waitForLock phase in the first step.

Specific tasks of the critical savepoint phase (e.g. processing of private log buffers) require JobWorker threads, so a lack of JobWorker threads (i.e. a higher demand than configured via global.ini -> [execution] -> max_concurrency, SAP Note 2222250) can result in increased critical savepoint phases.

Another reason for long critical phase runtimes are situations where the non-critical flush phase is left before the majority of changed blocks was flushed to disk. Make sure that the related SAP HANA thresholds (global.ini -> [persistence] -> savepoint_max_pre_critical_flush_duration, global.ini -> [persistence] -> savepoint_pre_critical_flush_retry_threshold) are set to reasonable values (typically default) and see "Long running non-critical savepoint phase" below for a description of other root causes.

In rare cases the critical phase can be dominated by accesses to the secondary system replication site (SAP Note 1999880) that are required for synchronization purposes. In this cases a DisasterRecovery module will show up in the savepoint call stack (SAP Note 2313619), e.g.:

You have to check for issues in the system replication environment in this case.

Another reason for long critical phases as an extraordinary high amount of modification operations (e.g. BW full loads or individual issues like the significant /IWFND/SU_STATS INSERTs in context of SAP Note 2293307). Therefore you should check for frequent modification operations like INSERT, UPDATE or DELETE executed against the database. See SAP Note 2000002 for more information regarding SQL statement analysis.

Long waitForLock phase M0346 enterCriticalPhase(waitForLock)

Long durations of the blocking phase (outside of the critical phase) are typically caused by SAP HANA internal lock contention. The following known scenarios exist:

 

Starting with Rev. 1.00.102 you can configure the following parameter in order to trigger a runtime dump (SAP Note 2400007) in case waiting for entering the critical phase takes longer than seconds:

Per default a maximum of one runtime dump is created within 24 hours. Starting with SAP HANA 1.00.122.07 it can be adjusted via:

Long running non-critical savepoint phase

M0358
M0385
M0386
M0387

flushPagesinNonCriticalPhase

You can use SQL: "HANA_IO_Savepoints" (MIN_SAVEPOINT_DURATION_S = ) of SAP Note 1969700 to check for long running savepoints. A good value for can be 900. This means that only savepoints with a duration of more than 15 minutes are displayed.

Depending on the output of this command the following situations can be distinguished:

Long running savepoints due to low write throughput:

If the write throughput (MB_PER_S) is significantly lower than 100 MB / s, you should check the I/O write performance based on SAP Note 1999930 and analyze the non-I/O components of the page flushes via SQL: "HANA_IO_Flushes_Details" (SAP Note 1969700).

If you face a high runtime for encryption, you can consider to disable data volume encryption as a workaround. Furthermore an improvement in encryption performance is available with SAP HANA >= 1.00.122.14 and >= 2.00.012.04.

Call stack modules related to encryption are e.g.:

Long running savepoints due to high I/O write volume:

If the amount of data written to disk (SUM_SIZE_MB) is much higher than expected, you should at first check via SQL: "HANA_Tables_IOStatistics" (ORDER_BY = 'WRITE') of SAP Note 1969700 for specific tables with a high amount of I/O writes. Check for these tables from an application perspective if you can reduce the amount of changes. If writes are linked to delta merges (SAP Note 2057046) or optimize compression (SAP Note 2119087) you can check if optimizing the merge / compression configuration or defining more table partitions (SAP Note 2044468) for a better load distribution can help.

A particularly high amount of data being written in the non-critical savepoint phase can be a consequence of LOB garbage collector activities at a time when a blocked garbage collection (SAP Note 2169283) is released. The behavior is improved with SAP HANA >= 1.00.122.10 and >= 2.00.012.

Additionally it is possible that a high "prepare flush retry count" has a significant impact on the amount of data written during the savepoint. SAP Note 2538561 describes a SAP HANA bug with Rev. 2.00.000 - 2.00.012.04 and 2.00.020 - 2.00.023 and as a workaround the savepoint_pre_critical_flush_retry_threshold parameter value can be adjusted.

Significant time since last savepoint M0355  

Make sure that the parameter for controlling automatic savepoints is set to a reasonable value, optimally the default value is kept:

Also particularly long running savepoints that are not successfully finished for a long time can be responsible for a significant time since the last successful savepoint. See "Long running savepoints" above for more details.

Old database snapshots not related to backups M1830   You can look into M_SNAPSHOTS or use SQL: "HANA_IO_Snapshots" (SAP Note 1969700) in order to check for currently existing snapshots. Old snapshots can result in increased disk space requirements. See SAP Note 2039883 for more information regarding database snapshots.

7. What is the prepare flush retry count and how can I optimize it?

As already explained, the blocking phase of the savepoint requires a consistent change lock on the underlying table. In order to make sure that the blocking phase is as short as possible, SAP HANA only enters the blocking phase if it expects that the duration will not exceed a critical limit. If many memory pages are modified in parallel to the savepoint preparation it can happen that the critical limit will not be met and so SAP HANA starts another savepoint preparation. This retry activity can happen many times involving a high amount of additional write I/O.

As of SAP HANA SPS 10 you can check PREPARE_FLUSH_RETRY_COUNT in M_SAVEPOINTS in order to see if a long running savepoint has performed a high number of retries before it entered the blocking phase.

This mechanism is controlled by the following SAP HANA parameters:

Parameter Unit Default Details
  s

0 (SPS 08 and below)
900 (SPS 09 and above)

This parameter defines the maximum time (in seconds) that should be spend for optimizing the duration of the blocking phase. The value 0 means that there is no time limit and so a savepoint can potentially run for a very long time without being able to start the blocking phase.
  ms 3000

This parameter defines an upper limit for the expected blocking phase duration. As soon as SAP HANA assumes that the blocking phase will be below this limit, the blocking phase is entered.

Setting this parameter to a higher value (e.g. 5000 for 5 seconds or 10000 for 10 seconds) will reduce the number of retries before entering the blocking phase.

These parameters are a trade-off between savepoint and I/O write overhead on the one hand side and locking situation during the blocking phase on the other hand side. In order to come around high I/O overhead and very long savepoint times beyond 15 minutes, you can set the savepoint_max_pre_critical_flush_duration to 900 without having to expect a significant negative locking behavior.

8. What kinds of snapshots exist?

The following types of snapshots exist:

Type SAP Note Creation Deletion
System replication snapshots 1999880 Regularly in order to provide consistent persistence state for system replication Automatically or - in exceptional cases - manually using hdbcons ("snapshot d "). See SAP Note 2222218 for more information related to hdbcons.
Backup snapshots 2039883 Support of snapshot based backups Automatically by backup procedure or - in exceptional cases - manually using "BACKUP DATA DROP SNAPSHOT" (see SAP Note 1703435 for a use case)
Restore snapshots 1642148 Created after restore of data backup Automatically at the end of the recovery
Secondary time travel snapshots 1999880

Automatic creation based on:

Automatic deletion once the retention time is exceeded:

Fallback snapshots 2768738

Manual creation for quickly restoring an earlier database state (SAP HANA >= 2.00.030, no system replication):

Subsequently it is possible to restore the database to this snapshot:

 

9. When and how are pages actually flushed to disk?

Pages are flushed by the FlushResourcesThread that may use some helpers to parallelize the workload. Starting with SAP HANA 1.00.122.16 and 2.00.024.00 helper threads can be activated / deactivated with the following parameter:

Parameter Default Details
global.ini -> [persistence] -> use_helper_threads_for_flush true

In general helper threads are of advantage as workload is distributed and parallelized.

SAP Note 2655238 describes a problem with helper threads with SAP HANA 1.00.122.16 - 1.00.122.17 and 2.00.024.00 - 2.00.024.03 that can result in small I/O write requests and the risk of significant disk fragmentation. As a workaround the helper threads can be deactivated.

The pages that are flushed are retrieved from the so-called flush queue.

The flush queue is populated in different contexts:

  • During savepoints and snapshots
  • As part of continuous page flushing
  • When the last reference to a page with temporary disposition is dropped
  • In context of container implementations like VirtualFile or VarSizeEntryContainer

Continuous page flushing is performed by the ContinuousPageFlusher thread that checks for dirty pages that haven't been modified for a certain time. If pages are found, they are put into the flush queue. The main purpose of continuous page flushing is to reduce the amount of I/O required during savepoints. The ContinuousPageFlusher activity can be controlled with the following SAP HANA parameters:

→ Savepoint 수행시 내려야 할 Dirty buffer를 미리 선별해서 Queue에 넣어 둠

Parameter Default Unit Details
global.ini -> [persistence] -> continuous_flush_interval_s 60 s

Check interval, per default the ContinuousPageFlusher is activated once a minute and checks if there are dirty pages that should be flushed

A value of 0 disable the continuous page flush feature.

Attention: Due to a SAP HANA bug continuous page flush can be responsible for corruptions with Revisions 1.00.120 to 1.00.122.03 and so continuous_flush_interval_s should be set to 0 in order to disable the continuous page flush feature. See SAP Note 2370160 for more information.

global.ini -> [persistence] -> continuous_flush_threshold_s 120 s Threshold for dirty pages to be flushed, per default only pages are flushed that haven't been touched for more than two minutes

It is usually not required to adjust these settings.

10. What are shadow pages?

If pages representing the consistent state at the last savepoint are changed, the "old" savepoint state is kept and additionally a page with the current data is created. These savepoint related pages are called shadow pages. If there are many shadow pages, the database size on disk increases. Typical reasons for a high number of shadow pages are:

  • Long-running savepoints
  • Low savepoint frequency
  • Mass changes
  • Table optimizations of large table

Be aware that snapshot related page versions are not considered as shadow pages, only savepoints use this concept.

You can use SQL: "HANA_Disk_Pages" (SAP Note 1969700) in order to check for the current and historic shadow page situation.

Mini check M0383 ("Max. size of shadow pages (GB, last day)") available via SAP Note 1999993 monitors the recent shadow page situation.

11. What are typical initiations and purposes for savepoints?

The savepoint view M_SAVEPOINTS provides information in columns INITIATION and PURPOSE in order to clarify the context of the savepoint. Typical scenarios are:

Initiation Purpose Scenario

EXECUTED_EXPLICITLY
EXCECUTED_EXPLICITLY [sic!]

DROP_SNAPSHOT Explicit savepoint when a snapshot is dropped
EXECUTED_EXPLICITLY
EXCECUTED_EXPLICITLY [sic!]
NORMAL

This constellation appears in the following scenarios:

  • Manual savepoint via "ALTER SYSTEM SAVEPOINT"
  • During SAP HANA startup (and shutdown)
  • During rollforward of a restored backup (SAP Note 1642148)
  • Repeatedly during execution of "ALTER SYSTEM RECLAIM DATAVOLUME" (SAP Note 2400005)
EXECUTED_EXPLICITLY SNAPSHOT Explicit fallback snapshot (SAP HANA >= 2.00.030)
EXECUTED_EXPLICITLY
EXCECUTED_EXPLICITLY [sic!]
SNAPSHOT_FOR_BACKUP Explicit savepoint for a backup snapshot (SAP Note 2039883)
EXECUTED_EXPLICITLY
EXCECUTED_EXPLICITLY [sic!]
SNAPSHOT_FOR_REPLICATION Explicit savepoint for a system replication snapshot (SAP Note 1999880)
EXECUTED_EXPLICITLY
EXCECUTED_EXPLICITLY [sic!]
SNAPSHOT_FOR_RESUMERE... Explicit savepoint after restore of a backup (SAP Note 1642148)
EXECUTED_EXPLICITLY
EXCECUTED_EXPLICITLY [sic!]
SNAPSHOT_FOR_SECONDARY Explicit savepoints shipped from primary to secondary system replication site (once during initial data shipping, regularly during delta data shipping)
TRIGGERED_TIMEBASED NORMAL

Regular standard savepoint triggered on a regular basis based on the following SAP HANA parameter (savepoint interval in seconds):

12. What are reasons for snapshots being retained for a long time?

Old snapshots can significantly increase the persistence data size, because a lot of data is stored in two versions. For the following reasons snapshots can be held for a rather long time:

Scenario SAP Note Details
Stuck system replication

2345901
2425682

If system replication is stuck, the related snapshot on primary site can remain for a long time.
Increased snapshot retentions in system replication environments 1999880

An increased setting of the system replication snapshot retention time with the following parameter results in snapshots with a long life time:

Long secondary time travel retention 1999880

Snapshots taken for secondary time travel purposes are dropped once the configured retention time is exceeded:

If this parameter is set to a high retention time, snapshots remain in the system for a long time.

No deletion of fallback snapshots 2768738

Fallback snapshots need to be purged manually via:

If this cleanup isn't performed for a long time, an old fallback snapshot can remain in the system.

13. What are savepoint callbacks?

The savepoint allows other SAP HANA components to execute tasks in specific savepoint phases via callbacks. The following callbacks exist:

  • spAbortCallback
  • spBeginCallback
  • spCriticalPhaseAfterFlushCallback
  • spCriticalPhaseCallback
  • spDoneCallback
  • spEndOfCriticalPhaseCallback
  • spPostCriticalPhaseCallback
  • spPreCriticalPhaseCallback

In general you don't have to take care for callbacks, but in some situations the runtime of a callback can significantly extend the savepoint runtime. In this case it can be helpful to have a closer look. Long callback runtimes of at least 10 seconds will result in the following entry in the database trace (SAP Note 2380176):

Callback ::() took ms.

The class indicates from where the callback was triggered, e.g. CheckPointMgrSavepointCallback or PersistenceManagerSPCallback.

These messages are reported by check ID T0869 ("Long runtime of savepoint callback") of SQL: "HANA_TraceFiles_MiniChecks" (SAP Note 1969700).

Keywords

 

savepoint_interval_s
ALTER SYSTEM SAVEPOINT
M_SAVEPOINT_STATISTICS
M_SAVEPOINTS
PeriodicSavepoint
runtimedump_for_blocked_savepoint_timeout
savepoint_max_pre_critical_flush_duration
savepoint_pre_critical_flush_retry_threshold