Symptom
You face issues or have questions related to savepoints in SAP HANA environments.
Environment
SAP HANA
Cause
1. What are savepoints in SAP HANA environments?
2. When is a savepoint triggered?
3. Where can I find more information related to savepoints?
4. Which indications exist for problems related to savepoints?
5. Are savepoints online operations?
6. How can typical savepoint issues be analyzed and resolved?
7. What is the prepare flush retry count and how can I optimize it?
8. What kinds of snapshots exist?
9. When and how are pages actually flushed to disk?
10. What are shadow pages?
11. What are typical initiations and purposes for savepoints?
12. What are reasons for snapshots being retained for a long time?
13. What are savepoint callbacks?
Resolution
1. What are savepoints in SAP HANA environments?
Savepoints are required to synchronize changes in memory with the persistency on disk level. All modified pages of row and column store are written to disk during a savepoint.
Each SAP HANA host and service has its own savepoints.
The data belonging to a savepoint represents a consistent state of the data on disk and remains untouched until the next savepoint operation has been completed.
The availability of a recent savepoint improves the restart time of SAP HANA, because less redo logs need to be applied to make the database consistent.
2. When is a savepoint triggered?
Savepoints are triggered in the following ways:
Scenario | Details |
Savepoint interval (automatic) |
During normal operations savepoints are automatically triggered when a predefined time since the last savepoint is passed. The length of the time interval between two consecutive savepoints can be controlled with the following parameter: Its default value is 300, so savepoints are taken in intervals of 300 seconds (5 minutes). |
System command (manual) |
The following command can be used to execute a savepoint manually: |
Soft shutdown |
A soft shutdown invokes a savepoint before the services are stopped. A hard shutdown doesn't trigger a savepoint. This can increase the subsequent restart time. |
Backup |
A global savepoint is performed before a data backup is started. A savepoint is written after the backup of a specific service if finished. |
Startup |
After a consistent database state is reached during startup, a savepoint is performed. |
Snapshots |
Snapshots are savepoints that are preserved for longer use and so they are not overwritten by the next savepoint. |
Reclaim Datavolume |
When RECLAIM DATAVOLUME is executed to defragment persistence (SAP Note 2400005), it regularly triggers savepoints. |
3. Where can I find more information related to savepoints?
The following SAP HANA views contain savepoint related information:
View | Details |
M_SAVEPOINT_STATISTICS | Global savepoint information per host and service |
M_SAVEPOINTS | Detailed information for individual savepoints |
M_SERVICE_THREADS M_SERVICE_THREAD_SAMPLES HOST_SERVICE_THREAD_SAMPLES |
As of SAP HANA SPS 10 savepoint details are logged for THREAD_TYPE = 'PeriodicSavepoint' (see SAP Note 2114710). |
The following SQL statements of SAP Note 1969700 can be used to analyze savepoints:
SQL statement | Details |
SQL: "HANA_IO_Savepoints" | Detailed information for individual savepoints |
SQL: "HANA_IO_Snapshots" | Snapshot information |
4. Which indications exist for problems related to savepoints?
The following SAP HANA alerts indicate problems in the area of savepoints:
Alert | Name | Description |
28 | Most recent savepoint operation | Determines how long ago the last savepoint was defined, that is, how long ago a complete, consistent image of the database was persisted to disk. |
54 | Savepoint duration | Identifies long-running savepoint operations. |
66 | Storage snapshot is prepared | Determines whether or not the period, during which the database is prepared for a storage snapshot, exceeds a given threshold. |
107 | Inconsistent database fallback snapshot | Determines if an inconsistent fallback snapshots exist. |
108 | Database fallback snapshot age | Determines if a snapshot exists for an extended period of time. |
SQL: "HANA_Configuration_MiniChecks" (SAP Notes 1969700, 1999993) returns a potentially critical issue (C = 'X') for one of the following individual checks:
Check ID | Details |
M0346 | Long waitForLock savepoint phases (last day) |
M0348 | Long critical savepoint phases (last day) |
M0350 | Blocking savepoint phases > 10 s (last day) |
M0351 | Blocking savepoint phase avg. (s, last day) |
M0352 | Blocking savepoint phase max. (s, last day) |
M0355 | Time since last savepoint (s) |
M0356 | Savepoint crit. phase write throughput (MB/s) |
M0357 | Savepoint write throughput (MB/s) |
M0358 | Savepoints taking longer than 900 s (last day) |
M0380 | Age of oldest backup snapshot (days) |
M0381 | Age of oldest fallback snapshot (days) |
M0383 | Max. size of shadow pages (GB, last day) |
M0385 | Savepoint vol. per day vs. data (%, last week) |
M0386 | Max. savepoint prepare flush retries (current) |
M0387 | Avg. savepoint prepare flush retries (current) |
M1830 | Age of oldest replication snapshot (h) |
SQL: "HANA_TraceFiles_MiniChecks" (SAP Notes 1969700, 2380176) returns one of the following check IDs:
Check ID | Details |
T0869 | Long runtime of savepoint callback |
INSERT / UPDATE / DELETE threads may be blocked by a savepoint if they are in state SharedLockEnter waiting for a lock of type ConsistentChangeLock. See SAP Note 1999998 for more information.
5. Are savepoints online operations?
The majority of the savepoint is performed online without holding a lock, but the finalization of the savepoint requires a lock. This step is called the blocking phase of the savepoint. It consists of two major phases:
Phase | Sub phase | Mini check | Thread detail | Description |
Blocking | WaitForLock | 346 ("Long waitForLock savepoint phases (last day)") | enterCriticalPhase(waitForLock) |
Before the critical phase is entered, a ConsistentChangeLock needs to be allocated by the savepoint. If this lock is held by other threads / transactions, the duration of this phase is increasing. At the same time all other modifications on the underlying table like INSERT, UPDATE or DELETE are blocked by the savepoint with ConsistentChangeLock. → Critical단계로 들어가기 전에 ConsistentChangeLock 할당이 필요 (Modification 동시 수행이 blocked 됨) |
Blocking | Critical | 348 ("Long critical savepoint phases (last day)") | processCriticalPhase |
Once the ConsistentChangeLock is acquired, the actual critical phase is entered and remaining I/O writes are performed in order to guarantee a consistent set of data on disk level. During this time other transactions aren't allowed to perform changes on the underlying table and are blocked with ConsistentChangeLock. → ConsistenChangeLock이 할당되면 실제 Critical단계가 수행됨 (Disk 레벨의 consistent를 위해서 I/O write 수행됨 = 다른 Transaction이 변경을 수행하지 못함) |
Usually the blocking phase shouldn't take longer than 1 to 2 seconds.
→ savepoint 마지막단계에서 lock을 필요로 하지만 1~2초 정도에서 마무리 됨 (이에 대한 영향도 검증 필요?)
6. How can typical savepoint issues be analyzed and resolved?
You can use the following approaches to analyze and resolve typical savepoint issues. Column "Check ID" refers to the related check ID of the SAP HANA Mini Checks (SAP Note 1999993).
Symptoms | Check ID | Thread detail | Details |
Long critical phase | M0348 | processCriticalPhase |
Delays during the critical phase are often caused by problems in the disk I/O area. See SAP Note 1999930 for further information about analyzing the SAP HANA I/O performance. Particularly pay attention for the trigger read and write ratios, because they can indicate increased amounts of synchronous I/O (SAP Note 1930979). You can check for the savepoint write throughput via SQL: "HANA_Configuration_MiniChecks" ('Savepoint write throughput (MB/s)') of SAP Note 1999993. Values below the expectation can be caused by problems in the I/O area or by significant flush overhead outside of the I/O area like data volume encryption. Be aware that a long waitForLock phase (see below) can increase the runtime of the critical phase because all changes introduced during the waitForLock phase need to be written to disk. So if both the critical phase and the waitForLock phase takes longer for the same savepoint, you should focus on analyzing the waitForLock phase in the first step. Specific tasks of the critical savepoint phase (e.g. processing of private log buffers) require JobWorker threads, so a lack of JobWorker threads (i.e. a higher demand than configured via global.ini -> [execution] -> max_concurrency, SAP Note 2222250) can result in increased critical savepoint phases. Another reason for long critical phase runtimes are situations where the non-critical flush phase is left before the majority of changed blocks was flushed to disk. Make sure that the related SAP HANA thresholds (global.ini -> [persistence] -> savepoint_max_pre_critical_flush_duration, global.ini -> [persistence] -> savepoint_pre_critical_flush_retry_threshold) are set to reasonable values (typically default) and see "Long running non-critical savepoint phase" below for a description of other root causes. In rare cases the critical phase can be dominated by accesses to the secondary system replication site (SAP Note 1999880) that are required for synchronization purposes. In this cases a DisasterRecovery module will show up in the savepoint call stack (SAP Note 2313619), e.g.: You have to check for issues in the system replication environment in this case. Another reason for long critical phases as an extraordinary high amount of modification operations (e.g. BW full loads or individual issues like the significant /IWFND/SU_STATS INSERTs in context of SAP Note 2293307). Therefore you should check for frequent modification operations like INSERT, UPDATE or DELETE executed against the database. See SAP Note 2000002 for more information regarding SQL statement analysis. |
Long waitForLock phase | M0346 | enterCriticalPhase(waitForLock) |
Long durations of the blocking phase (outside of the critical phase) are typically caused by SAP HANA internal lock contention. The following known scenarios exist:
Starting with Rev. 1.00.102 you can configure the following parameter in order to trigger a runtime dump (SAP Note 2400007) in case waiting for entering the critical phase takes longer than seconds: Per default a maximum of one runtime dump is created within 24 hours. Starting with SAP HANA 1.00.122.07 it can be adjusted via: |
Long running non-critical savepoint phase |
M0358 |
flushPagesinNonCriticalPhase |
You can use SQL: "HANA_IO_Savepoints" (MIN_SAVEPOINT_DURATION_S = ) of SAP Note 1969700 to check for long running savepoints. A good value for can be 900. This means that only savepoints with a duration of more than 15 minutes are displayed. Depending on the output of this command the following situations can be distinguished: Long running savepoints due to low write throughput: If the write throughput (MB_PER_S) is significantly lower than 100 MB / s, you should check the I/O write performance based on SAP Note 1999930 and analyze the non-I/O components of the page flushes via SQL: "HANA_IO_Flushes_Details" (SAP Note 1969700). If you face a high runtime for encryption, you can consider to disable data volume encryption as a workaround. Furthermore an improvement in encryption performance is available with SAP HANA >= 1.00.122.14 and >= 2.00.012.04. Call stack modules related to encryption are e.g.: Long running savepoints due to high I/O write volume: If the amount of data written to disk (SUM_SIZE_MB) is much higher than expected, you should at first check via SQL: "HANA_Tables_IOStatistics" (ORDER_BY = 'WRITE') of SAP Note 1969700 for specific tables with a high amount of I/O writes. Check for these tables from an application perspective if you can reduce the amount of changes. If writes are linked to delta merges (SAP Note 2057046) or optimize compression (SAP Note 2119087) you can check if optimizing the merge / compression configuration or defining more table partitions (SAP Note 2044468) for a better load distribution can help. A particularly high amount of data being written in the non-critical savepoint phase can be a consequence of LOB garbage collector activities at a time when a blocked garbage collection (SAP Note 2169283) is released. The behavior is improved with SAP HANA >= 1.00.122.10 and >= 2.00.012. Additionally it is possible that a high "prepare flush retry count" has a significant impact on the amount of data written during the savepoint. SAP Note 2538561 describes a SAP HANA bug with Rev. 2.00.000 - 2.00.012.04 and 2.00.020 - 2.00.023 and as a workaround the savepoint_pre_critical_flush_retry_threshold parameter value can be adjusted. |
Significant time since last savepoint | M0355 |
Make sure that the parameter for controlling automatic savepoints is set to a reasonable value, optimally the default value is kept: Also particularly long running savepoints that are not successfully finished for a long time can be responsible for a significant time since the last successful savepoint. See "Long running savepoints" above for more details. |
|
Old database snapshots not related to backups | M1830 | You can look into M_SNAPSHOTS or use SQL: "HANA_IO_Snapshots" (SAP Note 1969700) in order to check for currently existing snapshots. Old snapshots can result in increased disk space requirements. See SAP Note 2039883 for more information regarding database snapshots. |
7. What is the prepare flush retry count and how can I optimize it?
As already explained, the blocking phase of the savepoint requires a consistent change lock on the underlying table. In order to make sure that the blocking phase is as short as possible, SAP HANA only enters the blocking phase if it expects that the duration will not exceed a critical limit. If many memory pages are modified in parallel to the savepoint preparation it can happen that the critical limit will not be met and so SAP HANA starts another savepoint preparation. This retry activity can happen many times involving a high amount of additional write I/O.
As of SAP HANA SPS 10 you can check PREPARE_FLUSH_RETRY_COUNT in M_SAVEPOINTS in order to see if a long running savepoint has performed a high number of retries before it entered the blocking phase.
This mechanism is controlled by the following SAP HANA parameters:
Parameter | Unit | Default | Details |
s |
0 (SPS 08 and below) |
This parameter defines the maximum time (in seconds) that should be spend for optimizing the duration of the blocking phase. The value 0 means that there is no time limit and so a savepoint can potentially run for a very long time without being able to start the blocking phase. | |
ms | 3000 |
This parameter defines an upper limit for the expected blocking phase duration. As soon as SAP HANA assumes that the blocking phase will be below this limit, the blocking phase is entered. Setting this parameter to a higher value (e.g. 5000 for 5 seconds or 10000 for 10 seconds) will reduce the number of retries before entering the blocking phase. |
These parameters are a trade-off between savepoint and I/O write overhead on the one hand side and locking situation during the blocking phase on the other hand side. In order to come around high I/O overhead and very long savepoint times beyond 15 minutes, you can set the savepoint_max_pre_critical_flush_duration to 900 without having to expect a significant negative locking behavior.
8. What kinds of snapshots exist?
The following types of snapshots exist:
Type | SAP Note | Creation | Deletion |
System replication snapshots | 1999880 | Regularly in order to provide consistent persistence state for system replication | Automatically or - in exceptional cases - manually using hdbcons ("snapshot d "). See SAP Note 2222218 for more information related to hdbcons. |
Backup snapshots | 2039883 | Support of snapshot based backups | Automatically by backup procedure or - in exceptional cases - manually using "BACKUP DATA DROP SNAPSHOT" (see SAP Note 1703435 for a use case) |
Restore snapshots | 1642148 | Created after restore of data backup | Automatically at the end of the recovery |
Secondary time travel snapshots | 1999880 |
Automatic creation based on: |
Automatic deletion once the retention time is exceeded: |
Fallback snapshots | 2768738 |
Manual creation for quickly restoring an earlier database state (SAP HANA >= 2.00.030, no system replication): Subsequently it is possible to restore the database to this snapshot: |
9. When and how are pages actually flushed to disk?
Pages are flushed by the FlushResourcesThread that may use some helpers to parallelize the workload. Starting with SAP HANA 1.00.122.16 and 2.00.024.00 helper threads can be activated / deactivated with the following parameter:
Parameter | Default | Details |
global.ini -> [persistence] -> use_helper_threads_for_flush | true |
In general helper threads are of advantage as workload is distributed and parallelized. SAP Note 2655238 describes a problem with helper threads with SAP HANA 1.00.122.16 - 1.00.122.17 and 2.00.024.00 - 2.00.024.03 that can result in small I/O write requests and the risk of significant disk fragmentation. As a workaround the helper threads can be deactivated. |
The pages that are flushed are retrieved from the so-called flush queue.
The flush queue is populated in different contexts:
- During savepoints and snapshots
- As part of continuous page flushing
- When the last reference to a page with temporary disposition is dropped
- In context of container implementations like VirtualFile or VarSizeEntryContainer
Continuous page flushing is performed by the ContinuousPageFlusher thread that checks for dirty pages that haven't been modified for a certain time. If pages are found, they are put into the flush queue. The main purpose of continuous page flushing is to reduce the amount of I/O required during savepoints. The ContinuousPageFlusher activity can be controlled with the following SAP HANA parameters:
→ Savepoint 수행시 내려야 할 Dirty buffer를 미리 선별해서 Queue에 넣어 둠
Parameter | Default | Unit | Details |
global.ini -> [persistence] -> continuous_flush_interval_s | 60 | s |
Check interval, per default the ContinuousPageFlusher is activated once a minute and checks if there are dirty pages that should be flushed A value of 0 disable the continuous page flush feature. Attention: Due to a SAP HANA bug continuous page flush can be responsible for corruptions with Revisions 1.00.120 to 1.00.122.03 and so continuous_flush_interval_s should be set to 0 in order to disable the continuous page flush feature. See SAP Note 2370160 for more information. |
global.ini -> [persistence] -> continuous_flush_threshold_s | 120 | s | Threshold for dirty pages to be flushed, per default only pages are flushed that haven't been touched for more than two minutes |
It is usually not required to adjust these settings.
10. What are shadow pages?
If pages representing the consistent state at the last savepoint are changed, the "old" savepoint state is kept and additionally a page with the current data is created. These savepoint related pages are called shadow pages. If there are many shadow pages, the database size on disk increases. Typical reasons for a high number of shadow pages are:
- Long-running savepoints
- Low savepoint frequency
- Mass changes
- Table optimizations of large table
Be aware that snapshot related page versions are not considered as shadow pages, only savepoints use this concept.
You can use SQL: "HANA_Disk_Pages" (SAP Note 1969700) in order to check for the current and historic shadow page situation.
Mini check M0383 ("Max. size of shadow pages (GB, last day)") available via SAP Note 1999993 monitors the recent shadow page situation.
11. What are typical initiations and purposes for savepoints?
The savepoint view M_SAVEPOINTS provides information in columns INITIATION and PURPOSE in order to clarify the context of the savepoint. Typical scenarios are:
Initiation | Purpose | Scenario |
EXECUTED_EXPLICITLY |
DROP_SNAPSHOT | Explicit savepoint when a snapshot is dropped |
EXECUTED_EXPLICITLY EXCECUTED_EXPLICITLY [sic!] |
NORMAL |
This constellation appears in the following scenarios: |
EXECUTED_EXPLICITLY | SNAPSHOT | Explicit fallback snapshot (SAP HANA >= 2.00.030) |
EXECUTED_EXPLICITLY EXCECUTED_EXPLICITLY [sic!] |
SNAPSHOT_FOR_BACKUP | Explicit savepoint for a backup snapshot (SAP Note 2039883) |
EXECUTED_EXPLICITLY EXCECUTED_EXPLICITLY [sic!] |
SNAPSHOT_FOR_REPLICATION | Explicit savepoint for a system replication snapshot (SAP Note 1999880) |
EXECUTED_EXPLICITLY EXCECUTED_EXPLICITLY [sic!] |
SNAPSHOT_FOR_RESUMERE... | Explicit savepoint after restore of a backup (SAP Note 1642148) |
EXECUTED_EXPLICITLY EXCECUTED_EXPLICITLY [sic!] |
SNAPSHOT_FOR_SECONDARY | Explicit savepoints shipped from primary to secondary system replication site (once during initial data shipping, regularly during delta data shipping) |
TRIGGERED_TIMEBASED | NORMAL |
Regular standard savepoint triggered on a regular basis based on the following SAP HANA parameter (savepoint interval in seconds): |
12. What are reasons for snapshots being retained for a long time?
Old snapshots can significantly increase the persistence data size, because a lot of data is stored in two versions. For the following reasons snapshots can be held for a rather long time:
Scenario | SAP Note | Details |
Stuck system replication | If system replication is stuck, the related snapshot on primary site can remain for a long time. | |
Increased snapshot retentions in system replication environments | 1999880 |
An increased setting of the system replication snapshot retention time with the following parameter results in snapshots with a long life time: |
Long secondary time travel retention | 1999880 |
Snapshots taken for secondary time travel purposes are dropped once the configured retention time is exceeded: If this parameter is set to a high retention time, snapshots remain in the system for a long time. |
No deletion of fallback snapshots | 2768738 |
Fallback snapshots need to be purged manually via: If this cleanup isn't performed for a long time, an old fallback snapshot can remain in the system. |
13. What are savepoint callbacks?
The savepoint allows other SAP HANA components to execute tasks in specific savepoint phases via callbacks. The following callbacks exist:
- spAbortCallback
- spBeginCallback
- spCriticalPhaseAfterFlushCallback
- spCriticalPhaseCallback
- spDoneCallback
- spEndOfCriticalPhaseCallback
- spPostCriticalPhaseCallback
- spPreCriticalPhaseCallback
In general you don't have to take care for callbacks, but in some situations the runtime of a callback can significantly extend the savepoint runtime. In this case it can be helpful to have a closer look. Long callback runtimes of at least 10 seconds will result in the following entry in the database trace (SAP Note 2380176):
Callback ::() took ms.
The class indicates from where the callback was triggered, e.g. CheckPointMgrSavepointCallback or PersistenceManagerSPCallback.
These messages are reported by check ID T0869 ("Long runtime of savepoint callback") of SQL: "HANA_TraceFiles_MiniChecks" (SAP Note 1969700).
Keywords
savepoint_interval_s
ALTER SYSTEM SAVEPOINT
M_SAVEPOINT_STATISTICS
M_SAVEPOINTS
PeriodicSavepoint
runtimedump_for_blocked_savepoint_timeout
savepoint_max_pre_critical_flush_duration
savepoint_pre_critical_flush_retry_threshold
'Database' 카테고리의 다른 글
[MariaDB] IP/Port 이용한 접속 및 DB생성 (0) | 2020.02.17 |
---|---|
[HANA] 1977584 - Technical Consistency Checks for SAP HANA Databases (0) | 2019.12.02 |
[HANA] 2180165 - FAQ: SAP HANA Expensive Statements Trace (0) | 2019.11.21 |
[Postgresql] 외부에서 접속하기 (0) | 2019.11.16 |
[HANA] hdbsql options (0) | 2019.07.22 |