For Swiss Mobiliar, saving CPU cycles was the main driver to migrate early to Db2 for z/OS V12 (Db2 12). Now, a few months after the migration, I will share the results at the IDUG (International Db2 Users Group) North America Db2 Tech Conference in Charlotte, North Carolina. Which benefits were available immediately after the migration? How did we identify objects, which needed a change to become eligible for new functions? What had to be done to fully exploit the new performance options? I will focus not only on the well-known features (Index Fast Traversal Blocks (FTB), Contiguous Buffer Pools, Insert Algorithm 2) but also on some more hidden new functions (index selection, predicate optimization, IN-list direct table access). During my presentation in Charlotte, I will also touch a few queries, which had to be changed to benefit from new functions, and answer questions like “Will Index FTB make Index Lookaside become obsolete?”. Eventually, I will also talk about how we measured and presented the performance results to middle and upper level management.
So, if you are interested in
- the Db2 12 performance experiences from a customer’s point of view,
- learning how much CPU benefit Swiss Mobiliar took immediately after the migration,
- examining what needs to be done to fully exploit the Db2 12 performance features,
- getting the most out of Index Fast Traversal Blocks, Contiguous Buffer Pools and Insert Algorithm 2,
- translating CPU improvements into real cost savings,
this is the conference and the session to attend.
The 2019 IDUG Db2 North American Tech Conference is the premier training and networking event dedicated exclusively to IBM’s industry-leading Db2 product family. This thoughtfully designed event features Db2 leaders and developers from around the globe to provide the technical sessions and in-depth seminars you need! Join me as I speak about Db2 12 performance. Visit https://www.idug.org/page/na2019 for information on registration, education, and more!
If for any reason you can’t make it to Charlotte, here are some of the highlights of my presentation:
Contiguous Buffer Pools
Buffer pool tuning has always been and will certainly remain a cornerstone of Db2 system tuning. Before we will discuss the Db2 12 enhancements in this area, just remember that earlier releases of Db2 already improved buffer pool performance: For example, when new page stealing algorithms were introduced, such as FIFO or when, in Db2 V10, the new page stealing (PGSTEAL) option NONE was introduced. Tablespaces and indexes defined with a PGSTEAL(NONE) buffer pool were memory-resident, and according to Db2 Redbook, “Performance benefits in elapsed time and CPU time can be expected with PGSTEAL(NONE) buffer pools when objects completely fit into a buffer pool. In such cases, objects are loaded into the buffer pools on their first access and subsequent I/O’s can be avoided”. Starting with Db2 V10, Swiss Mobiliar was already heavily using PGSTEAL(NONE) buffer pools to benefit from avoiding sync I/O operations.
Db2 12 improved the performance of PGSTEAL(NONE) buffer pools by following an in-memory layout which allows to directly access the buffer pool pages without having to touch a hash table first. Therefore, the name of this type of buffer pool was changed from in-memory buffer pool to contiguous buffer pool. Additionally, the LRU chain management for this type of pool was eliminated.
Having more than every fourth getpage addressing an object residing in one of our PGSTEAL(NONE) buffer pools, we expected a significant CPU saving, starting at day 1 after the migration. To be more precise, we expected a 4% CPU reduction from this feature only.
PGSTEAL(NONE) defined buffer pools will be primed by sequential access upon first getpage of an object. No further I/O is necessary for any page of this object. As sequential prefetch will be turned off after priming the pool for each object, this will help to reduce the number of prefetch engines used and to avoid situations with “no prefetch engines available”: Not only for this pool, but for the whole data sharing member (or subsystem in a non-data-sharing environment).
If you have loaded more table and index pages than the size of the pool allows, additional pages are using an area called overflow area, and pages are stolen using a FIFO approach. Therefore, make sure to set VPSIZE large enough and – like for any other buffer pool – check with your system administrator that they are all backed by real memory.
But which tablespaces and indexes qualify for this «VIP» treatment? At Swiss Mobiliar, a machine learning algorithm has been trained to answer this question. In more detail, we use a Random Forest algorithm with Getpage Activity (No of getpages), I/O Activity (No of sync I/O, No of async I/O, No of writes), Size (No of Pages) and Growth (% of growth in number of rows or pages per time interval) as its features to classify any recently created object into the type of page stealing method (FIFO, LRU or NONE) which best corresponds to the feature values monitored over a time period of a few days. This is a paper of its own; for more information about that, just drop me a mail.
Back to operating contiguous buffer pools: Monitor to check whether the pool is defined large enough, or if pages were written to the overflow area. When the overflow area is used, consider making the buffer pool (VPSIZE) larger. Message DSNB604I is written to the Db2 DSNMSTR when pages are using the overflow area. This message contains the name of the dataset. Furthermore, the –DIS BPOOL DETAIL command counts the number of accesses to the overflow area of a PGSTEAL(NONE) buffer pool.
With these commands in mind, it is worth a test to make use of this improved kind of buffer pool. To get started, a simple query to select candidate objects will be presented and discussed during my speech in Charlotte. Any object, small enough to fit entirely in a buffer pool and with a high getpage rate, is a good candidate. At Swiss Mobiliar, more than every fourth getpage is targeted against a buffer pool defined with PGSTEAL(NONE).
IN-List Direct Table Access
This is one of the lesser prominent improvements in Db2 12, but it was one of the most important improvements for our workload.
SELECT … FROM T1 WHERE C1 IN (?, ?, ,,, ,?) AND C2=?
are a frequently used query pattern in our workload. Db2 12 addresses these queries by selecting a more efficient access path.
IN-list direct table access occurs when Db2 uses in-memory tables to process one or more IN-lists where an index key matches the IN-list predicates. Before Db2 12, IN-list direct table access was limited to the use of a nested loop join with list prefetch. Db2 12 removes this limitation and chooses between nested loop join and hybrid join based on estimated cost.
DB2 12 improved performance of IN-lists in two ways:
- Non-matching IN-lists with more than 128 entries now benefit from index screening.
- Matching IN-lists are loaded into an in-memory table and then joined to the base table using hybrid join (if only a few RID qualify for each probe), or nested loop join. Hybrid join can be more efficient because of the way the qualifying RIDs are first accumulated before making the list prefetch requests. When Db2 12 chooses to use hybrid join with IN-list direct table access, the qualifying RIDs are not sorted (SORTN_JOIN=N) to avoid possible CPU regressions. Nested loop join with list prefetch is still selected by the optimizer for IN-list direct table access if the inner table access is estimated to contain enough qualifying RIDs per probe such that the use of nested loop join is more cost effective than the use of hybrid join.
At the first day of Db2 12, we saw a very important part of these queries using an in-memory table combined with either hybrid join or nested loop join. Most of these queries did not use an in-memory table in V11, and walked sequentially through the IN-list, instead of using list prefetch as it is done in V12 now.
Day 1 Review
Considering contiguous buffer pool and IN-list direct table access improvements only, we came up with an estimate of 7% CPU reduction for our dynamic queries, and of 4% for our static queries, which were rebound immediately after the migration, but were still using the V11 access paths after being rebound with the APREUSE bind option.
A CPU benefit of 9.1% was measured after day 1! Don’t expect similar improvements, if only 0-2% of your getpages address a PGSTEAL(NONE) buffer: Remember, we have more than 25% getpages directed to a PGSTEAL(NONE) pool. But obviously, other features have also contributed to this exceptional result.
After day 1, our expectations to save CPU cycles were already exceeded.
Miscellaneous Improvements Matching Our Workload
The following improvements particularly match our workload characteristics, and are all available in function level V12R1M100 (aka compatibility mode) already:
- Increased number of Db2 prefetch engines from 600 to 900. At least, the number of disabled prefetches due to unavailable read engines went down to almost zero.
- Declared Global Temporary Tables (DGTT) improvements: Db2 12 provides an in-memory copy of the catalog information that is required to declare a global temporary table. Instead of looking up Db2 catalog or directory information whenever a DGTT is created, this in-memory catalog copy is used.
- RELEASE(DEALLOCATE) improvements: Before Db2 12, all claims that were held by an agent are released (declaimed) at COMMIT time, even for packages bound with the RELEASE (DEALLOCATE) option. In Db2 12, Db2 skips most declaim activity at COMMIT time for packages which are bound with RELEASE (DEALLOCATE). In this way, the CPU cost of declaiming at commit time is avoided. This process also avoids the CPU cost of reacquiring the claims again after the commit. We thought that this also contributed to the increasing performance of our batch processes.
- Shorter code path for singleton SELECTs with unique index access: Db2 12 provides a shorter code path to perform matching key index scan for this type of SELECT statement. By using this specialized code path, Db2 can avoid having to use the general runtime environment and the overhead that is associated with its ability to handle all variations of SELECT INTO statements.
Day 2: A Negative Surprise
Remember when I was talking about the following access pattern just a few lines above: In-memory tables followed by list prefetch to the base tables. And we were kind of proud about the performance benefits we gained. Early morning of the second day, we had thousands of queries running in parallel, all performing IN-list accesses, and all producing tablespace scans instead of list prefetch operations! With response times of minutes instead of milliseconds. And CPU usage including zIIP processors at 100%. No idea why this happened at the second day and not already at the first day. Not even to speak about ever having seen anything like this at the preprod system. And we did careful performance tests in pre-production before. So, what happened? Didn’t we order enough memory? Hard to analyze, because it was literally impossible to access any details with reasonable performance. We started to cancel queries, which used the in-memory access pattern, changed some catalog statistics by emergency in order to favor accesses without in-memory tables, but they were too many to address. So eventually we decided to recycle Db2.
After restarting Db2 and increasing our available CPU capacity within the limits supported by our contract, we started to investigate. Immediately after restart, everything worked fine, accesses used the in-memory pattern in an efficient way as they did the day before. Then, suddenly, as if a switch was pressed, we faced the same problems again! But this time we could match the queries which were running at the time the “switch” was pressed, with the queries running earlier that morning. And we found a single query as a candidate which could have produced this whole situation.
In a nutshell: A simple change to an existing query text had a very large impact! Without going too much into details, this query simply monopolized all the RID lists available in our subsystem and freed them only after cancelling the query. The same query was repeatedly running for a couple of days before the migration, but without any significant impact on overall performance.
SELECT * FROM T01 WHERE EXISTS ( SELECT * FROM T02 WHERE T01.C1=T02.C1 AND T2.C2 >= '01.01.2018' OR T02.C1 IN (1, 2, 3));
Optimizing it was easy and quickly done, it had just to be corrected by setting additional parentheses at the places where they were missing before (at the beginning and the end of the last line). But we were not sure, if there would be other queries with a similar impact, so we decided to fall back to V11, analyze all query workload to scan for similar queries while at the same time looking both for a way to circumvent the problem and for a fix by IBM to permanently avoid it.
Three weeks later, we were back with V12, this time enforcing parallel SQL for dynamic queries with a maximum parallel degree of 2 (Zparms CDSSRDEF=’ANY’ and PARAMDEG=2): Initially, this setting was planned for a future time in the migration process, but as it proved to be a useful and safe way to circumvent the problems produced by queries like the one above, we decided to re-migrate while enforcing the usage of SQL query parallelism for qualifying queries.
Index Fast Traversal Blocks
The typical matching index access is performed by an index probe (index b-tree traversal): Starting at the root page, getpage operations will be fired for each level of the index B-tree, until the leaf page will eventually be reached. If the access pattern is skip-sequential, which means that the probe access is repeated many times for the same leaf page or leaf pages physically located close to each other, a technique called index lookaside will be used: After an initial index traversal, Db2 checks for any subsequent accesses to the same index, if the searched row is found on the same leaf page, or on the immediately higher non-leaf page. A sweet spot for this technique is a nested loop join with the same index key sequence of the inner table index as of the outer table.
However, if the access pattern to the index is completely random, this technique doesn’t help. Therefore, Db2 12 introduced a new index access optimization technique called Index Fast Traversal Blocks: All non-leaf pages of a pageset which qualifies are stored in a separate in-memory area. This area is located outside of the buffer pools and requires additional real memory. Db2 automatically determines over time which partitions of which unique indexes would benefit. Best candidates are indexes, which are frequently accessed by key lookups and don’t suffer from frequent index page splits due to e.g. numerous insert, update and delete operations.
Db2 does internal monitoring by a new zIIP-eligible index memory optimization daemon. It allocates FTB storage to indexes considered to benefit from this. Therefore, traversal through non-leaf index pages becomes very fast. In order to manage index FTB, setting ZParm INDEX_MEMORY_CONTROL=AUTO is a good point to get started. But make sure that enough real memory is available. The FTB size might be adjusted by setting the ZPARM value to any value between 10 (Mbyte) and 200000 (200 Gbyte). According to the manual, AUTO translates to 20% of the sum of all buffer pools (at least 10MByte for very small environments). However, according to my own experiences, I observed another formula, more like AUTO := MIN(500 Mbyte, 20% of overall virtual buffer pool size). Anyway, after an initial phase with AUTO, we are working with a fix value of 2000, which corresponds to 2GByte.
Additionally, micro-management is possible by adding rows into the SYSIBM.SYSINDEXCONTROL catalog table. These rows will specify which indexes should become candidates for FTB exploitation, in which timeframe, or which indexes should be excluded from FTB selection.
The index memory optimization daemon, which selects candidate indexes, is perfectly working, micro-management was not necessary for our workload.
Monitoring of FTB usage can be done by explicitly issuing a -DIS STATS command which lists the current index partitions, the number of index levels and the size these indexes occupy in the FTB structure. An additional information is repeatedly written to DSNMSTR address space:
10.54.20(…)DSNI070I - DB2P FAST INDEX TRAVERSAL STATUS: 566 MB, USED BY: 46 OBJECTS, CANDIDATE OBJECTS: 35
Insert Algorithm 2
When a row is inserted, Db2 performs a space search algorithm to determine the optimal placement of the new row within the tablespace. The algorithm is different for each type of tablespace, but has always been optimized for fast retrieval of the row once it has been inserted. The key concept of index space search is the clustering index, which defines the order of the rows within the tablespace. However, it is not always necessary for the rows to be inserted in clustering order: A typical use case is a journal table, where rows are inserted at the end of the tablespace or partition, and where data clustering is not required. As the space search algorithm has been a limitation to very fast inserting many rows, earlier releases of Db2 have already addressed that in a certain way by means of the MEMBER CLUSTER and the APPEND YES attributes. Db2 12 now offers a new space search algorithm, which goes way beyond what was available before and offers a streamlined space search operation. Exactly for the use cases where “insert at the end” is required or at least tolerated. Like most of the other new features, the new insert search space algorithm also requires additional memory.
An average CPU decrease of 20% was measured for INSERT INTO TABLE operations using Insert Algorithm 2.
This feature requires V12R1M500 (“new function mode”) and a careful selection of tables, which should use this new index algorithm. We found an interesting correlation between tables targeting another buffer pool type (PGSTEAL=FIFO) and tables which meet the sweet spot for this new index algorithm. Once again, our machine-learning based buffer pool selection process described above proved to be of great help.
Putting It All Together
After all, the average query’s CPU consumption was reduced from 64.4 msec to 51.4 msec: 20.15% CPU improvement was achieved by exploiting Db2 12 within 4 months after migration.
Identifying objects, which match the sweet spots for contiguous buffer pools and for index algorithm 2, is the key success factor to fully exploit Db2 12 performance.
The following diagram shows the CPU savings after activating individual new features. Please notify the difference between savings out-of-the box by simply turning the new feature on and savings which require additional engineering:
After that, further query tuning has even more reduced our overall CPU consumption. Compared to the time before the Db2 12 migration, the percentage of CPU reduction is now at 23.3%.
So far for today, further details will be revealed during the presentation. As you can see, this session alone is worth your trip to Charlotte! So, don’t miss the conference and visit session A07 “Db2 12 Performance: Delivered as Promised?” on Tuesday, June 4, 1:10 PM in room Carolina A.
References and Further Information
- IBM Db2 12 for z/OS Performance Topics. IBM Redbook, published 09/19/2017
- IDUG Db2 Technical Conference Anaheim 2017, Session B01, “Buffer Pool Tuning in the Cognitive Era”, Thomas Baumann
- DB2 12 for z/OS Technical Overview and Highlights. IBM Redbook, published 06/30/2017, updated 07/12/2017
- DB2 12 for z Optimizer. IBM Redpaper, published 06/28/2017