- Add some latency by calling Thread.sleep() to bring CPU usage to a high but healthy level of about 50%
- Introduce some CPU consumption outside object creation and garbage collection to bring the memory allocation and GC rate to a level of about 500 MB/s which is high but can be found in real high-traffic installations of reasonable applications
The pause time problem with large heapsStop-the-world collectors may be robust and efficient but with a heap of that size GC pauses are in most cases prohibitive for any interactive interaction. As an example, see in the next figure how the default collector (which can also be started explicitly using
-XX:+UseParallelGC) operates with parameters
-Xms50g -Xmx50g -XX:NewSize=2000m -XX:MaxNewSize=2000m
It is obvious that from a throughput and efficiency standpoint this collector does a good job because it only uses about 5% of elapsed time to clean up roughly 450 MB of garbage per second which is quite a lot even in the league of high throughput web applications. New generation pause times average 170 milliseconds which is good and old generation pauses are infrequent enough.
There is only one nuisance here: the old generation pause duration of about 30 seconds. Note that the benchmark does not have a very complicated reference graph and does also not include nasty things like reference processing and similar complications. A more complicated and more “real-world” application could easily push the old generation pause duration beyond the 1-minute-barrier. That’s why such stop-the-world collectors are not an option for any interactive application with a heap of that size. Using a low-pause collector becomes mandatory for such cases.
The CMS collector’s performance on a large heapThe CMS collector started with the settings
-XX:NewSize=2000m -XX:MaxNewSize=2000m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=80shows the following results when it runs the same load as above:
The CMS collector delivers excellent results, indeed. Throughput is only slightly lower than with the parallel collector, new generation pauses are shorter (on average 110 compared to 170 ms) and there aren’t any pauses longer than 140 ms, neither for new generation nor for old generation collections.
Keep in mind, however, that the benchmark is favorable for the CMS collector because, by construction, it works with objects of a single size and thus avoids fragmentation and any overhead for free-list searching during new generation pauses as much as any other complications. In real applications, CMS pauses, in particular ‘remark’ pauses, can take a second even with much smaller heaps.
Garbage First (G1)Even if we know that the benchmark is favorable for the CMS collector, it remains interesting and instructive to learn how the G1 collector copes with a load that is as simple and well-defined as this one on a heap of that size. As a first test, I applied G1 to the same load with only the following parameters:
-XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=80By default, this also implies a pause time target of 200 milliseconds:
-XX:MaxGCPauseMillis=200The following figure shows the results that G1 delivered:
This graphic and the statistics on the right show that the vast majority of new generation pauses meet the max pause target rather well and with little variation. The old generation pauses (which are in fact “mixed” pauses in the case of G1 because they are used to clean both new and old generation regions) can be spotted as vertical grey lines in regular intervals which miss that target and reach about 800 milliseconds after a rather extended warm-up phase. It can also be seen that the CPU usage with G1 is higher than with the other collectors: close to 60 instead of 50 percent.
We will see later that the setting
is not a good choice and rather increases the duration of mixed GC pauses. However, the tendency of mixed pauses to exceed the max pause target is a general problem as will be shown in the next section.
Let’s try to tune G1The results for G1 in figure 3 and for the CMS in figure 2 suggest that there should be a knob to tune G1’s performance. G1 tuning options and best practices from Oracle can be found in “Getting Started with the G1 Garbage Collector” and “Garbage First Garbage Collector Tuning”. I followed the best practice advice not to touch generation sizing parameters as for example
SurvivorRatioand let G1 adjust them to best fulfill the max pause target but measured the impact of other parameters on the achieved pause durations.
Pause duration target:
MaxGCPauseMillisis the prime tuning parameter for the G1 collector and is at the heart of its promise to control GC pause durations. Have a look at the following figure to see how well G1 is able to deliver the configured target when run with a minimal configuration of
-Xms50g -Xmx50g -XX:+UseG1GC -XX:MaxGCPauseMillis=x
The figure shows very nicely that pause durations follow the configured target to some limited extent:
- Young generation pauses (red lines) obey the target down to a limiting value (which is about 170 ms in this case) in such a way that the 0.9 quantile is more or less on target or lower.
- With a target value below 170 ms most young generation pauses still take 170 ms with very little variation (flat red curves are close together) except for very low target values below 40 ms where few pauses get longer again (“max young pause” line).
- “mixed” pauses (blue lines) follow the pause time target much less because considerable numbers of pauses (see “0.9 quantile mixed pause” line) do not fall below about 500 ms and the average only gets down to about 350 ms.
- There are other pauses (green lines) like “remark” pauses whose durations look completely independent of the configured target but are short enough anyway. Actually they are of similar nature as the CMS pauses and also in the same range as shown by the “CMS max pause” line which I took from the test shown in figure 2.
Note that the range where most GC pauses follow the pause time target depends on application, hardware and heap size. In our case, this range is 170-600 ms which by accident includes the default target value of 200 ms. With the same benchmark program but on different hardware (Oracle T3/SPARC, 16 cores x 8 hardwarethreads per core=128 processors at 1650 MHz) and with a smaller heap size (
-Xms10g -Xmx10g) I observed the same big picture as described above. However, the range of effective control was much lower: 40-150 ms.
Concurrent operation threshold:
Very much like the CMS collector G1 starts concurrent processing depending on how much of the heap is filled. The parameter which controls the threshold is
-XX:InitiatingHeapOccupancyPercentand the default value is 45, which means that concurrent processing starts when 45% of the available heap is filled. In contrast to that, the CMS collector uses a much higher default of 80 for the corresponding threshold parameter
-XX:InitiatingOccupancyFraction, which is often a good choice and sometimes needs to be lowered only a little bit to reduce the risk of concurrent mode failures. Altogether, for people with CMS experience G1’s default of 45 looks very low and it is tempting to push the threshold towards 80 in the hope that this could reduce G1’s CPU consumption and increase its throughput. We have seen in figure 4 that mixed pauses are the most critical ones and often exceed the max pause target. The following figure shows how mixed pauses behave as a function of the max pause target for different values of
It is clearly visible that values of
-XX:InitiatingHeapOccupancyPercent=45 (orange curves) and
60 (light blue curves) deliver almost the same results over the full range from
-XX:MaxGCPauseMillis=10 to 5000.
Values 80 and 90 each add about 100 millis to average pause duration and the 0.9 quantiles. It can also be seen at
-XX:MaxGCPauseMillis=5000 that a value of IHOP=90 leads to a concurrent mode failure and thus a Full GC stop which pushes average and 0.9 quantile very high. Because, for both CMS and G1, we always want to avoid the risk of concurrent mode failures and prefer shorter pauses we conclude that it makes sense to keep the value for parameter
-XX:InitiatingHeapOccupancyPercent below 80 and G1’s default is not bad.
Is the cost of G1’s pause control so high?How costly is it for G1 to control GC pause duration? How does the GC throughput depend on the pause time target? Is there a penalty if the pause time target is set to a lower value than G1 can deliver, i.e. below the 170 ms limit observed in figure 4? I hoped to answer these questions with the following figure:
This looks like a highly interesting result: Increasing the pause time target from 170 to 600 ms increases throughput by up to 35% while it is flat at lower and higher target values.Therefore, let’s first have a look at how this is measured and what the accuracy is. Look at some lines from the GC log from a measurement which corresponds to the blue line at a target value of 600 ms (
276.141: [GC pause (young) 44G->31G(50G), 0.4517380 secs] 288.056: [GC pause (young) 43G->32G(50G), 0.5088510 secs] 301.181: [GC pause (young) (initial-mark) 44G->32G(50G), 0.4453570 secs]Note that above a heap size of 10 GB G1’s accuracy for logging heap sizes is 1 GB (no decimal digits). This means in the log lines above we have a rounding error for the difference (=cleaned heap) of about 1/12=8%. That is a lot but still much less than the effect we see in figure 6. But let’s also look at some log lines from a test with MaxGCPauseMillis=150:
258.827: [GC pause (mixed) 35G->32G(50G), 0.2036500 secs] 262.001: [GC pause (young) 34G->32G(50G), 0.1774750 secs] 265.279: [GC pause (young) (initial-mark) 34G->32G(50G), 0.1744120 secs]We see that now each GC pause cleans much less, only about 2 GB of heap (because G1 has less time in each GC run and executes smaller chunks of work at a time) but log accuracy is still 1GB which gives us a statistical error of about 1/2 = 50%. But especially the lower value (after GC) does not vary a lot, which means that the 32G could be constantly at roughly the same value between 31.5 and 32.49, which results in a systematic error of up to 0.5 or 25% of what we measure. This means that most of the change in throughput which is shown in the figure above could be accounted for by a systematic measuring error when G1 reduces the amount of cleaned heap to meet the pause time target. The plot would be mostly useless due to the large systematic error, except as a hint to measure this again and for highlighting how insufficient G1’s logging accuracy is. This is a serious flaw in G1’s GC log and needs to be fixed. Until this has been done, such measurements of throughput for large heaps require an object creation counter in the benchmark’s code. In the meantime, I have repeated the measurements with such a counter to determine the actual throughput. As a result, I found that there is some divergence of the measured object creation rate from the error-prone GC rate shown in figure 6. Nevertheless, the main effects shown in figure 6 are correct: reducing pause duration costs (a lot of) throughput. Therefore, it makes sense to set the target as low as actually needed but not lower. The good news is that setting the target lower than G1 can deliver is free: you only pay for what you really get.
Tuning the number of threads:
G1 as much as the “classic” collectors share this parameter which sets the number of threads used in parallel for processing during GC pauses. The following figure shows how pause durations for G1 and CMS depend on that parameter:
The first thing to learn from this plot is that here the default (5/6 of the number of processors = 10 threads for G1 on the given hardware) is unnecessarily high because there is little to gain from using more than 6, the number of CPU cores. We have seen earlier in part 2 of this blog series that cores count for GC operations while hyperthreads (which double the number of processors to 12) add little benefit. The JVM, however, knows only the number of processors.
Second, the new generation pauses (dotted lines) for both collectors show almost identical dependence on the number of threads. Both benefit most from adding a second thread and on a quickly diminishing scale from adding more.The big difference, once again, is in the old generation/mixed collections. With a single GC thread G1’s mixed collections (solid light blue line) are excessively long and it takes at least 5 parallel GC threads to bring them down to the 500 ms for which 2 threads are enough in all other cases. As we have seen several times before, mixed collections are the troublesome task for the G1 collector which it tackles with plenty of CPU power (and at the expense of throughput). Starting from figure 7, we can speculate that G1 could use still more CPU cores (as are for example available on a SPARC T3 or T4 chip or in a multi-chip system) to further reduce GC pause duration while CMS would benefit less. It is not shown in figure 7 that GC throughput already reaches a maximum at 3 parallel GC threads.
Number of concurrent threads:
This parameter defines the number of GC threads working concurrently with mutator (= application) threads, and it depends on the number of parallel threads: the number of parallel GC threads is an upper bound for it and also determines its default value, which is approximately 1/4 of the number of parallel threads for G1.
To avoid concurrent mode failures (which lead to excessively long Full GC pauses) I found it necessary for G1 to use at least 2 concurrent GC threads whereas with CMS it was always safe to run the same benchmark with a single concurrent thread. Setting the number of concurrent threads to higher values or not setting it at all made very little difference. I conclude that setting this parameter is pushing things very far and in most cases is probably not worth the effort as the default behavior is very good as long as there are enough parallel GC threads.
G1 region size:
This is the parameter at the heart of G1’s heap design. G1 uses a number of regions which are managed individually. One might expect that the region size has a certain impact on G1’s results, but I could not find any effect when I varied the region size from its minimum to maximum value.
All the indicators monitored do not systematically depend on the region size parameter. Remember, however, that all the objects created in the benchmark are of equal size and very small compared to even the smallest possible region size. Once an application handles much larger objects, e.g. PDF documents in huge byte arrays in or close to the MB range, we should see a different picture.
G1 has few parameters to tune itThere are more parameters described in “Getting Started with the G1 Garbage Collector” and “Garbage First Garbage Collector Tuning”, but as I understand them they are meant to make G1’s operation more robust and to reduce a specific risk of failure rather than to tune G1’s performance. A case in point is
-XX:G1ReservePercentwhich in my experiments did not influence either throughput or GC pause duration. From my measurements I conclude that
-XX:MaxGCPauseMillisis by far the most important tuning parameter. The default value, 200 ms, is not bad. For most applications this is responsive enough and feasible at the same time. But it is important to keep in mind that there are cases (depending mainly on application, hardware and heap size) where a desired and reasonable pause time target simply cannot be reached and I have not found a JVM screw to fix that once the application is in production. For all the other parameters G1 uses rather good default values or even adjusts them dynamically to meet the pause time target, and it usually is not necessary and not beneficial to configure them explicitly.
Coexistence of several JVMs on a single OS instanceIn a common system architecture several JVMs, for example clustered application server instances, run on a common server without explicit allocation of CPU resources. In old-fashioned installations they often run on a common operating system instance. But even in virtualized environments with one JVM per OS instance mechanisms to allocate CPU resources are not always available or used. Therefore, I also examined how both garbage collectors, CMS and G1, operate under such circumstances where several JVMs freely compete for CPU power. For this I used the following test setup: A JVM with 25 GB of total heap size executes a 30 minutes test run using roughly 50% of CPU available power. 15 minutes after this first JVM a second JVM with exactly the same parameters and program is started. As both now compete for the available CPU power and other system resources, we expect a clear change after half the execution time. The following figure shows the CMS collector started with the parameters
-Xms25g -Xmx25g -XX:NewSize=2000m -XX:MaxNewSize=2000m -XX:CMSInitiatingOccupancyFraction=80doing GC in the first of the two JVMs.
The competition for CPU power affects the CMS collector in three ways:
- It increases the average duration of GC pauses by a factor of about 1.7 (from 83 to 145 ms)
- It moderately increases the standard deviation of GC pause duration (from 3 to 13 ms)
- It reduce the throughput (slope of the blue heap usage curve) by a factor of 1.9 such that the sum for both JVMs is only about 5% higher than for a single one (745 MB/s) because at close to 50% CPU usage all cores are almost in full use and hyperthreads have little benefit.
Altogether, the CMS collector gracefully responds to the sudden onset of CPU competition and in a way as I would have expected.The following figure shows how the G1 started with
-Xms25g -Xmx25gand its default max pause target of 200ms copes with exactly the same situation:
G1 shows a very different reaction to the sudden change:
- the average duration of new gen GC pauses is only slightly changed (from 180 to 208 ms)
- the GC pauses vary much more after the change and the standard deviation grows from 22 to 69 milliseconds, the max value from 264 to 700 milliseconds
- measured throughput is reduced by almost a factor of 2.7 (from 707 to 266 MB/s) such that the combined throughput of both instances is 25% lower than that of a single JVM but this detail result suffers from exactly the same kind of rounding error as described above and is therefore unreliable (unlike the corresponding result for the CMS collector that has no issue with rounding)
Summary and ConclusionThe results presented in this post show that G1’s control of GC pauses with the
-XX:MaxGCPauseMillisparameter is effective only in a rather narrow range of target values and that it has no effect (fortunately also no negative effect) when the target is set too low for the given setup. Control is also incomplete because a significant share of (mixed) GC pauses is hardly affected by the target value. Given the high cost of G1’s GC scheme, which I have already pointed out in the previous post, I find these results disappointing. When I started working with G1 I had a feeling I was missing a decisive parameter which would get things going. After searching and testing for a while I have come to the conclusion that there probably is none. Sometimes, parameters (like the number of parallel threads) can be changed to use less resources but I found no change that gave me significantly better results than what I got out-of-the-box using only a minimal configuration like:
-Xms50g -Xmx50g -XX:MaxGCPauseMillis=500My recommendation: start from here and set the heap size and pause time target to the values you really need for your application on your target platform. If that works out as expected you are done. If it doesn’t you probably can’t do much about it because G1’s default settings and auto-adjustments are good. High hopes that the
MaxGCPauseMillisparameter could be used to reduce GC pauses of real applications at will to any value in the 1000 to 10 ms range (be it at the expense of very high CPU consumption) look futile when this cannot be achieved with a rather simple benchmark as I have used for my tests. Looking at figure 2 which shows that the CMS collector achieves clearly better results on that same benchmark leads me to question G1’s design: Why does it use regions to manage all generations of objects? Why does it have a common
MaxGCPauseMillistarget for all kinds of GC pauses? Would it not be better to keep young objects in exactly the same kind of heap layout as the classic collectors do (1 Eden space + 2 survivor spaces) and to use many regions only for the old generation? The many new generation GC pauses should then in most cases stay clearly below the
MaxGCPauseMillistarget. And for the few old generation pauses a relaxed target of something like 1s would still be fine and hopefully cheaper to achieve.