Linux 6 Transparent Huge Pages and Hadoop Workloads
This past week I spent some time setting up and running various Hadoop workloads on my CDH cluster. After some Hadoop jobs had been running for several minutes, I noticed something quite alarming — the system CPU percentages where extremely high.
Platform Details
This cluster is comprised of 2s8c16t Xeon L5630 nodes with 96 GB of RAM running CentOS Linux 6.2 with java 1.6.0_30. The details of those are:
$ cat /etc/redhat-release CentOS release 6.2 (Final) $ uname -a Linux chaos 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux $ java -version java version "1.6.0_30" Java(TM) SE Runtime Environment (build 1.6.0_30-b12) Java HotSpot(TM) 64-Bit Server VM (build 20.5-b03, mixed mode)
Observations
Shortly after I kicked off some Hadoop jobs, I noticed the system CPU percentages were extremely high. This certainly isn’t normal for this type of workload and is pointing to something being wrong or a bug somewhere. Because the issue was related to kernel code (hence high system times), I fired up perf top and tried to see where in the kernel code all this time was being spent (thanks @kevinclosson). Here is single iteration from perf-top which was representative of what I was seeing:
PerfTop: 16096 irqs/sec kernel:92.6% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)
-------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ _____________________________________________________________________ __________________
223182.00 93.8% _spin_lock_irq [kernel.kallsyms]
3879.00 1.6% _spin_lock_irqsave [kernel.kallsyms]
3260.00 1.4% compaction_alloc [kernel.kallsyms]
1992.00 0.8% compact_zone [kernel.kallsyms]
1714.00 0.7% SpinPause libjvm.so
716.00 0.3% get_pageblock_flags_group [kernel.kallsyms]
596.00 0.3% ParallelTaskTerminator::offer_termination(TerminatorTerminator*) libjvm.so
169.00 0.1% _cond_resched [kernel.kallsyms]
114.00 0.0% _spin_lock [kernel.kallsyms]
101.00 0.0% hrtimer_interrupt [kernel.kallsyms]
At this point I decided to take a 60 second capture using perf record using the following command:
$ sudo perf record -a -g -F 1000 sleep 60
After I had the capture, I built a flame graph using Brendan Gregg’s tools (because I am a big fan of performance data visualizations).
Looking at the functions listed in the Flame Graph (below) it looked like the issue was related to virtual memory and the Linux source shows many of these functions are in linux/mm/compaction.c.

The issue seemed to be around virtual memory, however, this Hadoop job was using just 8 mappers per node and the java heap was set to 1GB, so there was plenty of “leftover” memory on the system, so why would this system be thrashing in the vm kernel code?
Experiment
While eating dinner and having a few beers something came to mind — Linux 6 had a new feature called Transparent Huge Pages, or THP for short. And like all new features that are deemed to add benefit, it is enabled by default. THP can be disabled by running the following command:
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
And this change, and only this change, is exactly what I did when I returned from dinner. I then fired off my Hadoop job and watched anxiously. To my pleasant surprise, the elevated sys CPU times were now gone and things looked much more like I wanted them to.
I’ve flipped back and forth several times and have had nothing but high sys times with THP enabled, so it’s pretty reproducible on my system.
Thoughts
I’m not 100% sure why THP are choking up my system (maybe bug vs. feature) but I’m certainly interested if others have seen similar behavior on Linux 6 with data intensive workloads like Hadoop and THP enabled. Other thoughts, experiment results, etc. are also very welcome.
To put things into perspective on how bad it gets, here are two screen captures of Cloudera Manager which highlights the ugly sys CPU times (see the middle chart; green = sys, blue = usr) when THP are enabled.
Do note the time scales are not identical.
Transparent Huge Pages enabled:

Transparent Huge Pages disabled:

Update:
The issue seems to be related to transparent hugepage compaction and is actually documented on the Cloudera Website (but my google foo did not turn it up) here which recommends the following:
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
I did stumble across the bug listed on the Red Hat Bugzilla site in the google cache but it is not publicly available for some reason (bummer).
https://bugzilla.redhat.com/show_bug.cgi?id=805593

Just confirming after I disabled THP defrag on my cluster the high sys CPU times are not present.
Update 2:
Just for documentation’s sake, here are some performance captures between THP enabled and disabled.
# echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled # collectl -scm -oT -i2 waiting for 2 second sample... # <----CPU[HYPER]-----><-----------Memory-----------> #Time cpu sys inter ctxsw Free Buff Cach Inac Slab Map 10:33:08 84 4 46083 6964 5G 324M 61G 60G 563M 25G 10:33:10 71 7 39933 25281 5G 324M 61G 60G 566M 24G 10:33:12 89 4 48545 15724 5G 324M 61G 60G 566M 25G 10:33:14 81 7 44566 8224 7G 324M 61G 60G 566M 23G 10:33:16 81 4 44604 8815 8G 324M 62G 60G 566M 22G 10:33:18 87 7 46906 20430 8G 324M 62G 60G 566M 22G 10:33:20 79 6 43565 11260 6G 324M 62G 61G 565M 24G 10:33:22 75 6 41113 13180 3G 325M 62G 61G 565M 26G 10:33:24 64 5 36610 9745 2G 325M 62G 61G 565M 27G 10:33:26 60 4 34439 7500 1G 325M 62G 61G 565M 28G 10:33:28 74 5 40507 9870 1G 324M 61G 60G 564M 30G 10:33:30 73 6 42778 7023 6G 324M 60G 59G 561M 25G 10:33:32 86 5 46904 11836 5G 324M 61G 59G 561M 26G 10:33:34 78 3 43803 9378 5G 324M 61G 59G 559M 25G 10:33:36 83 4 44566 11408 6G 324M 61G 60G 560M 24G 10:33:38 62 4 35228 7060 7G 324M 61G 60G 559M 23G 10:33:40 75 7 42878 16457 10G 324M 61G 60G 559M 21G 10:33:42 88 7 47898 13636 7G 324M 61G 60G 560M 23G 10:33:44 83 6 45221 17253 5G 324M 61G 60G 560M 25G 10:33:46 66 4 36586 6875 3G 324M 61G 60G 560M 26G 10:33:48 66 4 37690 9938 2G 324M 61G 60G 559M 28G 10:33:50 66 3 37199 6981 1G 324M 61G 60G 559M 28G # echo always > /sys/kernel/mm/redhat_transparent_hugepage/enabled # collectl -scm -oT -i2 waiting for 2 second sample... # <----CPU[HYPER]-----><-----------Memory-----------> #Time cpu sys inter ctxsw Free Buff Cach Inac Slab Map 10:51:31 99 81 51547 14961 24G 326M 53G 51G 536M 15G 10:51:33 92 81 49928 11377 24G 326M 52G 51G 536M 15G 10:51:35 59 58 39357 2440 24G 326M 52G 51G 536M 15G 10:51:37 54 53 36825 1639 24G 326M 52G 51G 536M 15G 10:51:39 88 87 49293 2284 24G 326M 52G 51G 536M 15G 10:51:41 95 94 50295 1638 24G 326M 52G 51G 536M 15G 10:51:43 99 98 51780 1838 24G 326M 52G 51G 536M 15G 10:51:45 97 95 50492 2412 24G 326M 52G 51G 536M 15G 10:51:47 100 96 50902 2732 24G 326M 52G 51G 536M 15G 10:51:49 100 89 51097 4748 24G 326M 52G 51G 536M 15G 10:51:51 100 71 51198 36708 24G 326M 52G 51G 536M 15G 10:51:53 99 56 51807 50767 24G 326M 52G 51G 536M 15G 10:51:55 100 51 51363 66095 24G 326M 52G 51G 536M 15G 10:51:57 100 48 51691 73226 24G 326M 52G 51G 536M 15G 10:51:59 99 36 52350 87560 24G 326M 52G 51G 536M 15G 10:52:01 99 51 51809 42327 24G 325M 52G 51G 536M 15G 10:52:03 100 50 51582 62493 24G 325M 52G 51G 536M 15G 10:52:05 99 44 52135 69813 24G 326M 52G 50G 536M 15G 10:52:07 99 39 51505 65393 24G 326M 52G 50G 536M 16G 10:52:09 98 39 52778 54844 24G 326M 52G 50G 536M 16G 10:52:11 98 62 51456 30880 24G 326M 52G 50G 536M 16G 10:52:13 100 83 51014 21095 24G 326M 52G 50G 536M 16G
Good post, Greg.
I had “echo never > /sys/kernel/mm/transparent_hugepage/enabled” in my rc.local for *ages* so that little treasure had totally slipped my mind when we were skyping about this problem. Why? Well, it only negatively impacts Greenplum concurrent query performance by 35% :-)
Your readers can see more context here: http://www.mjmwired.net/kernel/Documentation/vm/transhuge.txt
Did we try disabling defrag
Post fact, yes, and that yielded positive results (see update above).
THP is meant to *increase* performance and most often does exactly that; I’ve only noticed better behaviour with large heaps, not worse. To this smells more like a particular behaviour of either the very specific version of the patched-beyond-recognition RedHat kernel, or – IMHO more likely – something in HDFS (the native code bits maybe?) that excessively allocates and frees memory in fragmented portions.
The kenel is not patched at all, it’s right from the CentOS distribution and the full kernel version is listed. Given Kevin Closson’s comment above, “[THP] negatively impacts Greenplum concurrent query performance by 35%”, which has neither HDFS or java, I’d suspect the issue is more specific to THP in Linux. Clearly this looks like an area that needs more research and I hope this blog post helps promote that.
On my to-do list is to test in isolation each of the ancillary tunables related to THP. The problem I see is the fact that I disable a default setting and get 35% back from a cpu-saturated workload (no physical I/O either). After I re-enable it I have to walk through a large set of test permutations to “find the goodness.” If I find that the goodness is worth it good, if not, not good. The fact is, however, this mean about 20 hours of non-stop testing for each permutation on a tight schedule.
I haven’t blogged the default THP damage (imagine that) because it is a default Redhat setting so I presumed mine was an edge-case and there are other related tunables I had not investigated. Having said that I don’t think mine is an edge case (due to Greg’s experience) and for that reason it seems obvious to me that the defaults are not safe for the broadest of workloads. Things like that get better with time, I know. When I finally work out the lab time I’ll push through the various permutations and blog my findings.
To the “patched beyond recognition” statement… since you’re running CentOS… you are essentially running Redhat, which is to say, you are in fact running a kernel that’s patched beyond recognition. It’s not a bad thing, it’s just a variable that has to be factored in. I’ve got several boxes with 512GB of ram and we’ve found that enabling/disabling either all of the THP and the defrag yields varying results depending on the workload. Not all things work for all workload… imagine that. This is precisely why the kernel has lots of knobs to twist. It’s just knowing which knobs and in what order that’s been the challenge! Other kernel variables that are worth investigating are vm.swappiness, vm.dirty_ratio, vm.dirty_background_ratio, etc.
Is this RHEL/CentOS-specific stuff?
I have Ubuntu boxes with 2.6.38 kernel, but don’t have /sys/kernel/mm/transparent_hugepage
I have these:
$ ls -al /sys/kernel/mm/hugepages/hugepages-2048kB/
total 0
-r–r–r– 1 root root 4096 2012-06-19 04:08 free_hugepages
-rw-r–r– 1 root root 4096 2012-06-19 04:08 nr_hugepages
-rw-r–r– 1 root root 4096 2012-06-19 04:08 nr_hugepages_mempolicy
-rw-r–r– 1 root root 4096 2012-06-19 04:08 nr_overcommit_hugepages
-r–r–r– 1 root root 4096 2012-06-19 04:08 resv_hugepages
-r–r–r– 1 root root 4096 2012-06-19 04:08 surplus_hugepages
It is a Red Hat feature.
This is a generic feature (and not RH specific) but may not be built into your distro kernel. Test:
$>zgrep HUGE /proc/config.gz
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
Yes, Holger, you are right. Sorry. Does anyone know if any distro besides RHEL and Centos adopted the feature? I’m pretty certain the other, um, derivation (OEL) doesn’t adopt the feature.
Appears that Oracle Linux Release 6 Update 2 does not include THP (at least not with UEK):
THP is enabled in UEK2 and it appears for Oracle Linux Release 6 Update 3 UEK2 not UEK1 will be the default kernel.
#uname -a
Linux test 2.6.39-100.1.1.el6uek.x86_64 #1 SMP Wed Feb 8 01:46:29 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
https://oss.oracle.com/ol6/docs/RELEASE-NOTES-U3-en.html#Kernel_and_Driver_Updates
THP is in Linux since 2.6.38 (See http://lwn.net/Articles/423584/ ).
I don’t see THP as a database/server feature at all. Its best suited for apps/SPECINT/SPECFP where you cannot change the app being able to score a few large pages. I d imagine a database vendor would prefer reserving memory hugetlbfs style.
And speaking of collectl ;) (after all you did show collectl output in the base note), one thing that is very useful, especially on large clusters all running the same configuration like hadoop is to run collectl simultaneously on all boxes with colmux and then using arrows keys change the sort columns one by one and make sure the characteristics are at least similar. I’ve seen cases where a small number of boxes in hundreds and even thousands will jump out at you and something you’d never see trying manually running scripts. This is especially true when examining memory, but I just do it everywhere running -s+m (did you know you can add memory to the default like this? ;))
-mark
My bad. I should have included a link to the initial data capture I had created when I asked about this on twitter here:
Pls ignore if this is already taken care of . Just a quick thing, did the jvm processes have -XX:+UseLargePages setting?
[...] Greg Rahn has a great post about Linux 6 Transparent Huge Pages and Hadoop Workloads.Dominic Delmolino’ve been impressed with the solutions to his little problem of generating a symmetric Pascal matrix using SQL [...]
Just a thought to see if the explicit setting for your jvm version is still needed/has any effect even with the new transparent huge page feature in latest rhel 6* kernel.
These jvms did not use the -XX:+UseLargePages option, nor should they need to. That would only be required for statically allocated hugepages.
I did a sudo -i and executed the below command. But it doesn’t seem to take effect
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
The defrag still shows as below
always [never]
not sure if this is right.
The moment I try doing a get or put to hdfs, I see high cpu. first by java process and then by a winbindd process. Any idea how i can fix it.
Use
Your command isn’t working because you are redirecting the output which falls outside of the sudo permission boundaries.
Thanks Greg. I was able to change it to ‘never’.
I have encountered another problem. A process named winbindd is causing high cpu when i do anything with the HDFS. Simple hadoop commands take insanely long time to complete. Posted this question on IRC #cloudera. Looks like it is something to do with Samba running on VM. I am trying to debug further.
i’ve seen this recently on SLES 11 SP2. we’re running multiple SAP SIDs and oracle. the fix was to disable defrag.
this is a very good explanation of how to micro debug hadoop performance on linux. also its a good explanation of the bug … :) this just came up on the CDH mailing list.
Have you measured actual application performance? Since a spin lock is just something waiting for a concurrency control, I wonder if CPU time is simply being transferred from the idle process to that particular kernel task?
There was a measurable difference, but I don’t have the exact numbers any more. Fairly easy to reproduce in the lab though… It’s not just idle CPU to system CPU though, it’s about stealing away user CPU capacity as well.
Thanks Greg for that post!
I went through a 6 weeks nightmare trying to find out why our software is running into peaks every day without any significant pattern. This was the situation:
Our software is an in-memory search index written in C and encapsulated by a java based webapp running under tomcat. When our customer started complaining about frequent peaks every morning, we analysed the following:
- is there any suspect traffic: No
- is there any scheduler running: No
- Is our tomcat thread/worker/heapspace setting ok: Yes
- Is the Java GC doing stupid things: No, it just slows down sometimes without any reason
- Is there any particular part within our software that’s consuming the time: No, all profiling shows, that every part of the sw reacts slow during the peak time
- Now I learned about perf top (my new best friend!): interesting to see all these kernel-threads and the raw_spin_lock jumping to the top of the list during peak times. Hmm, probably we should reduce multi-core usage of our software. We did so: No effect
- we added new hardware and we went on tweaking our software/configuration: Helps slightly, but still we can see peaks (not as much/large as before, but still frightening)
- After six weeks, a second customer reports similar issues. No connection between the two environments, but one: SLES 11 SP2
- At 5 a.m. I cannot sleep and start searching the web. Find your post. Advised the sysadmin to “echo never > /sys/kernel/mm/transparent_hugepage/enabled”
- At 8:30 he did it
- Since then (2 days after): No single peak experienced on the patched server, all other server are still peaking. Proof!
- total debugging costs (incl. work from a development team, additional hardware, excl. lost revenue): ~50K
conclusion: a perfect example that optimizing a core component (e.g. memory handling of the OS) is a tricky thing and even being very careful will be no guarantee for nothing. There will always be unforeseen environments/configurations.
We had similar problems with locally-developed java web apps. At the time we were running in oracle java 6, tomcat6 or 7.
In our case, we weren’t just seeing increases in CPU time used, we had “stalls” on our web site; connect, wait, wait, wait…. The stalls were long enough for our monitoring software to time out; sometimes they would last for several minutes.
The problem only happened after the system was up for a while (at least a month), which makes sense – memory is more fragmented, so there is more copying around to be done.
Like yours, our problems were immediately solved when we disabled transparent hugepages.
The really sad part is that under normal circumstances our system has at least 50% available memory (most used by disk cache, of course).