Linux 6 Transparent Huge Pages and Hadoop Workloads

This past week I spent some time setting up and running various Hadoop workloads on my CDH cluster. After some Hadoop jobs had been running for several minutes, I noticed something quite alarming — the system CPU percentages where extremely high.

Platform Details

This cluster is comprised of 2s8c16t Xeon L5630 nodes with 96 GB of RAM running CentOS Linux 6.2 with java 1.6.0_30. The details of those are:

$ cat /etc/redhat-release
CentOS release 6.2 (Final)

$ uname -a
Linux chaos 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux

$ java -version
java version "1.6.0_30"
Java(TM) SE Runtime Environment (build 1.6.0_30-b12)
Java HotSpot(TM) 64-Bit Server VM (build 20.5-b03, mixed mode)

Observations

Shortly after I kicked off some Hadoop jobs, I noticed the system CPU percentages were extremely high. This certainly isn’t normal for this type of workload and is pointing to something being wrong or a bug somewhere. Because the issue was related to kernel code (hence high system times), I fired up perf top and tried to see where in the kernel code all this time was being spent (thanks @kevinclosson). Here is single iteration from perf-top which was representative of what I was seeing:

PerfTop:   16096 irqs/sec  kernel:92.6%  exact:  0.0% [1000Hz cycles],  (all, 16 CPUs)
-------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                                                              DSO
             _______ _____ _____________________________________________________________________ __________________

           223182.00 93.8% _spin_lock_irq                                                        [kernel.kallsyms] 
             3879.00  1.6% _spin_lock_irqsave                                                    [kernel.kallsyms] 
             3260.00  1.4% compaction_alloc                                                      [kernel.kallsyms] 
             1992.00  0.8% compact_zone                                                          [kernel.kallsyms] 
             1714.00  0.7% SpinPause                                                             libjvm.so
              716.00  0.3% get_pageblock_flags_group                                             [kernel.kallsyms] 
              596.00  0.3% ParallelTaskTerminator::offer_termination(TerminatorTerminator*)      libjvm.so
              169.00  0.1% _cond_resched                                                         [kernel.kallsyms] 
              114.00  0.0% _spin_lock                                                            [kernel.kallsyms] 
              101.00  0.0% hrtimer_interrupt                                                     [kernel.kallsyms]

At this point I decided to take a 60 second capture using perf record using the following command:

$ sudo perf record -a -g -F 1000 sleep 60

After I had the capture, I built a flame graph using Brendan Gregg’s tools (because I am a big fan of performance data visualizations).

Looking at the functions listed in the Flame Graph (below) it looked like the issue was related to virtual memory and the Linux source shows many of these functions are in linux/mm/compaction.c.

Flame graph cropped

The issue seemed to be around virtual memory, however, this Hadoop job was using just 8 mappers per node and the java heap was set to 1GB, so there was plenty of “leftover” memory on the system, so why would this system be thrashing in the vm kernel code?

Experiment

While eating dinner and having a few beers something came to mind — Linux 6 had a new feature called Transparent Huge Pages, or THP for short. And like all new features that are deemed to add benefit, it is enabled by default. THP can be disabled by running the following command:

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

And this change, and only this change, is exactly what I did when I returned from dinner. I then fired off my Hadoop job and watched anxiously. To my pleasant surprise, the elevated sys CPU times were now gone and things looked much more like I wanted them to.

I’ve flipped back and forth several times and have had nothing but high sys times with THP enabled, so it’s pretty reproducible on my system.

Thoughts

I’m not 100% sure why THP are choking up my system (maybe bug vs. feature) but I’m certainly interested if others have seen similar behavior on Linux 6 with data intensive workloads like Hadoop and THP enabled. Other thoughts, experiment results, etc. are also very welcome.

To put things into perspective on how bad it gets, here are two screen captures of Cloudera Manager which highlights the ugly sys CPU times (see the middle chart; green = sys, blue = usr) when THP are enabled.

Do note the time scales are not identical.

Transparent Huge Pages enabled:
Transparent Huge Pages enabled

Transparent Huge Pages disabled:
Transparent Huge Pages disabled

Update:

The issue seems to be related to transparent hugepage compaction and is actually documented on the Cloudera Website (but my google foo did not turn it up) here which recommends the following:

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

I did stumble across the bug listed on the Red Hat Bugzilla site in the google cache but it is not publicly available for some reason (bummer).
https://bugzilla.redhat.com/show_bug.cgi?id=805593

Just confirming after I disabled THP defrag on my cluster the high sys CPU times are not present.

Update 2:

Just for documentation’s sake, here are some performance captures between THP enabled and disabled.

# echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
# collectl -scm -oT -i2
waiting for 2 second sample...
#         <----CPU[HYPER]----->
#Time     cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map 
10:33:08   84   4 46083   6964   5G 324M  61G  60G 563M  25G 
10:33:10   71   7 39933  25281   5G 324M  61G  60G 566M  24G 
10:33:12   89   4 48545  15724   5G 324M  61G  60G 566M  25G 
10:33:14   81   7 44566   8224   7G 324M  61G  60G 566M  23G 
10:33:16   81   4 44604   8815   8G 324M  62G  60G 566M  22G 
10:33:18   87   7 46906  20430   8G 324M  62G  60G 566M  22G 
10:33:20   79   6 43565  11260   6G 324M  62G  61G 565M  24G 
10:33:22   75   6 41113  13180   3G 325M  62G  61G 565M  26G 
10:33:24   64   5 36610   9745   2G 325M  62G  61G 565M  27G 
10:33:26   60   4 34439   7500   1G 325M  62G  61G 565M  28G 
10:33:28   74   5 40507   9870   1G 324M  61G  60G 564M  30G 
10:33:30   73   6 42778   7023   6G 324M  60G  59G 561M  25G 
10:33:32   86   5 46904  11836   5G 324M  61G  59G 561M  26G 
10:33:34   78   3 43803   9378   5G 324M  61G  59G 559M  25G 
10:33:36   83   4 44566  11408   6G 324M  61G  60G 560M  24G 
10:33:38   62   4 35228   7060   7G 324M  61G  60G 559M  23G 
10:33:40   75   7 42878  16457  10G 324M  61G  60G 559M  21G 
10:33:42   88   7 47898  13636   7G 324M  61G  60G 560M  23G 
10:33:44   83   6 45221  17253   5G 324M  61G  60G 560M  25G 
10:33:46   66   4 36586   6875   3G 324M  61G  60G 560M  26G 
10:33:48   66   4 37690   9938   2G 324M  61G  60G 559M  28G 
10:33:50   66   3 37199   6981   1G 324M  61G  60G 559M  28G 

# echo always > /sys/kernel/mm/redhat_transparent_hugepage/enabled
# collectl -scm -oT -i2
waiting for 2 second sample...
#         <----CPU[HYPER]----->
#Time     cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map 
10:51:31   99  81 51547  14961  24G 326M  53G  51G 536M  15G 
10:51:33   92  81 49928  11377  24G 326M  52G  51G 536M  15G 
10:51:35   59  58 39357   2440  24G 326M  52G  51G 536M  15G 
10:51:37   54  53 36825   1639  24G 326M  52G  51G 536M  15G 
10:51:39   88  87 49293   2284  24G 326M  52G  51G 536M  15G 
10:51:41   95  94 50295   1638  24G 326M  52G  51G 536M  15G 
10:51:43   99  98 51780   1838  24G 326M  52G  51G 536M  15G 
10:51:45   97  95 50492   2412  24G 326M  52G  51G 536M  15G 
10:51:47  100  96 50902   2732  24G 326M  52G  51G 536M  15G 
10:51:49  100  89 51097   4748  24G 326M  52G  51G 536M  15G 
10:51:51  100  71 51198  36708  24G 326M  52G  51G 536M  15G 
10:51:53   99  56 51807  50767  24G 326M  52G  51G 536M  15G 
10:51:55  100  51 51363  66095  24G 326M  52G  51G 536M  15G 
10:51:57  100  48 51691  73226  24G 326M  52G  51G 536M  15G 
10:51:59   99  36 52350  87560  24G 326M  52G  51G 536M  15G 
10:52:01   99  51 51809  42327  24G 325M  52G  51G 536M  15G 
10:52:03  100  50 51582  62493  24G 325M  52G  51G 536M  15G 
10:52:05   99  44 52135  69813  24G 326M  52G  50G 536M  15G 
10:52:07   99  39 51505  65393  24G 326M  52G  50G 536M  16G 
10:52:09   98  39 52778  54844  24G 326M  52G  50G 536M  16G 
10:52:11   98  62 51456  30880  24G 326M  52G  50G 536M  16G 
10:52:13  100  83 51014  21095  24G 326M  52G  50G 536M  16G

Update: 2013-09-17
Oracle comments on disabling THP in Oracle Linux: Performance Issues with Transparent Huge Pages (THP)

41 comments

  1. Holger Hoffstätte

    THP is meant to *increase* performance and most often does exactly that; I’ve only noticed better behaviour with large heaps, not worse. To this smells more like a particular behaviour of either the very specific version of the patched-beyond-recognition RedHat kernel, or – IMHO more likely – something in HDFS (the native code bits maybe?) that excessively allocates and frees memory in fragmented portions.

  2. Greg Rahn

    The kenel is not patched at all, it’s right from the CentOS distribution and the full kernel version is listed. Given Kevin Closson’s comment above, “[THP] negatively impacts Greenplum concurrent query performance by 35%”, which has neither HDFS or java, I’d suspect the issue is more specific to THP in Linux. Clearly this looks like an area that needs more research and I hope this blog post helps promote that.

  3. Otis Gospodnetic

    Is this RHEL/CentOS-specific stuff?

    I have Ubuntu boxes with 2.6.38 kernel, but don’t have /sys/kernel/mm/transparent_hugepage
    I have these:

    $ ls -al /sys/kernel/mm/hugepages/hugepages-2048kB/
    total 0
    -r–r–r– 1 root root 4096 2012-06-19 04:08 free_hugepages
    -rw-r–r– 1 root root 4096 2012-06-19 04:08 nr_hugepages
    -rw-r–r– 1 root root 4096 2012-06-19 04:08 nr_hugepages_mempolicy
    -rw-r–r– 1 root root 4096 2012-06-19 04:08 nr_overcommit_hugepages
    -r–r–r– 1 root root 4096 2012-06-19 04:08 resv_hugepages
    -r–r–r– 1 root root 4096 2012-06-19 04:08 surplus_hugepages

  4. Kevin Closson

    On my to-do list is to test in isolation each of the ancillary tunables related to THP. The problem I see is the fact that I disable a default setting and get 35% back from a cpu-saturated workload (no physical I/O either). After I re-enable it I have to walk through a large set of test permutations to “find the goodness.” If I find that the goodness is worth it good, if not, not good. The fact is, however, this mean about 20 hours of non-stop testing for each permutation on a tight schedule.

    I haven’t blogged the default THP damage (imagine that) because it is a default Redhat setting so I presumed mine was an edge-case and there are other related tunables I had not investigated. Having said that I don’t think mine is an edge case (due to Greg’s experience) and for that reason it seems obvious to me that the defaults are not safe for the broadest of workloads. Things like that get better with time, I know. When I finally work out the lab time I’ll push through the various permutations and blog my findings.

  5. Holger Hoffstätte

    This is a generic feature (and not RH specific) but may not be built into your distro kernel. Test:
    $>zgrep HUGE /proc/config.gz
    CONFIG_TRANSPARENT_HUGEPAGE=y
    CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y

  6. Greg Rahn

    Appears that Oracle Linux Release 6 Update 2 does not include THP (at least not with UEK):

    # uname -a
    Linux havoc 2.6.32-300.25.1.el6uek.x86_64 #1 SMP Tue May 15 21:12:51 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
    # ls /sys/kernel/mm/
    hugepages  ksm
    
  7. Kevin Closson

    Yes, Holger, you are right. Sorry. Does anyone know if any distro besides RHEL and Centos adopted the feature? I’m pretty certain the other, um, derivation (OEL) doesn’t adopt the feature.

  8. bank kus

    I don’t see THP as a database/server feature at all. Its best suited for apps/SPECINT/SPECFP where you cannot change the app being able to score a few large pages. I d imagine a database vendor would prefer reserving memory hugetlbfs style.

  9. Mark Seger

    And speaking of collectl ;) (after all you did show collectl output in the base note), one thing that is very useful, especially on large clusters all running the same configuration like hadoop is to run collectl simultaneously on all boxes with colmux and then using arrows keys change the sort columns one by one and make sure the characteristics are at least similar. I’ve seen cases where a small number of boxes in hundreds and even thousands will jump out at you and something you’d never see trying manually running scripts. This is especially true when examining memory, but I just do it everywhere running -s+m (did you know you can add memory to the default like this? ;))

    -mark

  10. Pingback: Log Buffer #275, A Carnival of the Vanities for DBAs | The Pythian Blog
  11. laks

    Pls ignore if this is already taken care of . Just a quick thing, did the jvm processes have -XX:+UseLargePages setting?

  12. laks

    Just a thought to see if the explicit setting for your jvm version is still needed/has any effect even with the new transparent huge page feature in latest rhel 6* kernel.

  13. Greg Rahn

    These jvms did not use the -XX:+UseLargePages option, nor should they need to. That would only be required for statically allocated hugepages.

  14. Tim

    To the “patched beyond recognition” statement… since you’re running CentOS… you are essentially running Redhat, which is to say, you are in fact running a kernel that’s patched beyond recognition. It’s not a bad thing, it’s just a variable that has to be factored in. I’ve got several boxes with 512GB of ram and we’ve found that enabling/disabling either all of the THP and the defrag yields varying results depending on the workload. Not all things work for all workload… imagine that. This is precisely why the kernel has lots of knobs to twist. It’s just knowing which knobs and in what order that’s been the challenge! Other kernel variables that are worth investigating are vm.swappiness, vm.dirty_ratio, vm.dirty_background_ratio, etc.

  15. Greg Rahn

    Use

    $ sudo sh -c "echo 'never' > /sys/kernel/mm/redhat_transparent_hugepage/enabled"

    Your command isn’t working because you are redirecting the output which falls outside of the sudo permission boundaries.

  16. Kannappan

    I did a sudo -i and executed the below command. But it doesn’t seem to take effect
    echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

    The defrag still shows as below
    always [never]

    not sure if this is right.

    The moment I try doing a get or put to hdfs, I see high cpu. first by java process and then by a winbindd process. Any idea how i can fix it.

  17. Kannappan

    Thanks Greg. I was able to change it to ‘never’.
    I have encountered another problem. A process named winbindd is causing high cpu when i do anything with the HDFS. Simple hadoop commands take insanely long time to complete. Posted this question on IRC #cloudera. Looks like it is something to do with Samba running on VM. I am trying to debug further.

  18. g

    i’ve seen this recently on SLES 11 SP2. we’re running multiple SAP SIDs and oracle. the fix was to disable defrag.

  19. Gerardw

    Have you measured actual application performance? Since a spin lock is just something waiting for a concurrency control, I wonder if CPU time is simply being transferred from the idle process to that particular kernel task?

  20. Greg Rahn

    There was a measurable difference, but I don’t have the exact numbers any more. Fairly easy to reproduce in the lab though… It’s not just idle CPU to system CPU though, it’s about stealing away user CPU capacity as well.

  21. Pingback: Log Buffer #275, A Carnival of the Vanities for DBAs
  22. Siggi

    Thanks Greg for that post!

    I went through a 6 weeks nightmare trying to find out why our software is running into peaks every day without any significant pattern. This was the situation:
    Our software is an in-memory search index written in C and encapsulated by a java based webapp running under tomcat. When our customer started complaining about frequent peaks every morning, we analysed the following:
    - is there any suspect traffic: No
    - is there any scheduler running: No
    - Is our tomcat thread/worker/heapspace setting ok: Yes
    - Is the Java GC doing stupid things: No, it just slows down sometimes without any reason
    - Is there any particular part within our software that’s consuming the time: No, all profiling shows, that every part of the sw reacts slow during the peak time
    - Now I learned about perf top (my new best friend!): interesting to see all these kernel-threads and the raw_spin_lock jumping to the top of the list during peak times. Hmm, probably we should reduce multi-core usage of our software. We did so: No effect
    - we added new hardware and we went on tweaking our software/configuration: Helps slightly, but still we can see peaks (not as much/large as before, but still frightening)
    - After six weeks, a second customer reports similar issues. No connection between the two environments, but one: SLES 11 SP2
    - At 5 a.m. I cannot sleep and start searching the web. Find your post. Advised the sysadmin to “echo never > /sys/kernel/mm/transparent_hugepage/enabled”
    - At 8:30 he did it
    - Since then (2 days after): No single peak experienced on the patched server, all other server are still peaking. Proof!
    - total debugging costs (incl. work from a development team, additional hardware, excl. lost revenue): ~50K

    conclusion: a perfect example that optimizing a core component (e.g. memory handling of the OS) is a tricky thing and even being very careful will be no guarantee for nothing. There will always be unforeseen environments/configurations.

  23. Dan Pritts

    We had similar problems with locally-developed java web apps. At the time we were running in oracle java 6, tomcat6 or 7.

    In our case, we weren’t just seeing increases in CPU time used, we had “stalls” on our web site; connect, wait, wait, wait…. The stalls were long enough for our monitoring software to time out; sometimes they would last for several minutes.

    The problem only happened after the system was up for a while (at least a month), which makes sense – memory is more fragmented, so there is more copying around to be done.

    Like yours, our problems were immediately solved when we disabled transparent hugepages.

    The really sad part is that under normal circumstances our system has at least 50% available memory (most used by disk cache, of course).

  24. dape

    I had the same problem, hadoop server use centos 6.2 ,120G memory,
    the hadoop child process need new a sort buffer(1024m), io.sort.mb sometimes it will take several minutes to finished initialize the job,so it will slow down the whole job. thanks.

  25. pasmac

    Hi Greg,

    Great post.

    I have some questions,

    You recommend to do echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled or
    echo ‘never’ > /sys/kernel/mm/transparent_hugepage/defrag

    Why not to echo madvise > /sys/kernel/mm/transparent_hugepage/defrag ?

    What are the risks on performing this commands on live production?

    Can memory be corrupted? Specially running database Server? Or there are no ricks like running drop_caches?

    Thank you in Advanced,
    Best Regards,
    SM

  26. Pingback: rh6.2 kernel panic 级别的 bug | Jasey Wang
  27. Suya

    we’ve hit the same issue while running our Greenplum cluster 4.2.3.0 on the CentOs 6.2, the system CPU percentage was over 90% sometimes and after turn off the THP, it keeps below 30% now. we’re still monitoring the system performance.

  28. Alex Tomic

    Thanks for posting this. I came across a similar issue and I am noticing a huge performance improvement, many of my big hive queries run 2-3x faster now.

    I really wish the Redhat bug were publicly available as I want to understand why defragmentation causes such a problem. My guess is that any workload involving frequent, parallel instantiation of short-lived but memory hungry processes could be affected.

  29. Pingback: USB Transfert making the system unresponsive fix | Almost Working Code
  30. Dan Pritts

    Long-lived processes can have the issue, too, Alex. We’ve had the problem on java web apps.

    As an aside, (openjdk/oracle) java can use non-transparent hugepages with -XX:+UseLargePages. You have to statically allocate those huge pages. You can get a pretty big performance benefit.

  31. Frank

    Not similar but our threads were stuck in a select syscall waiting on zero fd’s
    adding this option {transparent_hugepage=never} to the kernel cmd line & rebooting fixed the issue

    title Red Hat Enterprise Linux (2.6.32-220.el6.x86_64)
    root (hd0,0)
    kernel /vmlinuz-2.6.32-220.el6.x86_64 ro root=UUID=96e27abc-9a03-4eec-bf08-4d344ddc6416 transparent_hugepage=never

  32. Shree

    Hi Greg/Others,
    Very nice post. I would like to know how did you plot the above CPU/MEM usage and THP enabled graphs ? Please provide any software/script you used for the same.
    Thanks
    Harsha

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s