Resolving 'Java OOM: Unable to Create New Native Thread' Errors on Heroku
For the most part, I’m a very happy Heroku user. The platform allows me deploy my apps, be they Java, Scala or Ruby-based without having to think/worry about infrastructure, which is amazing. They also allow me to, for the most part, do this for free. For the most part, I love it, and so do many others. That said, sometimes you do run into problems that cause you to lose lots of time debugging … in the wrong direction.
Yesterday, after adding non-blocking I/O to my Metascraper library, I load tested my deployed application and found it fatally crashing with java.lang.OutOfMemoryError: unable to create new native thread
errors. Not Good
What I did wrong
Because of the error thrown, I immediately thought to myself: “there’s gotta be a memory leak. Good thing I’m monitoring the app with New Relic !”. For the most part, if you Google the error, you find a lot of posts suggesting that your app has a memory leak somewhere, you need to tweak your VM memory options, etc. All of this advice is valid.
After looking at my instance’s memory usage though, it didn’t seem like that was the problem; used heap, commited heap, etc all looked fine. I should have probably stopped looking at memory usage, but I didn’t and proceeded to spend a few hours going through the cycle of combing through code, tweaking memory options, and testing. To no avail.
Seeing the light
Then, finally, I stumbled across this page, talking about how to resolve said OOM error. In short, the JVM apparently throws the Java OOM error whenever it can’t allocate a new thread from the OS, regardless of the cause. Their solution was to up the max processes per user. Hmmmm.
1 2 3 4 5 6 7 8 9 10 11 |
|
Since I had added non-blocking I/O, which inherently must be doing some kind of threading somewhere, I felt I was on to something. Googling “Heroku thread limits” brought me to this page
1X Dynos are limited a combined sum of 256 processes and threads. 2X Dynos are limited to 512. This limit applies whether they are executing, sleeping, or in any other state.
Bingo, but because there was nothing I could do about Heroku’s thread+process limits, I decided to look at the code for Dispatch, the HTTP library I’m using for fetching pages from URLs to look at how it manages threads. This is where things get icky.
Diving in
Apparently, in previous versions of Dispatch, one could configure the number of threads easily (I believe in 0.9.x, you had access to a threads
method). However, in the version that I’m using (latest as of writing), version 0.11.0, you do not. Moreover, unless being called from sbt
, the library now defaults to building clients using the default configuration for the underlying async-http-client
(which does make sense). Unfortunately, it appears that the default configuration therefore results in the use of Executor.newCachedThreadPool
, which some say is good and bad.
Problem identified
The main point is this: because of it’s use of newCachedThreadPool, aysnc-http-client
, and thus Dispatch
is going to use as many threads as necessary to handle the workload that you give it and rely on the JVM clean up idle threads later . Usually, this might not be a problem, but when running on Heroku or any other environment where you might hit thread limit constraints, the cleanup might not happen quick enough to not crash your program.
Resolution
To fix the scary “OOM unable to create new thread” problem when an app using my library is running in such an environment, I did a bit of sleuthing to find out how I might limit the number of threads used by my HTTP library and came up with this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
It makes more sense when you look at the entire Actor source, but in short, I instantiate an HTTP client, passing in an ExecutorService
that uses a fixed threads pool. I then allow library users to configure the number of threads for the client when instantiating the actor (and other options). Of course, this means that an actor’s HTTP client will wait if all execution threads are busy, but since it’s a non-blocking call, the actor itself doesn’t care, and the only negative result is maybe slower operations under load. All in all, I think it’s a good tradeoff for not having your app die.
Lessons learned
- Don’t take an error message at face value. Know exactly when it gets thrown and if there are multiple possible causes, go for the most likely one first.
- Know your environment and its constraints.
Hoped this post helped you !