Friday, January 16, 2015

Gunicorn dyno death spiral on Heroku -- Part II

After a lot of investigation, we've figured out there is an issue with NewRelic, Postgres and Gunicorn. I summarized the issue here:

https://discussion.heroku.com/t/gunicorn-dyno-death-spiral/136/13

After discussing this with Graham Dumpleton over Twitter there is an issue with libpq. Below is a summary of a rather long Twitter discussion. Anything in quotes is from Graham but could have been paraphrased or reworded slightly to make sense here. Didn't want anyone to think that using Graham's words as my own...

The real issue is caused by "the use of the NewRelic agent in a background thread which does SSL HTTP calls -- surfaced issues with libpq SSL connections to database." This can be replicated by the use of a script someone wrote when the bug was reported to Postgres in October, 2014 (see link below). It's not the fault of the NR agent -- just that it uses a background thread and triggers the same behavior. That's why you can reproduce the issue without the NR agent.

So the "NR agent will create a background thread, but if you had other threads for other reasons which did SSL connections, [it] still occurs. If process is completely single threaded without multiple request handler threads, nor background threads doing SSL, [then it] is okay."

http://www.postgresql.org/message-id/CAHUL3dpWYFnUgdgo95OHYDQ4kugdnBKPTjq0mNbTuBhCMG4xvQ@mail.gmail.com

So in a perfect storm, libpq deadlocks which causes large issues for Gunicorn. The reason why is how Gunicorn is designed and that a "main thread is used to handle requests and if that deadlocks then signals aren't handled or if it uses a pipe of death, it will never return to accept on connection where it gets message to shutdown."

By default, "Django creates a new database connection per request which exacerbates the problem." So the problem can mostly be alleviated by using some sort of database connection pooling -- either what is built-in into Django 1.6+ or something like django-postgresql-pool or PGBouncer however it still can cause issues for Gunicorn as the db pool only reduces the likelihood of the problem. Also, because of the way the main thread works -- that is why I personally saw that the Gunicorn timeout directive have no affect on the problem because the worker was still waiting for Postgres and therefore was still alive despite the fact that Heroku killed the request on the client side.

The only real work around until libpq is fixed is to use something other than Gunicorn like Uwsgi and deal with the Harakiri requests OR don't mix DB calls when calls to HTTPS resources. In our case, we are using Amazon S3 so some requests need to make a bucket request to get information and sometimes cause this deadlock issue.

I know many other people in my local user group that have written off Gunicorn on Heroku thinking the issue was Gunicorn and/or New Relic. However, the problem is in Postgres and it is exacerbated when the NR agent is used.

While we (GreatBizTools) has a workaround in place, it would be great if Heroku (and maybe you can get New Relic) to poke at the Postgres folks to fix this deadlock into that was reported to them in October, 2014. I can't image the number of folks that have been caught out in the rain in the past year or so by this. There are several blog posts that (now wrongly) point fingers at Heroku for not supporting Gunicorn correctly.

We can't be the only people to use Heroku on Python with New Relic and Postgres? It must be a really popular combination.

---

Edit: Heroku is now aware of the issue and is working with respective parties in order to rectify this low level issue. -- January 16th, 2015

3 comments:

  1. […] ← Python / Django / Selenium: Set viewport size (window size) Gunicorn dyno death spiral on Heroku — Part II → […]

    ReplyDelete
  2. The report on the Postgres list actually came from New Relic (Amjith). We have tried to work with a Postgres committer to come up with a resolution since then but there was no simple obvious change that could be made in libpq as dynamic unloading of libraries from memory complicated things. We have not heard anything since and since Postgres doesn’t have a public issue tracker that we know of, no where to track what is going on.

    As to this just affecting Gunicorn on Heroku, that isn’t the case. Any multithreaded WSGI server can be affected and we identified it when using mod_wsgi in testing. I have also had sporadic reports of mod_wsgi lock ups ever since mod_wsgi was released. Given that since we tracked this down I have separately pinned two such lock ups seen by mod_wsgi users to the libpq issue, I suspect it may be the problem I have been seeing for many years even when New Relic wasn’t being used.

    Both mod_wsgi and uWSGI have solid mechanisms for dealing with locked up processes and then shutting down, when configured, which aren’t dependent on Python code being able to run as they operate at the C code level. Unfortunately one can only run mod_wsgi on Heroku with Python 3.3 and 3.4. Not possible to run it on 2.7 on Heroku because of limitations in the Heroku Python installation, so if on 2.7 it isn’t really an option in that case.

    ReplyDelete
  3. Graham, thanks for the reply and adding to the future finders of this post. You're right that this affects other WSGI servers NOT just Gunicorn, but it fails WAY harder on Gunicorn than others (IMO).

    And it's not specific to Heroku either as you point out. Apologies for inferring otherwise -- this blog post is mostly a one-up copy of a message with Heroku support with little editing. My blog is more of a notebook than anything well written admittedly.

    I thought it was better to post it than there be little information out there because the major number of complaints / forum posts / stack overflow pleas I've found about this are specifically aimed around Heroku. Also, they pretty much wrongly blame everything on Gunicorn and makes app servers like uWSGI / others into magic bullets (which is wrong as well, they just do better when this circumstance comes up).

    ReplyDelete