Friday, January 16, 2015

Gunicorn dyno death spiral on Heroku -- Part II

After a lot of investigation, we've figured out there is an issue with NewRelic, Postgres and Gunicorn. I summarized the issue here:

https://discussion.heroku.com/t/gunicorn-dyno-death-spiral/136/13

After discussing this with Graham Dumpleton over Twitter there is an issue with libpq. Below is a summary of a rather long Twitter discussion. Anything in quotes is from Graham but could have been paraphrased or reworded slightly to make sense here. Didn't want anyone to think that using Graham's words as my own...

The real issue is caused by "the use of the NewRelic agent in a background thread which does SSL HTTP calls -- surfaced issues with libpq SSL connections to database." This can be replicated by the use of a script someone wrote when the bug was reported to Postgres in October, 2014 (see link below). It's not the fault of the NR agent -- just that it uses a background thread and triggers the same behavior. That's why you can reproduce the issue without the NR agent.

So the "NR agent will create a background thread, but if you had other threads for other reasons which did SSL connections, [it] still occurs. If process is completely single threaded without multiple request handler threads, nor background threads doing SSL, [then it] is okay."

http://www.postgresql.org/message-id/CAHUL3dpWYFnUgdgo95OHYDQ4kugdnBKPTjq0mNbTuBhCMG4xvQ@mail.gmail.com

So in a perfect storm, libpq deadlocks which causes large issues for Gunicorn. The reason why is how Gunicorn is designed and that a "main thread is used to handle requests and if that deadlocks then signals aren't handled or if it uses a pipe of death, it will never return to accept on connection where it gets message to shutdown."

By default, "Django creates a new database connection per request which exacerbates the problem." So the problem can mostly be alleviated by using some sort of database connection pooling -- either what is built-in into Django 1.6+ or something like django-postgresql-pool or PGBouncer however it still can cause issues for Gunicorn as the db pool only reduces the likelihood of the problem. Also, because of the way the main thread works -- that is why I personally saw that the Gunicorn timeout directive have no affect on the problem because the worker was still waiting for Postgres and therefore was still alive despite the fact that Heroku killed the request on the client side.

The only real work around until libpq is fixed is to use something other than Gunicorn like Uwsgi and deal with the Harakiri requests OR don't mix DB calls when calls to HTTPS resources. In our case, we are using Amazon S3 so some requests need to make a bucket request to get information and sometimes cause this deadlock issue.

I know many other people in my local user group that have written off Gunicorn on Heroku thinking the issue was Gunicorn and/or New Relic. However, the problem is in Postgres and it is exacerbated when the NR agent is used.

While we (GreatBizTools) has a workaround in place, it would be great if Heroku (and maybe you can get New Relic) to poke at the Postgres folks to fix this deadlock into that was reported to them in October, 2014. I can't image the number of folks that have been caught out in the rain in the past year or so by this. There are several blog posts that (now wrongly) point fingers at Heroku for not supporting Gunicorn correctly.

We can't be the only people to use Heroku on Python with New Relic and Postgres? It must be a really popular combination.

---

Edit: Heroku is now aware of the issue and is working with respective parties in order to rectify this low level issue. -- January 16th, 2015

Monday, January 12, 2015

Gunicorn dyno death spiral on Heroku

FYI -- Gunicorn dyno death spiral on Heroku -- Part II is now available

-----

We recently released our app XXXX on Heroku using Gunicorn however we quickly found in even the most modest of production load (as little as 10 users) that some dynos would stop responding and start throwing continuous H12 errors for hours.

We experience three separate events (from January 5-6) where one or more dynos would stop serving requests with Gunicorn and throw H12 errors for every request and the load metrics would spike from .2-.5 to 1.5 or higher on that particular dyno. The only remedy was to manually run heroku ps:restart web.X after reading logs and kill the appropriate dyno.

We experienced the same issue as outlined on this thread on the Heroku forums:

https://discussion.heroku.com/t/gunicorn-dyno-death-spiral/136

We were able to track it down to "bad clients" using the application -- they were always Verizon Wireless or Sprint Mobility aircards on laptop computers. We have a single client using this application so it was easy to confirm with them that the reverse IP was indeed Verizon Wireless or Sprint.

Our guess is that a client would not close a connect or respond with ACK messages for the streamed response and therefore exceed the 30 second limit. When Heroku performed an H12 on it, it left the worker on Gunicorn to continue working -- left tied up in an unrecoverable state. This would repeatedly happen (we were only running 3 workers per dyno) until all workers on a single dyno stopped responding. At this point, the the routing mesh would continue routing requests to this rogue dyno but the dyno would just return H12s until it was manually restarted.

We have confirmed it is NOT our application code. The application runs just fine when Gunicorn is swapped out with Uwsgi (we also tested Waitress with success as well). Currently, we are running Uwsgi on the XXXX application since the evening of January 6th. We have not experience any more events where dynos would death spiral out of control after switching to Uwsgi permanently. We still occasionally see a bad client and request -- however we are using the Harakiri option in Uwsgi and the rogue worker is killed and respawned after 25 seconds.

The question we have is why Heroku continues to recommend using Gunicorn when other people like ourselves have experienced terrible results with this particular application server.