After a lot of investigation, we've figured out there is an issue with NewRelic, Postgres and Gunicorn. I summarized the issue here:https://discussion.heroku.com/t/gunicorn-dyno-death-spiral/136/13
After discussing this with Graham Dumpleton over Twitter there is an issue with libpq. Below is a summary of a rather long Twitter discussion. Anything in quotes is from Graham but could have been paraphrased or reworded slightly to make sense here. Didn't want anyone to think that using Graham's words as my own...
The real issue is caused by "the use of the NewRelic agent in a background thread which does SSL HTTP calls -- surfaced issues with libpq SSL connections to database." This can be replicated by the use of a script someone wrote when the bug was reported to Postgres in October, 2014 (see link below). It's not the fault of the NR agent -- just that it uses a background thread and triggers the same behavior. That's why you can reproduce the issue without the NR agent.
So the "NR agent will create a background thread, but if you had other threads for other reasons which did SSL connections, [it] still occurs. If process is completely single threaded without multiple request handler threads, nor background threads doing SSL, [then it] is okay."http://www.postgresql.org/message-id/CAHUL3dpWYFnUgdgo95OHYDQ4kugdnBKPTjq0mNbTuBhCMG4xvQ@mail.gmail.com
So in a perfect storm, libpq deadlocks which causes large issues for Gunicorn. The reason why is how Gunicorn is designed and that a "main thread is used to handle requests and if that deadlocks then signals aren't handled or if it uses a pipe of death, it will never return to accept on connection where it gets message to shutdown."
By default, "Django creates a new database connection per request which exacerbates the problem." So the problem can mostly be alleviated by using some sort of database connection pooling -- either what is built-in into Django 1.6+ or something like django-postgresql-pool or PGBouncer however it still can cause issues for Gunicorn as the db pool only reduces the likelihood of the problem. Also, because of the way the main thread works -- that is why I personally saw that the Gunicorn timeout directive have no affect on the problem because the worker was still waiting for Postgres and therefore was still alive despite the fact that Heroku killed the request on the client side.
The only real work around until libpq is fixed is to use something other than Gunicorn like Uwsgi and deal with the Harakiri requests OR don't mix DB calls when calls to HTTPS resources. In our case, we are using Amazon S3 so some requests need to make a bucket request to get information and sometimes cause this deadlock issue.
I know many other people in my local user group that have written off Gunicorn on Heroku thinking the issue was Gunicorn and/or New Relic. However, the problem is in Postgres and it is exacerbated when the NR agent is used.
While we (GreatBizTools) has a workaround in place, it would be great if Heroku (and maybe you can get New Relic) to poke at the Postgres folks to fix this deadlock into that was reported to them in October, 2014. I can't image the number of folks that have been caught out in the rain in the past year or so by this. There are several blog posts that (now wrongly) point fingers at Heroku for not supporting Gunicorn correctly.
We can't be the only people to use Heroku on Python with New Relic and Postgres? It must be a really popular combination.
Edit: Heroku is now aware of the issue and is working with respective parties in order to rectify this low level issue. -- January 16th, 2015