Monday, January 12, 2015

Gunicorn dyno death spiral on Heroku

FYI -- Gunicorn dyno death spiral on Heroku -- Part II is now available

-----

We recently released our app XXXX on Heroku using Gunicorn however we quickly found in even the most modest of production load (as little as 10 users) that some dynos would stop responding and start throwing continuous H12 errors for hours.

We experience three separate events (from January 5-6) where one or more dynos would stop serving requests with Gunicorn and throw H12 errors for every request and the load metrics would spike from .2-.5 to 1.5 or higher on that particular dyno. The only remedy was to manually run heroku ps:restart web.X after reading logs and kill the appropriate dyno.

We experienced the same issue as outlined on this thread on the Heroku forums:

https://discussion.heroku.com/t/gunicorn-dyno-death-spiral/136

We were able to track it down to "bad clients" using the application -- they were always Verizon Wireless or Sprint Mobility aircards on laptop computers. We have a single client using this application so it was easy to confirm with them that the reverse IP was indeed Verizon Wireless or Sprint.

Our guess is that a client would not close a connect or respond with ACK messages for the streamed response and therefore exceed the 30 second limit. When Heroku performed an H12 on it, it left the worker on Gunicorn to continue working -- left tied up in an unrecoverable state. This would repeatedly happen (we were only running 3 workers per dyno) until all workers on a single dyno stopped responding. At this point, the the routing mesh would continue routing requests to this rogue dyno but the dyno would just return H12s until it was manually restarted.

We have confirmed it is NOT our application code. The application runs just fine when Gunicorn is swapped out with Uwsgi (we also tested Waitress with success as well). Currently, we are running Uwsgi on the XXXX application since the evening of January 6th. We have not experience any more events where dynos would death spiral out of control after switching to Uwsgi permanently. We still occasionally see a bad client and request -- however we are using the Harakiri option in Uwsgi and the rogue worker is killed and respawned after 25 seconds.

The question we have is why Heroku continues to recommend using Gunicorn when other people like ourselves have experienced terrible results with this particular application server.

No comments:

Post a Comment