production resultsdb fails to run
Closed, ResolvedPublic

Description

https://taskotron.fedoraproject.org/resultsdb/results either times out or returns 500. Restarting httpd helped for a few minutes and then started failing again (at the moment, I restarted again, let's see how long that hold up). In /var/log/httpd/error_log I see the following error repeated several hundred times per minute:

[Wed Mar 15 09:49:56.314409 2017] [wsgi:error] [pid 16429] (11)Resource temporarily unavailable: [client 192.168.1.37:35806] mod_wsgi (pid=16429): Unable to connect to WSGI daemon process 'resultsdb' on '/etc/httpd/run/wsgi.16044.0.2.sock' after multiple attempts as listener backlog limit was exceeded.

It seems we're running over our capacity of requests to handle? There's some discussion here.

Also, a few times per minute I see an error like this:

[Wed Mar 15 09:49:57.449294 2017] [mpm_prefork:error] [pid 16044] (11)Resource temporarily unavailable: AH00159: fork: Unable to fork new process

Maybe we're even running out of PIDs or something?

When looking at /var/log/httpd/access_log, I see hundreds/thousands of 503 and 504 responses, like this:

192.168.1.12 - - [15/Mar/2017:09:52:53 +0000] "GET /resultsdb_api/api/v1.0//testcases/dist.rpmgrill.spec-file-encod
ing/results?type=koji_build&item=pigz-2.3.4-2.fc26 HTTP/1.1" 503 299 "-" "brt-dashboard/0.1"
192.168.1.14 - - [15/Mar/2017:09:52:53 +0000] "GET /resultsdb_api/api/v1.0//testcases/scratch.dockerautotest/result
s?type=koji_build&item=pigz-2.3.4-2.fc26 HTTP/1.1" 503 299 "-" "brt-dashboard/0.1"
192.168.1.25 - - [15/Mar/2017:09:52:53 +0000] "GET /resultsdb_api/api/v1.0//testcases/compose.install_kickstart_fir
ewall_configured/results?type=koji_build&item=policycoreutils-2.6-3.fc26 HTTP/1.1" 503 299 "-" "brt-dashboard/0.1"
192.168.1.52 - - [15/Mar/2017:09:52:53 +0000] "GET /resultsdb_api/api/v1.0//testcases/dist.rpmgrill.security-policy
/results?type=koji_build&item=python-chardet-2.3.0-3.fc26 HTTP/1.1" 503 299 "-" "brt-dashboard/0.1"
192.168.1.15 - - [15/Mar/2017:09:52:53 +0000] "GET /resultsdb_api/api/v1.0//testcases/compose.server_cockpit_basic/
results?type=koji_build&item=policycoreutils-2.6-3.fc26 HTTP/1.1" 503 299 "-" "brt-dashboard/0.1"
192.168.1.12 - - [15/Mar/2017:09:52:53 +0000] "GET /resultsdb_api/api/v1.0//testcases/compose.install_kickstart_fir
ewall_disabled/results?type=koji_build&item=plymouth-0.9.3-0.7.20160620git0e65b86c.fc26 HTTP/1.1" 503 299 "-" "brt-
dashboard/0.1"

I think this could be related. Because bodhi queries $item for all available testcases, it creates a hundred calls for each page displayed. Most of them are not present in database, though (because we obviously don't test every item with every testcase), therefore the search takes a long time and times out eventually (I've never seen a query for non-existent result to return before it times out, so this is a 100% guaranteed timeout). That could be the reason why the requests queue fills up that quickly. I believe we need to do something about a) the way Bodhi queries us b) improve the experience for non-existent results (a default time limit, perhaps).

Another reason that we're overloaded could be that bots index us:

192.168.1.78 - - [15/Mar/2017:09:59:46 +0000] "GET /resultsdb_api//api/v1.0/testcases/compose.install_btrfs/results?type=koji_build&item=sane-backends-1.0.25-6.fc25&callback=jQuery1102014881141134537756_1489571775067&_=1487808000000 HTTP/1.1" 200 542 "https://bodhi.fedoraproject.org/updates/sane-backends-1.0.25-6.fc25" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

We should exclude bot indexing?

kparal created this task.Mar 15 2017, 10:07 AM
kparal moved this task from Restricted Project Column to Restricted Project Column on the Restricted Project board.

A short term patch could be to increase the number of processes allocated to the wsgi application.

I wonder what changed - I hit a potentially similar issue where resultsdb just stopped responding to anything.

Today resultsdb_api seems to run somewhat (testcases work, but some tasks fail to post results). The frontend seems completely non-functional (testcases return Service Unavailable).

There are some new errors to be seen:

[Thu Mar 16 09:07:30.902985 2017] [wsgi:alert] [pid 29593] (11)Resource temporarily unavailable: mod_wsgi (pid=29593): Couldn't create worker thread 42 in daemon process 'resultsdb'.

and some old friends:

[Thu Mar 16 09:07:39.841411 2017] [wsgi:error] [pid 22199] (11)Resource temporarily unavailable: [client 192.168.1.37:46160] mod_wsgi (pid=22199): Unable to connect to WSGI daemon process 'resultsdb_frontend' on '/etc/httpd/run/wsgi.22188.0.3.sock' after multiple attempts as listener backlog limit was exceeded.

Googlebot is still querying us constantly.

I have deployed new robots.txt: https://taskotron.fedoraproject.org/robots.txt
We should see whether it helps in a day or so (once Google refreshes that file). Btw, these are the current per-hour access numbers from Google:
2568
2341
3083
4336
3213
4225
3738
So about 3300 hits per hour on overage, 55 per minute. They are performed in bursts, though (ten or twenty simultaneous requests, then a pause, then again).

Googlebot stopped hammering us, new robots.txt work. They are present in ansible now. However, it didn't solve the problems, resultsdb (and execdb) is down again.

kparal lowered the priority of this task from "Unbreak Now!" to "High".Mar 21 2017, 11:07 AM

Production resultsdb seems to work better now, so lowering the priority. However, since we can't track how many tasks failed to post results (since we don't fail them if they do so), it's hard to guess how much uptime do we actually have. Maybe we should start failing jobs if resultsdb responds 5xx? Or retry a few times and fail eventually?

kparal closed this task as "Resolved".Aug 1 2017, 1:03 PM
kparal claimed this task.

Seems better now.