We had a problem with a script. Actually, two problems.
Firstly, it was restarting itself pretty regularly. Monit was configured to restart it when it came down (because the script’s purpose was business-critical), but the Ops team were getting miffed with all the Monit warning emails. After a few exchanges, and a lot of searching through the code, and trying to re-create the issue, it was noticed that the script was restarted by Monit mostly on the half-hour, every hour. “Can you have a look at the list of crons for that machine?” “Yup. There’s only one - it kills a script every hour, on the half-hour”.
When I finished laughing at my wasted days, I made a mental note
- all crons on live machines should be documented.
Actually the days weren’t wasted, because I did see another potential problem in looking through the code, which was promptly fixed.
However, once the cron was commented, a different sort of problem reared its ugly head - one much more difficult to track down.
The script was reading from an ActveMQ topic and writing the messages to disk. Nothing else. The AMQ network was pretty well configured (at least seven brokers, three different locations, at least two brokers per location). The topic was not busy, but messages had to be received when sent.
The cron was put in place because of this problem, which was that after running for a period of time (one day or seven) the script would still be running (hence no Monit kick) but no messages would get through to it, rendering it completely useless.
After a while of trying to get this problem recreated, I tried something everyone should - null-routing packets between the script and the ActiveMQ host. This is an easy thing to do with iptables :
iptables -A OUTPUT -d <broker IP> -j DROP iptables -A INPUT -s <broker IP> -j DROP
(iptables is very useful. Less so than tcpdump, but for those times when you want to test something like a network problem, or a slow transfer rate, it’s pretty handy).
(Also, to delete those rules - if you had no rules in your tables before, just run this (-D 1 deletes the first rule in the chain) :
iptables -D 1 OUTPUT iptables -D 1 INPUT
Otherwise you’ll have to find them using :
iptables -L OUTPUT iptables -L INPUT
Find rules that match the ones we’ve entered above, and replace the ‘1’ in the -D commands with the number rule they are (starting from 1 at the top))
There was code in the script designed to detect when the connection had been dropped, but this was a long way from the kernel network stack (PHP script using the PECL Stomp module).
To my surprise / bemusement / relief, the script had absolutely no idea that the connection had been dropped. I could send messages and they’d not get picked up, but the script was not even trying to reconnect. This was a complete recreation of the problem in live. The connection was probably being assumed to be idle by something in the middle (router / firewall / pixie) and silently severed without notifying either end.
It was a bit puzzling, although I then had deja vu to a previous job where long-running MySQL queries in PHP CLI tasks would appear to finish accoring to SHOW PROCESSES but the PHP script (with max_execution_time turned off) would never get the resultset back - presumably because the connection had been silently dropped by something in between.
We then changed the script to keep track of how long it had kept its connection open, and to reconnect if it exceeded a maximum time. There have been no problems reported since.