
One this you can do is to manage your events with a queue. There is standard technology to deal with queues. An example is Microsoft's MSMQ. You can even choose a queue in the cloud if you want to get fancy. Wait. Hold that thought. You do not want the support calls. So choose a mature and stable queuing technology.
Many business processes are batch oriented. Each stage requires the output from the prior stage. Our back end is a big UNIX system. You would think you could just pipes the processes along. That works in theory. This is a bad practice in reality. You should design a scheduling system that spawns the processes separately. The results from each stage should go to a file. The scheduler can wake up every so often, see if a stage has completed, then start up the next stage. We have our loading software follow this pattern. It is rock solid, even when things go awry.
A final tip is one that I have heard before. Every time you resolve some trouble ticket, you should turn that into an additional automated unit test. You already are investing the time to test your bug fixes. It is a marginal extra effort to automate that testing. This helps you grow your unit test code base and coverage. Sometimes prevention helps you avoid the pain that you would get otherwise.