There are many principles you can employ to design systems that don't break often. This will help you sleep better at night. That is because the system won't crash, and they won't have to call you for help. I should know about this. I just got a support call this evening. We must not be following the right principles at work.
One this you can do is to manage your events with a queue. There is standard technology to deal with queues. An example is Microsoft's MSMQ. You can even choose a queue in the cloud if you want to get fancy. Wait. Hold that thought. You do not want the support calls. So choose a mature and stable queuing technology.
Many business processes are batch oriented. Each stage requires the output from the prior stage. Our back end is a big UNIX system. You would think you could just pipes the processes along. That works in theory. This is a bad practice in reality. You should design a scheduling system that spawns the processes separately. The results from each stage should go to a file. The scheduler can wake up every so often, see if a stage has completed, then start up the next stage. We have our loading software follow this pattern. It is rock solid, even when things go awry.
A final tip is one that I have heard before. Every time you resolve some trouble ticket, you should turn that into an additional automated unit test. You already are investing the time to test your bug fixes. It is a marginal extra effort to automate that testing. This helps you grow your unit test code base and coverage. Sometimes prevention helps you avoid the pain that you would get otherwise.
Be Brave to Get Work Done - I was woken up this morning from a call from work. Not a good sign. Apparently the customer found a potential problem in our delivery. I got on a conferen...