Pregunta

I presently have the task of baby-sitting a business critical batch ETL system, querying the database of this system every morning for inconsistencies, and rerunning jobs that have failed.

I am not the original designer of the system or of the database, but have made several changes to both, and am reasonably comfortable with the codebase and the data model.

The system has a reputation for being unreliable. Improvements have been made, but we are still a long way from hitting the targets we need to hit. This has led to trust issues for both the system and the team involved. Hence, the system now has a baby-sitter who makes sure it behaved well every morning while we work on hardening it up.

With this in mind, I have come up against a dilemma that my team is split on. Whenever the system shows inconsistencies or failed runs, how rapidly and how severely should we sound the alarm?

Here are the factors that have gone into the discussion thus far:

  • If we do not know what is causing a symptom, but the symptom looks severe, do we hold off on sounding the alarm, or sound the alarm with a lesser severity until we know the cause and discover it really is that severe?
  • Say we know the cause of something causing a symptom, and the symptom is severe. We have a strategy to mitigate that symptom, but the solution has a possibility of one or more failures before a success, and will nevertheless take time to implement. Do we hold off on sounding the severe alarm until we complete the mitigation steps, and still see the data going sideways?
  • Say we know the cause of something causing a symptom, but the cause is out of our control, and we will not be able to mitigate the symptom until the cause is rectified by a third-party. Does this deserve the same severity of alarm as if the data loss were the responsibility of our code.
  • If we sound false alarms too often, we run the risk of losing even more trust, because now we seem incompetent about the failure modes of our system. Is it worth delaying alarms, or tempering their severity, to avoid losing trust in our ability to know when and how our system is breaking? Is following up on high severity alarms to lessen the severity level enough mitigation, or has the damage already been done when the severe alarm goes out?

The reason I ask this is because I sounded what turned out to be a false alarm this morning, due to some data that looked initially to be way out of what I expected. Further research demonstrated that it was just a run against unusual data, and so the inconsistency I saw was to be expected. Along with this, I had to go back and rerun a few things that failed, and had to wait for these to finish before my data consistency checks would mean anything at all.

Another member of my team called me on this, and made the point that I should not have sounded the alarm, and should've instead either said nothing until I was certain, or reported a less severe status until I was certain that the data was in a severely compromised state and that it was our responsibility for it being that way. By doing so, I am not helping us restore trust.

So, with this in mind, I'd like to ask the following questions: when would you sound the alarms, and at what severity, for batch job issues like the one I describe above, knowing that it has a history of problems? Also, are we focussing on the wrong things in building back trust by trying to assure we send off alarms as little as possible?

No hay solución correcta

Licenciado bajo: CC-BY-SA con atribución
scroll top