How does Solaris SMF determine if something is to be in maintenance or to be restarted?

Question 1

As it turns out, I had two pkills in a row to make sure everything was terminated correctly. The second one, naturally, was exiting something other than 0. Changing this to include an exit 0 at the end of the script solved the problem.

Question 2

Manpage

Solaris uses a combination of startd/critical_failure_count and startd/critical_failure_period as described in the svc.startd manpage:

startd/critical_failure_count

startd/critical_failure_period

The critical_failure_count and critical_failure_period properties together specify the maximum number of service failures allowed in a given time interval before svc.startd transitions the service to maintenance. If the number of failures exceeds critical_failure_count in any period of critical_failure_period seconds, svc.startd will transition the service to maintenance.

Defaults in the source code

The defaults can be found in the source, the value depends on whether the service is "wait style":

if (instance_is_wait_style(inst))
    critical_failure_period = RINST_WT_SVC_FAILURE_RATE_NS;
else
    critical_failure_period = RINST_FAILURE_RATE_NS;

The defaults are either 5 failures/10 minutes or 5 failures/second:

#define RINST_START_TIMES   5       /* failures to consider */
#define RINST_FAILURE_RATE_NS   600000000000LL  /* 1 failure/10 minutes */
#define RINST_WT_SVC_FAILURE_RATE_NS    NANOSEC /* 1 failure/second */

These variables can be set in the SMF as properties:

<service_bundle type="manifest" name="npm2es">
  <service name="site/npm2es" type="service" version="1">
    ...
    <property_group name="startd" type="framework">
      <propval name='critical_failure_count' type='integer' value='10'/>
      <propval name='critical_failure_period' type='integer' value='30'/>
      <propval name="ignore_error" type="astring" value="core,signal" />
    </property_group>
    ...
  </service>
</service_bundle>

TL;DR

After checking against the startd values, If the service is "wait style", it will be throttled to a max restart of 1/sec, until it no longer exits with a non-cfg error. If the service is not "wait style" it will be put into maintenance mode.

Question 3

Presuming a normal service manifest, I would suspect that you're dropping into maintenance because SMF is restarting you "too quickly" (which is a bit arbitrarily defined). svcs -xv should tell you if that is the case. If it is, SMF is restarting you, and then you're exiting again rapidly and it's decided to give up until the problem is fixed (and you've manually svcadm clear'd it.

I'd wondered if exiting 0 (and indicating success) may cause further confusion, but it doesn't appear that it will.

I don't think Oracle Solaris allows you to tune what SMF considers "too quickly".

Question 4

You have to create a service manifest. This is more complicated than not. This has example manifests and documents the manifest structure.

http://www.oracle.com/technetwork/server-storage/solaris/solaris-smf-manifest-wp-167902.pdf