Priority Inversion Avoidance for Enterprise Real-Time Linux Systems
Priority inversion is a common and well documented situation known to any real-time system or application developer. The mechanisms in place to avoid it are similarly well known: priority inheritance and priority ceiling. For the uninitiated, I'll briefly review these topics.
Priority inversion occurs when a high priority task is prevented from executing because a low priority task holds a lock it requires. Unbounded priority inversion occurs when that low priority task is preempted by a medium priority task, preventing it from completing its critical section and allowing the high priority task to continue (since there is no way to know how long the medium priority task will run for). POSIX pthreads provide two mechanisms for bounding priority inversion (not preventing it): priority ceiling and priority inheritance mutexes.
Priority ceiling mutexes (PTHREAD_PRIO_PROTECT protocol) have an internal priority set which is used to boost the priority of every thread that takes the lock to this priority "ceiling". Every time one such lock is taken and released, the task taking the lock incurs two priority changes. As many locks can be taken uncontested, this can lead to unnecessary overhead, including context switching. Since low priority tasks always run at the priority ceiling while holding the lock, the potential exists to starve out medium priority tasks in certain scenarios. On the up side though, because all threads holding the lock run at the same priority, nested lock chains are prevented from developing, thereby reducing the worst-case priority inversion as seen by the highest priority threads. Many sources (including both articles referenced below) tout the priority ceiling protocol as avoiding deadlock by preventing these nested lock chains. From what I can tell, this only applies to single processor systems, and is only true if only priority ceiling locks are used in the same chain, and if they all have the same priority ceiling. This plus for priority inheritance can be considered moot for Enterprise Real-Time Systems in my opinion.
Priority inheritance mutexes (PTHREAD_PRIO_INHERIT protocol) will boost the priority of a lower priority task to match that of the highest priority task blocked on it. If a lock is uncontested, no priority boosting is necessary to take the lock (in fact the lock can be taken from userspace avoiding the overhead of system-calls). For this reason, the priority inheritance protocol has good average case performance. On the flip-side, because each new contender for the lock can cause an additional priority boost of the thread holding the lock, the worst-case response time is the sum of all the lower priority threads' critical sections and the additional context switches. Unlike the priority ceiling protocol however, the medium priority threads won't be starved out if the low and high priority threads line up just right.
With the emergence of Enterprise Real-Time Systems (those using SMP systems, full Linux environments, and real-time middle-ware and run-time environments such as IBM's WebSphere Real Time) the complexity of real-time applications has greatly increased. In turn, these complex systems require flexible mechanisms for things like priority inversion avoidance. The priority ceiling protocol caps the priority of threads accessing these locks by returning an error if a thread with a higher priority than the mutex's ceiling attempts to take the lock. Static analysis is considered required in order to ensure robustness of a system employing the priority ceiling protocol. Because the priority inheritance protocol bases the priority boosting on the priorities of the threads contending for the lock, it is a much more flexible protocol.
And finally, while best-practices suggest restricting the use of shared resources protected by these protocols to real-time threads (those of scheduling policy SCHED_FIFO or SCHED_RR), these complex systems will at times include a SCHED_OTHER thread in the list of lock contenders. The SUSv3 is not specific on the case of a SCHED_OTHER thread trying to take a mutex of either PTHREAD_PRIO_PROTECT or PTHREAD_PRIO_INHERIT policy, while the Solaris man page explicitly state the behavior is unspecified.
In current Linux implementations, the priority ceiling protocol is handled via glibc, which is reliant on the sched_setscheduler() system call to boost the priority of threads. Because it is illegal to attempt to set the priority of a SCHED_OTHER thread to anything other than 0, glibc would have to first change the policy of the tread to either SCHED_FIFO or SCHED_RR - and therein lies the grounds for "unspecified" behavior: glibc should not make an arbitrary decision for the user. Glibc returns EINVAL to the caller.
In the case of priority inheritance however, the priority boosting is handled within the kernel, which has the ability to boost a SCHED_OTHER thread's effective priority without changing its scheduling policy. The current Linux kernel (2.6.24.4) explicitly supports the use of priority inheritance by SCHED_OTHER threads.
Since static analysis of very complex applications is impractical and the flexibility to use SCHED_OTHER threads in combination with SCHED_FIFO and SCHED_RR threads is important for Enterprise Real-Time Systems, priority inheritance becomes the best choice for these systems.
References
- http://docs.sun.com/app/docs/doc/817-5435/6mkt2tqb3?l=en&a=view
- http://www.linuxdevices.com/articles/AT5698775833.html
- http://www.embedded.com/columns/technicalinsights/20600062?_requestid=478026
- http://www-306.ibm.com/software/webservers/realtime/
- http://www.opengroup.org/onlinepubs/009695399/functions/pthread_mutexattr_getprotocol.html
