From: Dennis Groenen <dennis_groenen@hotmail.com>
Date: Sun, 10 Apr 2011 21:00:42 +0000 (+0200)
Subject: bfs 363 -> 400, fix BFS documentation
X-Git-Url: http://vcs.maemo.org/git/?p=kernel-bfs;a=commitdiff_plain;h=cf93716d483ed109f0689d791b107495d2e16b61

bfs 363 -> 400, fix BFS documentation
---

diff --git a/kernel-bfs-2.6.28/debian/patches/bfs-363-to-400.patch b/kernel-bfs-2.6.28/debian/patches/bfs-363-to-400.patch
new file mode 100644
index 0000000..e7d458b
--- /dev/null
+++ b/kernel-bfs-2.6.28/debian/patches/bfs-363-to-400.patch
@@ -0,0 +1,1425 @@
+diff -urpN linux-2.6.28.orig/Documentation/scheduler/sched-BFS.txt linux-2.6.28/Documentation/scheduler/sched-BFS.txt
+--- linux-2.6.28.orig/Documentation/scheduler/sched-BFS.txt	2011-04-10 21:50:27.152902332 +0200
++++ linux-2.6.28/Documentation/scheduler/sched-BFS.txt	2011-04-10 21:32:08.616315645 +0200
+@@ -1,359 +1,347 @@
+-***************
+-*** 0 ****
+---- 1,356 ----
+-+ BFS - The Brain Fuck Scheduler by Con Kolivas.
+-+ 
+-+ Goals.
+-+ 
+-+ The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to
+-+ completely do away with the complex designs of the past for the cpu process
+-+ scheduler and instead implement one that is very simple in basic design.
+-+ The main focus of BFS is to achieve excellent desktop interactivity and
+-+ responsiveness without heuristics and tuning knobs that are difficult to
+-+ understand, impossible to model and predict the effect of, and when tuned to
+-+ one workload cause massive detriment to another.
+-+ 
+-+ 
+-+ Design summary.
+-+ 
+-+ BFS is best described as a single runqueue, O(n) lookup, earliest effective
+-+ virtual deadline first design, loosely based on EEVDF (earliest eligible virtual
+-+ deadline first) and my previous Staircase Deadline scheduler. Each component
+-+ shall be described in order to understand the significance of, and reasoning for
+-+ it. The codebase when the first stable version was released was approximately
+-+ 9000 lines less code than the existing mainline linux kernel scheduler (in
+-+ 2.6.31). This does not even take into account the removal of documentation and
+-+ the cgroups code that is not used.
+-+ 
+-+ Design reasoning.
+-+ 
+-+ The single runqueue refers to the queued but not running processes for the
+-+ entire system, regardless of the number of CPUs. The reason for going back to
+-+ a single runqueue design is that once multiple runqueues are introduced,
+-+ per-CPU or otherwise, there will be complex interactions as each runqueue will
+-+ be responsible for the scheduling latency and fairness of the tasks only on its
+-+ own runqueue, and to achieve fairness and low latency across multiple CPUs, any
+-+ advantage in throughput of having CPU local tasks causes other disadvantages.
+-+ This is due to requiring a very complex balancing system to at best achieve some
+-+ semblance of fairness across CPUs and can only maintain relatively low latency
+-+ for tasks bound to the same CPUs, not across them. To increase said fairness
+-+ and latency across CPUs, the advantage of local runqueue locking, which makes
+-+ for better scalability, is lost due to having to grab multiple locks.
+-+ 
+-+ A significant feature of BFS is that all accounting is done purely based on CPU
+-+ used and nowhere is sleep time used in any way to determine entitlement or
+-+ interactivity. Interactivity "estimators" that use some kind of sleep/run
+-+ algorithm are doomed to fail to detect all interactive tasks, and to falsely tag
+-+ tasks that aren't interactive as being so. The reason for this is that it is
+-+ close to impossible to determine that when a task is sleeping, whether it is
+-+ doing it voluntarily, as in a userspace application waiting for input in the
+-+ form of a mouse click or otherwise, or involuntarily, because it is waiting for
+-+ another thread, process, I/O, kernel activity or whatever. Thus, such an
+-+ estimator will introduce corner cases, and more heuristics will be required to
+-+ cope with those corner cases, introducing more corner cases and failed
+-+ interactivity detection and so on. Interactivity in BFS is built into the design
+-+ by virtue of the fact that tasks that are waking up have not used up their quota
+-+ of CPU time, and have earlier effective deadlines, thereby making it very likely
+-+ they will preempt any CPU bound task of equivalent nice level. See below for
+-+ more information on the virtual deadline mechanism. Even if they do not preempt
+-+ a running task, because the rr interval is guaranteed to have a bound upper
+-+ limit on how long a task will wait for, it will be scheduled within a timeframe
+-+ that will not cause visible interface jitter.
+-+ 
+-+ 
+-+ Design details.
+-+ 
+-+ Task insertion.
+-+ 
+-+ BFS inserts tasks into each relevant queue as an O(1) insertion into a double
+-+ linked list. On insertion, *every* running queue is checked to see if the newly
+-+ queued task can run on any idle queue, or preempt the lowest running task on the
+-+ system. This is how the cross-CPU scheduling of BFS achieves significantly lower
+-+ latency per extra CPU the system has. In this case the lookup is, in the worst
+-+ case scenario, O(n) where n is the number of CPUs on the system.
+-+ 
+-+ Data protection.
+-+ 
+-+ BFS has one single lock protecting the process local data of every task in the
+-+ global queue. Thus every insertion, removal and modification of task data in the
+-+ global runqueue needs to grab the global lock. However, once a task is taken by
+-+ a CPU, the CPU has its own local data copy of the running process' accounting
+-+ information which only that CPU accesses and modifies (such as during a
+-+ timer tick) thus allowing the accounting data to be updated lockless. Once a
+-+ CPU has taken a task to run, it removes it from the global queue. Thus the
+-+ global queue only ever has, at most,
+-+ 
+-+ 	(number of tasks requesting cpu time) - (number of logical CPUs) + 1
+-+ 
+-+ tasks in the global queue. This value is relevant for the time taken to look up
+-+ tasks during scheduling. This will increase if many tasks with CPU affinity set
+-+ in their policy to limit which CPUs they're allowed to run on if they outnumber
+-+ the number of CPUs. The +1 is because when rescheduling a task, the CPU's
+-+ currently running task is put back on the queue. Lookup will be described after
+-+ the virtual deadline mechanism is explained.
+-+ 
+-+ Virtual deadline.
+-+ 
+-+ The key to achieving low latency, scheduling fairness, and "nice level"
+-+ distribution in BFS is entirely in the virtual deadline mechanism. The one
+-+ tunable in BFS is the rr_interval, or "round robin interval". This is the
+-+ maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy)
+-+ tasks of the same nice level will be running for, or looking at it the other
+-+ way around, the longest duration two tasks of the same nice level will be
+-+ delayed for. When a task requests cpu time, it is given a quota (time_slice)
+-+ equal to the rr_interval and a virtual deadline. The virtual deadline is
+-+ offset from the current time in jiffies by this equation:
+-+ 
+-+ 	jiffies + (prio_ratio * rr_interval)
+-+ 
+-+ The prio_ratio is determined as a ratio compared to the baseline of nice -20
+-+ and increases by 10% per nice level. The deadline is a virtual one only in that
+-+ no guarantee is placed that a task will actually be scheduled by this time, but
+-+ it is used to compare which task should go next. There are three components to
+-+ how a task is next chosen. First is time_slice expiration. If a task runs out
+-+ of its time_slice, it is descheduled, the time_slice is refilled, and the
+-+ deadline reset to that formula above. Second is sleep, where a task no longer
+-+ is requesting CPU for whatever reason. The time_slice and deadline are _not_
+-+ adjusted in this case and are just carried over for when the task is next
+-+ scheduled. Third is preemption, and that is when a newly waking task is deemed
+-+ higher priority than a currently running task on any cpu by virtue of the fact
+-+ that it has an earlier virtual deadline than the currently running task. The
+-+ earlier deadline is the key to which task is next chosen for the first and
+-+ second cases. Once a task is descheduled, it is put back on the queue, and an
+-+ O(n) lookup of all queued-but-not-running tasks is done to determine which has
+-+ the earliest deadline and that task is chosen to receive CPU next. The one
+-+ caveat to this is that if a deadline has already passed (jiffies is greater
+-+ than the deadline), the tasks are chosen in FIFO (first in first out) order as
+-+ the deadlines are old and their absolute value becomes decreasingly relevant
+-+ apart from being a flag that they have been asleep and deserve CPU time ahead
+-+ of all later deadlines.
+-+ 
+-+ The CPU proportion of different nice tasks works out to be approximately the
+-+ 
+-+ 	(prio_ratio difference)^2
+-+ 
+-+ The reason it is squared is that a task's deadline does not change while it is
+-+ running unless it runs out of time_slice. Thus, even if the time actually
+-+ passes the deadline of another task that is queued, it will not get CPU time
+-+ unless the current running task deschedules, and the time "base" (jiffies) is
+-+ constantly moving.
+-+ 
+-+ Task lookup.
+-+ 
+-+ BFS has 103 priority queues. 100 of these are dedicated to the static priority
+-+ of realtime tasks, and the remaining 3 are, in order of best to worst priority,
+-+ SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority
+-+ scheduling). When a task of these priorities is queued, a bitmap of running
+-+ priorities is set showing which of these priorities has tasks waiting for CPU
+-+ time. When a CPU is made to reschedule, the lookup for the next task to get
+-+ CPU time is performed in the following way:
+-+ 
+-+ First the bitmap is checked to see what static priority tasks are queued. If
+-+ any realtime priorities are found, the corresponding queue is checked and the
+-+ first task listed there is taken (provided CPU affinity is suitable) and lookup
+-+ is complete. If the priority corresponds to a SCHED_ISO task, they are also
+-+ taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds
+-+ to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this
+-+ stage, every task in the runlist that corresponds to that priority is checked
+-+ to see which has the earliest set deadline, and (provided it has suitable CPU
+-+ affinity) it is taken off the runqueue and given the CPU. If a task has an
+-+ expired deadline, it is taken and the rest of the lookup aborted (as they are
+-+ chosen in FIFO order).
+-+ 
+-+ Thus, the lookup is O(n) in the worst case only, where n is as described
+-+ earlier, as tasks may be chosen before the whole task list is looked over.
+-+ 
+-+ 
+-+ Scalability.
+-+ 
+-+ The major limitations of BFS will be that of scalability, as the separate
+-+ runqueue designs will have less lock contention as the number of CPUs rises.
+-+ However they do not scale linearly even with separate runqueues as multiple
+-+ runqueues will need to be locked concurrently on such designs to be able to
+-+ achieve fair CPU balancing, to try and achieve some sort of nice-level fairness
+-+ across CPUs, and to achieve low enough latency for tasks on a busy CPU when
+-+ other CPUs would be more suited. BFS has the advantage that it requires no
+-+ balancing algorithm whatsoever, as balancing occurs by proxy simply because
+-+ all CPUs draw off the global runqueue, in priority and deadline order. Despite
+-+ the fact that scalability is _not_ the prime concern of BFS, it both shows very
+-+ good scalability to smaller numbers of CPUs and is likely a more scalable design
+-+ at these numbers of CPUs.
+-+ 
+-+ It also has some very low overhead scalability features built into the design
+-+ when it has been deemed their overhead is so marginal that they're worth adding.
+-+ The first is the local copy of the running process' data to the CPU it's running
+-+ on to allow that data to be updated lockless where possible. Then there is
+-+ deference paid to the last CPU a task was running on, by trying that CPU first
+-+ when looking for an idle CPU to use the next time it's scheduled. Finally there
+-+ is the notion of cache locality beyond the last running CPU. The sched_domains
+-+ information is used to determine the relative virtual "cache distance" that
+-+ other CPUs have from the last CPU a task was running on. CPUs with shared
+-+ caches, such as SMT siblings, or multicore CPUs with shared caches, are treated
+-+ as cache local. CPUs without shared caches are treated as not cache local, and
+-+ CPUs on different NUMA nodes are treated as very distant. This "relative cache
+-+ distance" is used by modifying the virtual deadline value when doing lookups.
+-+ Effectively, the deadline is unaltered between "cache local" CPUs, doubled for
+-+ "cache distant" CPUs, and quadrupled for "very distant" CPUs. The reasoning
+-+ behind the doubling of deadlines is as follows. The real cost of migrating a
+-+ task from one CPU to another is entirely dependant on the cache footprint of
+-+ the task, how cache intensive the task is, how long it's been running on that
+-+ CPU to take up the bulk of its cache, how big the CPU cache is, how fast and
+-+ how layered the CPU cache is, how fast a context switch is... and so on. In
+-+ other words, it's close to random in the real world where we do more than just
+-+ one sole workload. The only thing we can be sure of is that it's not free. So
+-+ BFS uses the principle that an idle CPU is a wasted CPU and utilising idle CPUs
+-+ is more important than cache locality, and cache locality only plays a part
+-+ after that. Doubling the effective deadline is based on the premise that the
+-+ "cache local" CPUs will tend to work on the same tasks up to double the number
+-+ of cache local CPUs, and once the workload is beyond that amount, it is likely
+-+ that none of the tasks are cache warm anywhere anyway. The quadrupling for NUMA
+-+ is a value I pulled out of my arse.
+-+ 
+-+ When choosing an idle CPU for a waking task, the cache locality is determined
+-+ according to where the task last ran and then idle CPUs are ranked from best
+-+ to worst to choose the most suitable idle CPU based on cache locality, NUMA
+-+ node locality and hyperthread sibling business. They are chosen in the
+-+ following preference (if idle):
+-+ 
+-+ * Same core, idle or busy cache, idle threads
+-+ * Other core, same cache, idle or busy cache, idle threads.
+-+ * Same node, other CPU, idle cache, idle threads.
+-+ * Same node, other CPU, busy cache, idle threads.
+-+ * Same core, busy threads.
+-+ * Other core, same cache, busy threads.
+-+ * Same node, other CPU, busy threads.
+-+ * Other node, other CPU, idle cache, idle threads.
+-+ * Other node, other CPU, busy cache, idle threads.
+-+ * Other node, other CPU, busy threads.
+-+ 
+-+ This shows the SMT or "hyperthread" awareness in the design as well which will
+-+ choose a real idle core first before a logical SMT sibling which already has
+-+ tasks on the physical CPU.
+-+ 
+-+ Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark.
+-+ However this benchmarking was performed on an earlier design that was far less
+-+ scalable than the current one so it's hard to know how scalable it is in terms
+-+ of both CPUs (due to the global runqueue) and heavily loaded machines (due to
+-+ O(n) lookup) at this stage. Note that in terms of scalability, the number of
+-+ _logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x)
+-+ quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark
+-+ results are very promising indeed, without needing to tweak any knobs, features
+-+ or options. Benchmark contributions are most welcome.
+-+ 
+-+ 
+-+ Features
+-+ 
+-+ As the initial prime target audience for BFS was the average desktop user, it
+-+ was designed to not need tweaking, tuning or have features set to obtain benefit
+-+ from it. Thus the number of knobs and features has been kept to an absolute
+-+ minimum and should not require extra user input for the vast majority of cases.
+-+ There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval
+-+ and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition
+-+ to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is
+-+ support for CGROUPS. The average user should neither need to know what these
+-+ are, nor should they need to be using them to have good desktop behaviour.
+-+ 
+-+ rr_interval
+-+ 
+-+ There is only one "scheduler" tunable, the round robin interval. This can be
+-+ accessed in
+-+ 
+-+ 	/proc/sys/kernel/rr_interval
+-+ 
+-+ The value is in milliseconds, and the default value is set to 6 on a
+-+ uniprocessor machine, and automatically set to a progressively higher value on
+-+ multiprocessor machines. The reasoning behind increasing the value on more CPUs
+-+ is that the effective latency is decreased by virtue of there being more CPUs on
+-+ BFS (for reasons explained above), and increasing the value allows for less
+-+ cache contention and more throughput. Valid values are from 1 to 5000
+-+ Decreasing the value will decrease latencies at the cost of decreasing
+-+ throughput, while increasing it will improve throughput, but at the cost of
+-+ worsening latencies. The accuracy of the rr interval is limited by HZ resolution
+-+ of the kernel configuration. Thus, the worst case latencies are usually slightly
+-+ higher than this actual value. The default value of 6 is not an arbitrary one.
+-+ It is based on the fact that humans can detect jitter at approximately 7ms, so
+-+ aiming for much lower latencies is pointless under most circumstances. It is
+-+ worth noting this fact when comparing the latency performance of BFS to other
+-+ schedulers. Worst case latencies being higher than 7ms are far worse than
+-+ average latencies not being in the microsecond range.
+-+ 
+-+ Isochronous scheduling.
+-+ 
+-+ Isochronous scheduling is a unique scheduling policy designed to provide
+-+ near-real-time performance to unprivileged (ie non-root) users without the
+-+ ability to starve the machine indefinitely. Isochronous tasks (which means
+-+ "same time") are set using, for example, the schedtool application like so:
+-+ 
+-+ 	schedtool -I -e amarok
+-+ 
+-+ This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works
+-+ is that it has a priority level between true realtime tasks and SCHED_NORMAL
+-+ which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie,
+-+ if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval
+-+ rate). However if ISO tasks run for more than a tunable finite amount of time,
+-+ they are then demoted back to SCHED_NORMAL scheduling. This finite amount of
+-+ time is the percentage of _total CPU_ available across the machine, configurable
+-+ as a percentage in the following "resource handling" tunable (as opposed to a
+-+ scheduler tunable):
+-+ 
+-+ 	/proc/sys/kernel/iso_cpu
+-+ 
+-+ and is set to 70% by default. It is calculated over a rolling 5 second average
+-+ Because it is the total CPU available, it means that on a multi CPU machine, it
+-+ is possible to have an ISO task running as realtime scheduling indefinitely on
+-+ just one CPU, as the other CPUs will be available. Setting this to 100 is the
+-+ equivalent of giving all users SCHED_RR access and setting it to 0 removes the
+-+ ability to run any pseudo-realtime tasks.
+-+ 
+-+ A feature of BFS is that it detects when an application tries to obtain a
+-+ realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the
+-+ appropriate privileges to use those policies. When it detects this, it will
+-+ give the task SCHED_ISO policy instead. Thus it is transparent to the user.
+-+ Because some applications constantly set their policy as well as their nice
+-+ level, there is potential for them to undo the override specified by the user
+-+ on the command line of setting the policy to SCHED_ISO. To counter this, once
+-+ a task has been set to SCHED_ISO policy, it needs superuser privileges to set
+-+ it back to SCHED_NORMAL. This will ensure the task remains ISO and all child
+-+ processes and threads will also inherit the ISO policy.
+-+ 
+-+ Idleprio scheduling.
+-+ 
+-+ Idleprio scheduling is a scheduling policy designed to give out CPU to a task
+-+ _only_ when the CPU would be otherwise idle. The idea behind this is to allow
+-+ ultra low priority tasks to be run in the background that have virtually no
+-+ effect on the foreground tasks. This is ideally suited to distributed computing
+-+ clients (like setiathome, folding, mprime etc) but can also be used to start
+-+ a video encode or so on without any slowdown of other tasks. To avoid this
+-+ policy from grabbing shared resources and holding them indefinitely, if it
+-+ detects a state where the task is waiting on I/O, the machine is about to
+-+ suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As
+-+ per the Isochronous task management, once a task has been scheduled as IDLEPRIO,
+-+ it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can
+-+ be set to start as SCHED_IDLEPRIO with the schedtool command like so:
+-+ 
+-+ 	schedtool -D -e ./mprime
+-+ 
+-+ Subtick accounting.
+-+ 
+-+ It is surprisingly difficult to get accurate CPU accounting, and in many cases,
+-+ the accounting is done by simply determining what is happening at the precise
+-+ moment a timer tick fires off. This becomes increasingly inaccurate as the
+-+ timer tick frequency (HZ) is lowered. It is possible to create an application
+-+ which uses almost 100% CPU, yet by being descheduled at the right time, records
+-+ zero CPU usage. While the main problem with this is that there are possible
+-+ security implications, it is also difficult to determine how much CPU a task
+-+ really does use. BFS tries to use the sub-tick accounting from the TSC clock,
+-+ where possible, to determine real CPU usage. This is not entirely reliable, but
+-+ is far more likely to produce accurate CPU usage data than the existing designs
+-+ and will not show tasks as consuming no CPU usage when they actually are. Thus,
+-+ the amount of CPU reported as being used by BFS will more accurately represent
+-+ how much CPU the task itself is using (as is shown for example by the 'time'
+-+ application), so the reported values may be quite different to other schedulers.
+-+ Values reported as the 'load' are more prone to problems with this design, but
+-+ per process values are closer to real usage. When comparing throughput of BFS
+-+ to other designs, it is important to compare the actual completed work in terms
+-+ of total wall clock time taken and total work done, rather than the reported
+-+ "cpu usage".
+-+ 
+-+ 
+-+ Con Kolivas <kernel@kolivas.org> Thu Dec 3 2009
++BFS - The Brain Fuck Scheduler by Con Kolivas.
++
++Goals.
++
++The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to
++completely do away with the complex designs of the past for the cpu process
++scheduler and instead implement one that is very simple in basic design.
++The main focus of BFS is to achieve excellent desktop interactivity and
++responsiveness without heuristics and tuning knobs that are difficult to
++understand, impossible to model and predict the effect of, and when tuned to
++one workload cause massive detriment to another.
++
++
++Design summary.
++
++BFS is best described as a single runqueue, O(n) lookup, earliest effective
++virtual deadline first design, loosely based on EEVDF (earliest eligible virtual
++deadline first) and my previous Staircase Deadline scheduler. Each component
++shall be described in order to understand the significance of, and reasoning for
++it. The codebase when the first stable version was released was approximately
++9000 lines less code than the existing mainline linux kernel scheduler (in
++2.6.31). This does not even take into account the removal of documentation and
++the cgroups code that is not used.
++
++Design reasoning.
++
++The single runqueue refers to the queued but not running processes for the
++entire system, regardless of the number of CPUs. The reason for going back to
++a single runqueue design is that once multiple runqueues are introduced,
++per-CPU or otherwise, there will be complex interactions as each runqueue will
++be responsible for the scheduling latency and fairness of the tasks only on its
++own runqueue, and to achieve fairness and low latency across multiple CPUs, any
++advantage in throughput of having CPU local tasks causes other disadvantages.
++This is due to requiring a very complex balancing system to at best achieve some
++semblance of fairness across CPUs and can only maintain relatively low latency
++for tasks bound to the same CPUs, not across them. To increase said fairness
++and latency across CPUs, the advantage of local runqueue locking, which makes
++for better scalability, is lost due to having to grab multiple locks.
++
++A significant feature of BFS is that all accounting is done purely based on CPU
++used and nowhere is sleep time used in any way to determine entitlement or
++interactivity. Interactivity "estimators" that use some kind of sleep/run
++algorithm are doomed to fail to detect all interactive tasks, and to falsely tag
++tasks that aren't interactive as being so. The reason for this is that it is
++close to impossible to determine that when a task is sleeping, whether it is
++doing it voluntarily, as in a userspace application waiting for input in the
++form of a mouse click or otherwise, or involuntarily, because it is waiting for
++another thread, process, I/O, kernel activity or whatever. Thus, such an
++estimator will introduce corner cases, and more heuristics will be required to
++cope with those corner cases, introducing more corner cases and failed
++interactivity detection and so on. Interactivity in BFS is built into the design
++by virtue of the fact that tasks that are waking up have not used up their quota
++of CPU time, and have earlier effective deadlines, thereby making it very likely
++they will preempt any CPU bound task of equivalent nice level. See below for
++more information on the virtual deadline mechanism. Even if they do not preempt
++a running task, because the rr interval is guaranteed to have a bound upper
++limit on how long a task will wait for, it will be scheduled within a timeframe
++that will not cause visible interface jitter.
++
++
++Design details.
++
++Task insertion.
++
++BFS inserts tasks into each relevant queue as an O(1) insertion into a double
++linked list. On insertion, *every* running queue is checked to see if the newly
++queued task can run on any idle queue, or preempt the lowest running task on the
++system. This is how the cross-CPU scheduling of BFS achieves significantly lower
++latency per extra CPU the system has. In this case the lookup is, in the worst
++case scenario, O(n) where n is the number of CPUs on the system.
++
++Data protection.
++
++BFS has one single lock protecting the process local data of every task in the
++global queue. Thus every insertion, removal and modification of task data in the
++global runqueue needs to grab the global lock. However, once a task is taken by
++a CPU, the CPU has its own local data copy of the running process' accounting
++information which only that CPU accesses and modifies (such as during a
++timer tick) thus allowing the accounting data to be updated lockless. Once a
++CPU has taken a task to run, it removes it from the global queue. Thus the
++global queue only ever has, at most,
++
++	(number of tasks requesting cpu time) - (number of logical CPUs) + 1
++
++tasks in the global queue. This value is relevant for the time taken to look up
++tasks during scheduling. This will increase if many tasks with CPU affinity set
++in their policy to limit which CPUs they're allowed to run on if they outnumber
++the number of CPUs. The +1 is because when rescheduling a task, the CPU's
++currently running task is put back on the queue. Lookup will be described after
++the virtual deadline mechanism is explained.
++
++Virtual deadline.
++
++The key to achieving low latency, scheduling fairness, and "nice level"
++distribution in BFS is entirely in the virtual deadline mechanism. The one
++tunable in BFS is the rr_interval, or "round robin interval". This is the
++maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy)
++tasks of the same nice level will be running for, or looking at it the other
++way around, the longest duration two tasks of the same nice level will be
++delayed for. When a task requests cpu time, it is given a quota (time_slice)
++equal to the rr_interval and a virtual deadline. The virtual deadline is
++offset from the current time in jiffies by this equation:
++
++	jiffies + (prio_ratio * rr_interval)
++
++The prio_ratio is determined as a ratio compared to the baseline of nice -20
++and increases by 10% per nice level. The deadline is a virtual one only in that
++no guarantee is placed that a task will actually be scheduled by this time, but
++it is used to compare which task should go next. There are three components to
++how a task is next chosen. First is time_slice expiration. If a task runs out
++of its time_slice, it is descheduled, the time_slice is refilled, and the
++deadline reset to that formula above. Second is sleep, where a task no longer
++is requesting CPU for whatever reason. The time_slice and deadline are _not_
++adjusted in this case and are just carried over for when the task is next
++scheduled. Third is preemption, and that is when a newly waking task is deemed
++higher priority than a currently running task on any cpu by virtue of the fact
++that it has an earlier virtual deadline than the currently running task. The
++earlier deadline is the key to which task is next chosen for the first and
++second cases. Once a task is descheduled, it is put back on the queue, and an
++O(n) lookup of all queued-but-not-running tasks is done to determine which has
++the earliest deadline and that task is chosen to receive CPU next.
++
++The CPU proportion of different nice tasks works out to be approximately the
++
++	(prio_ratio difference)^2
++
++The reason it is squared is that a task's deadline does not change while it is
++running unless it runs out of time_slice. Thus, even if the time actually
++passes the deadline of another task that is queued, it will not get CPU time
++unless the current running task deschedules, and the time "base" (jiffies) is
++constantly moving.
++
++Task lookup.
++
++BFS has 103 priority queues. 100 of these are dedicated to the static priority
++of realtime tasks, and the remaining 3 are, in order of best to worst priority,
++SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority
++scheduling). When a task of these priorities is queued, a bitmap of running
++priorities is set showing which of these priorities has tasks waiting for CPU
++time. When a CPU is made to reschedule, the lookup for the next task to get
++CPU time is performed in the following way:
++
++First the bitmap is checked to see what static priority tasks are queued. If
++any realtime priorities are found, the corresponding queue is checked and the
++first task listed there is taken (provided CPU affinity is suitable) and lookup
++is complete. If the priority corresponds to a SCHED_ISO task, they are also
++taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds
++to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this
++stage, every task in the runlist that corresponds to that priority is checked
++to see which has the earliest set deadline, and (provided it has suitable CPU
++affinity) it is taken off the runqueue and given the CPU. If a task has an
++expired deadline, it is taken and the rest of the lookup aborted (as they are
++chosen in FIFO order).
++
++Thus, the lookup is O(n) in the worst case only, where n is as described
++earlier, as tasks may be chosen before the whole task list is looked over.
++
++
++Scalability.
++
++The major limitations of BFS will be that of scalability, as the separate
++runqueue designs will have less lock contention as the number of CPUs rises.
++However they do not scale linearly even with separate runqueues as multiple
++runqueues will need to be locked concurrently on such designs to be able to
++achieve fair CPU balancing, to try and achieve some sort of nice-level fairness
++across CPUs, and to achieve low enough latency for tasks on a busy CPU when
++other CPUs would be more suited. BFS has the advantage that it requires no
++balancing algorithm whatsoever, as balancing occurs by proxy simply because
++all CPUs draw off the global runqueue, in priority and deadline order. Despite
++the fact that scalability is _not_ the prime concern of BFS, it both shows very
++good scalability to smaller numbers of CPUs and is likely a more scalable design
++at these numbers of CPUs.
++
++It also has some very low overhead scalability features built into the design
++when it has been deemed their overhead is so marginal that they're worth adding.
++The first is the local copy of the running process' data to the CPU it's running
++on to allow that data to be updated lockless where possible. Then there is
++deference paid to the last CPU a task was running on, by trying that CPU first
++when looking for an idle CPU to use the next time it's scheduled. Finally there
++is the notion of "sticky" tasks that are flagged when they are involuntarily
++descheduled, meaning they still want further CPU time. This sticky flag is
++used to bias heavily against those tasks being scheduled on a different CPU
++unless that CPU would be otherwise idle. When a cpu frequency governor is used
++that scales with CPU load, such as ondemand, sticky tasks are not scheduled
++on a different CPU at all, preferring instead to go idle. This means the CPU
++they were bound to is more likely to increase its speed while the other CPU
++will go idle, thus speeding up total task execution time and likely decreasing
++power usage. This is the only scenario where BFS will allow a CPU to go idle
++in preference to scheduling a task on the earliest available spare CPU.
++
++The real cost of migrating a task from one CPU to another is entirely dependant
++on the cache footprint of the task, how cache intensive the task is, how long
++it's been running on that CPU to take up the bulk of its cache, how big the CPU
++cache is, how fast and how layered the CPU cache is, how fast a context switch
++is... and so on. In other words, it's close to random in the real world where we
++do more than just one sole workload. The only thing we can be sure of is that
++it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and
++utilising idle CPUs is more important than cache locality, and cache locality
++only plays a part after that.
++
++When choosing an idle CPU for a waking task, the cache locality is determined
++according to where the task last ran and then idle CPUs are ranked from best
++to worst to choose the most suitable idle CPU based on cache locality, NUMA
++node locality and hyperthread sibling business. They are chosen in the
++following preference (if idle):
++
++* Same core, idle or busy cache, idle threads
++* Other core, same cache, idle or busy cache, idle threads.
++* Same node, other CPU, idle cache, idle threads.
++* Same node, other CPU, busy cache, idle threads.
++* Same core, busy threads.
++* Other core, same cache, busy threads.
++* Same node, other CPU, busy threads.
++* Other node, other CPU, idle cache, idle threads.
++* Other node, other CPU, busy cache, idle threads.
++* Other node, other CPU, busy threads.
++
++This shows the SMT or "hyperthread" awareness in the design as well which will
++choose a real idle core first before a logical SMT sibling which already has
++tasks on the physical CPU.
++
++Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark.
++However this benchmarking was performed on an earlier design that was far less
++scalable than the current one so it's hard to know how scalable it is in terms
++of both CPUs (due to the global runqueue) and heavily loaded machines (due to
++O(n) lookup) at this stage. Note that in terms of scalability, the number of
++_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x)
++quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark
++results are very promising indeed, without needing to tweak any knobs, features
++or options. Benchmark contributions are most welcome.
++
++
++Features
++
++As the initial prime target audience for BFS was the average desktop user, it
++was designed to not need tweaking, tuning or have features set to obtain benefit
++from it. Thus the number of knobs and features has been kept to an absolute
++minimum and should not require extra user input for the vast majority of cases.
++There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval
++and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition
++to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is
++support for CGROUPS. The average user should neither need to know what these
++are, nor should they need to be using them to have good desktop behaviour.
++
++rr_interval
++
++There is only one "scheduler" tunable, the round robin interval. This can be
++accessed in
++
++	/proc/sys/kernel/rr_interval
++
++The value is in milliseconds, and the default value is set to 6ms. Valid values
++are from 1 to 1000. Decreasing the value will decrease latencies at the cost of
++decreasing throughput, while increasing it will improve throughput, but at the
++cost of worsening latencies. The accuracy of the rr interval is limited by HZ
++resolution of the kernel configuration. Thus, the worst case latencies are
++usually slightly higher than this actual value. BFS uses "dithering" to try and
++minimise the effect the Hz limitation has. The default value of 6 is not an
++arbitrary one. It is based on the fact that humans can detect jitter at
++approximately 7ms, so aiming for much lower latencies is pointless under most
++circumstances. It is worth noting this fact when comparing the latency
++performance of BFS to other schedulers. Worst case latencies being higher than
++7ms are far worse than average latencies not being in the microsecond range.
++Experimentation has shown that rr intervals being increased up to 300 can
++improve throughput but beyond that, scheduling noise from elsewhere prevents
++further demonstrable throughput.
++
++Isochronous scheduling.
++
++Isochronous scheduling is a unique scheduling policy designed to provide
++near-real-time performance to unprivileged (ie non-root) users without the
++ability to starve the machine indefinitely. Isochronous tasks (which means
++"same time") are set using, for example, the schedtool application like so:
++
++	schedtool -I -e amarok
++
++This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works
++is that it has a priority level between true realtime tasks and SCHED_NORMAL
++which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie,
++if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval
++rate). However if ISO tasks run for more than a tunable finite amount of time,
++they are then demoted back to SCHED_NORMAL scheduling. This finite amount of
++time is the percentage of _total CPU_ available across the machine, configurable
++as a percentage in the following "resource handling" tunable (as opposed to a
++scheduler tunable):
++
++	/proc/sys/kernel/iso_cpu
++
++and is set to 70% by default. It is calculated over a rolling 5 second average
++Because it is the total CPU available, it means that on a multi CPU machine, it
++is possible to have an ISO task running as realtime scheduling indefinitely on
++just one CPU, as the other CPUs will be available. Setting this to 100 is the
++equivalent of giving all users SCHED_RR access and setting it to 0 removes the
++ability to run any pseudo-realtime tasks.
++
++A feature of BFS is that it detects when an application tries to obtain a
++realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the
++appropriate privileges to use those policies. When it detects this, it will
++give the task SCHED_ISO policy instead. Thus it is transparent to the user.
++Because some applications constantly set their policy as well as their nice
++level, there is potential for them to undo the override specified by the user
++on the command line of setting the policy to SCHED_ISO. To counter this, once
++a task has been set to SCHED_ISO policy, it needs superuser privileges to set
++it back to SCHED_NORMAL. This will ensure the task remains ISO and all child
++processes and threads will also inherit the ISO policy.
++
++Idleprio scheduling.
++
++Idleprio scheduling is a scheduling policy designed to give out CPU to a task
++_only_ when the CPU would be otherwise idle. The idea behind this is to allow
++ultra low priority tasks to be run in the background that have virtually no
++effect on the foreground tasks. This is ideally suited to distributed computing
++clients (like setiathome, folding, mprime etc) but can also be used to start
++a video encode or so on without any slowdown of other tasks. To avoid this
++policy from grabbing shared resources and holding them indefinitely, if it
++detects a state where the task is waiting on I/O, the machine is about to
++suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As
++per the Isochronous task management, once a task has been scheduled as IDLEPRIO,
++it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can
++be set to start as SCHED_IDLEPRIO with the schedtool command like so:
++
++	schedtool -D -e ./mprime
++
++Subtick accounting.
++
++It is surprisingly difficult to get accurate CPU accounting, and in many cases,
++the accounting is done by simply determining what is happening at the precise
++moment a timer tick fires off. This becomes increasingly inaccurate as the
++timer tick frequency (HZ) is lowered. It is possible to create an application
++which uses almost 100% CPU, yet by being descheduled at the right time, records
++zero CPU usage. While the main problem with this is that there are possible
++security implications, it is also difficult to determine how much CPU a task
++really does use. BFS tries to use the sub-tick accounting from the TSC clock,
++where possible, to determine real CPU usage. This is not entirely reliable, but
++is far more likely to produce accurate CPU usage data than the existing designs
++and will not show tasks as consuming no CPU usage when they actually are. Thus,
++the amount of CPU reported as being used by BFS will more accurately represent
++how much CPU the task itself is using (as is shown for example by the 'time'
++application), so the reported values may be quite different to other schedulers.
++Values reported as the 'load' are more prone to problems with this design, but
++per process values are closer to real usage. When comparing throughput of BFS
++to other designs, it is important to compare the actual completed work in terms
++of total wall clock time taken and total work done, rather than the reported
++"cpu usage".
++
++
++Con Kolivas <kernel@kolivas.org> Tue, 5 Apr 2011
+diff -urpN linux-2.6.28.orig/drivers/cpufreq/cpufreq.c linux-2.6.28/drivers/cpufreq/cpufreq.c
+--- linux-2.6.28.orig/drivers/cpufreq/cpufreq.c	2011-04-10 21:50:27.082902337 +0200
++++ linux-2.6.28/drivers/cpufreq/cpufreq.c	2011-04-10 21:29:38.566326570 +0200
+@@ -28,6 +28,7 @@
+ #include <linux/cpu.h>
+ #include <linux/completion.h>
+ #include <linux/mutex.h>
++#include <linux/sched.h>
+ 
+ #define dprintk(msg...) cpufreq_debug_printk(CPUFREQ_DEBUG_CORE, \
+ 						"cpufreq-core", msg)
+@@ -1460,6 +1461,12 @@ int __cpufreq_driver_target(struct cpufr
+ 		target_freq, relation);
+ 	if (cpu_online(policy->cpu) && cpufreq_driver->target)
+ 		retval = cpufreq_driver->target(policy, target_freq, relation);
++	if (likely(retval != -EINVAL)) {
++		if (target_freq == policy->max)
++			cpu_nonscaling(policy->cpu);
++		else
++			cpu_scaling(policy->cpu);
++	}
+ 
+ 	return retval;
+ }
+diff -urpN linux-2.6.28.orig/include/linux/sched.h linux-2.6.28/include/linux/sched.h
+--- linux-2.6.28.orig/include/linux/sched.h	2011-04-10 21:50:27.659568962 +0200
++++ linux-2.6.28/include/linux/sched.h	2011-04-10 21:45:23.102924469 +0200
+@@ -1126,7 +1126,9 @@ struct task_struct {
+ 	struct list_head run_list;
+ 	u64 last_ran;
+ 	u64 sched_time; /* sched_clock time spent running */
+-
++#ifdef CONFIG_SMP
++	int sticky; /* Soft affined flag */
++#endif
+ 	unsigned long rt_timeout;
+ #else /* CONFIG_SCHED_BFS */
+ 	const struct sched_class *sched_class;
+@@ -1409,6 +1411,8 @@ struct task_struct {
+ #ifdef CONFIG_SCHED_BFS
+ extern int grunqueue_is_locked(void);
+ extern void grq_unlock_wait(void);
++extern void cpu_scaling(int cpu);
++extern void cpu_nonscaling(int cpu);
+ #define tsk_seruntime(t)		((t)->sched_time)
+ #define tsk_rttimeout(t)		((t)->rt_timeout)
+ #define task_rq_unlock_wait(tsk)	grq_unlock_wait()
+@@ -1426,16 +1430,23 @@ static inline void tsk_cpus_current(stru
+ 
+ static inline void print_scheduler_version(void)
+ {
+-	printk(KERN_INFO"BFS CPU scheduler v0.363 by Con Kolivas.\n");
++	printk(KERN_INFO"BFS CPU scheduler v0.400 by Con Kolivas.\n");
+ }
+ 
+ static inline int iso_task(struct task_struct *p)
+ {
+ 	return (p->policy == SCHED_ISO);
+ }
+-#else
++#else /* CFS */
+ extern int runqueue_is_locked(void);
+ extern void task_rq_unlock_wait(struct task_struct *p);
++static inline void cpu_scaling(int cpu)
++{
++}
++
++static inline void cpu_nonscaling(int cpu)
++{
++}
+ #define tsk_seruntime(t)	((t)->se.sum_exec_runtime)
+ #define tsk_rttimeout(t)	((t)->rt.timeout)
+ 
+diff -urpN linux-2.6.28.orig/kernel/sched_bfs.c linux-2.6.28/kernel/sched_bfs.c
+--- linux-2.6.28.orig/kernel/sched_bfs.c	2011-04-10 21:50:27.659568962 +0200
++++ linux-2.6.28/kernel/sched_bfs.c	2011-04-10 21:43:43.506265052 +0200
+@@ -81,7 +81,7 @@
+ #define idleprio_task(p)	unlikely((p)->policy == SCHED_IDLEPRIO)
+ #define iso_task(p)		unlikely((p)->policy == SCHED_ISO)
+ #define iso_queue(rq)		unlikely((rq)->rq_policy == SCHED_ISO)
+-#define ISO_PERIOD		((5 * HZ * num_online_cpus()) + 1)
++#define ISO_PERIOD		((5 * HZ * grq.noc) + 1)
+ 
+ /*
+  * Convert user-nice values [ -20 ... 0 ... 19 ]
+@@ -184,6 +184,7 @@ struct global_rq {
+ 	cpumask_t cpu_idle_map;
+ 	int idle_cpus;
+ #endif
++	int noc; /* num_online_cpus stored and updated when it changes */
+ 	u64 niffies; /* Nanosecond jiffies */
+ 	unsigned long last_jiffy; /* Last jiffy we updated niffies */
+ 
+@@ -226,6 +227,8 @@ struct rq {
+ #ifdef CONFIG_SMP
+ 	int cpu;		/* cpu of this runqueue */
+ 	int online;
++	int scaling; /* This CPU is managed by a scaling CPU freq governor */
++	struct task_struct *sticky_task;
+ 
+ 	struct root_domain *rd;
+ 	struct sched_domain *sd;
+@@ -242,7 +245,11 @@ struct rq {
+ #endif
+ 	u64 last_niffy; /* Last time this RQ updated grq.niffies */
+ #endif
++#ifdef CONFIG_IRQ_TIME_ACCOUNTING
++	u64 prev_irq_time;
++#endif
+ 	u64 clock, old_clock, last_tick;
++	u64 clock_task;
+ 	int dither;
+ 
+ #ifdef CONFIG_SCHEDSTATS
+@@ -407,9 +414,14 @@ static inline void update_clocks(struct 
+  * when we're not updating niffies.
+  * Looking up task_rq must be done under grq.lock to be safe.
+  */
++static void update_rq_clock_task(struct rq *rq, s64 delta);
++
+ static inline void update_rq_clock(struct rq *rq)
+ {
+-	rq->clock = sched_clock_cpu(cpu_of(rq));
++	s64 delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
++
++	rq->clock += delta;
++	update_rq_clock_task(rq, delta);
+ }
+ 
+ static inline int task_running(struct task_struct *p)
+@@ -741,10 +753,8 @@ static void resched_task(struct task_str
+ 
+ /*
+  * The best idle CPU is chosen according to the CPUIDLE ranking above where the
+- * lowest value would give the most suitable CPU to schedule p onto next. We
+- * iterate from the last CPU upwards instead of using for_each_cpu_mask so as
+- * to be able to break out immediately if the last CPU is idle. The order works
+- * out to be the following:
++ * lowest value would give the most suitable CPU to schedule p onto next. The
++ * order works out to be the following:
+  *
+  * Same core, idle or busy cache, idle threads
+  * Other core, same cache, idle or busy cache, idle threads.
+@@ -756,38 +766,19 @@ static void resched_task(struct task_str
+  * Other node, other CPU, idle cache, idle threads.
+  * Other node, other CPU, busy cache, idle threads.
+  * Other node, other CPU, busy threads.
+- *
+- * If p was the last task running on this rq, then regardless of where
+- * it has been running since then, it is cache warm on this rq.
+  */
+-static void resched_best_idle(struct task_struct *p)
++static void
++resched_best_mask(unsigned long best_cpu, struct rq *rq, cpumask_t *tmpmask)
+ {
+-	unsigned long cpu_tmp, best_cpu, best_ranking;
+-	cpumask_t tmpmask;
+-	struct rq *rq;
+-	int iterate;
++	unsigned long cpu_tmp, best_ranking;
+ 
+-	cpus_and(tmpmask, p->cpus_allowed, grq.cpu_idle_map);
+-	iterate = cpus_weight(tmpmask);
+-	best_cpu = task_cpu(p);
+-	/*
+-	 * Start below the last CPU and work up with next_cpu_nr as the last
+-	 * CPU might not be idle or affinity might not allow it.
+-	 */
+-	cpu_tmp = best_cpu - 1;
+-	rq = cpu_rq(best_cpu);
+ 	best_ranking = ~0UL;
+ 
+-	do {
++	for_each_cpu_mask(cpu_tmp, *tmpmask) {
+ 		unsigned long ranking;
+ 		struct rq *tmp_rq;
+ 
+ 		ranking = 0;
+-		cpu_tmp = next_cpu_nr(cpu_tmp, tmpmask);
+-		if (cpu_tmp >= nr_cpu_ids) {
+-			cpu_tmp = -1;
+-			cpu_tmp = next_cpu_nr(cpu_tmp, tmpmask);
+-		}
+ 		tmp_rq = cpu_rq(cpu_tmp);
+ 
+ #ifdef CONFIG_NUMA
+@@ -815,37 +806,42 @@ static void resched_best_idle(struct tas
+ 				break;
+ 			best_ranking = ranking;
+ 		}
+-	} while (--iterate > 0);
++	}
+ 
+ 	resched_task(cpu_rq(best_cpu)->curr);
+ }
+ 
++static void resched_best_idle(struct task_struct *p)
++{
++	cpumask_t tmpmask;
++
++	cpus_and(tmpmask, p->cpus_allowed, grq.cpu_idle_map);
++	resched_best_mask(task_cpu(p), task_rq(p), &tmpmask);
++}
++
+ static inline void resched_suitable_idle(struct task_struct *p)
+ {
+ 	if (suitable_idle_cpus(p))
+ 		resched_best_idle(p);
+ }
+-
+ /*
+- * The cpu cache locality difference between CPUs is used to determine how far
+- * to offset the virtual deadline. <2 difference in locality means that one
+- * timeslice difference is allowed longer for the cpu local tasks. This is
+- * enough in the common case when tasks are up to 2* number of CPUs to keep
+- * tasks within their shared cache CPUs only. CPUs on different nodes or not
+- * even in this domain (NUMA) have "4" difference, allowing 4 times longer
+- * deadlines before being taken onto another cpu, allowing for 2* the double
+- * seen by separate CPUs above.
+- * Simple summary: Virtual deadlines are equal on shared cache CPUs, double
+- * on separate CPUs and quadruple in separate NUMA nodes.
++ * Flags to tell us whether this CPU is running a CPU frequency governor that
++ * has slowed its speed or not. No locking required as the very rare wrongly
++ * read value would be harmless.
+  */
+-static inline int
+-cache_distance(struct rq *task_rq, struct rq *rq, struct task_struct *p)
++void cpu_scaling(int cpu)
+ {
+-	int locality = rq->cpu_locality[cpu_of(task_rq)] - 2;
++	cpu_rq(cpu)->scaling = 1;
++}
+ 
+-	if (locality > 0)
+-		return task_timeslice(p) << locality;
+-	return 0;
++void cpu_nonscaling(int cpu)
++{
++	cpu_rq(cpu)->scaling = 0;
++}
++
++static inline int scaling_rq(struct rq *rq)
++{
++	return rq->scaling;
+ }
+ #else /* CONFIG_SMP */
+ static inline void inc_qnr(void)
+@@ -879,12 +875,25 @@ static inline void resched_suitable_idle
+ {
+ }
+ 
+-static inline int
+-cache_distance(struct rq *task_rq, struct rq *rq, struct task_struct *p)
++void cpu_scaling(int __unused)
++{
++}
++
++void cpu_nonscaling(int __unused)
++{
++}
++
++/*
++ * Although CPUs can scale in UP, there is nowhere else for tasks to go so this
++ * always returns 0.
++ */
++static inline int scaling_rq(struct rq *rq)
+ {
+ 	return 0;
+ }
+ #endif /* CONFIG_SMP */
++EXPORT_SYMBOL_GPL(cpu_scaling);
++EXPORT_SYMBOL_GPL(cpu_nonscaling);
+ 
+ /*
+  * activate_idle_task - move idle task to the _front_ of runqueue.
+@@ -974,6 +983,82 @@ void set_task_cpu(struct task_struct *p,
+ 	smp_wmb();
+ 	task_thread_info(p)->cpu = cpu;
+ }
++
++static inline void clear_sticky(struct task_struct *p)
++{
++	p->sticky = 0;
++}
++
++static inline int task_sticky(struct task_struct *p)
++{
++	return p->sticky;
++}
++
++/* Reschedule the best idle CPU that is not this one. */
++static void
++resched_closest_idle(struct rq *rq, unsigned long cpu, struct task_struct *p)
++{
++	cpumask_t tmpmask;
++
++	cpus_and(tmpmask, p->cpus_allowed, grq.cpu_idle_map);
++	cpu_clear(cpu, tmpmask);
++	if (cpus_empty(tmpmask))
++		return;
++	resched_best_mask(cpu, rq, &tmpmask);
++}
++
++/*
++ * We set the sticky flag on a task that is descheduled involuntarily meaning
++ * it is awaiting further CPU time. If the last sticky task is still sticky
++ * but unlucky enough to not be the next task scheduled, we unstick it and try
++ * to find it an idle CPU. Realtime tasks do not stick to minimise their
++ * latency at all times.
++ */
++static inline void
++swap_sticky(struct rq *rq, unsigned long cpu, struct task_struct *p)
++{
++	if (rq->sticky_task) {
++		if (rq->sticky_task == p) {
++			p->sticky = 1;
++			return;
++		}
++		if (rq->sticky_task->sticky) {
++			rq->sticky_task->sticky = 0;
++			resched_closest_idle(rq, cpu, rq->sticky_task);
++		}
++	}
++	if (!rt_task(p)) {
++		p->sticky = 1;
++		rq->sticky_task = p;
++	} else {
++		resched_closest_idle(rq, cpu, p);
++		rq->sticky_task = NULL;
++	}
++}
++
++static inline void unstick_task(struct rq *rq, struct task_struct *p)
++{
++	rq->sticky_task = NULL;
++	clear_sticky(p);
++}
++#else
++static inline void clear_sticky(struct task_struct *p)
++{
++}
++
++static inline int task_sticky(struct task_struct *p)
++{
++	return 0;
++}
++
++static inline void
++swap_sticky(struct rq *rq, unsigned long cpu, struct task_struct *p)
++{
++}
++
++static inline void unstick_task(struct rq *rq, struct task_struct *p)
++{
++}
+ #endif
+ 
+ /*
+@@ -984,6 +1069,7 @@ static inline void take_task(struct rq *
+ {
+ 	set_task_cpu(p, cpu_of(rq));
+ 	dequeue_task(p);
++	clear_sticky(p);
+ 	dec_qnr();
+ }
+ 
+@@ -1266,6 +1352,13 @@ static void try_preempt(struct task_stru
+ 	int highest_prio;
+ 	cpumask_t tmp;
+ 
++	/*
++	 * We clear the sticky flag here because for a task to have called
++	 * try_preempt with the sticky flag enabled means some complicated
++	 * re-scheduling has occurred and we should ignore the sticky flag.
++	 */
++	clear_sticky(p);
++
+ 	if (suitable_idle_cpus(p)) {
+ 		resched_best_idle(p);
+ 		return;
+@@ -1284,7 +1377,6 @@ static void try_preempt(struct task_stru
+ 	highest_prio = -1;
+ 
+ 	for_each_cpu_mask_nr(cpu, tmp) {
+-		u64 offset_deadline;
+ 		struct rq *rq;
+ 		int rq_prio;
+ 
+@@ -1293,12 +1385,9 @@ static void try_preempt(struct task_stru
+ 		if (rq_prio < highest_prio)
+ 			continue;
+ 
+-		offset_deadline = rq->rq_deadline -
+-				  cache_distance(this_rq, rq, p);
+-
+ 		if (rq_prio > highest_prio || (rq_prio == highest_prio &&
+-		    deadline_after(offset_deadline, latest_deadline))) {
+-			latest_deadline = offset_deadline;
++		    deadline_after(rq->rq_deadline, latest_deadline))) {
++			latest_deadline = rq->rq_deadline;
+ 			highest_prio = rq_prio;
+ 			highest_prio_rq = rq;
+ 		}
+@@ -1470,6 +1559,7 @@ void sched_fork(struct task_struct *p, i
+ #endif
+ 
+ 	p->oncpu = 0;
++	clear_sticky(p);
+ 
+ #ifdef CONFIG_PREEMPT
+ 	/* Want to start with kernel preemption disabled. */
+@@ -1809,6 +1899,152 @@ DEFINE_PER_CPU(struct kernel_stat, kstat
+ 
+ EXPORT_PER_CPU_SYMBOL(kstat);
+ 
++#ifdef CONFIG_IRQ_TIME_ACCOUNTING
++
++/*
++ * There are no locks covering percpu hardirq/softirq time.
++ * They are only modified in account_system_vtime, on corresponding CPU
++ * with interrupts disabled. So, writes are safe.
++ * They are read and saved off onto struct rq in update_rq_clock().
++ * This may result in other CPU reading this CPU's irq time and can
++ * race with irq/account_system_vtime on this CPU. We would either get old
++ * or new value with a side effect of accounting a slice of irq time to wrong
++ * task when irq is in progress while we read rq->clock. That is a worthy
++ * compromise in place of having locks on each irq in account_system_time.
++ */
++static DEFINE_PER_CPU(u64, cpu_hardirq_time);
++static DEFINE_PER_CPU(u64, cpu_softirq_time);
++
++static DEFINE_PER_CPU(u64, irq_start_time);
++static int sched_clock_irqtime;
++
++void enable_sched_clock_irqtime(void)
++{
++	sched_clock_irqtime = 1;
++}
++
++void disable_sched_clock_irqtime(void)
++{
++	sched_clock_irqtime = 0;
++}
++
++#ifndef CONFIG_64BIT
++static DEFINE_PER_CPU(seqcount_t, irq_time_seq);
++
++static inline void irq_time_write_begin(void)
++{
++	__this_cpu_inc(irq_time_seq.sequence);
++	smp_wmb();
++}
++
++static inline void irq_time_write_end(void)
++{
++	smp_wmb();
++	__this_cpu_inc(irq_time_seq.sequence);
++}
++
++static inline u64 irq_time_read(int cpu)
++{
++	u64 irq_time;
++	unsigned seq;
++
++	do {
++		seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu));
++		irq_time = per_cpu(cpu_softirq_time, cpu) +
++			   per_cpu(cpu_hardirq_time, cpu);
++	} while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq));
++
++	return irq_time;
++}
++#else /* CONFIG_64BIT */
++static inline void irq_time_write_begin(void)
++{
++}
++
++static inline void irq_time_write_end(void)
++{
++}
++
++static inline u64 irq_time_read(int cpu)
++{
++	return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
++}
++#endif /* CONFIG_64BIT */
++
++/*
++ * Called before incrementing preempt_count on {soft,}irq_enter
++ * and before decrementing preempt_count on {soft,}irq_exit.
++ */
++void account_system_vtime(struct task_struct *curr)
++{
++	unsigned long flags;
++	s64 delta;
++	int cpu;
++
++	if (!sched_clock_irqtime)
++		return;
++
++	local_irq_save(flags);
++
++	cpu = smp_processor_id();
++	delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
++	__this_cpu_add(irq_start_time, delta);
++
++	irq_time_write_begin();
++	/*
++	 * We do not account for softirq time from ksoftirqd here.
++	 * We want to continue accounting softirq time to ksoftirqd thread
++	 * in that case, so as not to confuse scheduler with a special task
++	 * that do not consume any time, but still wants to run.
++	 */
++	if (hardirq_count())
++		__this_cpu_add(cpu_hardirq_time, delta);
++	else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD))
++		__this_cpu_add(cpu_softirq_time, delta);
++
++	irq_time_write_end();
++	local_irq_restore(flags);
++}
++EXPORT_SYMBOL_GPL(account_system_vtime);
++
++static void update_rq_clock_task(struct rq *rq, s64 delta)
++{
++	s64 irq_delta;
++
++	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
++
++	/*
++	 * Since irq_time is only updated on {soft,}irq_exit, we might run into
++	 * this case when a previous update_rq_clock() happened inside a
++	 * {soft,}irq region.
++	 *
++	 * When this happens, we stop ->clock_task and only update the
++	 * prev_irq_time stamp to account for the part that fit, so that a next
++	 * update will consume the rest. This ensures ->clock_task is
++	 * monotonic.
++	 *
++	 * It does however cause some slight miss-attribution of {soft,}irq
++	 * time, a more accurate solution would be to update the irq_time using
++	 * the current rq->clock timestamp, except that would require using
++	 * atomic ops.
++	 */
++	if (irq_delta > delta)
++		irq_delta = delta;
++
++	rq->prev_irq_time += irq_delta;
++	delta -= irq_delta;
++	rq->clock_task += delta;
++}
++
++#else /* CONFIG_IRQ_TIME_ACCOUNTING */
++
++static void update_rq_clock_task(struct rq *rq, s64 delta)
++{
++	rq->clock_task += delta;
++}
++
++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
++
+ /*
+  * On each tick, see what percentage of that tick was attributed to each
+  * component and add the percentage to the _pc values. Once a _pc value has
+@@ -2372,9 +2608,14 @@ static inline u64 static_deadline_diff(i
+ 	return prio_deadline_diff(USER_PRIO(static_prio));
+ }
+ 
++static inline int longest_deadline_diff(void)
++{
++	return prio_deadline_diff(39);
++}
++
+ static inline int ms_longest_deadline_diff(void)
+ {
+-	return NS_TO_MS(prio_deadline_diff(39));
++	return NS_TO_MS(longest_deadline_diff());
+ }
+ 
+ /*
+@@ -2444,7 +2685,19 @@ retry:
+ 			goto out_take;
+ 		}
+ 
+-		dl = p->deadline + cache_distance(task_rq(p), rq, p);
++		/*
++		 * Soft affinity happens here by not scheduling a task with
++		 * its sticky flag set that ran on a different CPU last when
++		 * the CPU is scaling, or by greatly biasing against its
++		 * deadline when not.
++		 */
++		if (task_rq(p) != rq && task_sticky(p)) {
++			if (scaling_rq(rq))
++				continue;
++			else
++				dl = p->deadline + longest_deadline_diff();
++		} else
++			dl = p->deadline;
+ 
+ 		/*
+ 		 * No rt tasks. Find the earliest deadline task. Now we're in
+@@ -2601,14 +2854,8 @@ need_resched_nonpreemptible:
+ 				*/
+ 				grq_unlock_irq();
+ 				goto rerun_prev_unlocked;
+-			} else {
+-				/*
+-				 * If prev got kicked off by a task that has to
+-				 * run on this CPU for affinity reasons then
+-				 * there may be an idle CPU it can go to.
+-				 */
+-				resched_suitable_idle(prev);
+-			}
++			} else
++				swap_sticky(rq, cpu, prev);
+ 		}
+ 		return_task(prev, deactivate);
+ 	}
+@@ -2623,12 +2870,21 @@ need_resched_nonpreemptible:
+ 		set_cpuidle_map(cpu);
+ 	} else {
+ 		next = earliest_deadline_task(rq, idle);
+-		prefetch(next);
+-		prefetch_stack(next);
+-		clear_cpuidle_map(cpu);
++		if (likely(next->prio != PRIO_LIMIT)) {
++			prefetch(next);
++			prefetch_stack(next);
++			clear_cpuidle_map(cpu);
++		} else
++			set_cpuidle_map(cpu);
+ 	}
+ 
+ 	if (likely(prev != next)) {
++		/*
++		 * Don't stick tasks when a real time task is going to run as
++		 * they may literally get stuck.
++		 */
++		if (rt_task(next))
++			unstick_task(rq, prev);
+ 		sched_info_switch(prev, next);
+ 
+ 		set_rq_task(rq, next);
+@@ -4009,7 +4265,6 @@ void init_idle(struct task_struct *idle,
+ 	rcu_read_unlock();
+ 	rq->curr = rq->idle = idle;
+ 	idle->oncpu = 1;
+-	set_cpuidle_map(cpu);
+ 	grq_unlock_irqrestore(&flags);
+ 
+ 	/* Set the preempt count _outside_ the spinlocks! */
+@@ -4245,6 +4500,7 @@ static void break_sole_affinity(int src_
+ 				       task_pid_nr(p), p->comm, src_cpu);
+ 			}
+ 		}
++		clear_sticky(p);
+ 	} while_each_thread(t, p);
+ }
+ 
+@@ -4569,6 +4825,7 @@ migration_call(struct notifier_block *nf
+ 
+ 			set_rq_online(rq);
+ 		}
++		grq.noc = num_online_cpus();
+ 		grq_unlock_irqrestore(&flags);
+ 		break;
+ 
+@@ -4601,6 +4858,7 @@ migration_call(struct notifier_block *nf
+ 			BUG_ON(!cpu_isset(cpu, rq->rd->span));
+ 			set_rq_offline(rq);
+ 		}
++		grq.noc = num_online_cpus();
+ 		grq_unlock_irqrestore(&flags);
+ 		break;
+ #endif
+@@ -6005,7 +6263,7 @@ static int cache_cpu_idle(unsigned long 
+ void __init sched_init_smp(void)
+ {
+ 	struct sched_domain *sd;
+-	int cpu, cpus;
++	int cpu;
+ 
+ 	cpumask_t non_isolated_cpus;
+ 
+@@ -6035,14 +6293,6 @@ void __init sched_init_smp(void)
+ 	if (set_cpus_allowed_ptr(current, &non_isolated_cpus) < 0)
+ 		BUG();
+ 
+-	/*
+-	 * Assume that every added cpu gives us slightly less overall latency
+-	 * allowing us to increase the base rr_interval, non-linearly and with
+-	 * an upper bound.
+-	 */
+-	cpus = num_online_cpus();
+-	rr_interval = rr_interval * (4 * cpus + 4) / (cpus + 6);
+-
+ 	grq_lock_irq();
+ 	/*
+ 	 * Set up the relative cache distance of each online cpu from each
+@@ -6129,6 +6379,7 @@ void __init sched_init(void)
+ 	grq.last_jiffy = jiffies;
+ 	spin_lock_init(&grq.iso_lock);
+ 	grq.iso_ticks = grq.iso_refractory = 0;
++	grq.noc = 1;
+ #ifdef CONFIG_SMP
+ 	init_defrootdomain();
+ 	grq.qnr = grq.idle_cpus = 0;
+@@ -6142,6 +6393,7 @@ void __init sched_init(void)
+ 			      rq->iowait_pc = rq->idle_pc = 0;
+ 		rq->dither = 0;
+ #ifdef CONFIG_SMP
++		rq->sticky_task = NULL;
+ 		rq->last_niffy = 0;
+ 		rq->sd = NULL;
+ 		rq->rd = NULL;