From: Dennis Groenen Date: Sun, 10 Apr 2011 21:00:42 +0000 (+0200) Subject: bfs 363 -> 400, fix BFS documentation X-Git-Url: http://vcs.maemo.org/git/?p=kernel-bfs;a=commitdiff_plain;h=cf93716d483ed109f0689d791b107495d2e16b61 bfs 363 -> 400, fix BFS documentation --- diff --git a/kernel-bfs-2.6.28/debian/patches/bfs-363-to-400.patch b/kernel-bfs-2.6.28/debian/patches/bfs-363-to-400.patch new file mode 100644 index 0000000..e7d458b --- /dev/null +++ b/kernel-bfs-2.6.28/debian/patches/bfs-363-to-400.patch @@ -0,0 +1,1425 @@ +diff -urpN linux-2.6.28.orig/Documentation/scheduler/sched-BFS.txt linux-2.6.28/Documentation/scheduler/sched-BFS.txt +--- linux-2.6.28.orig/Documentation/scheduler/sched-BFS.txt 2011-04-10 21:50:27.152902332 +0200 ++++ linux-2.6.28/Documentation/scheduler/sched-BFS.txt 2011-04-10 21:32:08.616315645 +0200 +@@ -1,359 +1,347 @@ +-*************** +-*** 0 **** +---- 1,356 ---- +-+ BFS - The Brain Fuck Scheduler by Con Kolivas. +-+ +-+ Goals. +-+ +-+ The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to +-+ completely do away with the complex designs of the past for the cpu process +-+ scheduler and instead implement one that is very simple in basic design. +-+ The main focus of BFS is to achieve excellent desktop interactivity and +-+ responsiveness without heuristics and tuning knobs that are difficult to +-+ understand, impossible to model and predict the effect of, and when tuned to +-+ one workload cause massive detriment to another. +-+ +-+ +-+ Design summary. +-+ +-+ BFS is best described as a single runqueue, O(n) lookup, earliest effective +-+ virtual deadline first design, loosely based on EEVDF (earliest eligible virtual +-+ deadline first) and my previous Staircase Deadline scheduler. Each component +-+ shall be described in order to understand the significance of, and reasoning for +-+ it. The codebase when the first stable version was released was approximately +-+ 9000 lines less code than the existing mainline linux kernel scheduler (in +-+ 2.6.31). This does not even take into account the removal of documentation and +-+ the cgroups code that is not used. +-+ +-+ Design reasoning. +-+ +-+ The single runqueue refers to the queued but not running processes for the +-+ entire system, regardless of the number of CPUs. The reason for going back to +-+ a single runqueue design is that once multiple runqueues are introduced, +-+ per-CPU or otherwise, there will be complex interactions as each runqueue will +-+ be responsible for the scheduling latency and fairness of the tasks only on its +-+ own runqueue, and to achieve fairness and low latency across multiple CPUs, any +-+ advantage in throughput of having CPU local tasks causes other disadvantages. +-+ This is due to requiring a very complex balancing system to at best achieve some +-+ semblance of fairness across CPUs and can only maintain relatively low latency +-+ for tasks bound to the same CPUs, not across them. To increase said fairness +-+ and latency across CPUs, the advantage of local runqueue locking, which makes +-+ for better scalability, is lost due to having to grab multiple locks. +-+ +-+ A significant feature of BFS is that all accounting is done purely based on CPU +-+ used and nowhere is sleep time used in any way to determine entitlement or +-+ interactivity. Interactivity "estimators" that use some kind of sleep/run +-+ algorithm are doomed to fail to detect all interactive tasks, and to falsely tag +-+ tasks that aren't interactive as being so. The reason for this is that it is +-+ close to impossible to determine that when a task is sleeping, whether it is +-+ doing it voluntarily, as in a userspace application waiting for input in the +-+ form of a mouse click or otherwise, or involuntarily, because it is waiting for +-+ another thread, process, I/O, kernel activity or whatever. Thus, such an +-+ estimator will introduce corner cases, and more heuristics will be required to +-+ cope with those corner cases, introducing more corner cases and failed +-+ interactivity detection and so on. Interactivity in BFS is built into the design +-+ by virtue of the fact that tasks that are waking up have not used up their quota +-+ of CPU time, and have earlier effective deadlines, thereby making it very likely +-+ they will preempt any CPU bound task of equivalent nice level. See below for +-+ more information on the virtual deadline mechanism. Even if they do not preempt +-+ a running task, because the rr interval is guaranteed to have a bound upper +-+ limit on how long a task will wait for, it will be scheduled within a timeframe +-+ that will not cause visible interface jitter. +-+ +-+ +-+ Design details. +-+ +-+ Task insertion. +-+ +-+ BFS inserts tasks into each relevant queue as an O(1) insertion into a double +-+ linked list. On insertion, *every* running queue is checked to see if the newly +-+ queued task can run on any idle queue, or preempt the lowest running task on the +-+ system. This is how the cross-CPU scheduling of BFS achieves significantly lower +-+ latency per extra CPU the system has. In this case the lookup is, in the worst +-+ case scenario, O(n) where n is the number of CPUs on the system. +-+ +-+ Data protection. +-+ +-+ BFS has one single lock protecting the process local data of every task in the +-+ global queue. Thus every insertion, removal and modification of task data in the +-+ global runqueue needs to grab the global lock. However, once a task is taken by +-+ a CPU, the CPU has its own local data copy of the running process' accounting +-+ information which only that CPU accesses and modifies (such as during a +-+ timer tick) thus allowing the accounting data to be updated lockless. Once a +-+ CPU has taken a task to run, it removes it from the global queue. Thus the +-+ global queue only ever has, at most, +-+ +-+ (number of tasks requesting cpu time) - (number of logical CPUs) + 1 +-+ +-+ tasks in the global queue. This value is relevant for the time taken to look up +-+ tasks during scheduling. This will increase if many tasks with CPU affinity set +-+ in their policy to limit which CPUs they're allowed to run on if they outnumber +-+ the number of CPUs. The +1 is because when rescheduling a task, the CPU's +-+ currently running task is put back on the queue. Lookup will be described after +-+ the virtual deadline mechanism is explained. +-+ +-+ Virtual deadline. +-+ +-+ The key to achieving low latency, scheduling fairness, and "nice level" +-+ distribution in BFS is entirely in the virtual deadline mechanism. The one +-+ tunable in BFS is the rr_interval, or "round robin interval". This is the +-+ maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy) +-+ tasks of the same nice level will be running for, or looking at it the other +-+ way around, the longest duration two tasks of the same nice level will be +-+ delayed for. When a task requests cpu time, it is given a quota (time_slice) +-+ equal to the rr_interval and a virtual deadline. The virtual deadline is +-+ offset from the current time in jiffies by this equation: +-+ +-+ jiffies + (prio_ratio * rr_interval) +-+ +-+ The prio_ratio is determined as a ratio compared to the baseline of nice -20 +-+ and increases by 10% per nice level. The deadline is a virtual one only in that +-+ no guarantee is placed that a task will actually be scheduled by this time, but +-+ it is used to compare which task should go next. There are three components to +-+ how a task is next chosen. First is time_slice expiration. If a task runs out +-+ of its time_slice, it is descheduled, the time_slice is refilled, and the +-+ deadline reset to that formula above. Second is sleep, where a task no longer +-+ is requesting CPU for whatever reason. The time_slice and deadline are _not_ +-+ adjusted in this case and are just carried over for when the task is next +-+ scheduled. Third is preemption, and that is when a newly waking task is deemed +-+ higher priority than a currently running task on any cpu by virtue of the fact +-+ that it has an earlier virtual deadline than the currently running task. The +-+ earlier deadline is the key to which task is next chosen for the first and +-+ second cases. Once a task is descheduled, it is put back on the queue, and an +-+ O(n) lookup of all queued-but-not-running tasks is done to determine which has +-+ the earliest deadline and that task is chosen to receive CPU next. The one +-+ caveat to this is that if a deadline has already passed (jiffies is greater +-+ than the deadline), the tasks are chosen in FIFO (first in first out) order as +-+ the deadlines are old and their absolute value becomes decreasingly relevant +-+ apart from being a flag that they have been asleep and deserve CPU time ahead +-+ of all later deadlines. +-+ +-+ The CPU proportion of different nice tasks works out to be approximately the +-+ +-+ (prio_ratio difference)^2 +-+ +-+ The reason it is squared is that a task's deadline does not change while it is +-+ running unless it runs out of time_slice. Thus, even if the time actually +-+ passes the deadline of another task that is queued, it will not get CPU time +-+ unless the current running task deschedules, and the time "base" (jiffies) is +-+ constantly moving. +-+ +-+ Task lookup. +-+ +-+ BFS has 103 priority queues. 100 of these are dedicated to the static priority +-+ of realtime tasks, and the remaining 3 are, in order of best to worst priority, +-+ SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority +-+ scheduling). When a task of these priorities is queued, a bitmap of running +-+ priorities is set showing which of these priorities has tasks waiting for CPU +-+ time. When a CPU is made to reschedule, the lookup for the next task to get +-+ CPU time is performed in the following way: +-+ +-+ First the bitmap is checked to see what static priority tasks are queued. If +-+ any realtime priorities are found, the corresponding queue is checked and the +-+ first task listed there is taken (provided CPU affinity is suitable) and lookup +-+ is complete. If the priority corresponds to a SCHED_ISO task, they are also +-+ taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds +-+ to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this +-+ stage, every task in the runlist that corresponds to that priority is checked +-+ to see which has the earliest set deadline, and (provided it has suitable CPU +-+ affinity) it is taken off the runqueue and given the CPU. If a task has an +-+ expired deadline, it is taken and the rest of the lookup aborted (as they are +-+ chosen in FIFO order). +-+ +-+ Thus, the lookup is O(n) in the worst case only, where n is as described +-+ earlier, as tasks may be chosen before the whole task list is looked over. +-+ +-+ +-+ Scalability. +-+ +-+ The major limitations of BFS will be that of scalability, as the separate +-+ runqueue designs will have less lock contention as the number of CPUs rises. +-+ However they do not scale linearly even with separate runqueues as multiple +-+ runqueues will need to be locked concurrently on such designs to be able to +-+ achieve fair CPU balancing, to try and achieve some sort of nice-level fairness +-+ across CPUs, and to achieve low enough latency for tasks on a busy CPU when +-+ other CPUs would be more suited. BFS has the advantage that it requires no +-+ balancing algorithm whatsoever, as balancing occurs by proxy simply because +-+ all CPUs draw off the global runqueue, in priority and deadline order. Despite +-+ the fact that scalability is _not_ the prime concern of BFS, it both shows very +-+ good scalability to smaller numbers of CPUs and is likely a more scalable design +-+ at these numbers of CPUs. +-+ +-+ It also has some very low overhead scalability features built into the design +-+ when it has been deemed their overhead is so marginal that they're worth adding. +-+ The first is the local copy of the running process' data to the CPU it's running +-+ on to allow that data to be updated lockless where possible. Then there is +-+ deference paid to the last CPU a task was running on, by trying that CPU first +-+ when looking for an idle CPU to use the next time it's scheduled. Finally there +-+ is the notion of cache locality beyond the last running CPU. The sched_domains +-+ information is used to determine the relative virtual "cache distance" that +-+ other CPUs have from the last CPU a task was running on. CPUs with shared +-+ caches, such as SMT siblings, or multicore CPUs with shared caches, are treated +-+ as cache local. CPUs without shared caches are treated as not cache local, and +-+ CPUs on different NUMA nodes are treated as very distant. This "relative cache +-+ distance" is used by modifying the virtual deadline value when doing lookups. +-+ Effectively, the deadline is unaltered between "cache local" CPUs, doubled for +-+ "cache distant" CPUs, and quadrupled for "very distant" CPUs. The reasoning +-+ behind the doubling of deadlines is as follows. The real cost of migrating a +-+ task from one CPU to another is entirely dependant on the cache footprint of +-+ the task, how cache intensive the task is, how long it's been running on that +-+ CPU to take up the bulk of its cache, how big the CPU cache is, how fast and +-+ how layered the CPU cache is, how fast a context switch is... and so on. In +-+ other words, it's close to random in the real world where we do more than just +-+ one sole workload. The only thing we can be sure of is that it's not free. So +-+ BFS uses the principle that an idle CPU is a wasted CPU and utilising idle CPUs +-+ is more important than cache locality, and cache locality only plays a part +-+ after that. Doubling the effective deadline is based on the premise that the +-+ "cache local" CPUs will tend to work on the same tasks up to double the number +-+ of cache local CPUs, and once the workload is beyond that amount, it is likely +-+ that none of the tasks are cache warm anywhere anyway. The quadrupling for NUMA +-+ is a value I pulled out of my arse. +-+ +-+ When choosing an idle CPU for a waking task, the cache locality is determined +-+ according to where the task last ran and then idle CPUs are ranked from best +-+ to worst to choose the most suitable idle CPU based on cache locality, NUMA +-+ node locality and hyperthread sibling business. They are chosen in the +-+ following preference (if idle): +-+ +-+ * Same core, idle or busy cache, idle threads +-+ * Other core, same cache, idle or busy cache, idle threads. +-+ * Same node, other CPU, idle cache, idle threads. +-+ * Same node, other CPU, busy cache, idle threads. +-+ * Same core, busy threads. +-+ * Other core, same cache, busy threads. +-+ * Same node, other CPU, busy threads. +-+ * Other node, other CPU, idle cache, idle threads. +-+ * Other node, other CPU, busy cache, idle threads. +-+ * Other node, other CPU, busy threads. +-+ +-+ This shows the SMT or "hyperthread" awareness in the design as well which will +-+ choose a real idle core first before a logical SMT sibling which already has +-+ tasks on the physical CPU. +-+ +-+ Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark. +-+ However this benchmarking was performed on an earlier design that was far less +-+ scalable than the current one so it's hard to know how scalable it is in terms +-+ of both CPUs (due to the global runqueue) and heavily loaded machines (due to +-+ O(n) lookup) at this stage. Note that in terms of scalability, the number of +-+ _logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x) +-+ quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark +-+ results are very promising indeed, without needing to tweak any knobs, features +-+ or options. Benchmark contributions are most welcome. +-+ +-+ +-+ Features +-+ +-+ As the initial prime target audience for BFS was the average desktop user, it +-+ was designed to not need tweaking, tuning or have features set to obtain benefit +-+ from it. Thus the number of knobs and features has been kept to an absolute +-+ minimum and should not require extra user input for the vast majority of cases. +-+ There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval +-+ and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition +-+ to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is +-+ support for CGROUPS. The average user should neither need to know what these +-+ are, nor should they need to be using them to have good desktop behaviour. +-+ +-+ rr_interval +-+ +-+ There is only one "scheduler" tunable, the round robin interval. This can be +-+ accessed in +-+ +-+ /proc/sys/kernel/rr_interval +-+ +-+ The value is in milliseconds, and the default value is set to 6 on a +-+ uniprocessor machine, and automatically set to a progressively higher value on +-+ multiprocessor machines. The reasoning behind increasing the value on more CPUs +-+ is that the effective latency is decreased by virtue of there being more CPUs on +-+ BFS (for reasons explained above), and increasing the value allows for less +-+ cache contention and more throughput. Valid values are from 1 to 5000 +-+ Decreasing the value will decrease latencies at the cost of decreasing +-+ throughput, while increasing it will improve throughput, but at the cost of +-+ worsening latencies. The accuracy of the rr interval is limited by HZ resolution +-+ of the kernel configuration. Thus, the worst case latencies are usually slightly +-+ higher than this actual value. The default value of 6 is not an arbitrary one. +-+ It is based on the fact that humans can detect jitter at approximately 7ms, so +-+ aiming for much lower latencies is pointless under most circumstances. It is +-+ worth noting this fact when comparing the latency performance of BFS to other +-+ schedulers. Worst case latencies being higher than 7ms are far worse than +-+ average latencies not being in the microsecond range. +-+ +-+ Isochronous scheduling. +-+ +-+ Isochronous scheduling is a unique scheduling policy designed to provide +-+ near-real-time performance to unprivileged (ie non-root) users without the +-+ ability to starve the machine indefinitely. Isochronous tasks (which means +-+ "same time") are set using, for example, the schedtool application like so: +-+ +-+ schedtool -I -e amarok +-+ +-+ This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works +-+ is that it has a priority level between true realtime tasks and SCHED_NORMAL +-+ which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie, +-+ if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval +-+ rate). However if ISO tasks run for more than a tunable finite amount of time, +-+ they are then demoted back to SCHED_NORMAL scheduling. This finite amount of +-+ time is the percentage of _total CPU_ available across the machine, configurable +-+ as a percentage in the following "resource handling" tunable (as opposed to a +-+ scheduler tunable): +-+ +-+ /proc/sys/kernel/iso_cpu +-+ +-+ and is set to 70% by default. It is calculated over a rolling 5 second average +-+ Because it is the total CPU available, it means that on a multi CPU machine, it +-+ is possible to have an ISO task running as realtime scheduling indefinitely on +-+ just one CPU, as the other CPUs will be available. Setting this to 100 is the +-+ equivalent of giving all users SCHED_RR access and setting it to 0 removes the +-+ ability to run any pseudo-realtime tasks. +-+ +-+ A feature of BFS is that it detects when an application tries to obtain a +-+ realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the +-+ appropriate privileges to use those policies. When it detects this, it will +-+ give the task SCHED_ISO policy instead. Thus it is transparent to the user. +-+ Because some applications constantly set their policy as well as their nice +-+ level, there is potential for them to undo the override specified by the user +-+ on the command line of setting the policy to SCHED_ISO. To counter this, once +-+ a task has been set to SCHED_ISO policy, it needs superuser privileges to set +-+ it back to SCHED_NORMAL. This will ensure the task remains ISO and all child +-+ processes and threads will also inherit the ISO policy. +-+ +-+ Idleprio scheduling. +-+ +-+ Idleprio scheduling is a scheduling policy designed to give out CPU to a task +-+ _only_ when the CPU would be otherwise idle. The idea behind this is to allow +-+ ultra low priority tasks to be run in the background that have virtually no +-+ effect on the foreground tasks. This is ideally suited to distributed computing +-+ clients (like setiathome, folding, mprime etc) but can also be used to start +-+ a video encode or so on without any slowdown of other tasks. To avoid this +-+ policy from grabbing shared resources and holding them indefinitely, if it +-+ detects a state where the task is waiting on I/O, the machine is about to +-+ suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As +-+ per the Isochronous task management, once a task has been scheduled as IDLEPRIO, +-+ it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can +-+ be set to start as SCHED_IDLEPRIO with the schedtool command like so: +-+ +-+ schedtool -D -e ./mprime +-+ +-+ Subtick accounting. +-+ +-+ It is surprisingly difficult to get accurate CPU accounting, and in many cases, +-+ the accounting is done by simply determining what is happening at the precise +-+ moment a timer tick fires off. This becomes increasingly inaccurate as the +-+ timer tick frequency (HZ) is lowered. It is possible to create an application +-+ which uses almost 100% CPU, yet by being descheduled at the right time, records +-+ zero CPU usage. While the main problem with this is that there are possible +-+ security implications, it is also difficult to determine how much CPU a task +-+ really does use. BFS tries to use the sub-tick accounting from the TSC clock, +-+ where possible, to determine real CPU usage. This is not entirely reliable, but +-+ is far more likely to produce accurate CPU usage data than the existing designs +-+ and will not show tasks as consuming no CPU usage when they actually are. Thus, +-+ the amount of CPU reported as being used by BFS will more accurately represent +-+ how much CPU the task itself is using (as is shown for example by the 'time' +-+ application), so the reported values may be quite different to other schedulers. +-+ Values reported as the 'load' are more prone to problems with this design, but +-+ per process values are closer to real usage. When comparing throughput of BFS +-+ to other designs, it is important to compare the actual completed work in terms +-+ of total wall clock time taken and total work done, rather than the reported +-+ "cpu usage". +-+ +-+ +-+ Con Kolivas Thu Dec 3 2009 ++BFS - The Brain Fuck Scheduler by Con Kolivas. ++ ++Goals. ++ ++The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to ++completely do away with the complex designs of the past for the cpu process ++scheduler and instead implement one that is very simple in basic design. ++The main focus of BFS is to achieve excellent desktop interactivity and ++responsiveness without heuristics and tuning knobs that are difficult to ++understand, impossible to model and predict the effect of, and when tuned to ++one workload cause massive detriment to another. ++ ++ ++Design summary. ++ ++BFS is best described as a single runqueue, O(n) lookup, earliest effective ++virtual deadline first design, loosely based on EEVDF (earliest eligible virtual ++deadline first) and my previous Staircase Deadline scheduler. Each component ++shall be described in order to understand the significance of, and reasoning for ++it. The codebase when the first stable version was released was approximately ++9000 lines less code than the existing mainline linux kernel scheduler (in ++2.6.31). This does not even take into account the removal of documentation and ++the cgroups code that is not used. ++ ++Design reasoning. ++ ++The single runqueue refers to the queued but not running processes for the ++entire system, regardless of the number of CPUs. The reason for going back to ++a single runqueue design is that once multiple runqueues are introduced, ++per-CPU or otherwise, there will be complex interactions as each runqueue will ++be responsible for the scheduling latency and fairness of the tasks only on its ++own runqueue, and to achieve fairness and low latency across multiple CPUs, any ++advantage in throughput of having CPU local tasks causes other disadvantages. ++This is due to requiring a very complex balancing system to at best achieve some ++semblance of fairness across CPUs and can only maintain relatively low latency ++for tasks bound to the same CPUs, not across them. To increase said fairness ++and latency across CPUs, the advantage of local runqueue locking, which makes ++for better scalability, is lost due to having to grab multiple locks. ++ ++A significant feature of BFS is that all accounting is done purely based on CPU ++used and nowhere is sleep time used in any way to determine entitlement or ++interactivity. Interactivity "estimators" that use some kind of sleep/run ++algorithm are doomed to fail to detect all interactive tasks, and to falsely tag ++tasks that aren't interactive as being so. The reason for this is that it is ++close to impossible to determine that when a task is sleeping, whether it is ++doing it voluntarily, as in a userspace application waiting for input in the ++form of a mouse click or otherwise, or involuntarily, because it is waiting for ++another thread, process, I/O, kernel activity or whatever. Thus, such an ++estimator will introduce corner cases, and more heuristics will be required to ++cope with those corner cases, introducing more corner cases and failed ++interactivity detection and so on. Interactivity in BFS is built into the design ++by virtue of the fact that tasks that are waking up have not used up their quota ++of CPU time, and have earlier effective deadlines, thereby making it very likely ++they will preempt any CPU bound task of equivalent nice level. See below for ++more information on the virtual deadline mechanism. Even if they do not preempt ++a running task, because the rr interval is guaranteed to have a bound upper ++limit on how long a task will wait for, it will be scheduled within a timeframe ++that will not cause visible interface jitter. ++ ++ ++Design details. ++ ++Task insertion. ++ ++BFS inserts tasks into each relevant queue as an O(1) insertion into a double ++linked list. On insertion, *every* running queue is checked to see if the newly ++queued task can run on any idle queue, or preempt the lowest running task on the ++system. This is how the cross-CPU scheduling of BFS achieves significantly lower ++latency per extra CPU the system has. In this case the lookup is, in the worst ++case scenario, O(n) where n is the number of CPUs on the system. ++ ++Data protection. ++ ++BFS has one single lock protecting the process local data of every task in the ++global queue. Thus every insertion, removal and modification of task data in the ++global runqueue needs to grab the global lock. However, once a task is taken by ++a CPU, the CPU has its own local data copy of the running process' accounting ++information which only that CPU accesses and modifies (such as during a ++timer tick) thus allowing the accounting data to be updated lockless. Once a ++CPU has taken a task to run, it removes it from the global queue. Thus the ++global queue only ever has, at most, ++ ++ (number of tasks requesting cpu time) - (number of logical CPUs) + 1 ++ ++tasks in the global queue. This value is relevant for the time taken to look up ++tasks during scheduling. This will increase if many tasks with CPU affinity set ++in their policy to limit which CPUs they're allowed to run on if they outnumber ++the number of CPUs. The +1 is because when rescheduling a task, the CPU's ++currently running task is put back on the queue. Lookup will be described after ++the virtual deadline mechanism is explained. ++ ++Virtual deadline. ++ ++The key to achieving low latency, scheduling fairness, and "nice level" ++distribution in BFS is entirely in the virtual deadline mechanism. The one ++tunable in BFS is the rr_interval, or "round robin interval". This is the ++maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy) ++tasks of the same nice level will be running for, or looking at it the other ++way around, the longest duration two tasks of the same nice level will be ++delayed for. When a task requests cpu time, it is given a quota (time_slice) ++equal to the rr_interval and a virtual deadline. The virtual deadline is ++offset from the current time in jiffies by this equation: ++ ++ jiffies + (prio_ratio * rr_interval) ++ ++The prio_ratio is determined as a ratio compared to the baseline of nice -20 ++and increases by 10% per nice level. The deadline is a virtual one only in that ++no guarantee is placed that a task will actually be scheduled by this time, but ++it is used to compare which task should go next. There are three components to ++how a task is next chosen. First is time_slice expiration. If a task runs out ++of its time_slice, it is descheduled, the time_slice is refilled, and the ++deadline reset to that formula above. Second is sleep, where a task no longer ++is requesting CPU for whatever reason. The time_slice and deadline are _not_ ++adjusted in this case and are just carried over for when the task is next ++scheduled. Third is preemption, and that is when a newly waking task is deemed ++higher priority than a currently running task on any cpu by virtue of the fact ++that it has an earlier virtual deadline than the currently running task. The ++earlier deadline is the key to which task is next chosen for the first and ++second cases. Once a task is descheduled, it is put back on the queue, and an ++O(n) lookup of all queued-but-not-running tasks is done to determine which has ++the earliest deadline and that task is chosen to receive CPU next. ++ ++The CPU proportion of different nice tasks works out to be approximately the ++ ++ (prio_ratio difference)^2 ++ ++The reason it is squared is that a task's deadline does not change while it is ++running unless it runs out of time_slice. Thus, even if the time actually ++passes the deadline of another task that is queued, it will not get CPU time ++unless the current running task deschedules, and the time "base" (jiffies) is ++constantly moving. ++ ++Task lookup. ++ ++BFS has 103 priority queues. 100 of these are dedicated to the static priority ++of realtime tasks, and the remaining 3 are, in order of best to worst priority, ++SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority ++scheduling). When a task of these priorities is queued, a bitmap of running ++priorities is set showing which of these priorities has tasks waiting for CPU ++time. When a CPU is made to reschedule, the lookup for the next task to get ++CPU time is performed in the following way: ++ ++First the bitmap is checked to see what static priority tasks are queued. If ++any realtime priorities are found, the corresponding queue is checked and the ++first task listed there is taken (provided CPU affinity is suitable) and lookup ++is complete. If the priority corresponds to a SCHED_ISO task, they are also ++taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds ++to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this ++stage, every task in the runlist that corresponds to that priority is checked ++to see which has the earliest set deadline, and (provided it has suitable CPU ++affinity) it is taken off the runqueue and given the CPU. If a task has an ++expired deadline, it is taken and the rest of the lookup aborted (as they are ++chosen in FIFO order). ++ ++Thus, the lookup is O(n) in the worst case only, where n is as described ++earlier, as tasks may be chosen before the whole task list is looked over. ++ ++ ++Scalability. ++ ++The major limitations of BFS will be that of scalability, as the separate ++runqueue designs will have less lock contention as the number of CPUs rises. ++However they do not scale linearly even with separate runqueues as multiple ++runqueues will need to be locked concurrently on such designs to be able to ++achieve fair CPU balancing, to try and achieve some sort of nice-level fairness ++across CPUs, and to achieve low enough latency for tasks on a busy CPU when ++other CPUs would be more suited. BFS has the advantage that it requires no ++balancing algorithm whatsoever, as balancing occurs by proxy simply because ++all CPUs draw off the global runqueue, in priority and deadline order. Despite ++the fact that scalability is _not_ the prime concern of BFS, it both shows very ++good scalability to smaller numbers of CPUs and is likely a more scalable design ++at these numbers of CPUs. ++ ++It also has some very low overhead scalability features built into the design ++when it has been deemed their overhead is so marginal that they're worth adding. ++The first is the local copy of the running process' data to the CPU it's running ++on to allow that data to be updated lockless where possible. Then there is ++deference paid to the last CPU a task was running on, by trying that CPU first ++when looking for an idle CPU to use the next time it's scheduled. Finally there ++is the notion of "sticky" tasks that are flagged when they are involuntarily ++descheduled, meaning they still want further CPU time. This sticky flag is ++used to bias heavily against those tasks being scheduled on a different CPU ++unless that CPU would be otherwise idle. When a cpu frequency governor is used ++that scales with CPU load, such as ondemand, sticky tasks are not scheduled ++on a different CPU at all, preferring instead to go idle. This means the CPU ++they were bound to is more likely to increase its speed while the other CPU ++will go idle, thus speeding up total task execution time and likely decreasing ++power usage. This is the only scenario where BFS will allow a CPU to go idle ++in preference to scheduling a task on the earliest available spare CPU. ++ ++The real cost of migrating a task from one CPU to another is entirely dependant ++on the cache footprint of the task, how cache intensive the task is, how long ++it's been running on that CPU to take up the bulk of its cache, how big the CPU ++cache is, how fast and how layered the CPU cache is, how fast a context switch ++is... and so on. In other words, it's close to random in the real world where we ++do more than just one sole workload. The only thing we can be sure of is that ++it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and ++utilising idle CPUs is more important than cache locality, and cache locality ++only plays a part after that. ++ ++When choosing an idle CPU for a waking task, the cache locality is determined ++according to where the task last ran and then idle CPUs are ranked from best ++to worst to choose the most suitable idle CPU based on cache locality, NUMA ++node locality and hyperthread sibling business. They are chosen in the ++following preference (if idle): ++ ++* Same core, idle or busy cache, idle threads ++* Other core, same cache, idle or busy cache, idle threads. ++* Same node, other CPU, idle cache, idle threads. ++* Same node, other CPU, busy cache, idle threads. ++* Same core, busy threads. ++* Other core, same cache, busy threads. ++* Same node, other CPU, busy threads. ++* Other node, other CPU, idle cache, idle threads. ++* Other node, other CPU, busy cache, idle threads. ++* Other node, other CPU, busy threads. ++ ++This shows the SMT or "hyperthread" awareness in the design as well which will ++choose a real idle core first before a logical SMT sibling which already has ++tasks on the physical CPU. ++ ++Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark. ++However this benchmarking was performed on an earlier design that was far less ++scalable than the current one so it's hard to know how scalable it is in terms ++of both CPUs (due to the global runqueue) and heavily loaded machines (due to ++O(n) lookup) at this stage. Note that in terms of scalability, the number of ++_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x) ++quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark ++results are very promising indeed, without needing to tweak any knobs, features ++or options. Benchmark contributions are most welcome. ++ ++ ++Features ++ ++As the initial prime target audience for BFS was the average desktop user, it ++was designed to not need tweaking, tuning or have features set to obtain benefit ++from it. Thus the number of knobs and features has been kept to an absolute ++minimum and should not require extra user input for the vast majority of cases. ++There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval ++and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition ++to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is ++support for CGROUPS. The average user should neither need to know what these ++are, nor should they need to be using them to have good desktop behaviour. ++ ++rr_interval ++ ++There is only one "scheduler" tunable, the round robin interval. This can be ++accessed in ++ ++ /proc/sys/kernel/rr_interval ++ ++The value is in milliseconds, and the default value is set to 6ms. Valid values ++are from 1 to 1000. Decreasing the value will decrease latencies at the cost of ++decreasing throughput, while increasing it will improve throughput, but at the ++cost of worsening latencies. The accuracy of the rr interval is limited by HZ ++resolution of the kernel configuration. Thus, the worst case latencies are ++usually slightly higher than this actual value. BFS uses "dithering" to try and ++minimise the effect the Hz limitation has. The default value of 6 is not an ++arbitrary one. It is based on the fact that humans can detect jitter at ++approximately 7ms, so aiming for much lower latencies is pointless under most ++circumstances. It is worth noting this fact when comparing the latency ++performance of BFS to other schedulers. Worst case latencies being higher than ++7ms are far worse than average latencies not being in the microsecond range. ++Experimentation has shown that rr intervals being increased up to 300 can ++improve throughput but beyond that, scheduling noise from elsewhere prevents ++further demonstrable throughput. ++ ++Isochronous scheduling. ++ ++Isochronous scheduling is a unique scheduling policy designed to provide ++near-real-time performance to unprivileged (ie non-root) users without the ++ability to starve the machine indefinitely. Isochronous tasks (which means ++"same time") are set using, for example, the schedtool application like so: ++ ++ schedtool -I -e amarok ++ ++This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works ++is that it has a priority level between true realtime tasks and SCHED_NORMAL ++which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie, ++if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval ++rate). However if ISO tasks run for more than a tunable finite amount of time, ++they are then demoted back to SCHED_NORMAL scheduling. This finite amount of ++time is the percentage of _total CPU_ available across the machine, configurable ++as a percentage in the following "resource handling" tunable (as opposed to a ++scheduler tunable): ++ ++ /proc/sys/kernel/iso_cpu ++ ++and is set to 70% by default. It is calculated over a rolling 5 second average ++Because it is the total CPU available, it means that on a multi CPU machine, it ++is possible to have an ISO task running as realtime scheduling indefinitely on ++just one CPU, as the other CPUs will be available. Setting this to 100 is the ++equivalent of giving all users SCHED_RR access and setting it to 0 removes the ++ability to run any pseudo-realtime tasks. ++ ++A feature of BFS is that it detects when an application tries to obtain a ++realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the ++appropriate privileges to use those policies. When it detects this, it will ++give the task SCHED_ISO policy instead. Thus it is transparent to the user. ++Because some applications constantly set their policy as well as their nice ++level, there is potential for them to undo the override specified by the user ++on the command line of setting the policy to SCHED_ISO. To counter this, once ++a task has been set to SCHED_ISO policy, it needs superuser privileges to set ++it back to SCHED_NORMAL. This will ensure the task remains ISO and all child ++processes and threads will also inherit the ISO policy. ++ ++Idleprio scheduling. ++ ++Idleprio scheduling is a scheduling policy designed to give out CPU to a task ++_only_ when the CPU would be otherwise idle. The idea behind this is to allow ++ultra low priority tasks to be run in the background that have virtually no ++effect on the foreground tasks. This is ideally suited to distributed computing ++clients (like setiathome, folding, mprime etc) but can also be used to start ++a video encode or so on without any slowdown of other tasks. To avoid this ++policy from grabbing shared resources and holding them indefinitely, if it ++detects a state where the task is waiting on I/O, the machine is about to ++suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As ++per the Isochronous task management, once a task has been scheduled as IDLEPRIO, ++it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can ++be set to start as SCHED_IDLEPRIO with the schedtool command like so: ++ ++ schedtool -D -e ./mprime ++ ++Subtick accounting. ++ ++It is surprisingly difficult to get accurate CPU accounting, and in many cases, ++the accounting is done by simply determining what is happening at the precise ++moment a timer tick fires off. This becomes increasingly inaccurate as the ++timer tick frequency (HZ) is lowered. It is possible to create an application ++which uses almost 100% CPU, yet by being descheduled at the right time, records ++zero CPU usage. While the main problem with this is that there are possible ++security implications, it is also difficult to determine how much CPU a task ++really does use. BFS tries to use the sub-tick accounting from the TSC clock, ++where possible, to determine real CPU usage. This is not entirely reliable, but ++is far more likely to produce accurate CPU usage data than the existing designs ++and will not show tasks as consuming no CPU usage when they actually are. Thus, ++the amount of CPU reported as being used by BFS will more accurately represent ++how much CPU the task itself is using (as is shown for example by the 'time' ++application), so the reported values may be quite different to other schedulers. ++Values reported as the 'load' are more prone to problems with this design, but ++per process values are closer to real usage. When comparing throughput of BFS ++to other designs, it is important to compare the actual completed work in terms ++of total wall clock time taken and total work done, rather than the reported ++"cpu usage". ++ ++ ++Con Kolivas Tue, 5 Apr 2011 +diff -urpN linux-2.6.28.orig/drivers/cpufreq/cpufreq.c linux-2.6.28/drivers/cpufreq/cpufreq.c +--- linux-2.6.28.orig/drivers/cpufreq/cpufreq.c 2011-04-10 21:50:27.082902337 +0200 ++++ linux-2.6.28/drivers/cpufreq/cpufreq.c 2011-04-10 21:29:38.566326570 +0200 +@@ -28,6 +28,7 @@ + #include + #include + #include ++#include + + #define dprintk(msg...) cpufreq_debug_printk(CPUFREQ_DEBUG_CORE, \ + "cpufreq-core", msg) +@@ -1460,6 +1461,12 @@ int __cpufreq_driver_target(struct cpufr + target_freq, relation); + if (cpu_online(policy->cpu) && cpufreq_driver->target) + retval = cpufreq_driver->target(policy, target_freq, relation); ++ if (likely(retval != -EINVAL)) { ++ if (target_freq == policy->max) ++ cpu_nonscaling(policy->cpu); ++ else ++ cpu_scaling(policy->cpu); ++ } + + return retval; + } +diff -urpN linux-2.6.28.orig/include/linux/sched.h linux-2.6.28/include/linux/sched.h +--- linux-2.6.28.orig/include/linux/sched.h 2011-04-10 21:50:27.659568962 +0200 ++++ linux-2.6.28/include/linux/sched.h 2011-04-10 21:45:23.102924469 +0200 +@@ -1126,7 +1126,9 @@ struct task_struct { + struct list_head run_list; + u64 last_ran; + u64 sched_time; /* sched_clock time spent running */ +- ++#ifdef CONFIG_SMP ++ int sticky; /* Soft affined flag */ ++#endif + unsigned long rt_timeout; + #else /* CONFIG_SCHED_BFS */ + const struct sched_class *sched_class; +@@ -1409,6 +1411,8 @@ struct task_struct { + #ifdef CONFIG_SCHED_BFS + extern int grunqueue_is_locked(void); + extern void grq_unlock_wait(void); ++extern void cpu_scaling(int cpu); ++extern void cpu_nonscaling(int cpu); + #define tsk_seruntime(t) ((t)->sched_time) + #define tsk_rttimeout(t) ((t)->rt_timeout) + #define task_rq_unlock_wait(tsk) grq_unlock_wait() +@@ -1426,16 +1430,23 @@ static inline void tsk_cpus_current(stru + + static inline void print_scheduler_version(void) + { +- printk(KERN_INFO"BFS CPU scheduler v0.363 by Con Kolivas.\n"); ++ printk(KERN_INFO"BFS CPU scheduler v0.400 by Con Kolivas.\n"); + } + + static inline int iso_task(struct task_struct *p) + { + return (p->policy == SCHED_ISO); + } +-#else ++#else /* CFS */ + extern int runqueue_is_locked(void); + extern void task_rq_unlock_wait(struct task_struct *p); ++static inline void cpu_scaling(int cpu) ++{ ++} ++ ++static inline void cpu_nonscaling(int cpu) ++{ ++} + #define tsk_seruntime(t) ((t)->se.sum_exec_runtime) + #define tsk_rttimeout(t) ((t)->rt.timeout) + +diff -urpN linux-2.6.28.orig/kernel/sched_bfs.c linux-2.6.28/kernel/sched_bfs.c +--- linux-2.6.28.orig/kernel/sched_bfs.c 2011-04-10 21:50:27.659568962 +0200 ++++ linux-2.6.28/kernel/sched_bfs.c 2011-04-10 21:43:43.506265052 +0200 +@@ -81,7 +81,7 @@ + #define idleprio_task(p) unlikely((p)->policy == SCHED_IDLEPRIO) + #define iso_task(p) unlikely((p)->policy == SCHED_ISO) + #define iso_queue(rq) unlikely((rq)->rq_policy == SCHED_ISO) +-#define ISO_PERIOD ((5 * HZ * num_online_cpus()) + 1) ++#define ISO_PERIOD ((5 * HZ * grq.noc) + 1) + + /* + * Convert user-nice values [ -20 ... 0 ... 19 ] +@@ -184,6 +184,7 @@ struct global_rq { + cpumask_t cpu_idle_map; + int idle_cpus; + #endif ++ int noc; /* num_online_cpus stored and updated when it changes */ + u64 niffies; /* Nanosecond jiffies */ + unsigned long last_jiffy; /* Last jiffy we updated niffies */ + +@@ -226,6 +227,8 @@ struct rq { + #ifdef CONFIG_SMP + int cpu; /* cpu of this runqueue */ + int online; ++ int scaling; /* This CPU is managed by a scaling CPU freq governor */ ++ struct task_struct *sticky_task; + + struct root_domain *rd; + struct sched_domain *sd; +@@ -242,7 +245,11 @@ struct rq { + #endif + u64 last_niffy; /* Last time this RQ updated grq.niffies */ + #endif ++#ifdef CONFIG_IRQ_TIME_ACCOUNTING ++ u64 prev_irq_time; ++#endif + u64 clock, old_clock, last_tick; ++ u64 clock_task; + int dither; + + #ifdef CONFIG_SCHEDSTATS +@@ -407,9 +414,14 @@ static inline void update_clocks(struct + * when we're not updating niffies. + * Looking up task_rq must be done under grq.lock to be safe. + */ ++static void update_rq_clock_task(struct rq *rq, s64 delta); ++ + static inline void update_rq_clock(struct rq *rq) + { +- rq->clock = sched_clock_cpu(cpu_of(rq)); ++ s64 delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; ++ ++ rq->clock += delta; ++ update_rq_clock_task(rq, delta); + } + + static inline int task_running(struct task_struct *p) +@@ -741,10 +753,8 @@ static void resched_task(struct task_str + + /* + * The best idle CPU is chosen according to the CPUIDLE ranking above where the +- * lowest value would give the most suitable CPU to schedule p onto next. We +- * iterate from the last CPU upwards instead of using for_each_cpu_mask so as +- * to be able to break out immediately if the last CPU is idle. The order works +- * out to be the following: ++ * lowest value would give the most suitable CPU to schedule p onto next. The ++ * order works out to be the following: + * + * Same core, idle or busy cache, idle threads + * Other core, same cache, idle or busy cache, idle threads. +@@ -756,38 +766,19 @@ static void resched_task(struct task_str + * Other node, other CPU, idle cache, idle threads. + * Other node, other CPU, busy cache, idle threads. + * Other node, other CPU, busy threads. +- * +- * If p was the last task running on this rq, then regardless of where +- * it has been running since then, it is cache warm on this rq. + */ +-static void resched_best_idle(struct task_struct *p) ++static void ++resched_best_mask(unsigned long best_cpu, struct rq *rq, cpumask_t *tmpmask) + { +- unsigned long cpu_tmp, best_cpu, best_ranking; +- cpumask_t tmpmask; +- struct rq *rq; +- int iterate; ++ unsigned long cpu_tmp, best_ranking; + +- cpus_and(tmpmask, p->cpus_allowed, grq.cpu_idle_map); +- iterate = cpus_weight(tmpmask); +- best_cpu = task_cpu(p); +- /* +- * Start below the last CPU and work up with next_cpu_nr as the last +- * CPU might not be idle or affinity might not allow it. +- */ +- cpu_tmp = best_cpu - 1; +- rq = cpu_rq(best_cpu); + best_ranking = ~0UL; + +- do { ++ for_each_cpu_mask(cpu_tmp, *tmpmask) { + unsigned long ranking; + struct rq *tmp_rq; + + ranking = 0; +- cpu_tmp = next_cpu_nr(cpu_tmp, tmpmask); +- if (cpu_tmp >= nr_cpu_ids) { +- cpu_tmp = -1; +- cpu_tmp = next_cpu_nr(cpu_tmp, tmpmask); +- } + tmp_rq = cpu_rq(cpu_tmp); + + #ifdef CONFIG_NUMA +@@ -815,37 +806,42 @@ static void resched_best_idle(struct tas + break; + best_ranking = ranking; + } +- } while (--iterate > 0); ++ } + + resched_task(cpu_rq(best_cpu)->curr); + } + ++static void resched_best_idle(struct task_struct *p) ++{ ++ cpumask_t tmpmask; ++ ++ cpus_and(tmpmask, p->cpus_allowed, grq.cpu_idle_map); ++ resched_best_mask(task_cpu(p), task_rq(p), &tmpmask); ++} ++ + static inline void resched_suitable_idle(struct task_struct *p) + { + if (suitable_idle_cpus(p)) + resched_best_idle(p); + } +- + /* +- * The cpu cache locality difference between CPUs is used to determine how far +- * to offset the virtual deadline. <2 difference in locality means that one +- * timeslice difference is allowed longer for the cpu local tasks. This is +- * enough in the common case when tasks are up to 2* number of CPUs to keep +- * tasks within their shared cache CPUs only. CPUs on different nodes or not +- * even in this domain (NUMA) have "4" difference, allowing 4 times longer +- * deadlines before being taken onto another cpu, allowing for 2* the double +- * seen by separate CPUs above. +- * Simple summary: Virtual deadlines are equal on shared cache CPUs, double +- * on separate CPUs and quadruple in separate NUMA nodes. ++ * Flags to tell us whether this CPU is running a CPU frequency governor that ++ * has slowed its speed or not. No locking required as the very rare wrongly ++ * read value would be harmless. + */ +-static inline int +-cache_distance(struct rq *task_rq, struct rq *rq, struct task_struct *p) ++void cpu_scaling(int cpu) + { +- int locality = rq->cpu_locality[cpu_of(task_rq)] - 2; ++ cpu_rq(cpu)->scaling = 1; ++} + +- if (locality > 0) +- return task_timeslice(p) << locality; +- return 0; ++void cpu_nonscaling(int cpu) ++{ ++ cpu_rq(cpu)->scaling = 0; ++} ++ ++static inline int scaling_rq(struct rq *rq) ++{ ++ return rq->scaling; + } + #else /* CONFIG_SMP */ + static inline void inc_qnr(void) +@@ -879,12 +875,25 @@ static inline void resched_suitable_idle + { + } + +-static inline int +-cache_distance(struct rq *task_rq, struct rq *rq, struct task_struct *p) ++void cpu_scaling(int __unused) ++{ ++} ++ ++void cpu_nonscaling(int __unused) ++{ ++} ++ ++/* ++ * Although CPUs can scale in UP, there is nowhere else for tasks to go so this ++ * always returns 0. ++ */ ++static inline int scaling_rq(struct rq *rq) + { + return 0; + } + #endif /* CONFIG_SMP */ ++EXPORT_SYMBOL_GPL(cpu_scaling); ++EXPORT_SYMBOL_GPL(cpu_nonscaling); + + /* + * activate_idle_task - move idle task to the _front_ of runqueue. +@@ -974,6 +983,82 @@ void set_task_cpu(struct task_struct *p, + smp_wmb(); + task_thread_info(p)->cpu = cpu; + } ++ ++static inline void clear_sticky(struct task_struct *p) ++{ ++ p->sticky = 0; ++} ++ ++static inline int task_sticky(struct task_struct *p) ++{ ++ return p->sticky; ++} ++ ++/* Reschedule the best idle CPU that is not this one. */ ++static void ++resched_closest_idle(struct rq *rq, unsigned long cpu, struct task_struct *p) ++{ ++ cpumask_t tmpmask; ++ ++ cpus_and(tmpmask, p->cpus_allowed, grq.cpu_idle_map); ++ cpu_clear(cpu, tmpmask); ++ if (cpus_empty(tmpmask)) ++ return; ++ resched_best_mask(cpu, rq, &tmpmask); ++} ++ ++/* ++ * We set the sticky flag on a task that is descheduled involuntarily meaning ++ * it is awaiting further CPU time. If the last sticky task is still sticky ++ * but unlucky enough to not be the next task scheduled, we unstick it and try ++ * to find it an idle CPU. Realtime tasks do not stick to minimise their ++ * latency at all times. ++ */ ++static inline void ++swap_sticky(struct rq *rq, unsigned long cpu, struct task_struct *p) ++{ ++ if (rq->sticky_task) { ++ if (rq->sticky_task == p) { ++ p->sticky = 1; ++ return; ++ } ++ if (rq->sticky_task->sticky) { ++ rq->sticky_task->sticky = 0; ++ resched_closest_idle(rq, cpu, rq->sticky_task); ++ } ++ } ++ if (!rt_task(p)) { ++ p->sticky = 1; ++ rq->sticky_task = p; ++ } else { ++ resched_closest_idle(rq, cpu, p); ++ rq->sticky_task = NULL; ++ } ++} ++ ++static inline void unstick_task(struct rq *rq, struct task_struct *p) ++{ ++ rq->sticky_task = NULL; ++ clear_sticky(p); ++} ++#else ++static inline void clear_sticky(struct task_struct *p) ++{ ++} ++ ++static inline int task_sticky(struct task_struct *p) ++{ ++ return 0; ++} ++ ++static inline void ++swap_sticky(struct rq *rq, unsigned long cpu, struct task_struct *p) ++{ ++} ++ ++static inline void unstick_task(struct rq *rq, struct task_struct *p) ++{ ++} + #endif + + /* +@@ -984,6 +1069,7 @@ static inline void take_task(struct rq * + { + set_task_cpu(p, cpu_of(rq)); + dequeue_task(p); ++ clear_sticky(p); + dec_qnr(); + } + +@@ -1266,6 +1352,13 @@ static void try_preempt(struct task_stru + int highest_prio; + cpumask_t tmp; + ++ /* ++ * We clear the sticky flag here because for a task to have called ++ * try_preempt with the sticky flag enabled means some complicated ++ * re-scheduling has occurred and we should ignore the sticky flag. ++ */ ++ clear_sticky(p); ++ + if (suitable_idle_cpus(p)) { + resched_best_idle(p); + return; +@@ -1284,7 +1377,6 @@ static void try_preempt(struct task_stru + highest_prio = -1; + + for_each_cpu_mask_nr(cpu, tmp) { +- u64 offset_deadline; + struct rq *rq; + int rq_prio; + +@@ -1293,12 +1385,9 @@ static void try_preempt(struct task_stru + if (rq_prio < highest_prio) + continue; + +- offset_deadline = rq->rq_deadline - +- cache_distance(this_rq, rq, p); +- + if (rq_prio > highest_prio || (rq_prio == highest_prio && +- deadline_after(offset_deadline, latest_deadline))) { +- latest_deadline = offset_deadline; ++ deadline_after(rq->rq_deadline, latest_deadline))) { ++ latest_deadline = rq->rq_deadline; + highest_prio = rq_prio; + highest_prio_rq = rq; + } +@@ -1470,6 +1559,7 @@ void sched_fork(struct task_struct *p, i + #endif + + p->oncpu = 0; ++ clear_sticky(p); + + #ifdef CONFIG_PREEMPT + /* Want to start with kernel preemption disabled. */ +@@ -1809,6 +1899,152 @@ DEFINE_PER_CPU(struct kernel_stat, kstat + + EXPORT_PER_CPU_SYMBOL(kstat); + ++#ifdef CONFIG_IRQ_TIME_ACCOUNTING ++ ++/* ++ * There are no locks covering percpu hardirq/softirq time. ++ * They are only modified in account_system_vtime, on corresponding CPU ++ * with interrupts disabled. So, writes are safe. ++ * They are read and saved off onto struct rq in update_rq_clock(). ++ * This may result in other CPU reading this CPU's irq time and can ++ * race with irq/account_system_vtime on this CPU. We would either get old ++ * or new value with a side effect of accounting a slice of irq time to wrong ++ * task when irq is in progress while we read rq->clock. That is a worthy ++ * compromise in place of having locks on each irq in account_system_time. ++ */ ++static DEFINE_PER_CPU(u64, cpu_hardirq_time); ++static DEFINE_PER_CPU(u64, cpu_softirq_time); ++ ++static DEFINE_PER_CPU(u64, irq_start_time); ++static int sched_clock_irqtime; ++ ++void enable_sched_clock_irqtime(void) ++{ ++ sched_clock_irqtime = 1; ++} ++ ++void disable_sched_clock_irqtime(void) ++{ ++ sched_clock_irqtime = 0; ++} ++ ++#ifndef CONFIG_64BIT ++static DEFINE_PER_CPU(seqcount_t, irq_time_seq); ++ ++static inline void irq_time_write_begin(void) ++{ ++ __this_cpu_inc(irq_time_seq.sequence); ++ smp_wmb(); ++} ++ ++static inline void irq_time_write_end(void) ++{ ++ smp_wmb(); ++ __this_cpu_inc(irq_time_seq.sequence); ++} ++ ++static inline u64 irq_time_read(int cpu) ++{ ++ u64 irq_time; ++ unsigned seq; ++ ++ do { ++ seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu)); ++ irq_time = per_cpu(cpu_softirq_time, cpu) + ++ per_cpu(cpu_hardirq_time, cpu); ++ } while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq)); ++ ++ return irq_time; ++} ++#else /* CONFIG_64BIT */ ++static inline void irq_time_write_begin(void) ++{ ++} ++ ++static inline void irq_time_write_end(void) ++{ ++} ++ ++static inline u64 irq_time_read(int cpu) ++{ ++ return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); ++} ++#endif /* CONFIG_64BIT */ ++ ++/* ++ * Called before incrementing preempt_count on {soft,}irq_enter ++ * and before decrementing preempt_count on {soft,}irq_exit. ++ */ ++void account_system_vtime(struct task_struct *curr) ++{ ++ unsigned long flags; ++ s64 delta; ++ int cpu; ++ ++ if (!sched_clock_irqtime) ++ return; ++ ++ local_irq_save(flags); ++ ++ cpu = smp_processor_id(); ++ delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time); ++ __this_cpu_add(irq_start_time, delta); ++ ++ irq_time_write_begin(); ++ /* ++ * We do not account for softirq time from ksoftirqd here. ++ * We want to continue accounting softirq time to ksoftirqd thread ++ * in that case, so as not to confuse scheduler with a special task ++ * that do not consume any time, but still wants to run. ++ */ ++ if (hardirq_count()) ++ __this_cpu_add(cpu_hardirq_time, delta); ++ else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD)) ++ __this_cpu_add(cpu_softirq_time, delta); ++ ++ irq_time_write_end(); ++ local_irq_restore(flags); ++} ++EXPORT_SYMBOL_GPL(account_system_vtime); ++ ++static void update_rq_clock_task(struct rq *rq, s64 delta) ++{ ++ s64 irq_delta; ++ ++ irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; ++ ++ /* ++ * Since irq_time is only updated on {soft,}irq_exit, we might run into ++ * this case when a previous update_rq_clock() happened inside a ++ * {soft,}irq region. ++ * ++ * When this happens, we stop ->clock_task and only update the ++ * prev_irq_time stamp to account for the part that fit, so that a next ++ * update will consume the rest. This ensures ->clock_task is ++ * monotonic. ++ * ++ * It does however cause some slight miss-attribution of {soft,}irq ++ * time, a more accurate solution would be to update the irq_time using ++ * the current rq->clock timestamp, except that would require using ++ * atomic ops. ++ */ ++ if (irq_delta > delta) ++ irq_delta = delta; ++ ++ rq->prev_irq_time += irq_delta; ++ delta -= irq_delta; ++ rq->clock_task += delta; ++} ++ ++#else /* CONFIG_IRQ_TIME_ACCOUNTING */ ++ ++static void update_rq_clock_task(struct rq *rq, s64 delta) ++{ ++ rq->clock_task += delta; ++} ++ ++#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ ++ + /* + * On each tick, see what percentage of that tick was attributed to each + * component and add the percentage to the _pc values. Once a _pc value has +@@ -2372,9 +2608,14 @@ static inline u64 static_deadline_diff(i + return prio_deadline_diff(USER_PRIO(static_prio)); + } + ++static inline int longest_deadline_diff(void) ++{ ++ return prio_deadline_diff(39); ++} ++ + static inline int ms_longest_deadline_diff(void) + { +- return NS_TO_MS(prio_deadline_diff(39)); ++ return NS_TO_MS(longest_deadline_diff()); + } + + /* +@@ -2444,7 +2685,19 @@ retry: + goto out_take; + } + +- dl = p->deadline + cache_distance(task_rq(p), rq, p); ++ /* ++ * Soft affinity happens here by not scheduling a task with ++ * its sticky flag set that ran on a different CPU last when ++ * the CPU is scaling, or by greatly biasing against its ++ * deadline when not. ++ */ ++ if (task_rq(p) != rq && task_sticky(p)) { ++ if (scaling_rq(rq)) ++ continue; ++ else ++ dl = p->deadline + longest_deadline_diff(); ++ } else ++ dl = p->deadline; + + /* + * No rt tasks. Find the earliest deadline task. Now we're in +@@ -2601,14 +2854,8 @@ need_resched_nonpreemptible: + */ + grq_unlock_irq(); + goto rerun_prev_unlocked; +- } else { +- /* +- * If prev got kicked off by a task that has to +- * run on this CPU for affinity reasons then +- * there may be an idle CPU it can go to. +- */ +- resched_suitable_idle(prev); +- } ++ } else ++ swap_sticky(rq, cpu, prev); + } + return_task(prev, deactivate); + } +@@ -2623,12 +2870,21 @@ need_resched_nonpreemptible: + set_cpuidle_map(cpu); + } else { + next = earliest_deadline_task(rq, idle); +- prefetch(next); +- prefetch_stack(next); +- clear_cpuidle_map(cpu); ++ if (likely(next->prio != PRIO_LIMIT)) { ++ prefetch(next); ++ prefetch_stack(next); ++ clear_cpuidle_map(cpu); ++ } else ++ set_cpuidle_map(cpu); + } + + if (likely(prev != next)) { ++ /* ++ * Don't stick tasks when a real time task is going to run as ++ * they may literally get stuck. ++ */ ++ if (rt_task(next)) ++ unstick_task(rq, prev); + sched_info_switch(prev, next); + + set_rq_task(rq, next); +@@ -4009,7 +4265,6 @@ void init_idle(struct task_struct *idle, + rcu_read_unlock(); + rq->curr = rq->idle = idle; + idle->oncpu = 1; +- set_cpuidle_map(cpu); + grq_unlock_irqrestore(&flags); + + /* Set the preempt count _outside_ the spinlocks! */ +@@ -4245,6 +4500,7 @@ static void break_sole_affinity(int src_ + task_pid_nr(p), p->comm, src_cpu); + } + } ++ clear_sticky(p); + } while_each_thread(t, p); + } + +@@ -4569,6 +4825,7 @@ migration_call(struct notifier_block *nf + + set_rq_online(rq); + } ++ grq.noc = num_online_cpus(); + grq_unlock_irqrestore(&flags); + break; + +@@ -4601,6 +4858,7 @@ migration_call(struct notifier_block *nf + BUG_ON(!cpu_isset(cpu, rq->rd->span)); + set_rq_offline(rq); + } ++ grq.noc = num_online_cpus(); + grq_unlock_irqrestore(&flags); + break; + #endif +@@ -6005,7 +6263,7 @@ static int cache_cpu_idle(unsigned long + void __init sched_init_smp(void) + { + struct sched_domain *sd; +- int cpu, cpus; ++ int cpu; + + cpumask_t non_isolated_cpus; + +@@ -6035,14 +6293,6 @@ void __init sched_init_smp(void) + if (set_cpus_allowed_ptr(current, &non_isolated_cpus) < 0) + BUG(); + +- /* +- * Assume that every added cpu gives us slightly less overall latency +- * allowing us to increase the base rr_interval, non-linearly and with +- * an upper bound. +- */ +- cpus = num_online_cpus(); +- rr_interval = rr_interval * (4 * cpus + 4) / (cpus + 6); +- + grq_lock_irq(); + /* + * Set up the relative cache distance of each online cpu from each +@@ -6129,6 +6379,7 @@ void __init sched_init(void) + grq.last_jiffy = jiffies; + spin_lock_init(&grq.iso_lock); + grq.iso_ticks = grq.iso_refractory = 0; ++ grq.noc = 1; + #ifdef CONFIG_SMP + init_defrootdomain(); + grq.qnr = grq.idle_cpus = 0; +@@ -6142,6 +6393,7 @@ void __init sched_init(void) + rq->iowait_pc = rq->idle_pc = 0; + rq->dither = 0; + #ifdef CONFIG_SMP ++ rq->sticky_task = NULL; + rq->last_niffy = 0; + rq->sd = NULL; + rq->rd = NULL;