kernel-5.15/README.BFQ
Evgenii Shatokhin e8471b42b2 Updated to 4.10.x (4.10.4 atm), the first take
While at it, made the configs a bit closer to those from Ubuntu:
* disabled IDE drivers which are now barely maintained anyway;
* disabled some debugging facilities (verboseness of some drivers,
  etc.);
* made some often used modules like vfat, fuse, ata_piix, etc.,
  built-in.
* and so forth.
2017-03-19 16:56:31 +03:00

749 lines
38 KiB
Text

Budget Fair Queueing I/O Scheduler
==================================
This patchset introduces BFQ-v8r8 into Linux 4.10.0.
For further information: http://algogroup.unimore.it/people/paolo/disk_sched/
The overall diffstat is the following:
Documentation/block/00-INDEX | 2 +
Documentation/block/bfq-iosched.txt | 530 ++++++
Makefile | 2 +-
block/Kconfig.iosched | 30 +
block/Makefile | 1 +
block/bfq-cgroup.c | 1191 +++++++++++++
block/bfq-ioc.c | 36 +
block/bfq-iosched.c | 5306 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
block/bfq-sched.c | 1989 ++++++++++++++++++++++
block/bfq.h | 935 +++++++++++
include/linux/blkdev.h | 2 +-
11 files changed, 10022 insertions(+), 2 deletions(-)
CHANGELOG
BFQ v8r8
. BUGFIX: Removed a wrong compilation warning, due to the compiler
not taking into account short circuit in a condition.
. BUGIFX: Added several forgotten static qualifiers in function
definitions (completely harmless issue).
. BUGFIX: Put async queues on exit also without cgroups
. BUGFIX: The putting of async queues on scheduler exit was missing in case
cgroups support was not active. This fix adds the missing operation.
. BUGFIX: In the peak-rate estimator, there was a serious error in the
check that the percentage of sequential I/O-request dispatches was high
enough to trigger an update of the peak-rate estimate. This commit fixes
that check.
. IMPROVEMENT Luca Miccio has run a few responsiveness tests on recent
Android systems with average-speed storage devices. These tests have
shown that the following BFQ parameter was too low for these
systems: reference duration for slow storage devices of weight
raising for interactive applications. This commit raises that
duration to a value that is yelding optimal results in our
tests. Contributed by Luca Miccio.
. IMPROVEMENT This commit anticipates the complete check of budget
exhaustion, for the in-service bfq_queue, to when the next bfq_queue
to serve is selected (during a dispatch operation). This enables a
new bfq_queue to be immediately selected for service in case the
in-service bfq_queue has actually exhausted its budget. As a
consequence, a second dispatch invocation is not needed any more, to
have a new request dispatched. To implement this improvement, this
commit implements a further improvement too: the field next_rq of a
bfq_queue now always contains the actual next request to dispatch
(or NULL if the bfq_queue is empty).
. BUGFIX Make bfq_bic_update_cgroup() return nothing if
CONFIG_BFQ_GROUP_IOSCHED is disabled, as it happens if this option
is enabled. Contributed by Oleksandr Natalenko.
BFQ v8r7
. BUGFIX: make BFQ compile also without hierarchical support
BFQ v8r6
. BUGFIX Removed the check that, when the new queue to set in service
must be selected, the cached next_in_service entities coincide with
the entities chosen by __bfq_lookup_next_entity. This check, issuing
a warning on failure, was wrong, because the cached and the newly
chosen entity could differ in case of a CLASS_IDLE timeout.
. EFFICIENCY IMPROVEMENT (this improvement is related to the above
BUGFIX) The cached next_in_service entities are now really used to
select the next queue to serve when the in-service queue
expires. Before this change, the cached values were used only for
extra (and in general wrong) consistency checks. This caused
additional overhead instead of reducing it.
. EFFICIENCY IMPROVEMENT The next entity to serve, for each level of
the hierarchy, is now updated on every event that may change it,
i.e., on every activation or deactivation of any entity. This finer
granularity is not strictly needed for corectness, because it is
only on queue expirations that BFQ needs to know what are the next
entities to serve. Yet this change makes it possible to implement
optimizations in which it is necessary to know the next queue to
serve before the in-service queue expires.
. SERVICE-ACCURACY IMPROVEMENT The per-device CLASS_IDLE service
timeout has been turned into a much more accurate per-group timeout.
. CODE-QUALITY IMPROVEMENT The non-trivial parts touched by the above
improvements have been partially rewritten, and enriched of
comments, so as to improve their transparency and understandability.
. IMPROVEMENT Ported and improved CFQ commit 41647e7a Before this
improvememtn, BFQ used the same logic for detecting seeky queues for
rotational disks and SSDs. This logic is appropriate for the former,
as it takes into account only inter-request distance, and the latter
is the dominant latency factor on a rotational device. Yet things
change with flash-based devices, where serving a large request still
yields a high throughput, even the request is far from the previous
request served. This commits extends seeky detection to take into
accoutn also this fact with flash-based devices. In particular, this
commit is an improved port of the original commit 41647e7a for CFQ.
. CODE IMPROVEMENT Remove useless parameter from bfq_del_bfqq_busy
. OPTIMIZATION Optimize the update of next_in_service entity. If the
update of the next_in_service candidate entity is triggered by the
activation of an entity, then it is not necessary to perform full
lookups in the active trees to update next_in_service. In fact, it
is enough to check whether the just-activated entity has a higher
priority than next_in_service, or, even if it has the same priority
as next_in_service, is eligible and has a lower virtual finish time
than next_in_service. If this compound condition holds, then the new
entity can be set as the new next_in_service. Otherwise no change is
needed. This commit implements this optimization.
. BUGFIX Fix bug causing occasional loss of weight raising. When a
bfq_queue, say bfqq, is split after a merging with another
bfq_queue, BFQ checks whether it has to restore for bfqq the
weight-raising state that bfqq had before being merged. In
particular, the weight-raising is restored only if, according to the
weight-raising duration decided for bfqq when it started to be
weight-raised (before being merged), bfqq would not have already
finished its weight-raising period. Yet, by mistake, such a
duration was not saved when bfqq is merged. So, if bfqq was freed
and reallocated when it was split, then this duration was wrongly
set to zero on the split. As a consequence, the weight-raising state
of bfqq was wrongly not restored, which caused BFQ to fail in
guaranteeing a low latency to bfqq. This commit fixes this bug by
saving weight-raising duration when bfqq is merged, and correctly
restoring it when bfqq is split.
. BUGFIX Fix wrong reset of in-service entities In-service entities
were reset with an indirect logic, which happened to be even buggy
for some cases. This commit fixes this bug in two important
steps. First, by replacing this indirect logic with a direct logic,
in which all involved entities are immediately reset, with a
bubble-up loop, when the in-service queue is reset. Second, by
restructuring the code related to this change, so as to become not
only correct with respect to this change, but also cleaner and
hopefully clearer.
. CODE IMPROVEMENT Add code to be able to redirect trace log to
console.
. BUGFIX Fixed bug in optimized update of next_in_service entity.
There was a case where bfq_update_next_in_service did not update
next_in_service, even if it might need to be changed: in case of
requeueing or repositioning of the entity that happened to be
pointed exactly by next_in_service. This could result in violation
of service guarantees, because, after a change of timestamps for
such an entity, it might be the case that next_in_service had to
point to a different entity. This commit fixes this bug.
. OPTIMIZATION Stop bubble-up of next_in_service update if possible.
. BUGFIX Fixed a false-positive warning for uninitialized var
BFQ-v8r5
. DOCUMENTATION IMPROVEMENT Added documentation of BFQ benefits, inner
workings, interface and tunables.
. BUGFIX: Replaced max wrongly used for modulo numbers.
. DOCUMENTATION IMPROVEMENT Improved help message in Kconfig.iosched.
. BUGFIX: Removed wrong conversion in use of bfq_fifo_expire.
. CODE IMPROVEMENT Added parentheses to complex macros.
v8r4
. BUGFIX The function bfq_find_set_group may return a NULL pointer,
which happened not to properly handled in the function
__bfq_bic_change_cgroup. This fix handles this case. Contributed by
Lee Tibbert.
. BUGFIX Fix recovery of lost service for soft real-time
applications. This recovery is important for soft real-time
application to continue enjoying proper weight raising even if their
service happens to be delayed for a while. Contributed by Luca
Miccio.
. BUGFIX Fix handling of wait_request state. The semantics of
hrtimers makes the following assumption false after invoking
hrtimer_try_to_cancel: the timer results as non
active. Unfortunately this assumption was used in the previous
version of the code. This change lets code comply with the new
semantics.
. IMPROVEMENT Improve the peak-rate estimator. This change is a
complete rewrite of the peak-rate estimation algorithm. It is both
an improvement and a simplification: in particular it replaces the
previous, less effective, stable and clear algorithm for estimating
the peak rate. The previous algorihtm approximated the service rate
using the individual dispatch rates observed during the service
slots of queues. As such, it took into account not only just
individual queue workloads, but also rather short time intervals.
The new algorithm considers the global workload served by the
device, and computes the peak rate over much larger time
intervals. This makes the new algorihtm extremely more effective
with queueing devices and, in general, with devices with a
fluctuating bandwidth, either physical or virtual.
. IMPROVEMENT Force the device to serve one request at a time if
strict_guarantees is true. Forcing this service scheme is currently
the ONLY way to guarantee that the request service order enforced by
the scheduler is respected by a queueing device. Otherwise the
device is free even to make some unlucky request wait for as long as
the device wishes.
Of course, serving one request at at time may cause loss of throughput.
. IMPROVEMENT Let weight raising start for a soft real-time
application even while the application is till enjoying
weight-raising for interactive tasks. This allows soft real-time
applications to start enjoying the benefits of their special weight
raising as soon as possible.
v8r3
. BUGFIX Update weight-raising coefficient when switching from
interactive to soft real-time.
v8r2
. BUGFIX Removed variables that are not used if tracing is
disabled. Reported by Lee Tibbert <lee.tibbert@gmail.com>
. IMPROVEMENT Ported commit ae11889636: turned blkg_lookup_create into
blkg_lookup. As a side benefit, this finally enables BFQ to be used
as a module even with full hierarchical support.
v8r1
. BUGFIX Fixed incorrect invariant check
. IMPROVEMENT Privileged soft real-time applications against
interactive ones, to guarantee a lower and more stable latency to
the former
v8
. BUGFIX: Fixed incorrect rcu locking in bfq_bic_update_cgroup
. BUGFIX Fixed a few cgroups-related bugs, causing sporadic crashes
. BUGFIX Fixed wrong computation of queue weights as a function of ioprios
. BUGFIX Fixed wrong Kconfig.iosched dependency for BFQ_GROUP_IOSCHED
. IMPROVEMENT Preemption-based, idle-less service guarantees. If
several processes are competing for the device at the same time, but
all processes and groups have the same weight, then the mechanism
introduced by this improvement enables BFQ to guarantee the expected
throughput distribution without ever idling the device. Throughput
is then much higher in this common scenario.
. IMPROVEMENT Made burst handling more robust
. IMPROVEMENT Reduced false positives in EQM
. IMPROVEMENT Let queues preserve weight-raising also when shared
. IMPROVEMENT Improved peak-rate estimation and autotuning of the
parameters related to the device rate
. IMPROVEMENT Improved the weight-raising mechanism so as to further
reduce latency and to increase robustness
. IMPROVEMENT Added a strict-guarantees tunable. If this tunable is
set, then device-idling is forced whenever needed to provide
accurate service guarantees. CAVEAT: idling unconditionally may even
increase latencies, in case of processes that did stop doing I/O.
. IMPROVEMENT Improved handling of async (write) I/O requests
. IMPROVEMENT Ported several good CFQ commits
. CHANGE Changed default group weight to 100
. CODE IMPROVEMENT Refactored I/O-request-insertion code
v7r11:
. BUGFIX Remove the group_list data structure, which ended up in an
inconsistent state if BFQ happened to be activated for some device
when some blkio groups already existed (these groups where not added
to the list). The blkg list for the request queue is now used where
the removed group_list was used.
. BUGFIX Init and reset also dead_stats.
. BUGFIX Added, in __bfq_deactivate_entity, the correct handling of the
case where the entity to deactivate has not yet been activated at all.
. BUGFIX Added missing free of the root group for the case where full
hierarchical support is not activated.
. IMPROVEMENT Removed the now useless bfq_disconnect_groups
function. The same functionality is achieved through multiple
invocations of bfq_pd_offline (which are in their turn guaranteed to
be executed, when needed, by the blk-cgroups code).
v7r10 <VERSION RETIRED, BECAUSE OF THE BUGS FIXED IN v7r11!!!>:
. BUGFIX: Fixed wrong check on whether cooperating processes belong
to the same cgroup.
v7r9:
. IMPROVEMENT: Changed BFQ to use the blkio controller instead of its
own controller. BFQ now registers itself as a policy to the blkio
controller and implements its hierarchical scheduling support using
data structures that already exist in blk-cgroup. The bfqio
controller's code is completely removed.
. CODE IMPROVEMENTS: Applied all suggestions from Tejun Heo, received
on the last submission to lkml: https://lkml.org/lkml/2014/5/27/314.
v7r8:
. BUGFIX: Let weight-related fields of a bfq_entity be correctly initialized
(also) when the I/O priority of the entity is changed before the first
request is inserted into the bfq_queue associated to the entity.
. BUGFIX: When merging requests belonging to different bfq_queues, avoid
repositioning the surviving request. In fact, in this case the repositioning
may result in the surviving request being moved across bfq_queues, which
would ultimately cause bfq_queues' data structures to become inconsistent.
. BUGFIX: When merging requests belonging to the same bfq_queue, reposition
the surviving request so that it gets in the correct position, namely the
position of the dropped request, instead of always being moved to the head
of the FIFO of the bfq_queue (which means to let the request be considered
the eldest one).
. BUGFIX: Reduce the idling slice for seeky queues only if the scenario is
symmetric. This guarantees that also processes associated to seeky queues
do receive their reserved share of the throughput.
Contributed by Riccardo Pizzetti and Samuele Zecchini.
. IMPROVEMENT: Always perform device idling if the scenario is asymmetric in
terms of throughput distribution among processes.
This extends throughput-distribution guarantees to any process, regardless
of the properties of its request pattern and of the request patterns of the
other processes, and regardless of whether the device is NCQ-capable.
. IMPROVEMENT: Remove the current limitation on the maximum number of in-flight
requests allowed for a sync queue (limitation set in place for fairness
issues in CFQ, inherited by the first version of BFQ, but made unnecessary
by the latest accurate fairness strategies added to BFQ). Removing this
limitation enables devices with long internal queues to fill their queues
as much as they deem appropriate, also with sync requests. This avoids
throughput losses on these devices, because, to achieve a high throughput,
they often need to have a high number of requests queued internally.
. CODE IMPROVEMENT: Simplify I/O priority change logic by turning it into a
single-step procedure instead of a two-step one; improve readability by
rethinking the names of the functions involved in changing the I/O priority
of a bfq_queue.
v7r7:
. BUGFIX: Prevent the OOM queue from being involved in the queue
cooperation mechanism. In fact, since the requests temporarily
redirected to the OOM queue could be redirected again to dedicated
queues at any time, the state needed to correctly handle merging
with the OOM queue would be quite complex and expensive to
maintain. Besides, in such a critical condition as an out of
memory, the benefits of queue merging may be little relevant, or
even negligible.
. IMPROVEMENT: Let the OOM queue be initialized only once. Previously,
the OOM queue was reinitialized, at each request enqueue, with the
parameters related to the process that issued that request.
Depending on the parameters of the processes doing I/O, this could
easily cause the OOM queue to be moved continuously across service
trees, or even across groups. It also caused the parameters of the
OOM queue to be continuously reset in any case.
. CODE IMPROVEMENT. Performed some minor code cleanups, and added some
BUG_ON()s that, if the weight of an entity becomes inconsistent,
should better help understand why.
v7r6:
. IMPROVEMENT: Introduced a new mechanism that helps get the job done
more quickly with services and applications that create or reactivate
many parallel I/O-bound processes. This is the case, for example, with
systemd at boot, or with commands like git grep.
. CODE IMPROVEMENTS: Small code cleanups and improvements.
v7r5:
. IMPROVEMENT: Improve throughput boosting by idling the device
only for processes that, in addition to perform sequential I/O,
are I/O-bound (apart from weight-raised queues, for which idling
is always performed to guarantee them a low latency).
. IMPROVEMENT: Improve throughput boosting by depriving processes
that cooperate often of weight-raising.
. CODE IMPROVEMENT: Pass of improvement of the readability of both
comments and actual code.
v7r4:
. BUGFIX. Modified the code so as to be robust against late detection of
NCQ support for a rotational device.
. BUGFIX. Removed a bug that hindered the correct throughput distribution
on flash-based devices when not every process had to receive the same
fraction of the throughput. This fix entailed also a little efficiency
improvement, because it implied the removal of a short function executed
in a hot path.
. CODESTYLE IMPROVEMENT: removed quoted strings split across lines.
v7r3:
. IMPROVEMENT: Improved throughput boosting with NCQ-capable HDDs and
random workloads. The mechanism that further boosts throghput with
these devices and workloads is activated only in the cases where it
does not cause any violation of throughput-distribution and latency
guarantees.
. IMPROVEMENT: Generalized the computation of the parameters of the
low-latency heuristic for interactive applications, so as to fit also
slower storage devices. The purpose of this improvement is to preserve
low-latency guarantees for interactive applications also on slower
devices, such as portable hard disks, multimedia and SD cards.
. BUGFIX: Re-added MODULE_LICENSE macro.
. CODE IMPROVEMENTS: Small code cleanups; introduced a coherent naming
scheme for all identifiers related to weight raising; refactored and
optimized a few hot paths.
v7r2:
. BUGFIX/IMPROVEMENT. One of the requirements for an application to be
deemed as soft real-time is that it issues its requests in batches, and
stops doing I/O for a well-defined amount of time before issuing a new
batch. Imposing this minimum idle time allows BFQ to filter out I/O-bound
applications that may otherwise be incorrectly deemed as soft real-time
(under the circumstances described in detail in the comments to the
function bfq_bfqq_softrt_next_start()). Unfortunately, BFQ could however
start counting this idle time from two different events: either from the
expiration of the queue, if all requests of the queue had also been already
completed when the queue expired, or, if the previous condition did not
hold, from the first completion of one of the still outstanding requests.
In the second case, an application had more chances to be deemed as soft
real-time.
Actually, there was no reason for this differentiated treatment. We
addressed this issue by defining more precisely the above requirement for
an application to be deemed as soft real-time, and changing the code
consequently: a well-defined amount of time must elapse between the
completion of *all the requests* of the current pending batch and the
issuing of the first request of the next batch (this is, in the end, what
happens with a true soft real-time application). This change further
reduced false positives, and, as such, improved responsiveness and reduced
latency for actual soft real-time applications.
. CODE IMPROVEMENT. We cleaned up the code a little bit and addressed
some issues pointed out by the checkpatch.pl script.
v7r1:
. BUGFIX. Replace the old value used to approximate 'infinity', with
the correct one to use in case times are compared through the macro
time_is_before_jiffies(). In fact, this macro, designed to take
wraparound issues into account, easily returns anomalous results if
its argument is equal to the value that we used as an approximation
of 'infinity', namely ((unsigned long) (-1)). The consequence was
that the logical expression used to determine whether a queue
belongs to a soft real-time application often yielded an incorrect
result. In the end, some application happened to be incorrectly
deemed as soft real-time and hence weight-raised. This affected both
throughput and latency guarantees.
. BUGFIX. Fixed a scriverner's error made in an attempt to use the
above macro in a logical expression.
. IMPROVEMENT/BUGFIX. On the expiration of a queue, use a more general
condition to allow a weight-raising period to start if the queue is
soft real-time. The previous condition could prevent an empty,
soft-real time queue from being correctly deemed as soft real-time.
. IMPROVEMENT/MINOR BUGFIX. Use jiffies-comparison macros also in the
following cases:
. to establish whether an application initially deemed as interactive
is now meeting the requirements for being classified as soft
real-time;
. to determine if a weight-raising period must be ended.
. CODE IMPROVEMENT. Change the type of the time quantities used in the
weight-raising heuristics to unsigned long, as the type of the time
(jiffies) is unsigned long.
v7:
- IMPROVEMENT: In the presence of weight-raised queues and if the
device is NCQ-enabled, device idling is now disabled for non-raised
readers, i.e., for their associated sync queues. Hence a sync queue
is expired immediately if it becomes empty, and a new queue is
served. As explained in detail in the papers about BFQ, not idling
the device for sync queues when the latter become empty causes BFQ to
assign higher timestamps to these queues when they get backlogged
again, and hence to serve these queues less frequently. This fact,
plus to the fact that, because of the immediate expiration itself,
these queues get less service while they are granted access to the
disk, reduces the relative rate at which the processes associated to
these queues ask for requests from the I/O request pool. If the pool
is saturated, as it happens in the presence of write hogs, reducing
the above relative rate increases the probability that a request is
available (soon) in the pool when a weight-raised process needs it.
This change does seem to mitigate the typical starvation problems
that occur in the presence of write hogs and NCQ, and hence to
guarantee a higher application and system responsiveness in these
hostile scenarios.
- IMPROVEMENT/BUGFIX: Introduced a new classification rule to the soft
real-time heuristic, which takes into account also the isochronous
nature of such applications. The computation of next_start has been
fixed as well. Now it is correctly done from the time of the last
transition from idle to backlogged; the next_start is therefore
computed from the service received by the queue from its last
transition from idle to backlogged. Finally, the code which
preserved weight-raising for a soft real-time queue even with no
idle->backlogged transition has been removed.
- IMPROVEMENT: Add a few jiffies to the reference time interval used to
establish whether an application is greedy or not. This reference
interval was, by default, HZ/125 seconds, which could generate false
positives in the following two cases (especially if both cases occur):
1) If HZ is so low that the duration of a jiffie is comparable to or
higher than the above reference time interval. This happens, e.g.,
on slow devices with HZ=100.
2) If jiffies, instead of increasing at a constant rate, may stop
increasing for some time, then suddenly 'jump' by several units to
recover the lost increments. This seems to happen, e.g., in virtual
machines.
The added number of jiffies has been found experimentally. In particular,
according to our experiments, adding this number of jiffies seems to make
the filter quite precise also in embedded systems and KVM/QEMU virtual
machines. Also contributed by
Alexander Spyridakis <a.spyridakis@virtualopensystems.com>.
- IMPROVEMENT/BUGFIX: Keep disk idling also for NCQ-provided
rotational devices, which boosts the throughput on NCQ-enabled
rotational devices.
- BUGFIX: The budget-timeout condition in the bfq_rq_enqueued() function
was checked only if the request is large enough to provoke an unplug. As
a consequence, for a process always issuing small I/O requests the
budget timeout was never checked. The queue associated to the process
therefore expired only when its budget was exhausted, even if the
queue had already incurred a budget timeout from a while.
This fix lets a queue be checked for budget timeout at each request
enqueue, and, if needed, expires the queue accordingly even if the
request is small.
- BUGFIX: Make sure that weight-raising is resumed for a split queue,
if it was merged when already weight-raised.
- MINOR BUGFIX: Let bfq_end_raising_async() correctly end weight-raising
also for the queues belonging to the root group.
- IMPROVEMENT: Get rid of the some_coop_idle flag, which in its turn
was used to decide whether to disable idling for an in-service
shared queue whose seek mean decreased. In fact, disabling idling
for such a queue turned out to be useless.
- CODE IMPROVEMENT: The bfq_bfqq_must_idle() function and the
bfq_select_queue() function may not change the current in-service
queue in various cases. We have cleaned up the involved conditions,
by factoring out the common parts and getting rid of the useless
ones.
- MINOR CODE IMPROVEMENT: The idle_for_long_time condition in the
bfq_add_rq_rb() function should be evaluated only on an
idle->backlogged transition. Now the condition is set to false
by default, evaluating it only if the queue was not busy on a
request insertion.
- MINOR CODE IMPROVEMENT: Added a comment describing the rationale
behind the condition evaluated in the function
bfq_bfqq_must_not_expire().
v6r2:
- Fairness fix: the case of queue expiration for budget timeout is
now correctly handled also for sync queues, thus allowing also
the processes corresponding to these queues to be guaranteed their
reserved share of the disk throughput.
- Fixed a bug that prevented group weights from being correctly
set via the sysfs interface.
- Fixed a bug that cleared a previously-set group weight if the
same value was re-inserted via the sysfs interface.
- Fixed an EQM bug that allowed a newly-started process to skip
its initial weight-raising period if its queue was merged before
its first request was inserted.
- Fixed a bug that preserved already-started weight-raising periods
even if the low_latency tunable was disabled.
- The raising_max_time tunable now shows, more user-friendly, the
maximum raising time in milliseconds.
v6r1:
- Fix use-after-free of queues in __bfq_bfqq_expire(). It may happen that
a call to bfq_del_bfqq_busy() puts the last reference taken on a queue
and frees it. Subsequent accesses to that same queue would result in a
use-after-free. Make sure that a queue that has just been deleted from
busy is no more touched.
- Use the uninitialized_var() macro when needed. It may happen that a
variable is initialized in a function that is called by the function
that defined it. Use the uninitialized_var() macro in these cases.
v6:
- Replacement of the cooperating-queue merging mechanism borrowed from
CFQ with Early Queue Merge (EQM), a unified mechanism to get a
sequential read pattern, and hence a high throughput, with any set of
processes performing interleaved I/O. EQM also preserves low latency.
(see http://algogroup.unimore.it/people/paolo/disk_sched/description.php
for more details). Contributed by Mauro Andreolini and Arianna Avanzini.
The code for detecting whether two queues have to be merged is a
slightly modified version of the CFQ code for detecting whether two
queues belong to cooperating processes and whether the service of a
queue should be preempted to boost the throughput.
- Fix a bug that caused the peak rate of a disk to be computed as zero
in case of multiple I/O errors. Subsequent estimations of the weight
raising duration caused a division-by-zero error.
v5r1:
- BUG FIX: Fixed stall occurring when the active queue is moved to
a different group while idling (this caused the idling timer to be
cancelled and hence no new queue to be selected, and no new
request to be dispatched).
- BUG FIX: Fixed wrong assignment of too high budgets to queues during
the first few seconds after initialization.
- BUG FIX: Added proper locking to the function handling the "weights"
tunable.
v5:
- Added an heuristic that, if the tunable raising_max_time is set to
0, automatically computes the duration of the weight raising
according to the estimated peak rate of the device. This enables
flash-based devices to reach maximum throughput as soon as possible,
without sacrificing latency.
v4:
- Throughput-boosting for flash-based devices: improved version of commits
a68bbdd and f7d7b7a, which boosts the throughput while still preserving
latency guarantees for interactive and soft real-time applications.
- Better identification of NCQ-capable disks: port of commit e459dd0.
v3-r4:
- Bugfixes
* Removed an important memory leak: under some circumstances the process references
to a queue were not decremented correctly, which prevented unused shared bfq_queue
to be correctly deallocated.
* Fixed various errors related to hierarchical scheduling:
* Removed an error causing tasks to be attached to the bfqio cgroup
controller even when BFQ was not the active scheduler
* Corrected wrong update of the budgets from the leaf to the root upon
forced selection of a service tree or a bfq_queue
* Fixed the way how active leaf entities are moved to the root group before
the group entity is deactivated when a cgroup is destroyed
- Throughput-boosting improvement for cooperating queues: close detection is now based
on a fixed threshold instead of the queue's average seek. This is a port of one of
the changes in the CFQ commit 3dde36d by Corrado Zoccolo.
v3-r3:
- Bugfix: removed an important error causing occasional kernel panics when
moving a process to a new cgroup. The panic occurred if:
1) the queue associated to the process was idle when the process was moved
and
2) a new disk request was inserted into the queue just after the move.
- Further latency improvement through a better treatment of low-bandwidth
async queues.
v3-r2:
- Bugfix: added a forgotten condition that prevents weights of low-bw async
queues from being raised when low_latency is off.
- Latency improvement: low-bw async queues are now better identified.
v3-r1:
- Fixed an important request-dispatch bug causing occasional IO hangs.
- Added a new mechanism to reduce the latency of low-bw async queues.
This reduces the latency of also the sync queues synchronized with
the above async queues.
- Fixed a minor bug in iocontext locking (port of commits 9b50902 and 3181faa
from CFQ).
v3:
- Improved low-latency mechanisms, including a more accurate criterion to
distinguish between greedy-but-seeky and soft real-time applications.
Interactive applications now enjoy noticeably lower latencies.
- Switch to the simpler one-request-dispatch-at-a-time scheme as in CFQ.
- Ported cooperating-queues merging from CFQ (6d048f5, 1afba04,
d9e7620, a36e71f, 04dc6e7, 26a2ac0, 3ac6c9f, f2d1f0a, 83096eb,
2e46e8b, df5fe3e, b3b6d04, e6c5bc7, c0324a0, f04a642, 8682e1f,
b9d8f4c, 2f7a2d8, ae54abe, e9ce335, 39c01b2, d02a2c0, c10b61f).
Contributed by Arianna Avanzini. Queues of processes performing IO
on interleaved, yet contiguous disk zones are merged to boost the
throughput. Some little optimizations to get a more stable throughput
have been added to the original CFQ version.
- Added static fallback queue for extreme OOM conditions (porting of
CFQ commits d5036d7, 6118b70, b706f64, 32f2e80). Port contributed by
Francesco Allertsen.
- Ported CFQ commits b0b78f8, 40bb54d, 30996f4, dddb745, ad5ebd2, cf7c25c;
mainly code cleanup and fix of minor bugs. Port contributed by
Francesco Allertsen.
v2:
- An issue that may cause little throughput loss on fast disks has been solved.
BFQ-v1 and CFQ may suffer from this problem.
- The disk-idling timeout has been better tuned to further file latency
(especially for the idle- or light-loaded-disk scenarios).
- One of the parameters of the low-latency heuristics has been tuned a little
bit more, so as to reduce the probability that a disk-bound process may
hamper the reduction of the latency of interactive and soft real-time
applications.
- Same low-latency guarantees with and without NCQ.
- Latency for interactive applications about halved with respect to BFQ-v1.
- When the low_latency tunable is set, also soft real-time applications
now enjoy reduced latency.
- A very little minimum bandwidth is now guaranteed to the
Idle IO-scheduling class also when the other classes are
backlogged, just to prevent them from starving.
v1:
This is a new version of BFQ with respect to the versions you can
find on Fabio's site: http://feanor.sssup.it/~fabio/linux/bfq.
Here is what we changed with respect to the previous versions:
1) re-tuned the budget feedback mechanism: it is now slighlty more
biased toward assigning high budgets, to boost the aggregated
throughput more, and more quickly as new processes are started
2) introduced more tolerance toward seeky queues (I verified that the
phenomena described below used to occur systematically):
2a: if a queue is expired after having received very little
service, then it is not punished as a seeky queue, even if it
occurred to consume that little service too slowly; the
rationale is that, if the new active queue has been served for
a too short time interval, then its possible sequential
accesses may not yet prevail on the initial latencies for
moving the disk head on the first sector requested
2b: the waiting time (disk idling) of a queue detected as seeky as
a function of the position of the requests it issued is reduced
to a very low value only after the queue has consumed a minimum
fraction of the assigned budget; this prevents processes
generating (partly) seeky workloads from being too ill-treated
2c: if a queue has consumed 'enough' budget upon a budget timeout, then,
even if it did not consume all of its budget, that queue is not punished
as any seeky queue; the rationale is that, depending on the disk zones,
a queue may be served at a lower rate than the estimated peak rate.
Changes 2a and 2b have been critical in lowering latencies, whereas
change 2c, in addition to change 1, helped a lot increase the disk
throughput.
3) slightly changed the peak rate estimator: a low-pass filter is now
used instead of just keeping the highest rate sampled; the rationale
is that the peak rate of a disk should be quite stable, so the filter
should converge more or less smoothly to the right value; it seemed to
correctly catch the peak rate with all disks we used
4) added the low latency mechanism described in detail in
http://algogroup.unimore.it/people/paolo/disk_sched/description.php.