Re: Pluggable Disk Scheduler Project



From: Karsten Behrmann <BearPerson@xxxxxxx>
Date: Tue, Oct 16, 2007 04:10:37PM +0200

Hi,
is anybody working on the `Pluggable Disk Scheduler Project' from
the ideas page?
I've been kicking the idea around in my head, but I'm probably newer to
everything involved than you are, so feel free to pick it up. If you want,
we can toss some ideas and code to each other, though I don't really
have anything on the latter.

Thank you for your answer, I'd really like to work/discuss with you
and anyone else interested in this project.


o Where is the right place (in GEOM) for a disk scheduler?
I have spent some time at eurobsdcon talking to Kirk and phk about
this, and the result was that I now know strong proponents both for
putting it into the disk drivers and for putting it into geom ;-)

Personally, I would put it into geom. I'll go into more detail on
this later, but basically, geom seems a better fit for "high-level"
code than a device driver, and if done properly performance penalties
should be negligible.


I'm pretty interested even in the arguments for the opposite solution;
I've started from GEOM because it seemed to be a) what was proposed/
requested on the ideas page, and b) cleaner at least for a prototype.
I wanted to start with some code also to evaluate the performance
penalties of that approach.

I am a little bit scared from the perspective of changing the
queueing mechanisms that drivers use, as this kind of modifications
can be difficult to write, test and maintain, but I'd really like
to know what people with experience in those kernel areas think
about the possibility of doing more complex io scheduling, with
some sort of unified interface, at this level.

As a side note, by now I've not posted any performance number because
at the moment I've only access to old ata drives that would not
give significative results.


o How can anticipation be introduced into the GEOM framework?
I wouldn't focus on just anticipation, but also other types of
schedulers (I/O scheduling influenced by nice value?)


That would be interesting, especially for the background fsck case.
I think that some kind of fair sharing approach should be used; as
you say below a priority driven scheduler can have relations with
the VFS that are difficult to track. (This problem was pointed out
also in one of the follow-ups to [1].)


o How to deal with metadata requests and other VFS issues?
Like any other disk request, though for priority-respecting
schedulers this may get rather interesting.

[...]
The main idea is to allow the scheduler to enqueue the requests
having only one (other small fixed numbers can be better on some
hardware) outstanding request and to pass new requests to its
provider only after the service of the previous one ended.
You'll want to queue at least two requests at once. The reason for
this is performance:
Currently, drivers queue their own I/O. This means that as soon
as a request completes (on devices that don't have in-device
queues), they can fairly quickly grab a new request from their
internal queue and push it back to the device from the interrupt
handler or some other fast method.

Wouldn't that require, to be sustainable (unless you want a fast
dispatch every two requests,) that the driver queue is always of
length two or more? In this way you ask, to refill the driver
queue, the upper scheduler to dispatch a new request every time a
request is taken from the driver queue and sent to the disk, or
in any other moment before the request under service is completed.
In this way you cannot have an anticipation mechanism, because
the next request you'll want to dispatch from the upper scheduler
has not yet been issued (it will be only after the one being served
is completed and after the userspace application restarts.)


Having the device idle while the response percolates up the geom
stack and a new request down will likely be rather wasteful.

I completely agree on that. I've only done in this way because it
was the less intrusive option I could find. What can be other more
efficient alternatives? (Obviously without subverting any of the
existing interfaces, and allowing the anticipation of requests.)


For disks with queuing, I'd recommend trying to keep the queue
reasonably full (unless the queuing strategy says otherwise),
for disks without queuing I'd say we want to push at least one
more request down. Personally, I think the sanest design would
be to have device drivers return a "temporary I/O error" along
the lines of EAGAIN, meaning their queue is full.


This can be easily done for asynchronous requests. The troubles
arise when dealing with synchronous requests... i.e., if you dispatch
more than one synchronous request you are serving more than one
process, and you have long seek times between the requests you have
dispatched. I think we should do (I'll do as soon as I have access
to some realistic test system) some measurement on real hardware,
looking at what happens when we allow multiple outstanding requests.


The example scheduler in the draft takes the following approach:

o a scheduling GEOM class is introduced. It can be stacked on
top of disk geoms, and schedules all the requests coming
from its consumers. I'm not absolutely sure that a new class
is really needed but I think that it can simplify testing and
experimenting with various solutions on the scheduler placement.
Probably, though we'll want to make sure that they stack on top of
(or are inside of?) the geoms talking to the disks, because it rarely
makes sense to put a queuing geom on top of, say, a disklabel geom.

The advantage of making it a full geom is configurability. You would
be able to swap out a scheduler at runtime, select different sched-
ulers for different disks, and potentially even test new schedulers
without rebooting (though you wouldn't want to do that for benchmarks)


Ok.


o Requests coming from consumers are passed down immediately
if there is no other request under service, otherwise they
are queued in a bioq.
This is specific to the anticipatory scheduler. I would say in more
general terms:
- A queuing geom is to push all requests that it wants serviced down
towards the disk, until the disk reports a queue full. A queuing geom
is allowed to hold back requests even when the driver queue is not
full yet, if it does not want the disk to attempt such I/O yet (such
as the anticipatory scheduler waiting for another disk request near
the last one, or the process-priority scheduler holding back a low-
priority request that would potentially cause a long seek, until io
has been idle)
This dispels phk's anti-geom argument of "it will be inefficient
because it will take longer for a new request to get to the driver" -
if the queuing strategy had wanted the request to be sent to the
drive, it would already have sent it. (The exception is that the disk
will have its internal queue a little more empty than it could be -
not a big issue with queue sizes of 8 or 16)


Well, it will take longer for new requests to get to the driver,
because we are inserting a whole new geom (or a new layer into an
existing one) in the path of a request. Of course the benefits
should be in the time it takes to complete the request (or in other
properties of the service, such as fairness or something like bounded
service times.)

The mechanism you describe of an high level scheduler that fills
an underlying queue is somehow similar to what Hybrid did, with its
two-stage pipelined structure, but to the best of my knowledge it
did not work with synchronous requests, for the same problems
described above.


[...]
So, as I've said, I'd like to know what you think about the subject,
if I'm missing something, if there is some kind of interest on this
and if/how can this work proceed.
I would think that this would be quite useful, but I don't think my
voice counts for much ;-)
It would help
- servers where anticipatory performs better than elevator
- realtime environments that need a scheduler fitting their needs
- the background fsck, if someone implements a "priority" scheduler

Anyway, that's a quick dump of my thoughts on the subject so far,
I've myself wanted to get started on this but didn't get around to
it yet (I'm fairly new to FreeBSD).
If you want to hash some ideas out with me, I'll be watching my inbox,
this ML, and you can reach me on IRC as BearPerson on freenode,
quakenet, undernet, or whatever else you ask me to connect to, or via
whatever other method convenient to you ;)


Thank you very much, your help is very much appreciated. I'll try to
set up some kind of testing environment and share some result.


[1] http://lists.freebsd.org/pipermail/freebsd-geom/2007-January/001854.html

_______________________________________________
freebsd-hackers@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • [PATCH 01/28] io-controller: Documentation
    ... Block io priorities (in CFQ scheduler) ... +with cgroup stuff, it needs to be made hierarchical to achive a good ... They maintain only one queue where all ... +multiple io queues with-in group based on task context and maps the request ...
    (Linux-Kernel)
  • Re: Pluggable Disk Scheduler Project
    ... is anybody working on the `Pluggable Disk Scheduler Project' from ... I would put it into geom. ... hardware) outstanding request and to pass new requests to its ... experimenting with various solutions on the scheduler placement. ...
    (freebsd-hackers)
  • Re: [RFC] IO scheduler based IO controller V9
    ... Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7. ... That request generates a steady stream of IOs. ... I think because now related IO is in a single queue and it gets to run ... CFQ scheduler: ...
    (Linux-Kernel)
  • [PATCH 01/24] io-controller: Documentation
    ... Block io priorities (in CFQ scheduler) ... +with cgroup stuff, it needs to be made hierarchical to achive a good ... They maintain only one queue where all ... +is used to determine the cgroup of the request. ...
    (Linux-Kernel)
  • [PATCH 01/24] io-controller: Documentation
    ... Block io priorities (in CFQ scheduler) ... +with cgroup stuff, it needs to be made hierarchical to achive a good ... They maintain only one queue where all ... +is used to determine the cgroup of the request. ...
    (Linux-Kernel)