Reference-count devices

Matt Dillon’s procured an initial patch moving the device system to a reference-count mode, plus other things detailed in his lengthy post, which I’ve copied in below.

“This patch is of ‘alpha’ quality, I expect to commit it (with further
work) in about a week. I would appreciate testing so I can be confident
that I’m not going to blow up the system when I commit this stage,
but nobody should install this alpha patch on a production machine.


This patch starts to move our device subsystem to a reference-counted
model and accomplishes certain pre-requisits to later work.

This patch:

* Simplifies device instantiation and resolution. makedev() has been
removed in favor of make_dev(), and udev2dev() will only create
new minor numbers, not new major numbers. The port dispatch is
now installed directly in the dev_t and the mess that figured out
what port to use has been removed. Finally, a d_clone function has
been added (replacing d_autoq which was never used). This function
is called whenever a new device structure is created, allowing
the device to populate device-specific fields when the device
structure is created rather then populating these fields at
device-open. This function will also (soon) allow the creation of
a device node to be vetoed.

Device fields are now sufficiently initialized when the device
structure is allocated such that we can remove NULL checks and other
special cases from the rest of the device path.

* Devices are created and searched for based on (devsw,major,minor)
instead of (major,minor). This allows us to overload device major
numbers. All such devices are accessible but only the devices
registered for userland access via cdevsw_add() are directly accessible

All devices in the system that failed to call cdevsw_add() now call
it. If you fail to call cdevsw_add(), your device will not be
‘visible’ (hooked in) to userland via /dev.

* The disk subsystem, which provides partition table management and
translation, now overloads the device major supplied to it by the
raw disk device and creates its own device which is distinct from
the raw block device. The raw block device’s devsw is hidden from
userland (not registered with cdevsw_add()).

This means that the disk subsystem no longer needs to overload
fields in dev_t’s owned by the underlying raw block device which
in turn means that, theoretically, we can stack the partition manager
on top of any device that supports block operations.

This is a huge simplification over the ‘override’ mechanism that the
disk subsystem used before, which was not stackable.

* I’ve started ref-counting the dev_t and cdevsw structures. This
needs a lot more work, but it is a good start.

* struct buf’s b_dev field is now allowed to track the device through
translations. That is, when the disk layer translates an I/O for
execution by the underlying raw device it now sets b_dev to the
underlying raw device. This requires that b_dev always be initialized
prior to the initial dev_dstrategy() call. To enforce this behavior,
biodone() now setse b_dev to NODEV.

——- Future stages ——–

The following work is intended for future stages and not present in this

* Separate b_blkno and b_pblkno from the struct buf and place it in
an attached chain of structures which associate a device layer
with a cached block number. i.e. a chain of (dev_t, blkno) pairs.

This will allow us to arbitrarily layer devices and remove filesystem
block number special casing within the struct buf structure.
For example, we would be able to emplace a crypto layer in between
the disk layer and the raw device, and we would theoretically be
able to use a normal file (without the VN device) as backing store
for a ‘disk’ and even be able to cache the block translations for
the backing file itself along with everything else.

The caching of such translations through multiple layers will now be
possible, but not mandatory. A layer will be allowed to ‘replace’
an existing cache translation (by overwriting it with a new dev_t and
blkno) rather then add a new structure, and in fact will if chaining
structures are not readily available.

* Implement a userland block device facility. This would work much like
pty’s in that a userland process would be able to attach to one side
of the device while the kernel attaches to the other as a block device,
and requests will be passed back and forth to the userland process.

This facility would also be capable of ‘glueing’ a generic file
descriptor, such as a socket, regular file, or anything else capable
of stream or block I/O, to the backend side of the device.

Once we have a userland interface we will be able to write a crypto
layer, serializing log layer, snapshot layer, network backing store,
etc etc etc.