vcs.maemo.org Git - unfs3/blob - unfs3/doc/kirch1.txt

   1
   2                    Re: A multi-threaded NFS server for Linux
   3
   4    Olaf Kirch (okir@monad.swb.de)
   5    Tue, 26 Nov 1996 23:09:08 +0100
   6
   7      * Messages  sorted  by:  [1][  date ][2][ thread ][3][ subject ][4][
   8        author ]
   9      * Next message: [5]Olaf Kirch: "Re: rpc.lockd/rpc.statd"
  10      * Previous message: [6]Paul Christenson: "smail SPAM filter?"
  11      * Next  in  thread:  [7]Linus  Torvalds:  "Re:  A multi-threaded NFS
  12        server for Linux"
  13      * Reply:  [8]Linus  Torvalds:  "Re:  A multi-threaded NFS server for
  14        Linux"
  15      _________________________________________________________________
  16
  17    Hi all,
  18
  19    here are some ramblings about implementing nfsd, the differences
  20    between kernel- and user-space, and life in general. It's become quite
  21    long, so if you're not interested in either of these topics,
  22    just skip it...
  23
  24    On Sun, 24 Nov 1996 12:01:01 PST, "H.J. Lu" wrote:
  25    > With the upcoming the Linux C library 6.0, it is possible to
  26    > implement a multi-threaded NFS server in the user space using
  27    > the kernel-based pthread and MT-safe API included in libc 6.0.
  28
  29    In  my  opinion,  servicing NFS from user space is an idea that should
  30    die.
  31    The current unfsd (and I'm pretty sure this will hold for any other
  32    implementation) has a host of problems:
  33
  34    1. Speed.
  35
  36    This is only partly related to nfsd being single-threaded. I have
  37    run some benchmarks a while ago comparing my kernel-based nfsd to
  38    the user-space nfsd.
  39
  40    In  the  unfsd  case,  I  was  running 4 daemons in parallel (which is
  41    possible
  42    even now as long as you restrict yourself to read-only access), and
  43    found  the  upper  limit  for peak throughput was around 800 KBps; the
  44    rate
  45    for sustained reads was even lower. In comparison, the kernel-based
  46    nfsd achieved around than 1.1 MBps peak throughput which is almost
  47    the  theoretical  cheapernet  limit;  its  sustained rate was around 1
  48    MBps.
  49    Testers of my recent knfsd implementation reported a sustained rate
  50    of 3.8 MBps over 100 Mbps Ethernet.
  51
  52    Even  though  some tweaking of the unfsd source (especially by getting
  53    rid
  54    of  the  Sun  RPC  code)  may  improve  performance some more, I don't
  55    believe
  56    the  user-space  can  be  pushed  much  further.  [Speaking of the RPC
  57    library,
  58    a rewrite would be required anyway to safely support NFS over TCP. You
  59    can easily hang a vanilla RPC server by sending an incomplete request
  60    over TCP and keeping the connection open]
  61
  62    Now add to that the synchronization overhead required to keep the file
  63    handle cache in sync between the various threads...
  64
  65    This leads me straight to the next topic:
  66
  67    2. File Handle Layout
  68
  69    Traditional  nfsds usually stuff a file's device and inode number into
  70    the
  71    file handle, along with some information on the exported inode. Since
  72    a user space program has no way of opening a file just given its inode
  73    number,  unfsd  takes  a  different  approach.  It basically creates a
  74    hashed
  75    version  of  the  file's  path. Each path component is stat'ed, and an
  76    8bit
  77    hash of the component's device and inode number is used.
  78
  79    The first problem is that this kind of file handle is not invariant
  80    against renames from one directory to another. Agreed, this doesn't
  81    happen too often, but it does break Unix semantics. Try this on an
  82    nfs-mounted file system (with appropriate foo and bar):
  83
  84    (mv bar foo/bar; cat) < bar
  85
  86    The second problem is a lot worse. When unfsd is presented with a file
  87    handle it does not have in its cache, it must map it to a valid path
  88    name. This is basically done in the following way:
  89
  90    path = "/";
  91    depth = 0;
  92    while (depth < length(fhandle)) {
  93    deeper:
  94    dirp = opendir(path);
  95    while ((entry = readdir(dirp)) != NULL) {
  96    if (hash(dev,ino) matches fhandle component) {
  97    remember dirp
  98    append entry to path
  99    depth++;
 100    goto deeper;
 101    }
 102    }
 103    closedir(dirp);
 104    backtrack;
 105    }
 106
 107    Needless to say, this is not very fast. The file handle cache helps
 108    a lot here, but this kind of mapping operation occurs far more often
 109    than one might expect (consider a development tree where files get
 110    created   and   deleted   continuously).   In  addition,  the  current
 111    implementation
 112    discards conflicting handles when there's a hash collision.
 113
 114    This file handle layout also leaves little room for any additional
 115    baggage. Unfsd currently uses 4 bytes for an inode hash of the file
 116    itself and 28 bytes for the hashed path, but as soon as you add other
 117    information like the inode generation number, you will sooner or
 118    later run out of room.
 119
 120    Last not least, the file handle cache must be strictly synchronized
 121    between different nfsd processes/threads. Suppose you rename foo to
 122    bar,  which  is performed by thread1, then try to read the file, which
 123    is
 124    performed  by  thread2.  If the latter doesn't know the cached path is
 125    stale,
 126    it  will  fail.  You  could of course retry every operation that fails
 127    with
 128    ENOENT, but this will add even more clutter and overhead to the code.
 129
 130    3. Adherence to the NFSv2 specification
 131
 132    The  Linux  nfsd  currently  does  not  fulfill  the NFSv2 spec in its
 133    entirety.
 134    Especially  when  it  comes  to  safe  writes, it is really a fake. It
 135    neither
 136    makes  an  attempt  to  sync  file  data before replying to the client
 137    (which
 138    could be implemented, along with the `async' export option for turning
 139    off this kind of behavior), nor does it sync meta-data after inode
 140    operations (which is impossible from user space). To most people this
 141    is no big loss, but this behavior is definitely not acceptable if you
 142    want industry-strengh NFS.
 143
 144    But  even  if  you  did  implement at least synchronous file writes in
 145    unfsd,
 146    be it as an option or as the default, there seems to be no way to
 147    implement some of the more advanced techniques like gathered writes.
 148    When implementing gathered writes, the server tries to detect whether
 149    other nfsd threads are writing to the file at the same time (which
 150    frequently happens when the client's biods flush out the data on file
 151    close),  and  if  they  do,  it  delays  syncing  file  data for a few
 152    milliseconds
 153    so  the  others can finish first, and then flushes all data in one go.
 154    You
 155    can do this in kernel-land by watching inode->i_writecount, but you're
 156    totally at a loss in user-space.
 157
 158    4. Supporting NFSv3
 159
 160    A   user-space   NFS  server  is  not  particularly  well  suited  for
 161    implementing
 162    NFSv3.  For  instance,  NFSv3  tries  to help cache consistency on the
 163    client
 164    by   providing  pre-operation  attributes  for  some  operations,  for
 165    instance
 166    the WRITE call. When a client finds that the pre-operation attributes
 167    returned by the server agree with those it has cached, it can safely
 168    assume that any data it has cached was still valid when the server
 169    replied  to  its  call,  so there's no need to discard the cached file
 170    data
 171    and meta-data.
 172
 173    However, pre-op attributes can only be provided safely when the server
 174    retains  exclusive  access to the inode throughout the operation. This
 175    is
 176    impossible from user space.
 177
 178    A similar example is the exclusive create operation where a verifier
 179    is stored in the inode's atime/mtime fields by the server to guarantee
 180    exactly-once  behavior  even  in  the face of request retransmissions.
 181    These
 182    values cannot be checked atomically by a user-space server.
 183
 184    What this boils down to is that a user-space server cannot, without
 185    violating the protocol spec, implement many of the advanced features
 186    of NFSv3.
 187
 188    5. File locking over NFS
 189
 190    Supporting lockd in user-space is close to impossible. I've tried it,
 191    and have run into a large number of problems. Some of the highlights:
 192
 193    * lockd can provide only a limited number of locks at the same
 194    time because it has only a limited number of file descriptors.
 195
 196    * When lockd blocks a client's lock request because of a lock held
 197    by a local process on the server, it must continuously poll
 198    /proc/locks to see whether the request could be granted. What's
 199    more, if there's heavy contention for the file, it may take
 200    a long time before it succeeds because it cannot add itself
 201    to the inode's lock wait list in the kernel. That is, unless
 202    you want it to create a new thread just for blocking on this
 203    lock.
 204
 205    * Lockd must synchronize its file handle cache with that of
 206    the NFS servers. Unfortunately, lockd is also needed when
 207    running as an NFS client only, so you run into problems with
 208    who owns the file handle cache, and how to share it between
 209    these to services.
 210
 211    6. Conclusion
 212
 213    Alright,  this  has  become  rather  long.  Some  of the problems I've
 214    described
 215    above  may  be  solvable with more or less effort, but I believe that,
 216    taken
 217    as a whole, they make a pretty strong argument against sticking with
 218    a user-space nfsd.
 219
 220    In  kernel-space,  most of these issues are addressed most easily, and
 221    more
 222    efficiently. My current kernel nfsd is fairly small. Together with the
 223    RPC  core,  which  is  used  by  both  client  and server, it takes up
 224    something
 225    like 20 pages--don't quote me on the exact number. As mentioned above,
 226    it is also pretty fast, and I hope I'll be able to also provide fully
 227    functional file locking soon.
 228
 229    If you want to take a look at the current snapshot, it's available at
 230    ftp.mathematik.th-darmstadt.de/pub/linux/okir/dontuse/linux-nfs-X.Y.ta
 231    r.gz.
 232    This version still has a bug in the nfsd readdir implementation, but
 233    I'll  release  an  updated  (and  fixed) version as soon as I have the
 234    necessary
 235    lockd rewrite sorted out.
 236
 237    I  would  particularly  welcome  comments  from  Keepers of the Source
 238    whether
 239    my NFS rewrite has any chance of being incorporated into the kernel at
 240    some time... that would definitely motivate me to sick more time into
 241    it than I currently do.
 242
 243    Happy hacking
 244    Olaf
 245 --
 246 Olaf Kirch         |  --- o --- Nous sommes du soleil we love when we play
 247 okir@monad.swb.de  |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
 248              For my PGP public key, finger okir@brewhq.swb.de.
 249      _________________________________________________________________
 250
 251      * Next message: [9]Olaf Kirch: "Re: rpc.lockd/rpc.statd"
 252      * Previous message: [10]Paul Christenson: "smail SPAM filter?"
 253      * Next  in  thread:  [11]Linus  Torvalds:  "Re: A multi-threaded NFS
 254        server for Linux"
 255      * Reply:  [12]Linus  Torvalds:  "Re: A multi-threaded NFS server for
 256        Linux"
 257
 258 Referenser
 259
 260    1. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/date.html#18
 261    2. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/index.html#18
 262    3. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/subject.html#18
 263    4. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/author.html#18
 264    5. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0019.html
 265    6. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0017.html
 266    7. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html
 267    8. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html
 268    9. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0019.html
 269   10. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0017.html
 270   11. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html
 271   12. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html