2 Re: A multi-threaded NFS server for Linux
4 Olaf Kirch (okir@monad.swb.de)
5 Tue, 26 Nov 1996 23:09:08 +0100
7 * Messages sorted by: [1][ date ][2][ thread ][3][ subject ][4][
9 * Next message: [5]Olaf Kirch: "Re: rpc.lockd/rpc.statd"
10 * Previous message: [6]Paul Christenson: "smail SPAM filter?"
11 * Next in thread: [7]Linus Torvalds: "Re: A multi-threaded NFS
13 * Reply: [8]Linus Torvalds: "Re: A multi-threaded NFS server for
15 _________________________________________________________________
19 here are some ramblings about implementing nfsd, the differences
20 between kernel- and user-space, and life in general. It's become quite
21 long, so if you're not interested in either of these topics,
24 On Sun, 24 Nov 1996 12:01:01 PST, "H.J. Lu" wrote:
25 > With the upcoming the Linux C library 6.0, it is possible to
26 > implement a multi-threaded NFS server in the user space using
27 > the kernel-based pthread and MT-safe API included in libc 6.0.
29 In my opinion, servicing NFS from user space is an idea that should
31 The current unfsd (and I'm pretty sure this will hold for any other
32 implementation) has a host of problems:
36 This is only partly related to nfsd being single-threaded. I have
37 run some benchmarks a while ago comparing my kernel-based nfsd to
40 In the unfsd case, I was running 4 daemons in parallel (which is
42 even now as long as you restrict yourself to read-only access), and
43 found the upper limit for peak throughput was around 800 KBps; the
45 for sustained reads was even lower. In comparison, the kernel-based
46 nfsd achieved around than 1.1 MBps peak throughput which is almost
47 the theoretical cheapernet limit; its sustained rate was around 1
49 Testers of my recent knfsd implementation reported a sustained rate
50 of 3.8 MBps over 100 Mbps Ethernet.
52 Even though some tweaking of the unfsd source (especially by getting
54 of the Sun RPC code) may improve performance some more, I don't
56 the user-space can be pushed much further. [Speaking of the RPC
58 a rewrite would be required anyway to safely support NFS over TCP. You
59 can easily hang a vanilla RPC server by sending an incomplete request
60 over TCP and keeping the connection open]
62 Now add to that the synchronization overhead required to keep the file
63 handle cache in sync between the various threads...
65 This leads me straight to the next topic:
69 Traditional nfsds usually stuff a file's device and inode number into
71 file handle, along with some information on the exported inode. Since
72 a user space program has no way of opening a file just given its inode
73 number, unfsd takes a different approach. It basically creates a
75 version of the file's path. Each path component is stat'ed, and an
77 hash of the component's device and inode number is used.
79 The first problem is that this kind of file handle is not invariant
80 against renames from one directory to another. Agreed, this doesn't
81 happen too often, but it does break Unix semantics. Try this on an
82 nfs-mounted file system (with appropriate foo and bar):
84 (mv bar foo/bar; cat) < bar
86 The second problem is a lot worse. When unfsd is presented with a file
87 handle it does not have in its cache, it must map it to a valid path
88 name. This is basically done in the following way:
92 while (depth < length(fhandle)) {
95 while ((entry = readdir(dirp)) != NULL) {
96 if (hash(dev,ino) matches fhandle component) {
107 Needless to say, this is not very fast. The file handle cache helps
108 a lot here, but this kind of mapping operation occurs far more often
109 than one might expect (consider a development tree where files get
110 created and deleted continuously). In addition, the current
112 discards conflicting handles when there's a hash collision.
114 This file handle layout also leaves little room for any additional
115 baggage. Unfsd currently uses 4 bytes for an inode hash of the file
116 itself and 28 bytes for the hashed path, but as soon as you add other
117 information like the inode generation number, you will sooner or
118 later run out of room.
120 Last not least, the file handle cache must be strictly synchronized
121 between different nfsd processes/threads. Suppose you rename foo to
122 bar, which is performed by thread1, then try to read the file, which
124 performed by thread2. If the latter doesn't know the cached path is
126 it will fail. You could of course retry every operation that fails
128 ENOENT, but this will add even more clutter and overhead to the code.
130 3. Adherence to the NFSv2 specification
132 The Linux nfsd currently does not fulfill the NFSv2 spec in its
134 Especially when it comes to safe writes, it is really a fake. It
136 makes an attempt to sync file data before replying to the client
138 could be implemented, along with the `async' export option for turning
139 off this kind of behavior), nor does it sync meta-data after inode
140 operations (which is impossible from user space). To most people this
141 is no big loss, but this behavior is definitely not acceptable if you
142 want industry-strengh NFS.
144 But even if you did implement at least synchronous file writes in
146 be it as an option or as the default, there seems to be no way to
147 implement some of the more advanced techniques like gathered writes.
148 When implementing gathered writes, the server tries to detect whether
149 other nfsd threads are writing to the file at the same time (which
150 frequently happens when the client's biods flush out the data on file
151 close), and if they do, it delays syncing file data for a few
153 so the others can finish first, and then flushes all data in one go.
155 can do this in kernel-land by watching inode->i_writecount, but you're
156 totally at a loss in user-space.
160 A user-space NFS server is not particularly well suited for
162 NFSv3. For instance, NFSv3 tries to help cache consistency on the
164 by providing pre-operation attributes for some operations, for
166 the WRITE call. When a client finds that the pre-operation attributes
167 returned by the server agree with those it has cached, it can safely
168 assume that any data it has cached was still valid when the server
169 replied to its call, so there's no need to discard the cached file
173 However, pre-op attributes can only be provided safely when the server
174 retains exclusive access to the inode throughout the operation. This
176 impossible from user space.
178 A similar example is the exclusive create operation where a verifier
179 is stored in the inode's atime/mtime fields by the server to guarantee
180 exactly-once behavior even in the face of request retransmissions.
182 values cannot be checked atomically by a user-space server.
184 What this boils down to is that a user-space server cannot, without
185 violating the protocol spec, implement many of the advanced features
188 5. File locking over NFS
190 Supporting lockd in user-space is close to impossible. I've tried it,
191 and have run into a large number of problems. Some of the highlights:
193 * lockd can provide only a limited number of locks at the same
194 time because it has only a limited number of file descriptors.
196 * When lockd blocks a client's lock request because of a lock held
197 by a local process on the server, it must continuously poll
198 /proc/locks to see whether the request could be granted. What's
199 more, if there's heavy contention for the file, it may take
200 a long time before it succeeds because it cannot add itself
201 to the inode's lock wait list in the kernel. That is, unless
202 you want it to create a new thread just for blocking on this
205 * Lockd must synchronize its file handle cache with that of
206 the NFS servers. Unfortunately, lockd is also needed when
207 running as an NFS client only, so you run into problems with
208 who owns the file handle cache, and how to share it between
213 Alright, this has become rather long. Some of the problems I've
215 above may be solvable with more or less effort, but I believe that,
217 as a whole, they make a pretty strong argument against sticking with
220 In kernel-space, most of these issues are addressed most easily, and
222 efficiently. My current kernel nfsd is fairly small. Together with the
223 RPC core, which is used by both client and server, it takes up
225 like 20 pages--don't quote me on the exact number. As mentioned above,
226 it is also pretty fast, and I hope I'll be able to also provide fully
227 functional file locking soon.
229 If you want to take a look at the current snapshot, it's available at
230 ftp.mathematik.th-darmstadt.de/pub/linux/okir/dontuse/linux-nfs-X.Y.ta
232 This version still has a bug in the nfsd readdir implementation, but
233 I'll release an updated (and fixed) version as soon as I have the
235 lockd rewrite sorted out.
237 I would particularly welcome comments from Keepers of the Source
239 my NFS rewrite has any chance of being incorporated into the kernel at
240 some time... that would definitely motivate me to sick more time into
241 it than I currently do.
246 Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
247 okir@monad.swb.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
248 For my PGP public key, finger okir@brewhq.swb.de.
249 _________________________________________________________________
251 * Next message: [9]Olaf Kirch: "Re: rpc.lockd/rpc.statd"
252 * Previous message: [10]Paul Christenson: "smail SPAM filter?"
253 * Next in thread: [11]Linus Torvalds: "Re: A multi-threaded NFS
255 * Reply: [12]Linus Torvalds: "Re: A multi-threaded NFS server for
260 1. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/date.html#18
261 2. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/index.html#18
262 3. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/subject.html#18
263 4. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/author.html#18
264 5. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0019.html
265 6. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0017.html
266 7. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html
267 8. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html
268 9. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0019.html
269 10. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0017.html
270 11. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html
271 12. http://www.ussg.iu.edu/hypermail/linux/net/9611.3/0020.html