idnits 2.17.00 (12 Aug 2021) /tmp/idnits22384/draft-black-pnfs-block-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 15. -- Found old boilerplate from RFC 3978, Section 5.5 on line 868. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 845. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 852. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 858. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 21, 2005) is 6049 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Normative reference to a draft: ref. 'PNFS' Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 NFSv4 Working Group David L. Black 2 Internet Draft Stephen Fridella 3 Expires: April 2006 EMC Corporation 4 October 21, 2005 6 pNFS Block/Volume Layout 7 draft-black-pnfs-block-02.txt 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that 12 any applicable patent or other IPR claims of which he or she is 13 aware have been or will be disclosed, and any of which he or she 14 becomes aware will be disclosed, in accordance with Section 6 of 15 BCP 79. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html 33 This Internet-Draft will expire in April 2006. 35 Abstract 37 Parallel NFS (pNFS) extends NFSv4 to allow clients to directly access 38 file data on the storage used by the NFSv4 server. This ability to 39 bypass the server for data access can increase both performance and 40 parallelism, but requires additional client functionality for data 41 access, some of which is dependent on the class of storage used. The 42 main pNFS operations draft specifies storage-class-independent 43 extensions to NFS; this draft specifies the additional extensions 44 (primarily data structures) for use of pNFS with block and volume 45 based storage. 47 Conventions used in this document 49 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 50 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 51 document are to be interpreted as described in RFC-2119 [RFC2119]. 53 Table of Contents 55 1. Introduction...................................................3 56 2. Background and Architecture....................................3 57 2.1. Data Structures: Extents and Extent Lists.................4 58 2.1.1. Layout Requests and Extent Lists.....................6 59 2.1.2. Client Copy-on-Write Processing......................7 60 2.1.3. Extents are Permissions..............................8 61 2.2. Volume Identification....................................10 62 3. Operations Issues.............................................11 63 3.1. Layout Operation Ordering Considerations.................12 64 3.1.1. Client Side Considerations..........................12 65 3.1.2. Server Side Considerations..........................13 66 3.2. Recall Callback Completion and Robustness Concerns.......14 67 3.3. Crash Recovery Issues....................................15 68 3.4. Additional Features - Not Needed or Recommended..........16 69 4. Security Considerations.......................................17 70 5. Conclusions...................................................17 71 6. Revision History..............................................18 72 7. Acknowledgments...............................................18 73 8. References....................................................18 74 8.1. Normative References.....................................18 75 8.2. Informative References...................................18 76 Author's Addresses...............................................19 77 Intellectual Property Statement..................................19 78 Disclaimer of Validity...........................................19 79 Copyright Statement..............................................20 80 Acknowledgment...................................................20 82 NOTE: This is an early stage draft. It's still rough in places, with 83 significant work to be done. 85 1. Introduction 87 Figure 1 shows the overall architecture of a pNFS system: 89 +-----------+ 90 |+-----------+ +-----------+ 91 ||+-----------+ | | 92 ||| | NFSv4 + pNFS | | 93 +|| Clients |<------------------------------>| Server | 94 +| | | | 95 +-----------+ | | 96 ||| +-----------+ 97 ||| | 98 ||| | 99 ||| +-----------+ | 100 ||| |+-----------+ | 101 ||+----------------||+-----------+ | 102 |+-----------------||| | | 103 +------------------+|| Storage |------------+ 104 +| Systems | 105 +-----------+ 107 Figure 1 pNFS Architecture 109 The overall approach is that pNFS-enhanced clients obtain sufficient 110 information from the server to enable them to access the underlying 111 storage (on the Storage Systems) directly. See [PNFS] for more 112 details. This draft is concerned with access from pNFS clients to 113 Storage Systems over storage protocols based on blocks and volumes, 114 such as the SCSI protocol family (e.g., parallel SCSI, FCP for Fibre 115 Channel, iSCSI, SAS). This class of storage is referred to as 116 block/volume storage. While the Server to Storage System protocol is 117 not of concern for interoperability here, it will typically also be a 118 block/volume protocol when clients use block/volume protocols. 120 2. Background and Architecture 122 The fundamental storage abstraction supported by block/volume storage 123 is a storage volume consisting of a sequential series of fixed size 124 blocks. This can be thought of as a logical disk; it may be realized 125 by the Storage System as a physical disk, a portion of a physical 126 disk or something more complex (e.g., concatenation, striping, RAID, 127 and combinations thereof) involving multiple physical disks or 128 portions thereof. 130 A pNFS layout for this block/volume class of storage is responsible 131 for mapping from an NFS file (or portion of a file) to the blocks of 132 storage volumes that contain the file. The blocks are expressed as 133 extents with 64 bit offsets and lengths using the existing NFSv4 134 offset4 and length4 types. Clients must be able to perform I/O to 135 the block extents without affecting additional areas of storage 136 (especially important for writes), therefore extents MUST be aligned 137 to 512-byte boundaries, and SHOULD be aligned to the block size used 138 by the NFSv4 server in managing the actual filesystem (4 kilobytes 139 and 8 kilobytes are common block sizes). This block size is 140 available as an NFSv4 attribute - see Section 7.4 of [PNFS]. 142 This draft relies on the pNFS client indicating whether a requested 143 layout is for read use or read-write use. A read-only layout may 144 contain holes that are read as zero, whereas a read-write layout will 145 contain allocated, but uninitialized storage in those holes (read as 146 zero, can be written by client). This draft also supports client 147 participation in copy on write by providing both read-only and 148 uninitialized storage for the same range in a layout. Reads are 149 initially performed on the read-only storage, with writes going to 150 the uninitialized storage. After the first write that initializes 151 the uninitialized storage, all reads are performed to that now- 152 initialized writeable storage, and the corresponding read-only 153 storage is no longer used. 155 This draft draws extensively on the authors' familiarity with the the 156 mapping functionality and protocol in EMC's HighRoad system. The 157 protocol used by HighRoad is called FMP (File Mapping Protocol); it 158 is an add-on protocol that runs in parallel with filesystem protocols 159 such as NFSv3 to provide pNFS-like functionality for block/volume 160 storage. While drawing on HighRoad FMP, the data structures and 161 functional considerations in this draft differ in significant ways, 162 based on lessons learned and the opportunity to take advantage of 163 NFSv4 features such as COMPOUND operations. The support for client 164 participation in copy-on-write is based on contributions from those 165 with experience in that area, as HighRoad does not currently support 166 client participation in copy-on-write. 168 2.1. Data Structures: Extents and Extent Lists 170 A pNFS layout is a list of extents with associated properties. Each 171 extent MUST be at least 512-byte aligned. 173 struct extent { 175 offset4 file_offset;/* the logical location in the file */ 177 length4 extent_length; /* the size of this extent in file and 178 and on storage */ 180 pnfs_deviceid4 volume_ID; /* the logical volume/physical device 181 that this extent is on */ 183 offset4 storage_offset;/* the logical location of 184 this extent in the volume */ 186 extentState4 es; /* the state of this extent */ 188 }; 190 enum extentState4 { 192 READ_WRITE_DATA = 0, /* the data located by this extent is valid 193 for reading and writing. */ 195 READ_DATA = 1, /* the data located by this extent is valid for 196 reading only; it may not be written. */ 198 INVALID_DATA = 2, /* the location is valid; the data is invalid. 199 It is a newly (pre-) allocated extent. 200 There is physical space. */ 202 NONE_DATA = 3, /* the location is invalid. It is a hole in the 203 file. There is no physical space. */ 205 }; 207 The file_offset, extent_length, and es fields for an extent returned 208 from the server are always valid. The interpretation of the 209 storage_offset field depends on the value of es as follows: 211 o READ_WRITE_DATA means that storage_offset is valid, and points to 212 valid/initialized data that can be read and written. 214 o READ_DATA means that storage_offset is valid and points to valid/ 215 initialized data which can only be read. Write operations are 216 prohibited; the client may need to request a read-write layout. 218 o INVALID_DATA means that storage_offset is valid, but points to 219 invalid uninitialized data. This data must not be physically read 220 from the disk until it has been initialized. A read request for 221 an INVALID_DATA extent must fill the user buffer with zeros. Write 222 requests must write whole blocks to the disk with bytes not 223 initialized by the user must be set to zero. Any write to storage 224 in an INVALID_DATA extent changes the written portion of the 225 extent to READ_WRITE_DATA; the pNFS client is responsible for 226 reporting this change via LAYOUTCOMMIT. 228 o NONE_DATA means that storage_offset is not valid, and this extent 229 may not be used to satisfy write requests. Read requests may be 230 satisfied by zero-filling as for INVALID_DATA. NONE_DATA extents 231 are returned by requests for readable extents; they are never 232 returned if the request was for a writeable extent. 234 The volume_ID field for an extent returned by the server is used to 235 identify the logical volume on which this extent resides, see Section 236 2.2. 238 The extent list lists all relevant extents in increasing order of the 239 file_offset of each extent; any ties are broken by increasing order 240 of the extent state (es). 242 typedef extent extentList; /* MAX_EXTENTS = 256; */ 244 TODO: Define the actual layout and layoutupdate data structures as 245 extent lists. 247 TODO: Striping support. Layout independent striping will not be 248 added to [PNFS], but it can help compact layout representations when 249 the filesystem is striped across block/volume storage. 251 2.1.1. Layout Requests and Extent Lists 253 Each request for a layout specifies at least three parameters: 254 offset, desired size, and minimum size (the desired size is missing 255 from the operations draft - see Section 3). If the status of a 256 request indicates success, the extent list returned must meet the 257 following criteria: 259 o A request for a readable (but not writeable layout returns only 260 READ_WRITE_DATA, READ_DATA or NONE_DATA extents (but not 261 INVALID_DATA extents). A READ_WRITE_DATA extent MAY be returned 262 by a pNFS server in a readable layout in order to avoid a 263 subsequent client request for writing (ISSUE: Is that a good idea? 264 It involves server second-guessing client, and the downside is the 265 possible need for a recall callback). 267 o A request for a writeable layout returns READ_WRITE_DATA or 268 INVALID_DATA extents (but not NONE_DATA extents). It may also 269 return READ_DATA extents only when the offset ranges in those 270 extents are also covered by INVALID_DATA extents to permit writes. 272 o The first extent in the list MUST contain the starting offset. 274 o The total size of extents in the extent list MUST cover at least 275 the minimum size and no more than the desired size. One exception 276 is allowed: the total size MAY be smaller if only readable extents 277 were requested and EOF is encountered. 279 o Extents in the extent list MUST be logically contiguous for a 280 read-only layout. For a read-write layout, the set of writable 281 extents (i.e., excluding READ_DATA extents) MUST be logically 282 contiguous. Every READ_DATA extent in a read-write layout MUST be 283 covered by an INVALID_DATA extent. This overlap of READ_DATA and 284 INVALID_DATA extents is the only permitted extent overlap. 286 o Extents MUST be ordered in the list by starting offset, with 287 READ_DATA extents preceding INVALID_DATA extents in the case of 288 equal file_offsets. 290 2.1.2. Client Copy-on-Write Processing 292 Distinguishing the READ_WRITE_DATA and READ_DATA extent types 293 combined with the allowed overlap of READ_DATA extents with 294 INVALID_DATA extents allows copy-on-write processing to be done by 295 pNFS clients. In classic NFS, this operation would be done by the 296 server. Since pNFS enables clients to do direct block access, it 297 requires clients to participate in copy-on-write operations. 299 When a client wishes to write data covered by a READ_DATA extent, it 300 MUST have requested a writable layout from the server; that layout 301 will contain INVALID_DATA extents to cover all the data ranges of 302 that layout's READ_DATA extents. More precisely, for any file_offset 303 range covered by one or more READ_DATA extents in a writable layout, 304 the server MUST include one or more INVALID_DATA extents in the 305 layout that cover the same file_offset range. The client MUST 306 logically copy the data from the READ_DATA extent for any partial 307 blocks of file_offset and range, merge in the changes to be written, 308 and write the result to the INVALID_DATA extent for the blocks for 309 that file_offset and range. That is, if entire blocks of data are to 310 be overwritten by an operation, the corresponding READ_DATA blocks 311 need not be fetched, but any partial-block writes must be merged with 312 data fetched via READ_DATA extents before storing the result via 313 INVALID_DATA extents. Storing of data in an INVALID_DATA extent 314 converts the written portion of the INVALID_DATA extent to a 315 READ_WRITE_DATA extent; all subsequent reads MUST be performed from 316 this extent; the corresponding portion of the READ_DATA extent MUST 317 NOT be used after storing data in an INVALID_DATA extent. 319 In the LAYOUTCOMMIT operation that normally sends updated layout 320 information back to the server, for writable data, some INVALID_DATA 321 extents may be committed as READ_WRITE_DATA extents, signifying that 322 the storage at the corresponding storage_offset values has been 323 stored into and is now to be considered as valid data to be read. 324 READ_DATA extents need not be sent to the server. For extents that 325 the client receives via LAYOUTGET as INVALID_DATA and returns via 326 LAYOUTCOMMIT as READ_WRITE_DATA, the server will understand that the 327 READ_DATA mapping for that extent is no longer valid or necessary for 328 that file. 330 ISSUE: This assumes that all block/volume pNFS clients will support 331 copy-on-write. Negotiating this would require additional server code 332 to cope with clients that don't support this, which doesn't seem like 333 a good idea. 335 2.1.3. Extents are Permissions 337 Layout extents returned to pNFS clients grant permission to read or 338 write; READ_DATA and NONE_DATA are read-only (NONE_DATA reads as 339 zeroes), READ_WRITE_DATA and INVALID_DATA are read/write, 340 (INVALID_DATA reads as zeros, any write converts it to 341 READ_WRITE_DATA). This is the only client means of obtaining 342 permission to perform direct I/O to storage devices; a pNFS client 343 MUST NOT perform direct I/O operations that are not permitted by an 344 extent held by the client. Client adherence to this rule places the 345 pNFS server in control of potentially conflicting storage device 346 operations, enabling the server to determine what does conflict and 347 how to avoid conflicts by granting and recalling extents to/from 348 clients. 350 Block/volume class storage devices are not required to perform read 351 and write operations atomically. Overlapping concurrent read and 352 write operations to the same data may cause the read to return a 353 mixture of before-write and after-write data. Overlapping write 354 operations can be worse, as the result could be a mixture of data 355 from the two write operations; this can be particularly nasty if the 356 underlying storage is striped and the operations complete in 357 different orders on different stripes. A pNFS server can avoid these 358 conflicts by implementing a single writer XOR multiple readers 359 concurrency control policy when there are multiple clients who wish 360 to access the same data. This policy SHOULD be implemented when 361 storage devices do not provide atomicity for concurrent read/write 362 and write/write operations to the same data. 364 A client that makes a layout request that conflicts with an existing 365 layout delegation will be rejected with the error 366 NFS4ERR_LAYOUTTRYLATER. This client is then expected to retry the 367 request after a short interval. During this interval the server 368 needs to recall the conflicting portion of the layout delegation from 369 the client that currently holds it. This reject-and-retry approach 370 does not prevent client starvation when there is contention for the 371 layout of a particular file. For this reason a pNFS server SHOULD 372 implement a mechanism to prevent starvation. One possibility is that 373 the server can maintain a queue of rejected layout requests. Each 374 new layout request can be checked to see if it conflicts with a 375 previous rejected request, and if so, the newer request can be 376 rejected. Once the original requesting client retries its request, 377 its entry in the rejected request queue can be cleared, or the entry 378 in the rejected request queue can be removed when it reaches a 379 certain age. 381 NFSv4 supports mandatory locks and share reservations. These are 382 mechanisms that clients can use to restrict the set of I/O operations 383 that are permissible to other clients. Since all I/O operations 384 ultimately arrive at the NFSv4 server for processing, the server is 385 in a position to enforce these restrictions. However, with pNFS 386 layout delegations, I/Os will be issued from the clients that hold 387 the delegations directly to the storage devices that host the data. 388 These devices have no knowledge of files, mandatory locks, or share 389 reservations, and are not in a position to enforce such restrictions. 390 For this reason the NFSv4 server must not grant layout delegations 391 that conflict with mandatory locks or share reservations. Further, 392 if a conflicting mandatory lock request or a conflicting open request 393 arrives at the server, the server must recall the part of the layout 394 delegation in conflict with the request before processing the 395 request. 397 2.2. Volume Identification 399 Storage Systems such as storage arrays can have multiple physical 400 network ports that need not be connected to a common network, 401 resulting in a pNFS client having simultaneous multipath access to 402 the same storage volumes via different ports on different networks. 403 The networks may not even be the same technology - for example, 404 access to the same volume via both iSCSI and Fibre Channel is 405 possible, hence network address are difficult to use for volume 406 identification. For this reason, this pNFS block layout identifies 407 storage volumes by content, for example providing the means to match 408 (unique portions of) labels used by volume managers. Any block pNFS 409 system using this layout MUST support a means of content-based unique 410 volume identification that can be employed via the data structure 411 given here. 413 A volume is content-identified by a disk signature made up of extents 414 within blocks and contents that must match. 416 block_device_addr_list - A list of the disk signatures for the 417 physical volumes on which the file system resides. This is list of 418 variable number of diskSigInfo structures. This is the 419 device_addr_list<> as returned by GETDEVICELIST in [PNFS] 421 typedef diskSigInfo block_device_addr_list; 422 /* disksignature info */ 424 where diskSigInfo is: 426 struct diskSigInfo { /* used in DISK_SIGNATURE */ 427 diskSig ds; /* disk signature */ 429 pnfs_deviceid4 volume_ID; /* volume ID the server will use in 430 extents. */ 432 }; 434 where diskSig is defined as: 436 typedef sigComp diskSig; 438 struct sigComp { /* disk signature component */ 440 offset4 sig_offset; /* byte offset of component */ 442 length4 sig_length; /* byte length of component */ 443 sigCompContents contents; /* contents of this component of the 444 signature (this is opaque) */ 446 }; 448 sigCompContents MUST NOT be interpreted as a zero-terminated string, 449 as it may contain embedded zero-valued octets. It contains 450 sig_length octets. There are no restrictions on alignment (e.g., 451 neither sig_offset nor sig_length are required to be multiples of 4). 453 3. Operations Issues 455 NOTE: This section and its subsections are preserved for 456 historical/review purposes only, as the [PNFS] draft has addressed 457 all of these issues. The section and all subsections will be deleted 458 in the next version of this draft. 460 This section collects issues in the operations draft encountered in 461 writing this block/volume layout draft. Most of these issues are 462 expected to be resolved in draft-welch-pnfs-ops-03.txt . 464 1. RESOLVED: LAYOUTGET provides minimum and desired (max) lengths to 465 server. 467 2. RESOLVED: Layouts are managed by offset and range; they are no 468 longer treated as indivisible objects. 470 3. RESOLVED: There is a callback for the server to convey a new EOF 471 to the client. 473 4. RESOLVED: HighRoad supports three types of layout recalls beyond 474 range recalls: "everything in a file", "everything in a list of 475 files", "everything in a filesystem". The first and third are 476 supported in [PNFS] (set offset to zero and length to all 1's for 477 everything in a file - [PNFS] implies this, but isn't explicit 478 about it), and the second one can probably be done as a COMPOUND 479 with reasonable effectiveness. LAYOUTRETURN supports return of 480 everything in a file in a similar fashion (offset of zero, length 481 of all 1's). 483 5. RESOLVED: Access and Modify time behavior. LAYOUTCOMMIT operation 484 sets both Access and Modify times. LAYOUTRETURN cannot set 485 either time - use a SETATTR in a COMPOUND to do this (Q: Can this 486 inadvertently make time run backwards?). 488 6. RESOLVED: The disk signature approach to volume identification 489 appears to be supportable via the opaque pnfs_devaddr4 union 490 element. 492 7. RESOLVED: The LAYOUTCOMMIT operation has no LAYOUTRETURN side 493 effects in -03. If it ever did, they were not intended. 495 3.1. Layout Operation Ordering Considerations 497 This deserves its own subsection because there is some serious 498 subtlety here. 500 In contrast to NFSv4 callbacks that expect immediate responses, 501 HighRoad layout callback responses are delayed to allow the client to 502 perform any required commits, etc. prior to responding to the 503 callback. This allows the reply to the callback to serve as an 504 implicit return of the recalled range or ranges and tell the server 505 that all callback related processing has been completed by the 506 client. For consistency, pNFS should use the NFSv4 callback approach 507 in which immediate responses are expected. As a result all returns 508 of layout ranges MUST be explicit. 510 3.1.1. Client Side Considerations 512 Consider a pNFS client that has issued a LAYOUTGET and then receives 513 an overlapping recall callback for the same file. There are two 514 possibilities, which the client cannot distinguish when the callback 515 arrives: 517 1. The server processed the LAYOUTGET before issuing the recall, so 518 the LAYOUTGET response is in flight, and must be waited for 519 because it may be carrying layout info that will need to be 520 returned to deal with the recall callback. 522 2. The server issued the callback before receiving the LAYOUTGET. The 523 server will not respond to the LAYOUTGET until the recall callback 524 is processed. 526 This can cause deadlock, as the client must wait for the LAYOUTGET 527 response before processing the recall in the first case, but that 528 response will not arrive until after the recall is processed in the 529 second case. The deadlock is avoided via a simple rule: 531 RULE: A LAYOUTGET MUST be rejected with an error if there's an 532 overlapping outstanding recall callback to the same client. The 533 client MUST process the outstanding recall callback before 534 retrying the LAYOUTGET. 536 Now the client can wait for the LAYOUTGET response because it will 537 come in both cases. This RULE also applies to the callback to send 538 an updated EOF to the client. 540 The resulting situation is still less than desired, because issuance 541 of a recall callback indicates a conflict and potential contention at 542 the server, so recall callbacks should be processed as fast as 543 possible by clients. In the second case, if the client knows that 544 the LAYOUTGET will be rejected, it is beneficial for the client to 545 process the recall immediately without waiting for the LAYOUTGET 546 rejection. To do so without added client complexity, the server 547 needs to reject the LAYOUTGET even if it arrives at the server after 548 the client operations that process the recall callback; if the client 549 still wants that layout, it can reissue the LAYOUTGET. 551 HighRoad uses the equivalent of a per-file layout stateid to enable 552 this optimization. The layout stateid increments on each layout 553 operation completion and callback issuance, and the current value of 554 the layout stateid is sent in every operation response and every 555 callback. If the initial layout stateid value is N, then in the 556 first case above, the recall callback carries stateid N+2 indicating 557 that the LAYOUTGET response is carrying N+1 and hence has to be 558 waited for. In the second case above, the recall callback carries 559 layout stateid N+1 indicating that the LAYOUTGET will be rejected 560 with a stale layout stateid (N where N+1 or greater is current) 561 whenever it arrives, and hence the callback can be processed 562 immediately. This per-file layout stateid approach entails 563 prohibiting concurrent callbacks to the for the same file to the same 564 client, as server issuance of a new callback could cause stale layout 565 stateid errors for operations that the client is performing to deal 566 with an earlier recall callback. 568 ISSUE: Does restricting all pNFS client operations on the same file 569 to a single session help? 571 3.1.2. Server Side Considerations 573 Consider a related situation from the pNFS server's point of view. 574 The server has issued a recall callback and receives an overlapping 575 LAYOUTGET for the same file before the LAYOUTRETURN(s) that respond 576 to the recall callback. Again, there are two cases: 578 1. The client issued the LAYOUTGET before processing the recall 579 callback. The LAYOUTGET MUST be rejected according to the RULE in 580 the previous subsection. 582 2. The client issued the LAYOUTGET after processing the recall 583 callback, but it arrived before the LAYOUTRETURN that completed 584 that processing. 586 The simplest approach is to apply the RULE and always reject the 587 overlapping LAYOUTGET. The client has two ways to avoid this result 588 - it can issue the LAYOUTGET as a subsequent element of a COMPOUND 589 containing the LAYOUTRETURN that completes the recall callback, or it 590 can wait for the response to that LAYOUTRETURN. 592 This leads to a more general problem; in the absence of a callback if 593 a client issues concurrent overlapping LAYOUTGET and LAYOUTRETURN 594 operations, it is possible for the server to process them in either 595 order. HighRoad forbids a client from doing this, as the per-file 596 layout stateid will cause one of the two operations to be rejected 597 with a stale layout stateid. This approach is simpler and produces 598 better results by comparison to allowing concurrent operations, at 599 least for this sort of conflict case, because server execution of 600 operations in an order not anticipated by the client may produce 601 results that are not useful to the client (e.g., if a LAYOUTRETURN is 602 followed by a concurrent overlapping LAYOUTGET, but executed in the 603 other order, the client will not retain layout extents for the 604 overlapping range). 606 3.2. Recall Callback Completion and Robustness Concerns 608 The discussion of layout operation ordering implicitly assumed that 609 any callback results in a LAYOUTRETURN or set of LAYOUTRETURNs that 610 match the range in the callback. This envisions that the pNFS client 611 state for a file match the pNFS server state for that file and client 612 regarding layout ranges and permissions. That may not be the best 613 design assumption because: 615 1. It may be useful for clients to be able to discard layout 616 information without calling LAYOUTRETURN. If conflicts that 617 require callbacks are rare, and a server can use a multi-file 618 callback to recover per-client resources (e.g., via a multi-file 619 recall operation based on some sort of LRU), the result may be 620 significantly less client-server pNFS traffic. 622 2. It may be similarly useful for servers to enhance information 623 about what layout ranges are held by a client beyond what a client 624 actually holds. In the extreme, a server could manage conflicts 625 on a per-file basis, only issuing whole-file callbacks even though 626 clients may request and be granted sub-file ranges. 628 3. The synchronized state assumption is not robust to minor errors. 629 A more robust design would allow for divergence between client and 630 server and the ability to recover. It is vital that a client not 631 assign itself layout permissions beyond what the server has 632 granted and that the server not forget layout permissions that 633 have been granted in order to avoid errors. OTOH, if a server 634 believes that a client holds an extent that the client doesn't 635 know about, it's useful for the client to be able to issue the 636 LAYOUTRETURN that the server is expecting in response to a recall. 638 At a minimum, in light of the above, it is useful for a server to be 639 able to issue callbacks for layout ranges it has not granted to a 640 client, and for a client to return ranges it does not hold. This 641 leads to a couple of requirements: 643 A pNFS client's final operation in processing a recall callback 644 SHOULD be a LAYOUTRETURN whose range matches that in the callback. 645 If the pNFS client holds no layout permissions in the range that 646 has been recalled, it MUST respond with a LAYOUTRETURN whose range 647 matches that in the callback. 649 This avoids any need for callback cookies (server to client) that 650 would have to be returned to indicate recall callback completion. 652 For a callback to set EOF, the client MUST logically apply the new 653 EOF before issuing the response to the callback, and MUST NOT issue 654 any other pNFS operations before responding to the callback. 656 ISSUE: HighRoad FMP also requires that LAYOUTCOMMIT operations be 657 stalled at the server while an EOF callback is outstanding. 659 3.3. Crash Recovery Issues 661 Client recovery for layout delegations works in much the same way as 662 NFSv4 client recovery for other lock/delegation state. When an NFSv4 663 client reboots, it will lose all information about the layout 664 delegations that it previously owned. There are two methods by which 665 the server can reclaim these resources and begin providing them to 666 other clients. The first is through the expiry of the client's 667 lock/delegation lease. If the client recovery time is longer than 668 the lease period, the client's lock/delegation lease will expire and 669 the server will know to reclaim any state held by the client. On the 670 other hand, the client may recover in less time than it takes for the 671 lease period to expire. In such a case, the client will be required 672 to contact the server through the standard SETCLIENTID protocol. The 673 server will find that the client's id matches the id of the previous 674 client invocation, but that the verifier is different. The server 675 uses this as a signal to reclaim all the state associated with the 676 client's previous invocation. 678 The server recovery case is slightly more complex. In general, the 679 recovery process will again follow the standard NFSv4 recovery model: 680 the client will discover that the server has rebooted when it 681 receives an unexpected STALE_STATEID or STALE_CLIENTID reply from the 682 server; it will then proceed to try to reclaim its previous 683 delegations during the server's recovery grace period. However there 684 is an important safety concern associated with layout delegations 685 that does not come into play in the standard NFSv4 case. If a 686 standard NFSv4 client makes use of a stale delegation, the 687 consequence could be to deliver stale data to an application. 688 However, the pNFS layout delegation enables the client to directly 689 access the file system storage---if this access is not properly 690 managed by the NFSv4 server the client can potentially corrupt the 691 file system data or meta-data. 693 Thus it is vitally important that the client discover that the server 694 has rebooted as soon as possible, and that the client stops using 695 stale layout delegations before the server gives the delegations away 696 to other clients. To ensure this, the client must be implemented so 697 that layout delegations are never used to access the storage after 698 the client's lease timer has expired. This prohibition applies to 699 all accesses, especially the flushing of dirty data to storage. If 700 the client's lease timer expires because the client could not contact 701 the server for any reason, the client MUST immediately stop using the 702 layout delegation until the server can be contacted and the 703 delegation can be officially recovered or reclaimed. 705 3.4. Additional Features - Not Needed or Recommended 707 This subsection is a place to record things that existing SAN or 708 clustered filesystems do that aren't needed or recommended for pNFS: 710 o Callback for write-to-read downgrade. Writers tend to want to 711 remain writers, so this feature may not be very useful. 713 o HighRoad FMP implements several frequently used operation 714 combinations as single RPCs for efficiency; these can be 715 effectively handled by NFSv4 COMPOUNDs. One subtle difference is 716 that a single RPC is treated as a single operation, whereas NFSv4 717 COMPOUNDs are not atomic in any sense. This can result in 718 operation ordering subtleties, e.g., having to set the new EOF 719 *before* returning the layout extent that contains the new EOF, 720 even within a single COMPOUND. 722 o Queued request support. The HighRoad FMP protocol specification 723 allows the server to return an "operation blocked" result code 724 with a cookie that is later passed to the client in a "it's done 725 now" callback. This has not proven to be of great use vs. having 726 the client retry with some sort of back-off. Recommendations on 727 how to back off should be added to the ops draft. 729 o Additional client and server crash detection mechanisms. As a 730 separate protocol, HighRoad FMP had to handle this on its own. As 731 an NFSv4 extension, NFSv4's SETCLIENTID, STALE CLIENTID and STALE 732 STATEID mechanisms combined with implicit lease renewal and (per- 733 file) layout stateids should be sufficient for pNFS. 735 4. Security Considerations 737 Certain security responsibilities are delegated to pNFS clients. 738 Block/volume storage systems generally control access at a volume 739 granularity, and hence pNFS clients have to be trusted to only 740 perform accesses allowed by the layout extents it currently holds 741 (e.g., and not access storage for files on which a layout extent is 742 not held). This also has implications for some NFSv4 functionality 743 outside pNFS. For instance, if a file is covered by a mandatory 744 read-only lock, the server can ensure that only read-layout- 745 delegations for the file are granted to pNFS clients. However, it is 746 up to each pNFS client to ensure that the read layout delegation is 747 used only to service read requests, and not to allow writes to the 748 existing parts of the file. Since block/volume storage systems are 749 generally not capable of enforcing such file-based security, in 750 environments where pNFS clients cannot be trusted to enforce such 751 policies, block/volume-based pNFS SHOULD NOT be used. 753 761 5. Conclusions 763 765 6. IANA Considerations 767 There are no IANA considerations in this document. All pNFS IANA 768 Considerations are covered in [PNFS]. 770 7. Revision History 772 -00: Initial Version 774 -01: Rework discussion of extents as locks to talk about extents 775 granting access permissions. Rewrite operation ordering section to 776 discuss deadlocks and races that can cause problems. Add new section 777 on recall completion. Add client copy-on-write based on text from 778 Craig Everhart. 780 -02: Fix glitches in extent state descriptions. Describe most issues 781 as RESOLVED. Most of Section 3 has been incorporated into the [PNFS] 782 draft, add NOTE to that effect and say that it will be deleted in the 783 next version of this draft (which should be a draft-ietf-nfsv4 784 draft). Cleaning up a number of things have been left to that draft 785 revision, including the interlocks with the types in [PNFS], layout 786 striping support, and finishing the Security Considerations section. 788 8. Acknowledgments 790 This draft draws extensively on the authors' familiarity with the the 791 mapping functionality and protocol in EMC's HighRoad system. The 792 protocol used by HighRoad is called FMP (File Mapping Protocol); it 793 is an add-on protocol that runs in parallel with filesystem protocols 794 such as NFSv3 to provide pNFS-like functionality for block/volume 795 storage. While drawing on HighRoad FMP, the data structures and 796 functional considerations in this draft differ in significant ways, 797 based on lessons learned and the opportunity to take advantage of 798 NFSv4 features such as COMPOUND operations. The design to support 799 pNFS client participation in copy-on-write is based on text and ideas 800 contributed by Craig Everhart of IBM. 802 9. References 804 9.1. Normative References 806 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 807 Requirement Levels", BCP 14, RFC 2119, March 1997. 809 [PNFS] Goodson, G., et. al. "NFSv4 pNFS Extensions", draft-ietf- 810 nfsv4-pnfs-00.txt, Work in Progress, October 2005. 812 TODO: Need to reference RFC 3530. 814 9.2. Informative References 816 OPEN ISSUE: HighRoad and/or SAN.FS references? 818 Author's Addresses 820 David L. Black 821 EMC Corporation 822 176 South Street 823 Hopkinton, MA 01748 825 Phone: +1 (508) 293-7953 826 Email: black_david@emc.com 828 Stephen Fridella 829 EMC Corporation 830 32 Coslin Drive 831 Southboro, MA 01772 833 Phone: +1 (508) 305-8512 834 Email: fridella_stephen@emc.com 836 Intellectual Property Statement 838 The IETF takes no position regarding the validity or scope of any 839 Intellectual Property Rights or other rights that might be claimed to 840 pertain to the implementation or use of the technology described in 841 this document or the extent to which any license under such rights 842 might or might not be available; nor does it represent that it has 843 made any independent effort to identify any such rights. Information 844 on the procedures with respect to rights in RFC documents can be 845 found in BCP 78 and BCP 79. 847 Copies of IPR disclosures made to the IETF Secretariat and any 848 assurances of licenses to be made available, or the result of an 849 attempt made to obtain a general license or permission for the use of 850 such proprietary rights by implementers or users of this 851 specification can be obtained from the IETF on-line IPR repository at 852 http://www.ietf.org/ipr. 854 The IETF invites any interested party to bring to its attention any 855 copyrights, patents or patent applications, or other proprietary 856 rights that may cover technology that may be required to implement 857 this standard. Please address the information to the IETF at ietf- 858 ipr@ietf.org. 860 Disclaimer of Validity 862 This document and the information contained herein are provided on an 863 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 864 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 865 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 866 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 867 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 868 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 870 Copyright Statement 872 Copyright (C) The Internet Society (2005). 874 This document is subject to the rights, licenses and restrictions 875 contained in BCP 78, and except as set forth therein, the authors 876 retain all their rights. 878 Acknowledgment 880 Funding for the RFC Editor function is currently provided by the 881 Internet Society.