From 71f2016aa26a5ae6c800a6a022e4bc6c6068ce9a Mon Sep 17 00:00:00 2001
From: Vishnu Kannan <vishnuk@google.com>
Date: Sun, 29 Jan 2017 20:50:55 -0800
Subject: [PATCH 01/19] Adding a proposal for managing local storage

Signed-off-by: Vishnu Kannan <vishnuk@google.com>
---
 .../local-storage-overview.md                 | 356 ++++++++++++++++++
 1 file changed, 356 insertions(+)
 create mode 100644 contributors/design-proposals/local-storage-overview.md

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
new file mode 100644
index 00000000000..ce8b08d1ddf
--- /dev/null
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -0,0 +1,356 @@
+# Local Storage Management
+Authors: vishh@, msau42@
+
+This document presents a strawman for managing local storage in Kubernetes. We expect to provide a UX and high level design overview for managing most user workflows. More detailed design and implementation will be added once the community agrees with the high level design presented here. 
+
+# Goals
+* Enable ephemeral & durable access to local storage
+* Support storage requirements for all workloads supported by Kubernetes
+* Provide flexibility for users/vendors to utilize various types of storage devices
+* Define a standard partitioning scheme for storage drives for all Kubernetes nodes
+* Provide storage usage isolation for shared partitions
+* Support random access storage devices only
+
+# Non Goals
+* Provide isolation for all partitions. Isolation will not be of concern for most partitions since they are not expected to be shared.
+* Support all storage devices natively in upstream Kubernetes. Non standard storage devices are expected to be managed using extension mechanisms.
+
+# Use Cases
+
+## Ephemeral Local Storage
+Today, ephemeral local storage is exposed to pods via the container’s writable layer, logs directory, and EmptyDir volumes.  Pods use ephemeral local storage for scratch space, caching and logs.  There are many issues related to the lack of local storage accounting and isolation, including:
+
+* Pods do not know how much local storage is available to them.
+* Pods cannot request “guaranteed” local storage.
+* Local storage is a “best-effort” resource 
+* Pods can get evicted due to other pods filling up the local storage during which time no new pods will be admitted, until sufficient storage has been reclaimed
+
+## Persistent Local Storage
+Distributed filesystems and databases are the primary use cases for persistent local storage due to the following factors:
+
+* Performance: On cloud providers, local SSDs give better performance than remote disks.
+* Cost: On baremetal, in addition to performance, local storage is typically cheaper and using it is a necessity to provision distributed filesystems.
+
+Distributed systems often use replication to provide fault tolerance, and can therefore tolerate node failures. However, data gravity is preferred for reducing replication traffic and cold startup latencies.
+
+# Design Overview
+
+A node’s local storage can be broken into primary and secondary partitions.  Primary partitions are shared partitions that can provide ephemeral local storage.  The two supported primary partitions are:
+
+### Root
+ This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IO for example) from this partition.
+
+### Runtime 
+This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition.
+
+All other partitions are exposed as persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod.  Applications can continue to use their existing PVC specifications with minimal changes to request local storage.
+
+# User Workflows
+
+### Alice manages a deployment and requires “Guaranteed” ephemeral storage
+
+1. Kubelet running across all nodes will identify primary partition and expose capacity and allocatable for the primary “root” partition. The runtime partition is an implementation detail and is not exposed outside the node.
+```yaml
+apiVersion: v1
+kind: Node
+metadata:
+  name: foo
+status:
+  Capacity: 
+    Storage: 100Gi
+  Allocatable:
+    Storage: 90Gi
+```
+2. Alice adds a “Storage” requirement to her pod as follows
+```yaml
+apiVersion: v1
+kind: pod
+metadata:
+ name: foo
+spec:
+ containers:
+  name: fooc
+  resources:
+   limits:
+    storage-logs: 500Mi
+    storage-overlay: 1Gi
+ volumes:
+  name: myEmptyDir
+  emptyDir:
+   capacity: 20Gi
+```
+3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
+4. Alice’s pod is not provided any IO guarantees
+5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi
+6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation.
+7. With hard limits, containers will receive a ENOSPACE error if they consume all reserved storage. Without hard limits, the pod will be evicted by kubelet.
+8. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints.
+9. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes. 
+
+### Bob runs batch workloads and is unsure of “storage” requirements
+
+1. Bob can create pods without any “storage” resource requirements. 
+```yaml
+apiVersion: v1
+kind: pod
+metadata:
+ name: foo
+ namespace: myns
+spec:
+ containers:
+  name: fooc
+ volumes:
+  name: myEmptyDir
+  emptyDir:   
+```
+2. His cluster administrator being aware of the issues with disk reclamation latencies has intelligently decided not to allow overcommitting primary partitions. The cluster administrator has installed a LimitRange to “myns” namespace that will set a default storage size. Note: A cluster administrator can also specify bust ranges and a host of other features supported by LimitRange for local storage.
+```yaml
+apiVersion: v1
+kind: LimitRange
+metadata:
+  name: mylimits
+spec:
+   - default:
+     storage-logs: 200Mi
+     Storage-overlay: 200Mi
+     type: Container
+   - default:
+     storage: 1Gi
+     type: Pod
+```
+3. The limit range will update the pod specification as follows:
+```yaml
+apiVersion: v1
+kind: pod
+metadata:
+ name: foo
+spec:
+ containers:
+  name: fooc
+  resources:
+   limits:
+    storage-logs: 200Mi
+    storage-overlay: 200Mi
+ volumes:
+  name: myEmptyDir
+  emptyDir:
+   capacity: 1Gi
+```
+4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. 
+5. If Bob’s pod “foo” exceeds the “default” storage limits and gets evicted, then Bob can set a minimum storage requirement for his containers and a higher “capacity” for his EmptyDir volumes.
+```yaml
+apiVersion: v1
+kind: pod
+metadata:
+ name: foo
+spec:
+ containers:
+  name: fooc
+  resources:
+   requests:
+    storage-logs: 500Mi
+    storage-overlay: 500Mi
+ volumes:
+  name: myEmptyDir
+  emptyDir:
+   capacity: 2Gi
+```
+6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intent to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods. 
+
+### Alice manages a Database which needs access to “durable” and fast scratch space
+
+1. Cluster administrator provisions machines with local SSDs and brings up the cluster
+2. When a new node instance starts up, a addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium.
+```yaml
+kind: PersistentVolume
+apiVersion: v1
+metadata:
+  name: foo
+  annotations:
+    storage.kubernetes.io/node: k8s-node
+  labels:
+    storage.kubernetes.io/medium: ssd
+spec:
+  volume-type: local
+  storage-type: filesystem
+  capacity:
+    storage: 100Gi
+  hostpath:
+    path: /var/lib/kubelet/storage-partitions/foo
+  accessModes:
+    - ReadWriteOnce
+  persistentVolumeReclaimPolicy: Delete
+```
+```yaml
+$ kubectl get pv
+NAME       CAPACITY ACCESSMODES RECLAIMPOLICY STATUS    CLAIM … NODE  
+local-pv-1 375Gi    RWO         Delete        Available         gke-mycluster-1
+local-pv-2 375Gi    RWO         Delete        Available         gke-mycluster-1
+local-pv-1 375Gi    RWO         Delete        Available         gke-mycluster-2
+local-pv-2 375Gi    RWO         Delete        Available         gke-mycluster-2
+local-pv-1 375Gi    RWO         Delete        Available         gke-mycluster-3
+local-pv-2 375Gi    RWO         Delete        Available         gke-mycluster-3
+```
+3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy.
+4. Alice creates a StatefulSet that uses local PVCs
+```yaml
+apiVersion: apps/v1beta1
+kind: StatefulSet
+metadata:
+  name: web
+spec:
+  serviceName: "nginx"
+  replicas: 3
+  template:
+    metadata:
+      labels:
+        app: nginx
+    spec:
+      terminationGracePeriodSeconds: 10
+      containers:
+      - name: nginx
+        image: gcr.io/google_containers/nginx-slim:0.8
+        ports:
+        - containerPort: 80
+          name: web
+        volumeMounts:
+        - name: www
+          mountPath: /usr/share/nginx/html
+        - name: log
+          mountPath: /var/log/nginx
+  volumeClaimTemplates:
+  - metadata:
+      name: www
+      labels:
+        storage.kubernetes.io/medium: local-ssd
+        storage.kubernetes.io/volume-type: local
+    spec:
+      accessModes: [ "ReadWriteOnce" ]
+      resources:
+        requests:
+          storage: 1Gi
+  - metadata:
+      name: log
+      labels:
+        storage.kubernetes.io/medium: hdd
+        storage.kubernetes.io/volume-type: local
+    spec:
+      accessModes: [ "ReadWriteOnce" ]
+      resources:
+        requests:
+          storage: 1Gi
+```
+5. The scheduler identifies nodes for each pods that can satisfy cpu, memory, storage requirements and also contains free local PVs to satisfy the pods PVC claims. It then binds the pod’s PVCs to specific PVs on the node and then binds the pod to the node. 
+```yaml
+$ kubectl get pvc
+NAME            STATUS VOLUME     CAPACITY ACCESSMODES … NODE  
+www-local-pvc-1 Bound  local-pv-1 375Gi    RWO           gke-mycluster-1
+www-local-pvc-2 Bound  local-pv-1 375Gi    RWO           gke-mycluster-2
+www-local-pvc-3 Bound  local-pv-1 375Gi    RWO           gke-mycluster-3
+log-local-pvc-1 Bound  local-pv-1 375Gi    RWO           gke-mycluster-1
+log-local-pvc-2 Bound  local-pv-1 375Gi    RWO           gke-mycluster-2
+log-local-pvc-3 Bound  local-pv-1 375Gi    RWO           gke-mycluster-3
+```
+```yaml
+$ kubectl get pv
+NAME       CAPACITY … STATUS    CLAIM           NODE  
+local-pv-1 375Gi      Bound     www-local-pvc-1 gke-mycluster-1
+local-pv-2 375Gi      Bound     log-local-pvc-1 gke-mycluster-1
+local-pv-1 375Gi      Bound     www-local-pvc-2 gke-mycluster-2
+local-pv-2 375Gi      Bound     log-local-pvc-2 gke-mycluster-2
+local-pv-1 375Gi      Bound     www-local-pvc-3 gke-mycluster-3
+local-pv-2 375Gi      Bound     log-local-pvc-3 gke-mycluster-3
+```
+6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
+7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods.
+8. If a PV gets tainted as unhealthy, the StatefulSet is expected to delete pods if they cannot tolerate PV failures. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy.
+9. Once Alice decides to delete the database, the PVCs are expected to get deleted by the StatefulSet. PVs will then get recycled and deleted, and the addon adds it back to the cluster.
+
+### Bob manages a distributed filesystem which needs access to all available storage on each node
+
+1. The cluster that Bob is using is provisioned with nodes that contain one or more secondary partitions
+2. The cluster administrator runs a DaemonSet addon that discovers secondary partitions across all nodes and creates corresponding PVs for them.
+3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy.
+4. Bob creates a specialized controller (Operator) for his distributed filesystem and deploys it.
+5. The operator will identify all the nodes that it can schedule pods onto and discovers the PVs available on each of those nodes. The operator has a label selector that identifies the specific PVs that it can use (this helps preserve fast PVs for Databases for example).
+6. The operator will then create PVCs and manually bind to individual local PVs across all its nodes. 
+7. It will then create pods, manually place them on specific nodes (similar to a DaemonSet) with high enough priority and have them use all the PVCs create by the Operator on those nodes.
+8. If a pod dies, it will get replaced with a new pod that uses the same set of PVCs that the old pod had used.
+9. If a PV gets tainted as unhealthy, the Operator is expected to delete pods if they cannot tolerate device failures. 
+
+### Bob manages a specialized application that needs access to Block level storage
+
+1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet.
+2. The cluster admin creates a DaemonSet addon that discovers all the raw block devices on a node that are available within a specific directory and creates corresponding PVs for them with a ‘StorageType’ of ‘Block’
+```yaml
+kind: PersistentVolume
+apiVersion: v1
+metadata:
+  name: foo
+  annotations:
+    storage.kubernetes.io/node: k8s-node
+  labels:
+    storage.kubernetes.io/medium: hdd
+spec:
+  volume-type: local
+  storage-type: block
+  capacity:
+    storage: 100Gi
+  hostpath:
+    path: /var/lib/kubelet/storage-raw-devices/foo
+  accessModes:
+    - ReadWriteOnce
+  persistentVolumeReclaimPolicy: Delete
+```
+3. Bob creates a pod with a PVC that requests for block level access and similar to a Stateful Set scenario the scheduler will identify nodes that can satisfy the pods request.
+```yaml
+kind: PersistentVolumeClaim
+apiVersion: v1
+metadata:
+  name: myclaim
+  labels:
+    storage.kubernetes.io/medium: ssd
+spec:
+  volume-type: local
+  storage-type: block
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 80Gi
+```
+*The lifecycle of the block level PV will be similar to that of the scenarios explained earlier.* 
+
+# Open Questions & Discussion points 
+* Single vs split “limit” for storage across writable layer and logs
+* Local Persistent Volume bindings happening in the scheduler vs in PV controller
+    * Should the PV controller fold into the scheduler
+* Supporting dedicated partitions for logs and volumes in Kubelet in addition to runtime overlay filesystem
+    * This complicates kubelet.Not sure what value it adds to end users.
+* Providing IO isolation for ephemeral storage
+    * IO isolation is difficult. Use local PVs for performance
+* Support for encrypted partitions
+* Do applications need access to performant local storage for ephemeral use cases? Ex. A pod requesting local SSDs for use as ephemeral scratch space.
+    * Typically referred to as “inline PVs” in kube land
+* Should block level storage devices be auto formatted to be used as file level storage?
+    * Flexibility vs complexity
+    * Do we really need this? 
+* Repair/replace scenarios. 
+    * What are the implications of removing a disk and replacing it with a new one? 
+    * We may not do anything in the system, but may need a special workflow
+
+# Recommended Storage best practices
+
+* Have the primary partition on a reliable storage device
+* Consider using RAID and SSDs (for performance)
+* Partition the rest of the storage devices based on the application needs
+    * SSDs can be statically partitioned and they might still meet IO requirements of apps.
+    * TODO: Identify common durable storage requirements for most databases
+* Avoid having multiple logical partitions on hard drives to avoid IO isolation issues
+* Run a reliable cluster level logging service to drain logs from the nodes before they get rotated or deleted
+* The runtime partition for overlayfs is optional. You do not **need** one.
+* Alert on primary partition failures and act on it immediately. Primary partition failures will render your node unusable.
+* Use EmptyDir for all scratch space requirements of your apps. 
+* Make the container’s writable layer `readonly` if possible.
+* Another option is to keep the writable layer on tmpfs. Such a setup will allow you to eventually migrate from using local storage for anything but super fast caching purposes or distributed databases leading to higher reliability & uptime for nodes.
+

From c034e02c7c628a15ddb25f503cea0b05fec0a5fa Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Mon, 30 Jan 2017 11:34:46 -0800
Subject: [PATCH 02/19] Update pv workflow example

---
 .../local-storage-overview.md                 | 468 +++++++++---------
 1 file changed, 246 insertions(+), 222 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index ce8b08d1ddf..caf802db0f5 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -35,7 +35,10 @@ Distributed systems often use replication to provide fault tolerance, and can th
 
 # Design Overview
 
-A node’s local storage can be broken into primary and secondary partitions.  Primary partitions are shared partitions that can provide ephemeral local storage.  The two supported primary partitions are:
+A node’s local storage can be broken into primary and secondary partitions.  
+
+## Primary Partitions
+Primary partitions are shared partitions that can provide ephemeral local storage.  The two supported primary partitions are:
 
 ### Root
  This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IO for example) from this partition.
@@ -43,6 +46,7 @@ A node’s local storage can be broken into primary and secondary partitions.  P
 ### Runtime 
 This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition.
 
+## Secondary Partitions
 All other partitions are exposed as persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod.  Applications can continue to use their existing PVC specifications with minimal changes to request local storage.
 
 # User Workflows
@@ -50,35 +54,39 @@ All other partitions are exposed as persistent volumes. The PV interface allows
 ### Alice manages a deployment and requires “Guaranteed” ephemeral storage
 
 1. Kubelet running across all nodes will identify primary partition and expose capacity and allocatable for the primary “root” partition. The runtime partition is an implementation detail and is not exposed outside the node.
-```yaml
-apiVersion: v1
-kind: Node
-metadata:
-  name: foo
-status:
-  Capacity: 
-    Storage: 100Gi
-  Allocatable:
-    Storage: 90Gi
-```
+
+    ```yaml
+    apiVersion: v1
+    kind: Node
+    metadata:
+      name: foo
+    status:
+      Capacity: 
+        Storage: 100Gi
+      Allocatable:
+        Storage: 90Gi
+    ```
+
 2. Alice adds a “Storage” requirement to her pod as follows
-```yaml
-apiVersion: v1
-kind: pod
-metadata:
- name: foo
-spec:
- containers:
-  name: fooc
-  resources:
-   limits:
-    storage-logs: 500Mi
-    storage-overlay: 1Gi
- volumes:
-  name: myEmptyDir
-  emptyDir:
-   capacity: 20Gi
-```
+
+    ```yaml
+    apiVersion: v1
+    kind: pod
+    metadata:
+     name: foo
+    spec:
+     containers:
+      name: fooc
+      resources:
+       limits:
+        storage-logs: 500Mi
+        storage-overlay: 1Gi
+     volumes:
+      name: myEmptyDir
+      emptyDir:
+       capacity: 20Gi
+    ```
+
 3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
 4. Alice’s pod is not provided any IO guarantees
 5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi
@@ -90,177 +98,189 @@ spec:
 ### Bob runs batch workloads and is unsure of “storage” requirements
 
 1. Bob can create pods without any “storage” resource requirements. 
-```yaml
-apiVersion: v1
-kind: pod
-metadata:
- name: foo
- namespace: myns
-spec:
- containers:
-  name: fooc
- volumes:
-  name: myEmptyDir
-  emptyDir:   
-```
+
+    ```yaml
+    apiVersion: v1
+    kind: pod
+    metadata:
+     name: foo
+     namespace: myns
+    spec:
+     containers:
+      name: fooc
+     volumes:
+      name: myEmptyDir
+      emptyDir:   
+    ```
+
 2. His cluster administrator being aware of the issues with disk reclamation latencies has intelligently decided not to allow overcommitting primary partitions. The cluster administrator has installed a LimitRange to “myns” namespace that will set a default storage size. Note: A cluster administrator can also specify bust ranges and a host of other features supported by LimitRange for local storage.
-```yaml
-apiVersion: v1
-kind: LimitRange
-metadata:
-  name: mylimits
-spec:
-   - default:
-     storage-logs: 200Mi
-     Storage-overlay: 200Mi
-     type: Container
-   - default:
-     storage: 1Gi
-     type: Pod
-```
+
+    ```yaml
+    apiVersion: v1
+    kind: LimitRange
+    metadata:
+      name: mylimits
+    spec:
+       - default:
+         storage-logs: 200Mi
+         Storage-overlay: 200Mi
+         type: Container
+       - default:
+         storage: 1Gi
+         type: Pod
+    ```
+
 3. The limit range will update the pod specification as follows:
-```yaml
-apiVersion: v1
-kind: pod
-metadata:
- name: foo
-spec:
- containers:
-  name: fooc
-  resources:
-   limits:
-    storage-logs: 200Mi
-    storage-overlay: 200Mi
- volumes:
-  name: myEmptyDir
-  emptyDir:
-   capacity: 1Gi
-```
+
+    ```yaml
+    apiVersion: v1
+    kind: pod
+    metadata:
+     name: foo
+    spec:
+     containers:
+      name: fooc
+      resources:
+       limits:
+        storage-logs: 200Mi
+        storage-overlay: 200Mi
+     volumes:
+      name: myEmptyDir
+      emptyDir:
+       capacity: 1Gi
+    ```
+
 4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. 
 5. If Bob’s pod “foo” exceeds the “default” storage limits and gets evicted, then Bob can set a minimum storage requirement for his containers and a higher “capacity” for his EmptyDir volumes.
-```yaml
-apiVersion: v1
-kind: pod
-metadata:
- name: foo
-spec:
- containers:
-  name: fooc
-  resources:
-   requests:
-    storage-logs: 500Mi
-    storage-overlay: 500Mi
- volumes:
-  name: myEmptyDir
-  emptyDir:
-   capacity: 2Gi
-```
+
+  ```yaml
+  apiVersion: v1
+  kind: pod
+  metadata:
+   name: foo
+  spec:
+   containers:
+    name: fooc
+    resources:
+     requests:
+      storage-logs: 500Mi
+      storage-overlay: 500Mi
+   volumes:
+    name: myEmptyDir
+    emptyDir:
+     capacity: 2Gi
+  ```
+
 6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intent to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods. 
 
 ### Alice manages a Database which needs access to “durable” and fast scratch space
 
 1. Cluster administrator provisions machines with local SSDs and brings up the cluster
-2. When a new node instance starts up, a addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium.
-```yaml
-kind: PersistentVolume
-apiVersion: v1
-metadata:
-  name: foo
-  annotations:
-    storage.kubernetes.io/node: k8s-node
-  labels:
-    storage.kubernetes.io/medium: ssd
-spec:
-  volume-type: local
-  storage-type: filesystem
-  capacity:
-    storage: 100Gi
-  hostpath:
-    path: /var/lib/kubelet/storage-partitions/foo
-  accessModes:
-    - ReadWriteOnce
-  persistentVolumeReclaimPolicy: Delete
-```
-```yaml
-$ kubectl get pv
-NAME       CAPACITY ACCESSMODES RECLAIMPOLICY STATUS    CLAIM … NODE  
-local-pv-1 375Gi    RWO         Delete        Available         gke-mycluster-1
-local-pv-2 375Gi    RWO         Delete        Available         gke-mycluster-1
-local-pv-1 375Gi    RWO         Delete        Available         gke-mycluster-2
-local-pv-2 375Gi    RWO         Delete        Available         gke-mycluster-2
-local-pv-1 375Gi    RWO         Delete        Available         gke-mycluster-3
-local-pv-2 375Gi    RWO         Delete        Available         gke-mycluster-3
-```
-3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy.
-4. Alice creates a StatefulSet that uses local PVCs
-```yaml
-apiVersion: apps/v1beta1
-kind: StatefulSet
-metadata:
-  name: web
-spec:
-  serviceName: "nginx"
-  replicas: 3
-  template:
+2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium.
+
+    ```yaml
+    kind: PersistentVolume
+    apiVersion: v1
     metadata:
+      name: local-pv-1
+      annotations:
+        storage.kubernetes.io/node: node-1
       labels:
-        app: nginx
-    spec:
-      terminationGracePeriodSeconds: 10
-      containers:
-      - name: nginx
-        image: gcr.io/google_containers/nginx-slim:0.8
-        ports:
-        - containerPort: 80
-          name: web
-        volumeMounts:
-        - name: www
-          mountPath: /usr/share/nginx/html
-        - name: log
-          mountPath: /var/log/nginx
-  volumeClaimTemplates:
-  - metadata:
-      name: www
-      labels:
-        storage.kubernetes.io/medium: local-ssd
-        storage.kubernetes.io/volume-type: local
+        storage.kubernetes.io/medium: ssd
     spec:
-      accessModes: [ "ReadWriteOnce" ]
-      resources:
-        requests:
-          storage: 1Gi
-  - metadata:
-      name: log
-      labels:
-        storage.kubernetes.io/medium: hdd
-        storage.kubernetes.io/volume-type: local
+      volume-type: local
+      storage-type: filesystem
+      capacity:
+        storage: 100Gi
+      hostpath:
+        path: /var/lib/kubelet/storage-partitions/local-pv-1
+      accessModes:
+        - ReadWriteOnce
+      persistentVolumeReclaimPolicy: Delete
+    ```
+    ```
+    $ kubectl get pv
+    NAME       CAPACITY ACCESSMODES RECLAIMPOLICY STATUS    CLAIM … NODE  
+    local-pv-1 100Gi    RWO         Delete        Available         node-1
+    local-pv-2 10Gi     RWO         Delete        Available         node-1
+    local-pv-1 100Gi    RWO         Delete        Available         node-2
+    local-pv-2 10Gi     RWO         Delete        Available         node-2
+    local-pv-1 100Gi    RWO         Delete        Available         node-3
+    local-pv-2 10Gi     RWO         Delete        Available         node-3
+    ```
+3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy.
+4. Alice creates a StatefulSet that uses local PVCs
+
+    ```yaml
+    apiVersion: apps/v1beta1
+    kind: StatefulSet
+    metadata:
+      name: web
     spec:
-      accessModes: [ "ReadWriteOnce" ]
-      resources:
-        requests:
-          storage: 1Gi
-```
+      serviceName: "nginx"
+      replicas: 3
+      template:
+        metadata:
+          labels:
+            app: nginx
+        spec:
+          terminationGracePeriodSeconds: 10
+          containers:
+          - name: nginx
+            image: gcr.io/google_containers/nginx-slim:0.8
+            ports:
+            - containerPort: 80
+              name: web
+            volumeMounts:
+            - name: www
+              mountPath: /usr/share/nginx/html
+            - name: log
+              mountPath: /var/log/nginx
+      volumeClaimTemplates:
+      - metadata:
+          name: www
+          labels:
+            storage.kubernetes.io/medium: ssd
+        spec:
+          volume-type: local
+          accessModes: [ "ReadWriteOnce" ]
+          resources:
+            requests:
+              storage: 100Gi
+      - metadata:
+          name: log
+          labels:
+            storage.kubernetes.io/medium: hdd
+        spec:
+          volume-type: local
+          accessModes: [ "ReadWriteOnce" ]
+          resources:
+            requests:
+              storage: 1Gi
+    ```
+
 5. The scheduler identifies nodes for each pods that can satisfy cpu, memory, storage requirements and also contains free local PVs to satisfy the pods PVC claims. It then binds the pod’s PVCs to specific PVs on the node and then binds the pod to the node. 
-```yaml
-$ kubectl get pvc
-NAME            STATUS VOLUME     CAPACITY ACCESSMODES … NODE  
-www-local-pvc-1 Bound  local-pv-1 375Gi    RWO           gke-mycluster-1
-www-local-pvc-2 Bound  local-pv-1 375Gi    RWO           gke-mycluster-2
-www-local-pvc-3 Bound  local-pv-1 375Gi    RWO           gke-mycluster-3
-log-local-pvc-1 Bound  local-pv-1 375Gi    RWO           gke-mycluster-1
-log-local-pvc-2 Bound  local-pv-1 375Gi    RWO           gke-mycluster-2
-log-local-pvc-3 Bound  local-pv-1 375Gi    RWO           gke-mycluster-3
-```
-```yaml
-$ kubectl get pv
-NAME       CAPACITY … STATUS    CLAIM           NODE  
-local-pv-1 375Gi      Bound     www-local-pvc-1 gke-mycluster-1
-local-pv-2 375Gi      Bound     log-local-pvc-1 gke-mycluster-1
-local-pv-1 375Gi      Bound     www-local-pvc-2 gke-mycluster-2
-local-pv-2 375Gi      Bound     log-local-pvc-2 gke-mycluster-2
-local-pv-1 375Gi      Bound     www-local-pvc-3 gke-mycluster-3
-local-pv-2 375Gi      Bound     log-local-pvc-3 gke-mycluster-3
-```
+    ```
+    $ kubectl get pvc
+    NAME            STATUS VOLUME     CAPACITY ACCESSMODES … NODE  
+    www-local-pvc-1 Bound  local-pv-1 100Gi    RWO           node-1
+    www-local-pvc-2 Bound  local-pv-1 100Gi    RWO           node-2
+    www-local-pvc-3 Bound  local-pv-1 100Gi    RWO           node-3
+    log-local-pvc-1 Bound  local-pv-2 10Gi     RWO           node-1
+    log-local-pvc-2 Bound  local-pv-2 10Gi     RWO           node-2
+    log-local-pvc-3 Bound  local-pv-2 10Gi     RWO           node-3
+    ```
+    ```
+    $ kubectl get pv
+    NAME       CAPACITY … STATUS    CLAIM           NODE  
+    local-pv-1 100Gi      Bound     www-local-pvc-1 node-1
+    local-pv-2 10Gi       Bound     log-local-pvc-1 node-1
+    local-pv-1 100Gi      Bound     www-local-pvc-2 node-2
+    local-pv-2 10Gi       Bound     log-local-pvc-2 node-2
+    local-pv-1 100Gi      Bound     www-local-pvc-3 node-3
+    local-pv-2 10Gi       Bound     log-local-pvc-3 node-3
+    ```
+
 6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
 7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods.
 8. If a PV gets tainted as unhealthy, the StatefulSet is expected to delete pods if they cannot tolerate PV failures. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy.
@@ -282,43 +302,47 @@ local-pv-2 375Gi      Bound     log-local-pvc-3 gke-mycluster-3
 
 1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet.
 2. The cluster admin creates a DaemonSet addon that discovers all the raw block devices on a node that are available within a specific directory and creates corresponding PVs for them with a ‘StorageType’ of ‘Block’
-```yaml
-kind: PersistentVolume
-apiVersion: v1
-metadata:
-  name: foo
-  annotations:
-    storage.kubernetes.io/node: k8s-node
-  labels:
-    storage.kubernetes.io/medium: hdd
-spec:
-  volume-type: local
-  storage-type: block
-  capacity:
-    storage: 100Gi
-  hostpath:
-    path: /var/lib/kubelet/storage-raw-devices/foo
-  accessModes:
-    - ReadWriteOnce
-  persistentVolumeReclaimPolicy: Delete
-```
+
+    ```yaml
+    kind: PersistentVolume
+    apiVersion: v1
+    metadata:
+      name: foo
+      annotations:
+        storage.kubernetes.io/node: k8s-node
+      labels:
+        storage.kubernetes.io/medium: ssd
+    spec:
+      volume-type: local
+      storage-type: block
+      capacity:
+        storage: 100Gi
+      hostpath:
+        path: /var/lib/kubelet/storage-raw-devices/foo
+      accessModes:
+        - ReadWriteOnce
+      persistentVolumeReclaimPolicy: Delete
+    ```
+
 3. Bob creates a pod with a PVC that requests for block level access and similar to a Stateful Set scenario the scheduler will identify nodes that can satisfy the pods request.
-```yaml
-kind: PersistentVolumeClaim
-apiVersion: v1
-metadata:
-  name: myclaim
-  labels:
-    storage.kubernetes.io/medium: ssd
-spec:
-  volume-type: local
-  storage-type: block
-  accessModes:
-    - ReadWriteOnce
-  resources:
-    requests:
-      storage: 80Gi
-```
+
+    ```yaml
+    kind: PersistentVolumeClaim
+    apiVersion: v1
+    metadata:
+      name: myclaim
+      labels:
+        storage.kubernetes.io/medium: ssd
+    spec:
+      volume-type: local
+      storage-type: block
+      accessModes:
+        - ReadWriteOnce
+      resources:
+        requests:
+          storage: 80Gi
+    ```
+
 *The lifecycle of the block level PV will be similar to that of the scenarios explained earlier.* 
 
 # Open Questions & Discussion points 

From 2f01ad576f0f40fd77c851b02c4fb7e28cbf4b67 Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Wed, 1 Feb 2017 17:57:05 -0800
Subject: [PATCH 03/19] Update local storage doc with first round comments

---
 .../local-storage-overview.md                 | 99 ++++++++++---------
 1 file changed, 50 insertions(+), 49 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index caf802db0f5..57a9b1faf41 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -1,7 +1,7 @@
 # Local Storage Management
 Authors: vishh@, msau42@
 
-This document presents a strawman for managing local storage in Kubernetes. We expect to provide a UX and high level design overview for managing most user workflows. More detailed design and implementation will be added once the community agrees with the high level design presented here. 
+This document presents a strawman for managing local storage in Kubernetes. We expect to provide a UX and high level design overview for managing most user workflows. More detailed design and implementation will be added once the community agrees with the high level design presented here.
 
 # Goals
 * Enable ephemeral & durable access to local storage
@@ -9,10 +9,10 @@ This document presents a strawman for managing local storage in Kubernetes. We e
 * Provide flexibility for users/vendors to utilize various types of storage devices
 * Define a standard partitioning scheme for storage drives for all Kubernetes nodes
 * Provide storage usage isolation for shared partitions
-* Support random access storage devices only
+* Support random access storage devices only (e.g., hard disks and SSDs)
 
 # Non Goals
-* Provide isolation for all partitions. Isolation will not be of concern for most partitions since they are not expected to be shared.
+* Provide storage usage isolation for non-shared partitions.
 * Support all storage devices natively in upstream Kubernetes. Non standard storage devices are expected to be managed using extension mechanisms.
 
 # Use Cases
@@ -22,8 +22,8 @@ Today, ephemeral local storage is exposed to pods via the container’s writable
 
 * Pods do not know how much local storage is available to them.
 * Pods cannot request “guaranteed” local storage.
-* Local storage is a “best-effort” resource 
-* Pods can get evicted due to other pods filling up the local storage during which time no new pods will be admitted, until sufficient storage has been reclaimed
+* Local storage is a “best-effort” resource.
+* Pods can get evicted due to other pods filling up the local storage, after which no new pods will be admitted until sufficient storage has been reclaimed.
 
 ## Persistent Local Storage
 Distributed filesystems and databases are the primary use cases for persistent local storage due to the following factors:
@@ -35,19 +35,19 @@ Distributed systems often use replication to provide fault tolerance, and can th
 
 # Design Overview
 
-A node’s local storage can be broken into primary and secondary partitions.  
+A node’s local storage can be broken into primary and secondary partitions.
 
 ## Primary Partitions
 Primary partitions are shared partitions that can provide ephemeral local storage.  The two supported primary partitions are:
 
 ### Root
- This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IO for example) from this partition.
+ This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IOPs for example) from this partition.
 
-### Runtime 
+### Runtime
 This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition.
 
 ## Secondary Partitions
-All other partitions are exposed as persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod.  Applications can continue to use their existing PVC specifications with minimal changes to request local storage.
+All other partitions are exposed as persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod.  Applications can continue to use their existing PVC specifications with minimal changes to request local storage.  Each PV uses an entire partition.  The PVs can be precreated by an addon DaemonSet that discovers all the secondary partitions, and can create new PVs as disks are added to the node.
 
 # User Workflows
 
@@ -61,10 +61,10 @@ All other partitions are exposed as persistent volumes. The PV interface allows
     metadata:
       name: foo
     status:
-      Capacity: 
-        Storage: 100Gi
-      Allocatable:
-        Storage: 90Gi
+      capacity: 
+        storage: 100Gi
+      allocatable:
+        storage: 90Gi
     ```
 
 2. Alice adds a “Storage” requirement to her pod as follows
@@ -76,15 +76,15 @@ All other partitions are exposed as persistent volumes. The PV interface allows
      name: foo
     spec:
      containers:
-      name: fooc
-      resources:
+     - name: fooc
+       resources:
        limits:
-        storage-logs: 500Mi
-        storage-overlay: 1Gi
+         storageLogs: 500Mi
+         storageOverlay: 1Gi
      volumes:
-      name: myEmptyDir
-      emptyDir:
-       capacity: 20Gi
+     - name: myEmptyDir
+       emptyDir:
+         capacity: 20Gi
     ```
 
 3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
@@ -107,13 +107,13 @@ All other partitions are exposed as persistent volumes. The PV interface allows
      namespace: myns
     spec:
      containers:
-      name: fooc
+     - name: fooc
      volumes:
-      name: myEmptyDir
-      emptyDir:   
+     - name: myEmptyDir
+       emptyDir:
     ```
 
-2. His cluster administrator being aware of the issues with disk reclamation latencies has intelligently decided not to allow overcommitting primary partitions. The cluster administrator has installed a LimitRange to “myns” namespace that will set a default storage size. Note: A cluster administrator can also specify bust ranges and a host of other features supported by LimitRange for local storage.
+2. His cluster administrator being aware of the issues with disk reclamation latencies has intelligently decided not to allow overcommitting primary partitions. The cluster administrator has installed a [LimitRange](https://kubernetes.io/docs/user-guide/compute-resources/) to “myns” namespace that will set a default storage size. Note: A cluster administrator can also specify burst ranges and a host of other features supported by LimitRange for local storage.
 
     ```yaml
     apiVersion: v1
@@ -122,12 +122,12 @@ All other partitions are exposed as persistent volumes. The PV interface allows
       name: mylimits
     spec:
        - default:
-         storage-logs: 200Mi
-         Storage-overlay: 200Mi
+         storageLogs: 200Mi
+         storageOverlay: 200Mi
          type: Container
        - default:
          storage: 1Gi
-         type: Pod
+         type: EmptyDir
     ```
 
 3. The limit range will update the pod specification as follows:
@@ -142,8 +142,8 @@ All other partitions are exposed as persistent volumes. The PV interface allows
       name: fooc
       resources:
        limits:
-        storage-logs: 200Mi
-        storage-overlay: 200Mi
+        storageLogs: 200Mi
+        storageOverlay: 200Mi
      volumes:
       name: myEmptyDir
       emptyDir:
@@ -163,15 +163,15 @@ All other partitions are exposed as persistent volumes. The PV interface allows
     name: fooc
     resources:
      requests:
-      storage-logs: 500Mi
-      storage-overlay: 500Mi
+      storageLogs: 500Mi
+      storageOverlay: 500Mi
    volumes:
     name: myEmptyDir
     emptyDir:
      capacity: 2Gi
   ```
 
-6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intent to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods. 
+6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intend to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods.
 
 ### Alice manages a Database which needs access to “durable” and fast scratch space
 
@@ -188,8 +188,8 @@ All other partitions are exposed as persistent volumes. The PV interface allows
       labels:
         storage.kubernetes.io/medium: ssd
     spec:
-      volume-type: local
-      storage-type: filesystem
+      volumeType: local
+      storageType: filesystem
       capacity:
         storage: 100Gi
       hostpath:
@@ -200,7 +200,7 @@ All other partitions are exposed as persistent volumes. The PV interface allows
     ```
     ```
     $ kubectl get pv
-    NAME       CAPACITY ACCESSMODES RECLAIMPOLICY STATUS    CLAIM … NODE  
+    NAME       CAPACITY ACCESSMODES RECLAIMPOLICY STATUS    CLAIM … NODE
     local-pv-1 100Gi    RWO         Delete        Available         node-1
     local-pv-2 10Gi     RWO         Delete        Available         node-1
     local-pv-1 100Gi    RWO         Delete        Available         node-2
@@ -208,7 +208,7 @@ All other partitions are exposed as persistent volumes. The PV interface allows
     local-pv-1 100Gi    RWO         Delete        Available         node-3
     local-pv-2 10Gi     RWO         Delete        Available         node-3
     ```
-3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy.
+3. The addon will monitor the health of secondary partitions and mark PVs as unhealthy whenever the backing local storage devices have failed.
 4. Alice creates a StatefulSet that uses local PVCs
 
     ```yaml
@@ -242,7 +242,7 @@ All other partitions are exposed as persistent volumes. The PV interface allows
           labels:
             storage.kubernetes.io/medium: ssd
         spec:
-          volume-type: local
+          volumeType: local
           accessModes: [ "ReadWriteOnce" ]
           resources:
             requests:
@@ -252,17 +252,17 @@ All other partitions are exposed as persistent volumes. The PV interface allows
           labels:
             storage.kubernetes.io/medium: hdd
         spec:
-          volume-type: local
+          volumeType: local
           accessModes: [ "ReadWriteOnce" ]
           resources:
             requests:
               storage: 1Gi
     ```
 
-5. The scheduler identifies nodes for each pods that can satisfy cpu, memory, storage requirements and also contains free local PVs to satisfy the pods PVC claims. It then binds the pod’s PVCs to specific PVs on the node and then binds the pod to the node. 
+5. The scheduler identifies nodes for each pod that can satisfy cpu, memory, storage requirements and also contains available local PVs to satisfy the pod's PVC claims. It then binds the pod’s PVCs to specific PVs on the node and then binds the pod to the node.
     ```
     $ kubectl get pvc
-    NAME            STATUS VOLUME     CAPACITY ACCESSMODES … NODE  
+    NAME            STATUS VOLUME     CAPACITY ACCESSMODES … NODE
     www-local-pvc-1 Bound  local-pv-1 100Gi    RWO           node-1
     www-local-pvc-2 Bound  local-pv-1 100Gi    RWO           node-2
     www-local-pvc-3 Bound  local-pv-1 100Gi    RWO           node-3
@@ -272,7 +272,7 @@ All other partitions are exposed as persistent volumes. The PV interface allows
     ```
     ```
     $ kubectl get pv
-    NAME       CAPACITY … STATUS    CLAIM           NODE  
+    NAME       CAPACITY … STATUS    CLAIM           NODE
     local-pv-1 100Gi      Bound     www-local-pvc-1 node-1
     local-pv-2 10Gi       Bound     log-local-pvc-1 node-1
     local-pv-1 100Gi      Bound     www-local-pvc-2 node-2
@@ -282,9 +282,10 @@ All other partitions are exposed as persistent volumes. The PV interface allows
     ```
 
 6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
-7. If a new pod fails to get scheduled while attempting to reuse an old PVC, the StatefulSet controller is expected to give up on the old PVC (delete & recycle) and instead create a new PVC based on some policy. This is to guarantee scheduling of stateful pods.
-8. If a PV gets tainted as unhealthy, the StatefulSet is expected to delete pods if they cannot tolerate PV failures. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy.
-9. Once Alice decides to delete the database, the PVCs are expected to get deleted by the StatefulSet. PVs will then get recycled and deleted, and the addon adds it back to the cluster.
+7. If a pod fails to get scheduled while attempting to reuse an old PVC, a controller will unbind the PVC from the PV, clean up the PV according to the reclaim policy, and reschedule the pod.  The PVC will get rebound to a new PV.
+8. If the node gets tainted as NotReady or Unknown, the pod is evicted according to the taint's forgiveness setting.  The pod will then fail scheduling due to the taint, and follow step 7.
+9. If a PV becomes unhealthy, the pod is evicted by a controller, and follows step 7. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy.
+10. Once Alice decides to delete the database, the StatefulSet is destroyed, followed by the PVCs.  The PVs will then get recycled and deleted according to the reclaim policy, and the addon adds it back to the cluster.
 
 ### Bob manages a distributed filesystem which needs access to all available storage on each node
 
@@ -309,12 +310,12 @@ All other partitions are exposed as persistent volumes. The PV interface allows
     metadata:
       name: foo
       annotations:
-        storage.kubernetes.io/node: k8s-node
+        storage.kubernetes.io/node: node-1
       labels:
         storage.kubernetes.io/medium: ssd
     spec:
-      volume-type: local
-      storage-type: block
+      volumeType: local
+      storageLevel: block
       capacity:
         storage: 100Gi
       hostpath:
@@ -334,8 +335,8 @@ All other partitions are exposed as persistent volumes. The PV interface allows
       labels:
         storage.kubernetes.io/medium: ssd
     spec:
-      volume-type: local
-      storage-type: block
+      volumeType: local
+      storageLevel: block
       accessModes:
         - ReadWriteOnce
       resources:

From 92fc45c3f70b2bf23a4c6aed7fa83910524483e4 Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Thu, 23 Feb 2017 16:37:51 -0800
Subject: [PATCH 04/19] Update local storage doc with review comments,
 forgiveness examples, generalized node->PV binding, storage class example.

---
 .../local-storage-overview.md                 | 156 ++++++++++--------
 1 file changed, 86 insertions(+), 70 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index 57a9b1faf41..c6274cd17f0 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -41,19 +41,25 @@ A node’s local storage can be broken into primary and secondary partitions.
 Primary partitions are shared partitions that can provide ephemeral local storage.  The two supported primary partitions are:
 
 ### Root
- This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IOPs for example) from this partition.
+ This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IOPS for example) from this partition.
 
 ### Runtime
 This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition.
 
 ## Secondary Partitions
-All other partitions are exposed as persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod.  Applications can continue to use their existing PVC specifications with minimal changes to request local storage.  Each PV uses an entire partition.  The PVs can be precreated by an addon DaemonSet that discovers all the secondary partitions, and can create new PVs as disks are added to the node.
+All other partitions are exposed as local persistent volumes.  Each local volume uses an entire partition.  The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod.  All the local PVs can be queried and viewed from a cluster level using the existing PV object.  Applications can continue to use their existing PVC specifications with minimal changes to request local storage.
+
+The local PVs can be precreated by an addon DaemonSet that discovers all the secondary partitions at well-known directories, and can create new PVs as partitions are added to the node.  A default addon can be provided to handle common configurations.
+
+Local PVs can only provide semi-persistence, and are only suitable for specific use cases that need performance, data gravity and can tolerate data loss.  If the node or PV fails, then either the pod cannot run, or the pod has to give up on the local PV and find a new one.  Failure scenarios can be handled by unbinding the PVC from the local PV, and forcing the pod to reschedule and find a new PV.
+
+Since local PVs are only accessible from specific nodes, a new PV-node association will be used by the scheduler to place pods.  The association can be generalized to support any type of PV, not just local PVs.  This allows for any volume plugin to take advantage of this behavior.
 
 # User Workflows
 
 ### Alice manages a deployment and requires “Guaranteed” ephemeral storage
 
-1. Kubelet running across all nodes will identify primary partition and expose capacity and allocatable for the primary “root” partition. The runtime partition is an implementation detail and is not exposed outside the node.
+1. Kubelet running across all nodes will identify primary partition and expose capacity and allocatable for the primary “root” partition.  This allows primary storage to be considered as a first class resource when scheduling.  The runtime partition is an implementation detail and is not exposed outside the node.  
 
     ```yaml
     apiVersion: v1
@@ -67,7 +73,7 @@ All other partitions are exposed as persistent volumes. The PV interface allows
         storage: 90Gi
     ```
 
-2. Alice adds a “Storage” requirement to her pod as follows
+2. Alice adds new storage resource requirements to her pod, specifying limits for the container's writeable and overlay layers, and emptyDir volumes.
 
     ```yaml
     apiVersion: v1
@@ -90,10 +96,10 @@ All other partitions are exposed as persistent volumes. The PV interface allows
 3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
 4. Alice’s pod is not provided any IO guarantees
 5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi
-6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation.
+6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation.  Otherwise, kubelet can attempt to enforce soft limits.
 7. With hard limits, containers will receive a ENOSPACE error if they consume all reserved storage. Without hard limits, the pod will be evicted by kubelet.
 8. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints.
-9. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes. 
+9. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes.
 
 ### Bob runs batch workloads and is unsure of “storage” requirements
 
@@ -139,15 +145,15 @@ All other partitions are exposed as persistent volumes. The PV interface allows
      name: foo
     spec:
      containers:
-      name: fooc
-      resources:
-       limits:
-        storageLogs: 200Mi
-        storageOverlay: 200Mi
+     - name: fooc
+       resources:
+         limits:
+           storageLogs: 200Mi
+           storageOverlay: 200Mi
      volumes:
-      name: myEmptyDir
-      emptyDir:
-       capacity: 1Gi
+     - name: myEmptyDir
+       emptyDir:
+         capacity: 1Gi
     ```
 
 4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. 
@@ -160,15 +166,15 @@ All other partitions are exposed as persistent volumes. The PV interface allows
    name: foo
   spec:
    containers:
-    name: fooc
-    resources:
-     requests:
-      storageLogs: 500Mi
-      storageOverlay: 500Mi
+   - name: fooc
+     resources:
+       requests:
+         storageLogs: 500Mi
+         storageOverlay: 500Mi
    volumes:
-    name: myEmptyDir
-    emptyDir:
-     capacity: 2Gi
+   - name: myEmptyDir
+     emptyDir:
+       capacity: 2Gi
   ```
 
 6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intend to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods.
@@ -176,7 +182,7 @@ All other partitions are exposed as persistent volumes. The PV interface allows
 ### Alice manages a Database which needs access to “durable” and fast scratch space
 
 1. Cluster administrator provisions machines with local SSDs and brings up the cluster
-2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates HostPath PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points and include additional labels and annotations that help tie to a specific node and identify the storage medium.
+2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a new node annotation that ties the volume to a specific node.  Storage classes and labels may also be specified.  The volume consumes the entire partition.
 
     ```yaml
     kind: PersistentVolume
@@ -184,15 +190,12 @@ All other partitions are exposed as persistent volumes. The PV interface allows
     metadata:
       name: local-pv-1
       annotations:
-        storage.kubernetes.io/node: node-1
-      labels:
-        storage.kubernetes.io/medium: ssd
+        volume.kubernetes.io/node: node-1
+        volume.beta.kubernetes.io/storage-class: local-fast
     spec:
-      volumeType: local
-      storageType: filesystem
       capacity:
         storage: 100Gi
-      hostpath:
+      local:
         path: /var/lib/kubelet/storage-partitions/local-pv-1
       accessModes:
         - ReadWriteOnce
@@ -208,8 +211,7 @@ All other partitions are exposed as persistent volumes. The PV interface allows
     local-pv-1 100Gi    RWO         Delete        Available         node-3
     local-pv-2 10Gi     RWO         Delete        Available         node-3
     ```
-3. The addon will monitor the health of secondary partitions and mark PVs as unhealthy whenever the backing local storage devices have failed.
-4. Alice creates a StatefulSet that uses local PVCs
+3. Alice creates a StatefulSet that uses local PVCs.  The annotation `volume.kubernetes.io/node = ""` is specified to indicate that the requested volume should be local to a node.  The PVC will only be bound to PVs that also have the node annotation set and vice versa.
 
     ```yaml
     apiVersion: apps/v1beta1
@@ -239,27 +241,27 @@ All other partitions are exposed as persistent volumes. The PV interface allows
       volumeClaimTemplates:
       - metadata:
           name: www
-          labels:
-            storage.kubernetes.io/medium: ssd
+          annotations:
+            volume.kubernetes.io/node: ""
+            volume.beta.kubernetes.io/storage-class: local-fast
         spec:
-          volumeType: local
           accessModes: [ "ReadWriteOnce" ]
           resources:
             requests:
               storage: 100Gi
       - metadata:
           name: log
-          labels:
-            storage.kubernetes.io/medium: hdd
+          annotations:
+            volume.kubernetes.io/node: ""
+            volume.beta.kubernetes.io/storage-class: local-slow
         spec:
-          volumeType: local
           accessModes: [ "ReadWriteOnce" ]
           resources:
             requests:
               storage: 1Gi
     ```
 
-5. The scheduler identifies nodes for each pod that can satisfy cpu, memory, storage requirements and also contains available local PVs to satisfy the pod's PVC claims. It then binds the pod’s PVCs to specific PVs on the node and then binds the pod to the node.
+4. The scheduler identifies nodes for each pod that can satisfy cpu, memory, storage requirements and also contains available local PVs to satisfy the pod's PVC claims. It then binds the pod’s PVCs to specific PVs on the node and then binds the pod to the node.  The annotation `volume.kubernetes.io/node` will be filled in with the chosen node name.
     ```
     $ kubectl get pvc
     NAME            STATUS VOLUME     CAPACITY ACCESSMODES … NODE
@@ -281,28 +283,40 @@ All other partitions are exposed as persistent volumes. The PV interface allows
     local-pv-2 10Gi       Bound     log-local-pvc-3 node-3
     ```
 
-6. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
-7. If a pod fails to get scheduled while attempting to reuse an old PVC, a controller will unbind the PVC from the PV, clean up the PV according to the reclaim policy, and reschedule the pod.  The PVC will get rebound to a new PV.
-8. If the node gets tainted as NotReady or Unknown, the pod is evicted according to the taint's forgiveness setting.  The pod will then fail scheduling due to the taint, and follow step 7.
-9. If a PV becomes unhealthy, the pod is evicted by a controller, and follows step 7. Unhealthy PVs once released will not be added back to the cluster until the corresponding local storage device is healthy.
-10. Once Alice decides to delete the database, the StatefulSet is destroyed, followed by the PVCs.  The PVs will then get recycled and deleted according to the reclaim policy, and the addon adds it back to the cluster.
+5. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
+6. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario.  No toleration specified means that the failure is not tolerated.  In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV.  If a toleration is set, by default, it will be tolerated forever.  `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound.
+
+  A new PV taint will be introduced to handle unhealthy volumes.  The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy.
+  ```
+  tolerations:
+    - key: node.alpha.kubernetes.io/notReady
+      operator: TolerationOpExists
+      tolerationSeconds: 600
+    - key: node.alpha.kubernetes.io/unreachable
+      operator: TolerationOpExists
+      tolerationSeconds: 1200
+    - key: storage.kubernetes.io/pvUnhealthy
+      operator: TolerationOpExists
+  ```
+
+7. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs.  The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster.
 
 ### Bob manages a distributed filesystem which needs access to all available storage on each node
 
 1. The cluster that Bob is using is provisioned with nodes that contain one or more secondary partitions
 2. The cluster administrator runs a DaemonSet addon that discovers secondary partitions across all nodes and creates corresponding PVs for them.
-3. The addon will monitor the health of secondary partitions and taint PVs whenever the backing local storage devices becomes unhealthy.
+3. The addon will monitor the health of secondary partitions and mark PVs as unhealthy whenever the backing local storage devices have failed.
 4. Bob creates a specialized controller (Operator) for his distributed filesystem and deploys it.
 5. The operator will identify all the nodes that it can schedule pods onto and discovers the PVs available on each of those nodes. The operator has a label selector that identifies the specific PVs that it can use (this helps preserve fast PVs for Databases for example).
-6. The operator will then create PVCs and manually bind to individual local PVs across all its nodes. 
-7. It will then create pods, manually place them on specific nodes (similar to a DaemonSet) with high enough priority and have them use all the PVCs create by the Operator on those nodes.
+6. The operator will then create PVCs and manually bind to individual local PVs across all its nodes.
+7. It will then create pods, manually place them on specific nodes (similar to a DaemonSet) with high enough priority and have them use all the PVCs created by the Operator on those nodes.
 8. If a pod dies, it will get replaced with a new pod that uses the same set of PVCs that the old pod had used.
-9. If a PV gets tainted as unhealthy, the Operator is expected to delete pods if they cannot tolerate device failures. 
+9. If a PV gets marked as unhealthy, the Operator is expected to delete pods if they cannot tolerate device failures
 
 ### Bob manages a specialized application that needs access to Block level storage
 
 1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet.
-2. The cluster admin creates a DaemonSet addon that discovers all the raw block devices on a node that are available within a specific directory and creates corresponding PVs for them with a ‘StorageType’ of ‘Block’
+2. The cluster admin creates a DaemonSet addon that discovers all the raw block devices on a node that are available within a specific directory and creates corresponding PVs for them with a new `volumeType = block` field.
 
     ```yaml
     kind: PersistentVolume
@@ -311,32 +325,30 @@ All other partitions are exposed as persistent volumes. The PV interface allows
       name: foo
       annotations:
         storage.kubernetes.io/node: node-1
-      labels:
-        storage.kubernetes.io/medium: ssd
+        volume.beta.kubernetes.io/storage-class: local-fast
     spec:
-      volumeType: local
-      storageLevel: block
+      volumeType: block
       capacity:
         storage: 100Gi
-      hostpath:
+      local:
         path: /var/lib/kubelet/storage-raw-devices/foo
       accessModes:
         - ReadWriteOnce
       persistentVolumeReclaimPolicy: Delete
     ```
 
-3. Bob creates a pod with a PVC that requests for block level access and similar to a Stateful Set scenario the scheduler will identify nodes that can satisfy the pods request.
+3. Bob creates a pod with a PVC that requests for block level access and similar to a Stateful Set scenario the scheduler will identify nodes that can satisfy the pods request.  The block devices will not be formatted to allow the application to handle the device using their own methods.
 
     ```yaml
     kind: PersistentVolumeClaim
     apiVersion: v1
     metadata:
       name: myclaim
-      labels:
-        storage.kubernetes.io/medium: ssd
+      annotations:
+        volume.beta.kubernetes.io/node: ""
+        volume.beta.kubernetes.io/storage-class: local-fast
     spec:
-      volumeType: local
-      storageLevel: block
+      volumeType: block
       accessModes:
         - ReadWriteOnce
       resources:
@@ -348,24 +360,28 @@ All other partitions are exposed as persistent volumes. The PV interface allows
 
 # Open Questions & Discussion points 
 * Single vs split “limit” for storage across writable layer and logs
+    * Split allows for enforcement of hard quotas
+    * Single is a simpler UI
 * Local Persistent Volume bindings happening in the scheduler vs in PV controller
     * Should the PV controller fold into the scheduler
-* Supporting dedicated partitions for logs and volumes in Kubelet in addition to runtime overlay filesystem
-    * This complicates kubelet.Not sure what value it adds to end users.
-* Providing IO isolation for ephemeral storage
-    * IO isolation is difficult. Use local PVs for performance
-* Support for encrypted partitions
-* Do applications need access to performant local storage for ephemeral use cases? Ex. A pod requesting local SSDs for use as ephemeral scratch space.
-    * Typically referred to as “inline PVs” in kube land
-* Should block level storage devices be auto formatted to be used as file level storage?
-    * Flexibility vs complexity
-    * Do we really need this? 
+* Should block level storage devices be auto formatted to be used as file level storage instead of having the filesystems precreated by the admin?
+    * It would match behavior with GCE PD and EBS where the volume plugin will create the filesystem first.
+    * It can allow for more comprehensive (but slower) volume cleanup options.  The filesystem can be destroyed and then the partition can be zeroed.
+    * It limits the filesystem choices to those that k8 supports.
 * Repair/replace scenarios. 
     * What are the implications of removing a disk and replacing it with a new one? 
     * We may not do anything in the system, but may need a special workflow
+* How to handle capacity of overlay systems.  It can be specified in the pod spec, but it is not accounted for in the node capacity.
+* Volume-level replication use cases where there is no pod associated with a volume.  How could forgiveness/data gravity be handled there?
 
-# Recommended Storage best practices
+# Related Features
+* Protecting system daemons from abusive IO to primary partition
+* Raw device/block volume support.  This will benefit both remote and local devices.
+* Do applications need access to performant local storage for ephemeral use cases? Ex. A pod requesting local SSDs for use as ephemeral scratch space.
+    * Typically referred to as “inline PVs” in kube land
+* Support for encrypted secondary partitions in order to make wiping more secure and reduce latency
 
+# Recommended Storage best practices
 * Have the primary partition on a reliable storage device
 * Consider using RAID and SSDs (for performance)
 * Partition the rest of the storage devices based on the application needs
@@ -375,7 +391,7 @@ All other partitions are exposed as persistent volumes. The PV interface allows
 * Run a reliable cluster level logging service to drain logs from the nodes before they get rotated or deleted
 * The runtime partition for overlayfs is optional. You do not **need** one.
 * Alert on primary partition failures and act on it immediately. Primary partition failures will render your node unusable.
-* Use EmptyDir for all scratch space requirements of your apps. 
+* Use EmptyDir for all scratch space requirements of your apps when IOPS isolation is not needed.
 * Make the container’s writable layer `readonly` if possible.
 * Another option is to keep the writable layer on tmpfs. Such a setup will allow you to eventually migrate from using local storage for anything but super fast caching purposes or distributed databases leading to higher reliability & uptime for nodes.
 

From 9d4f582732628cc210619d9e3d4ffff50854300c Mon Sep 17 00:00:00 2001
From: Vishnu kannan <vishnuk@google.com>
Date: Wed, 1 Mar 2017 16:45:40 -0800
Subject: [PATCH 05/19] Add rationale for avoiding supporting I/O based
 isolation

Signed-off-by: Vishnu kannan <vishnuk@google.com>
---
 contributors/design-proposals/local-storage-overview.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index c6274cd17f0..8281a5e145f 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -14,7 +14,12 @@ This document presents a strawman for managing local storage in Kubernetes. We e
 # Non Goals
 * Provide storage usage isolation for non-shared partitions.
 * Support all storage devices natively in upstream Kubernetes. Non standard storage devices are expected to be managed using extension mechanisms.
-
+* Support for I/O isolation
+  * Available IOPS on rotational media is very limited compared to other resources like CPU and Memory. This leads to severe resource stranding if IOPS is exposed as a schedulable resource.
+  * Blkio cgroup based I/O isolation doesn't provide deterministic behavior compared to memory and cpu cgroups. Years of experience at Google with Borg has taught that relying on blkio or I/O scheduler isn't suitable for multi-tenancy.
+  * Blkio cgroup based I/O isolation isn't suitable for SSDs. Turning on CFQ on SSDs will hamper performance. Its better to statically partition SSDs and share them instead of using blkio.
+  * I/O isolation can be achieved by using a combination of static partitioning and remote storage. This proposal recommends this approach.
+  
 # Use Cases
 
 ## Ephemeral Local Storage

From ff03f56708245544a4b2bf081214202e61a4344f Mon Sep 17 00:00:00 2001
From: Vishnu kannan <vishnuk@google.com>
Date: Wed, 1 Mar 2017 16:57:26 -0800
Subject: [PATCH 06/19] extending related features

Signed-off-by: Vishnu kannan <vishnuk@google.com>
---
 contributors/design-proposals/local-storage-overview.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index 8281a5e145f..8cacfba1517 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -381,10 +381,11 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
 
 # Related Features
 * Protecting system daemons from abusive IO to primary partition
-* Raw device/block volume support.  This will benefit both remote and local devices.
+* Raw device/block volume support. This will benefit both remote and local devices.
 * Do applications need access to performant local storage for ephemeral use cases? Ex. A pod requesting local SSDs for use as ephemeral scratch space.
     * Typically referred to as “inline PVs” in kube land
 * Support for encrypted secondary partitions in order to make wiping more secure and reduce latency
+* Co-locating PVs and pods across zones. Binding PVCs in the scheduler will help with this feature.
 
 # Recommended Storage best practices
 * Have the primary partition on a reliable storage device

From 1a68dab16b372897f0bd76c19bb652728b6ebc3e Mon Sep 17 00:00:00 2001
From: Vishnu kannan <vishnuk@google.com>
Date: Sat, 4 Mar 2017 20:38:21 -0800
Subject: [PATCH 07/19] disk io isolation notes

Signed-off-by: Vishnu kannan <vishnuk@google.com>
---
 .../local-storage-overview.md                 | 228 +++++++++++++++---
 1 file changed, 189 insertions(+), 39 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index 8cacfba1517..c1b11a1dd10 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -14,11 +14,12 @@ This document presents a strawman for managing local storage in Kubernetes. We e
 # Non Goals
 * Provide storage usage isolation for non-shared partitions.
 * Support all storage devices natively in upstream Kubernetes. Non standard storage devices are expected to be managed using extension mechanisms.
-* Support for I/O isolation
-  * Available IOPS on rotational media is very limited compared to other resources like CPU and Memory. This leads to severe resource stranding if IOPS is exposed as a schedulable resource.
-  * Blkio cgroup based I/O isolation doesn't provide deterministic behavior compared to memory and cpu cgroups. Years of experience at Google with Borg has taught that relying on blkio or I/O scheduler isn't suitable for multi-tenancy.
-  * Blkio cgroup based I/O isolation isn't suitable for SSDs. Turning on CFQ on SSDs will hamper performance. Its better to statically partition SSDs and share them instead of using blkio.
-  * I/O isolation can be achieved by using a combination of static partitioning and remote storage. This proposal recommends this approach.
+* Support for I/O isolation using CFS & blkio cgroups.
+  * IOPS isn't safe to be a schedulable resource. IOPS on rotational media is very limited compared to other resources like CPU and Memory. This leads to severe resource stranding.
+  * Blkio cgroup + CFS based I/O isolation doesn't provide deterministic behavior compared to memory and cpu cgroups. Years of experience at Google with Borg has taught that relying on blkio or I/O scheduler isn't suitable for multi-tenancy.
+  * Blkio cgroup based I/O isolation isn't suitable for SSDs. Turning on CFQ on SSDs will hamper performance. Its better to statically partition SSDs and share them instead of using CFS.
+  * I/O isolation can be achieved by using a combination of static partitioning and remote storage. This proposal recommends this approach with illustrations below.
+  * Pod level resource isolation extensions will be made available in the Kubelet which will let vendors add support for CFQ if necessary for their deployments.
   
 # Use Cases
 
@@ -64,7 +65,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
 
 ### Alice manages a deployment and requires “Guaranteed” ephemeral storage
 
-1. Kubelet running across all nodes will identify primary partition and expose capacity and allocatable for the primary “root” partition.  This allows primary storage to be considered as a first class resource when scheduling.  The runtime partition is an implementation detail and is not exposed outside the node.  
+1. Kubelet running across all nodes will identify primary partition and expose capacity and allocatable for the primary partitions.  This allows primary partitions' storage capacity to be considered as a first class resource when scheduling.
 
     ```yaml
     apiVersion: v1
@@ -73,10 +74,12 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
       name: foo
     status:
       capacity: 
-        storage: 100Gi
-      allocatable:
-        storage: 90Gi
-    ```
+        storage.kubernetes.io/runtime: 100Gi
+		storage.kubernetes.io/root: 100Gi
+	allocatable:
+        storage.kubernetes.io/runtime: 100Gi
+		storage.kubernetes.io/root: 90Gi
+```
 
 2. Alice adds new storage resource requirements to her pod, specifying limits for the container's writeable and overlay layers, and emptyDir volumes.
 
@@ -90,21 +93,22 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      - name: fooc
        resources:
        limits:
-         storageLogs: 500Mi
-         storageOverlay: 1Gi
+        storage.kubernetes.io/logs: 500Mi
+		storage.kubernetes.io/root: 1Gi
      volumes:
      - name: myEmptyDir
        emptyDir:
-         capacity: 20Gi
+	     resources:
+	       limits
+		     storage.kubernetes.io: 20Gi
     ```
 
 3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
-4. Alice’s pod is not provided any IO guarantees
+4. `storage.kubernetes.io/logs` resource can only be satisfied by `storage.kubernetes.io/root` Allocatable on nodes. `storage.kubernetes.io/overlay` resource can be satisfied by `storage.kubernetes.io/runtime` if exposed by nodes or by `storage.kubernetes.io/root` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod.
 5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi
-6. Kubelet will attempt to hard limit local storage consumed by pod “foo” if Linux project quota feature is available and runtime supports storage isolation.  Otherwise, kubelet can attempt to enforce soft limits.
-7. With hard limits, containers will receive a ENOSPACE error if they consume all reserved storage. Without hard limits, the pod will be evicted by kubelet.
-8. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints.
-9. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes.
+6. Kubelet will track the usage of pods across logs and overlay filesystem and restart the container if it's total usage exceeds it's storage limits. If usage on `EmptyDir` volume exceeds its `limit`, then the pod will be evicted by the kubelet. By performing soft limiting, users will be able to easily identify pods that run out of storage.
+7. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints.
+8. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes.
 
 ### Bob runs batch workloads and is unsure of “storage” requirements
 
@@ -133,11 +137,11 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
       name: mylimits
     spec:
        - default:
-         storageLogs: 200Mi
-         storageOverlay: 200Mi
+         storage.kubernetes.io/logs: 200Mi
+         storage.kubernetes.io/overlay: 200Mi
          type: Container
        - default:
-         storage: 1Gi
+         storage.kubernetes.io: 1Gi
          type: EmptyDir
     ```
 
@@ -153,12 +157,14 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      - name: fooc
        resources:
          limits:
-           storageLogs: 200Mi
-           storageOverlay: 200Mi
+           storage.kubernetes.io/logs: 200Mi
+           storage.kubernetes.io/overlay: 200Mi
      volumes:
      - name: myEmptyDir
        emptyDir:
-         capacity: 1Gi
+	     resources:
+           limits:
+		     storage.kubernetes.io: 1Gi
     ```
 
 4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. 
@@ -173,16 +179,18 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
    containers:
    - name: fooc
      resources:
-       requests:
-         storageLogs: 500Mi
-         storageOverlay: 500Mi
+     requests:
+	   storage.kubernetes.io/logs: 500Mi
+	   storage.kubernetes.io/overlay: 500Mi
    volumes:
    - name: myEmptyDir
      emptyDir:
-       capacity: 2Gi
+	     resources:
+           limits:
+             storage.kubernetes.io/logs: 2Gi
   ```
 
-6. Note that it is not possible to specify a minimum storage requirement for EmptyDir volumes because we intend to limit overcommitment of local storage due to high latency in reclaiming it from terminated pods.
+6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. it is recommended to use Persistent Durable Volumes as much as possible and avoid primary partitions.
 
 ### Alice manages a Database which needs access to “durable” and fast scratch space
 
@@ -289,10 +297,23 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
     ```
 
 5. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
-6. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario.  No toleration specified means that the failure is not tolerated.  In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV.  If a toleration is set, by default, it will be tolerated forever.  `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound.
+6. To workaround situations when a pod cannot get access to its existing local PV due to resource unvavilability on the local PV's node, pods can choose to opt for switching to use a new local PV after a `timeout`. If the scheduler cannot bind the pod to the node where the local PV exists before `timeout` elapses since the pod's creation then the corresponding PVC will be unbound by the scheduler and the pod will then be bound to a different node where the pod would fit and local PV requirements are met. 
+
+```yaml
+apiVersion: v1
+type: pod
+spec:
+volumes:
+ - name: myDurableVolume
+   persistentVolumeClaim:
+     claimName: foo
+     accessTimeoutSeconds: 30
+	 ```
+	 
+7. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario.  No toleration specified means that the failure is not tolerated.  In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV.  If a toleration is set, by default, it will be tolerated forever.  `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound.
 
   A new PV taint will be introduced to handle unhealthy volumes.  The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy.
-  ```
+  ```yaml
   tolerations:
     - key: node.alpha.kubernetes.io/notReady
       operator: TolerationOpExists
@@ -304,7 +325,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
       operator: TolerationOpExists
   ```
 
-7. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs.  The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster.
+8. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs.  The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster.
 
 ### Bob manages a distributed filesystem which needs access to all available storage on each node
 
@@ -318,6 +339,123 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
 8. If a pod dies, it will get replaced with a new pod that uses the same set of PVCs that the old pod had used.
 9. If a PV gets marked as unhealthy, the Operator is expected to delete pods if they cannot tolerate device failures
 
+### Phippy manages a cluster and intends to mitigate storage I/O abuse
+
+1. Phippy creates a dedicated partition with a separate device for her system daemons. She achieves this by making `/var/log/containers`, `/var/lib/kubelet`, `/var/lib/docker` (with the docker runtime) all reside on a separate partition.
+2. Phippy is aware that pods can cause abuse to each other.
+3. Whenever a pod experiences I/O issues with it's EmptyDir volume, Phippy reconfigures those pods to use Persistent Volumes whose lifetime is tied to the pod.
+    ```yaml
+    apiVersion: v1
+    kind: pod
+    metadata:
+     name: foo
+    spec:
+     containers:
+     - name: fooc
+       resources:
+       limits:
+         storageLogs: 500Mi
+         storageOverlay: 1Gi
+     volumes:
+     - name: myEphemeralPeristentVolume
+	   inline:
+         metadata:
+	       labels:
+		     storage.kubernetes.io/medium: local-ssd
+		     storage.kubernetes.io/volume-type: local
+		 spec:
+		     accessModes: [ "ReadWriteOnce" ]
+		     resources:
+		       requests:
+		          storage: 1Gi
+    ```
+
+4. Phippy notices some of her pods are experiencing spurious downtimes. With the help of monitoring (`iostat`), she notices that the nodes pods are running on are overloaded with I/O operations. She then updates her pods to use Logging Volumes which are backed by persistent storage. If a logging volumeMount is associated with a container, Kubelet will place log data from stdout & stderr of the container under the volume mount path within the container. Kubelet will continue to expose stdout/stderr log data to external logging agents using symlinks as it does already.
+
+    ```yaml
+    apiVersion: v1
+    kind: pod
+    metadata:
+     name: foo
+    spec:
+     containers:
+     - name: fooc
+	   volumeMounts:
+	     name: myLoggingVolume
+		 path: /var/log/
+         policy:
+		   logDir:
+		    subDir: foo
+			glob: *.log
+     - name: barc
+	   volumeMounts:
+	     name: myInMemoryLoggVolume
+		 path: /var/log/
+		 policy:
+		   logDir:
+		    subDir: bar
+			glob: *.log
+    volumes:
+	- name: myLoggingVolume
+	   inline:
+         metadata:
+	       labels:
+		     storage.kubernetes.io/medium: local-ssd
+		     storage.kubernetes.io/volume-type: local
+		 spec:
+		     accessModes: [ "ReadWriteOnce" ]
+		     resources:
+		       requests:
+		          storage: 1Gi
+	- name: myInMemoryLogVolume
+	  emptyDir:
+	    medium: memory
+	    resources:
+		  limits:
+		    storage: 100Mi
+    ```
+
+5. Phippy notices some of her pods are suffering hangs by while writing to their writable layer. Phippy again notices that I/O contention is the root cause and then updates her Pod Spec to use memory backed or persistent volumes for her pods writable layer. Kubelet will instruct the runtimes to overlay the volume with `overlay` policy over the writable layer of the container.
+
+    ```yaml
+    apiVersion: v1
+    kind: pod
+    metadata:
+     name: foo
+    spec:
+     containers:
+     - name: fooc
+	   volumeMounts:
+	     name: myWritableLayer
+		 policy:
+		   overlay:
+		    subDir: foo
+     - name: barc
+	   volumeMounts:
+	     name: myDurableWritableLayer
+		 policy:
+		   overlay:
+		    subDir: bar
+	volumes:
+	- name: myWritableLayer
+	  emptyDir:
+	    medium: memory
+	    resources:
+		  limits:
+		    storage: 100Mi
+	- name: myDurableWritableLayer
+	   inline:
+         metadata:
+	       labels:
+		     storage.kubernetes.io/medium: local-ssd
+		     storage.kubernetes.io/volume-type: local
+		 spec:
+		     accessModes: [ "ReadWriteOnce" ]
+		     resources:
+		       requests:
+		          storage: 1Gi
+```
+
 ### Bob manages a specialized application that needs access to Block level storage
 
 1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet.
@@ -369,26 +507,23 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
     * Single is a simpler UI
 * Local Persistent Volume bindings happening in the scheduler vs in PV controller
     * Should the PV controller fold into the scheduler
+	* This will help spread PVs and pods across matching zones.
 * Should block level storage devices be auto formatted to be used as file level storage instead of having the filesystems precreated by the admin?
     * It would match behavior with GCE PD and EBS where the volume plugin will create the filesystem first.
     * It can allow for more comprehensive (but slower) volume cleanup options.  The filesystem can be destroyed and then the partition can be zeroed.
     * It limits the filesystem choices to those that k8 supports.
-* Repair/replace scenarios. 
+* Repair/replace scenarios.
     * What are the implications of removing a disk and replacing it with a new one? 
     * We may not do anything in the system, but may need a special workflow
-* How to handle capacity of overlay systems.  It can be specified in the pod spec, but it is not accounted for in the node capacity.
 * Volume-level replication use cases where there is no pod associated with a volume.  How could forgiveness/data gravity be handled there?
 
 # Related Features
-* Protecting system daemons from abusive IO to primary partition
-* Raw device/block volume support. This will benefit both remote and local devices.
-* Do applications need access to performant local storage for ephemeral use cases? Ex. A pod requesting local SSDs for use as ephemeral scratch space.
-    * Typically referred to as “inline PVs” in kube land
 * Support for encrypted secondary partitions in order to make wiping more secure and reduce latency
 * Co-locating PVs and pods across zones. Binding PVCs in the scheduler will help with this feature.
 
 # Recommended Storage best practices
 * Have the primary partition on a reliable storage device
+* Have a dedicated storage device for system daemons.
 * Consider using RAID and SSDs (for performance)
 * Partition the rest of the storage devices based on the application needs
     * SSDs can be statically partitioned and they might still meet IO requirements of apps.
@@ -397,7 +532,22 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
 * Run a reliable cluster level logging service to drain logs from the nodes before they get rotated or deleted
 * The runtime partition for overlayfs is optional. You do not **need** one.
 * Alert on primary partition failures and act on it immediately. Primary partition failures will render your node unusable.
-* Use EmptyDir for all scratch space requirements of your apps when IOPS isolation is not needed.
+* Use EmptyDir for all scratch space requirements of your apps when IOPS isolation is not of concern.
 * Make the container’s writable layer `readonly` if possible.
 * Another option is to keep the writable layer on tmpfs. Such a setup will allow you to eventually migrate from using local storage for anything but super fast caching purposes or distributed databases leading to higher reliability & uptime for nodes.
 
+# Features & Milestones
+
+The following two features are intended to prioritized over others to begin with.
+
+1. Support for durable Local PVs
+2. Support for capacity isolation
+
+Alpha support for these two features are targeted for v1.7. Beta and GA timelines are TBD.
+Currently, msau42@, jinxu@ and vishh@ will be developing these features.
+
+The following pending features need owners. Their delivery timelines will depend on the future owners.
+1. Support for persistent volumes tied to the lifetime of a pod (`inline PV`)
+2. Support for Logging Volumes
+3. Support for changing the writable layer type of containers
+4. Support for Block Level Storage

From 8c08c3cf2e04a53780b9413067181f22983af526 Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Wed, 8 Mar 2017 17:45:15 -0800
Subject: [PATCH 08/19] Fix examples, remove node annotation on PVC, update
 block device handling.

---
 .../local-storage-overview.md                 | 265 +++++++++---------
 1 file changed, 137 insertions(+), 128 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index c1b11a1dd10..680e3450127 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -73,12 +73,12 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
     metadata:
       name: foo
     status:
-      capacity: 
+      capacity:
         storage.kubernetes.io/runtime: 100Gi
-		storage.kubernetes.io/root: 100Gi
-	allocatable:
+        storage.kubernetes.io/root: 100Gi
+      allocatable:
         storage.kubernetes.io/runtime: 100Gi
-		storage.kubernetes.io/root: 90Gi
+        storage.kubernetes.io/root: 90Gi
 ```
 
 2. Alice adds new storage resource requirements to her pod, specifying limits for the container's writeable and overlay layers, and emptyDir volumes.
@@ -92,15 +92,18 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      containers:
      - name: fooc
        resources:
-       limits:
-        storage.kubernetes.io/logs: 500Mi
-		storage.kubernetes.io/root: 1Gi
+         limits:
+           storage.kubernetes.io/logs: 500Mi
+           storage.kubernetes.io/overlay: 1Gi
+       volumeMounts:
+       - name: myEmptyDir
+         mountPath: /mnt/data
      volumes:
      - name: myEmptyDir
        emptyDir:
-	     resources:
-	       limits
-		     storage.kubernetes.io: 20Gi
+         resources:
+           limits:
+             size: 1Gi
     ```
 
 3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
@@ -123,6 +126,9 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
     spec:
      containers:
      - name: fooc
+       volumeMounts:
+       - name: myEmptyDir
+         mountPath: /mnt/data
      volumes:
      - name: myEmptyDir
        emptyDir:
@@ -141,7 +147,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
          storage.kubernetes.io/overlay: 200Mi
          type: Container
        - default:
-         storage.kubernetes.io: 1Gi
+         size: 1Gi
          type: EmptyDir
     ```
 
@@ -159,12 +165,15 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
          limits:
            storage.kubernetes.io/logs: 200Mi
            storage.kubernetes.io/overlay: 200Mi
+       volumeMounts:
+       - name: myEmptyDir
+         mountPath: /mnt/data
      volumes:
      - name: myEmptyDir
        emptyDir:
-	     resources:
+         resources:
            limits:
-		     storage.kubernetes.io: 1Gi
+             size: 1Gi
     ```
 
 4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. 
@@ -179,15 +188,18 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
    containers:
    - name: fooc
      resources:
-     requests:
-	   storage.kubernetes.io/logs: 500Mi
-	   storage.kubernetes.io/overlay: 500Mi
+       requests:
+         storage.kubernetes.io/logs: 500Mi
+         storage.kubernetes.io/overlay: 500Mi
+     volumeMounts:
+     - name: myEmptyDir
+       mountPath: /mnt/data
    volumes:
    - name: myEmptyDir
      emptyDir:
-	     resources:
-           limits:
-             storage.kubernetes.io/logs: 2Gi
+       resources:
+         limits:
+           size: 2Gi
   ```
 
 6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. it is recommended to use Persistent Durable Volumes as much as possible and avoid primary partitions.
@@ -195,7 +207,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
 ### Alice manages a Database which needs access to “durable” and fast scratch space
 
 1. Cluster administrator provisions machines with local SSDs and brings up the cluster
-2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a new node annotation that ties the volume to a specific node.  Storage classes and labels may also be specified.  The volume consumes the entire partition.
+2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a new node annotation that ties the volume to a specific node.  A StorageClass name that is prefixed with "local-" is required for the system to be able to differentiate between local and remote storage.  Labels may also be specified.  The volume consumes the entire partition.
 
     ```yaml
     kind: PersistentVolume
@@ -204,7 +216,6 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
       name: local-pv-1
       annotations:
         volume.kubernetes.io/node: node-1
-        volume.beta.kubernetes.io/storage-class: local-fast
     spec:
       capacity:
         storage: 100Gi
@@ -213,6 +224,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
       accessModes:
         - ReadWriteOnce
       persistentVolumeReclaimPolicy: Delete
+      storageClassName: local-fast
     ```
     ```
     $ kubectl get pv
@@ -224,7 +236,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
     local-pv-1 100Gi    RWO         Delete        Available         node-3
     local-pv-2 10Gi     RWO         Delete        Available         node-3
     ```
-3. Alice creates a StatefulSet that uses local PVCs.  The annotation `volume.kubernetes.io/node = ""` is specified to indicate that the requested volume should be local to a node.  The PVC will only be bound to PVs that also have the node annotation set and vice versa.
+3. Alice creates a StatefulSet that uses local PVCs.  The StorageClass prefix of "local-" indicates that the user wants local storage.  The PVC will only be bound to PVs that match the StorageClass name. 
 
     ```yaml
     apiVersion: apps/v1beta1
@@ -254,27 +266,23 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
       volumeClaimTemplates:
       - metadata:
           name: www
-          annotations:
-            volume.kubernetes.io/node: ""
-            volume.beta.kubernetes.io/storage-class: local-fast
         spec:
           accessModes: [ "ReadWriteOnce" ]
+          storageClassName: local-fast
           resources:
             requests:
               storage: 100Gi
       - metadata:
           name: log
-          annotations:
-            volume.kubernetes.io/node: ""
-            volume.beta.kubernetes.io/storage-class: local-slow
         spec:
           accessModes: [ "ReadWriteOnce" ]
+          storageClassName: local-slow
           resources:
             requests:
               storage: 1Gi
     ```
 
-4. The scheduler identifies nodes for each pod that can satisfy cpu, memory, storage requirements and also contains available local PVs to satisfy the pod's PVC claims. It then binds the pod’s PVCs to specific PVs on the node and then binds the pod to the node.  The annotation `volume.kubernetes.io/node` will be filled in with the chosen node name.
+4. The scheduler identifies nodes for each pod that can satisfy cpu, memory, storage requirements and also contains available local PVs to satisfy the pod's PVC claims. It then binds the pod’s PVCs to specific PVs on the node and then binds the pod to the node. 
     ```
     $ kubectl get pvc
     NAME            STATUS VOLUME     CAPACITY ACCESSMODES … NODE
@@ -297,35 +305,27 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
     ```
 
 5. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
-6. To workaround situations when a pod cannot get access to its existing local PV due to resource unvavilability on the local PV's node, pods can choose to opt for switching to use a new local PV after a `timeout`. If the scheduler cannot bind the pod to the node where the local PV exists before `timeout` elapses since the pod's creation then the corresponding PVC will be unbound by the scheduler and the pod will then be bound to a different node where the pod would fit and local PV requirements are met. 
-
-```yaml
-apiVersion: v1
-type: pod
-spec:
-volumes:
- - name: myDurableVolume
-   persistentVolumeClaim:
-     claimName: foo
-     accessTimeoutSeconds: 30
-	 ```
-	 
-7. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario.  No toleration specified means that the failure is not tolerated.  In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV.  If a toleration is set, by default, it will be tolerated forever.  `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound.
-
-  A new PV taint will be introduced to handle unhealthy volumes.  The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy.
+6. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario.  No toleration specified means that the failure is not tolerated.  In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV.  If a toleration is set, by default, it will be tolerated forever.  `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound.
+
+  Node taints already exist today.  New PV and scheduling taints can be added to handle additional failure use cases when using local storage.  A new PV taint will be introduced to handle unhealthy volumes.  The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy.  A scheduling taint could signal a scheduling failure for the pod due to being unable to fit on the node.
   ```yaml
-  tolerations:
+  nodeTolerations:
     - key: node.alpha.kubernetes.io/notReady
       operator: TolerationOpExists
       tolerationSeconds: 600
     - key: node.alpha.kubernetes.io/unreachable
       operator: TolerationOpExists
       tolerationSeconds: 1200
+  pvTolerations:
     - key: storage.kubernetes.io/pvUnhealthy
       operator: TolerationOpExists
+  schedulingTolerations:
+    - key: scheduler.kubernetes.io/podCannotFit
+      operator: TolerationOpExists
+      tolerationSeconds: 600
   ```
 
-8. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs.  The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster.
+7. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs.  The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster.
 
 ### Bob manages a distributed filesystem which needs access to all available storage on each node
 
@@ -354,20 +354,20 @@ volumes:
      - name: fooc
        resources:
        limits:
-         storageLogs: 500Mi
-         storageOverlay: 1Gi
+         storage.kubernetes.io/logs: 500Mi
+         storage.kubernetes.io/overlay: 1Gi
+       volumeMounts:
+       - name: myEphemeralPersistentVolume
+         mountPath: /mnt/tmpdata
      volumes:
      - name: myEphemeralPeristentVolume
-	   inline:
-         metadata:
-	       labels:
-		     storage.kubernetes.io/medium: local-ssd
-		     storage.kubernetes.io/volume-type: local
-		 spec:
-		     accessModes: [ "ReadWriteOnce" ]
-		     resources:
-		       requests:
-		          storage: 1Gi
+       inline:
+         spec:
+           accessModes: [ "ReadWriteOnce" ]
+           storageClassName: local-fast
+           resources:
+             limits:
+               size: 1Gi
     ```
 
 4. Phippy notices some of her pods are experiencing spurious downtimes. With the help of monitoring (`iostat`), she notices that the nodes pods are running on are overloaded with I/O operations. She then updates her pods to use Logging Volumes which are backed by persistent storage. If a logging volumeMount is associated with a container, Kubelet will place log data from stdout & stderr of the container under the volume mount path within the container. Kubelet will continue to expose stdout/stderr log data to external logging agents using symlinks as it does already.
@@ -380,39 +380,36 @@ volumes:
     spec:
      containers:
      - name: fooc
-	   volumeMounts:
-	     name: myLoggingVolume
-		 path: /var/log/
+       volumeMounts:
+       - name: myLoggingVolume
+         mountPath: /var/log/
          policy:
-		   logDir:
-		    subDir: foo
-			glob: *.log
+           logDir:
+             subDir: foo
+             glob: *.log
      - name: barc
-	   volumeMounts:
-	     name: myInMemoryLoggVolume
-		 path: /var/log/
-		 policy:
-		   logDir:
-		    subDir: bar
-			glob: *.log
+       volumeMounts:
+       - name: myInMemoryLoggVolume
+         mountPath: /var/log/
+         policy:
+           logDir:
+             subDir: bar
+             glob: *.log
     volumes:
-	- name: myLoggingVolume
-	   inline:
-         metadata:
-	       labels:
-		     storage.kubernetes.io/medium: local-ssd
-		     storage.kubernetes.io/volume-type: local
-		 spec:
-		     accessModes: [ "ReadWriteOnce" ]
-		     resources:
-		       requests:
-		          storage: 1Gi
-	- name: myInMemoryLogVolume
-	  emptyDir:
-	    medium: memory
-	    resources:
-		  limits:
-		    storage: 100Mi
+    - name: myLoggingVolume
+      inline:
+        spec:
+          accessModes: [ "ReadWriteOnce" ]
+          storageClassName: local-slow
+          resources:
+            requests:
+              storage: 1Gi
+    - name: myInMemoryLogVolume
+      emptyDir:
+        medium: memory
+        resources:
+          limits:
+            size: 100Mi
     ```
 
 5. Phippy notices some of her pods are suffering hangs by while writing to their writable layer. Phippy again notices that I/O contention is the root cause and then updates her Pod Spec to use memory backed or persistent volumes for her pods writable layer. Kubelet will instruct the runtimes to overlay the volume with `overlay` policy over the writable layer of the container.
@@ -425,41 +422,38 @@ volumes:
     spec:
      containers:
      - name: fooc
-	   volumeMounts:
-	     name: myWritableLayer
-		 policy:
-		   overlay:
-		    subDir: foo
+       volumeMounts:
+       - name: myWritableLayer
+         policy:
+           overlay:
+             subDir: foo
      - name: barc
-	   volumeMounts:
-	     name: myDurableWritableLayer
-		 policy:
-		   overlay:
-		    subDir: bar
-	volumes:
-	- name: myWritableLayer
-	  emptyDir:
-	    medium: memory
-	    resources:
-		  limits:
-		    storage: 100Mi
-	- name: myDurableWritableLayer
-	   inline:
-         metadata:
-	       labels:
-		     storage.kubernetes.io/medium: local-ssd
-		     storage.kubernetes.io/volume-type: local
-		 spec:
-		     accessModes: [ "ReadWriteOnce" ]
-		     resources:
-		       requests:
-		          storage: 1Gi
-```
+       volumeMounts:
+       - name: myDurableWritableLayer
+         policy:
+           overlay:
+               subDir: bar
+     volumes:
+     - name: myWritableLayer
+       emptyDir:
+         medium: memory
+         resources:
+           limits:
+             storage: 100Mi
+     - name: myDurableWritableLayer
+       inline:
+         spec:
+           accessModes: [ "ReadWriteOnce" ]
+           storageClassName: local-fast
+           resources:
+             requests:
+               storage: 1Gi
+    ```
 
 ### Bob manages a specialized application that needs access to Block level storage
 
 1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet.
-2. The cluster admin creates a DaemonSet addon that discovers all the raw block devices on a node that are available within a specific directory and creates corresponding PVs for them with a new `volumeType = block` field.
+2. The same addon DaemonSet can discover block devices in the same directory as the filesystem mount points and creates corresponding PVs for them with a new `volumeType = block` field.  This field indicates the original volume type upon PV creation.
 
     ```yaml
     kind: PersistentVolume
@@ -468,9 +462,8 @@ volumes:
       name: foo
       annotations:
         storage.kubernetes.io/node: node-1
-        volume.beta.kubernetes.io/storage-class: local-fast
     spec:
-      volumeType: block
+      volumeType: block 
       capacity:
         storage: 100Gi
       local:
@@ -478,6 +471,7 @@ volumes:
       accessModes:
         - ReadWriteOnce
       persistentVolumeReclaimPolicy: Delete
+      storageClassName: local-fast
     ```
 
 3. Bob creates a pod with a PVC that requests for block level access and similar to a Stateful Set scenario the scheduler will identify nodes that can satisfy the pods request.  The block devices will not be formatted to allow the application to handle the device using their own methods.
@@ -487,17 +481,36 @@ volumes:
     apiVersion: v1
     metadata:
       name: myclaim
-      annotations:
-        volume.beta.kubernetes.io/node: ""
-        volume.beta.kubernetes.io/storage-class: local-fast
     spec:
       volumeType: block
+      storageClassName: local-fast
       accessModes:
         - ReadWriteOnce
       resources:
         requests:
           storage: 80Gi
     ```
+4. It is also possible for a PVC that requests `volumeType: file` to also use a PV with `volumeType: block`, if no file-based PVs are available.  In this situation, the block device would get formatted with the filesystem type specified in the PV spec.  And when the PV gets destroyed, then the filesystem also gets destroyed to return back to the original block state.
+
+    ```yaml
+    kind: PersistentVolume
+    apiVersion: v1
+    metadata:
+      name: foo
+      annotations:
+        storage.kubernetes.io/node: node-1
+    spec:
+      volumeType: block
+      capacity:
+        storage: 100Gi
+      local:
+        path: /var/lib/kubelet/storage-raw-devices/foo
+        fsType: ext4
+      accessModes:
+        - ReadWriteOnce
+      persistentVolumeReclaimPolicy: Delete
+      storageClassName: local-fast
+    ```
 
 *The lifecycle of the block level PV will be similar to that of the scenarios explained earlier.* 
 
@@ -508,10 +521,6 @@ volumes:
 * Local Persistent Volume bindings happening in the scheduler vs in PV controller
     * Should the PV controller fold into the scheduler
 	* This will help spread PVs and pods across matching zones.
-* Should block level storage devices be auto formatted to be used as file level storage instead of having the filesystems precreated by the admin?
-    * It would match behavior with GCE PD and EBS where the volume plugin will create the filesystem first.
-    * It can allow for more comprehensive (but slower) volume cleanup options.  The filesystem can be destroyed and then the partition can be zeroed.
-    * It limits the filesystem choices to those that k8 supports.
 * Repair/replace scenarios.
     * What are the implications of removing a disk and replacing it with a new one? 
     * We may not do anything in the system, but may need a special workflow

From 4dbe1aa8e0bebd73bab2e38944fd901a83eb7929 Mon Sep 17 00:00:00 2001
From: Vishnu kannan <vishnuk@google.com>
Date: Thu, 9 Mar 2017 13:43:21 -0800
Subject: [PATCH 09/19] rename local storage resource names

Signed-off-by: Vishnu kannan <vishnuk@google.com>
---
 .../local-storage-overview.md                 | 40 +++++++++----------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index 680e3450127..1dee9f4cbd3 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -74,11 +74,11 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
       name: foo
     status:
       capacity:
-        storage.kubernetes.io/runtime: 100Gi
-        storage.kubernetes.io/root: 100Gi
+        storage.kubernetes.io/overlay: 100Gi
+        storage.kubernetes.io/scratch: 100Gi
       allocatable:
-        storage.kubernetes.io/runtime: 100Gi
-        storage.kubernetes.io/root: 90Gi
+        storage.kubernetes.io/overlay: 100Gi
+        storage.kubernetes.io/scratch: 90Gi
 ```
 
 2. Alice adds new storage resource requirements to her pod, specifying limits for the container's writeable and overlay layers, and emptyDir volumes.
@@ -93,7 +93,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      - name: fooc
        resources:
          limits:
-           storage.kubernetes.io/logs: 500Mi
+           storage.kubernetes.io/scratch: 500Mi
            storage.kubernetes.io/overlay: 1Gi
        volumeMounts:
        - name: myEmptyDir
@@ -101,14 +101,13 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      volumes:
      - name: myEmptyDir
        emptyDir:
-         resources:
-           limits:
-             size: 1Gi
+	     size: 1Gi
     ```
 
 3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
-4. `storage.kubernetes.io/logs` resource can only be satisfied by `storage.kubernetes.io/root` Allocatable on nodes. `storage.kubernetes.io/overlay` resource can be satisfied by `storage.kubernetes.io/runtime` if exposed by nodes or by `storage.kubernetes.io/root` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod.
-5. Kubelet will rotate logs to keep logs usage of “fooc” under 500Mi
+4. `storage.kubernetes.io/scratch` resource is meant for logs. `storage.kubernetes.io/overlay` is meant for writable layer.
+5. `storage.kubernetes.io/overlay` resource can be satisfied by `storage.kubernetes.io/overlay` if exposed by nodes or by `storage.kubernetes.io/scratch` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod.
+5. Kubelet will rotate logs to keep scratch space usage of “fooc” under 500Mi
 6. Kubelet will track the usage of pods across logs and overlay filesystem and restart the container if it's total usage exceeds it's storage limits. If usage on `EmptyDir` volume exceeds its `limit`, then the pod will be evicted by the kubelet. By performing soft limiting, users will be able to easily identify pods that run out of storage.
 7. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints.
 8. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes.
@@ -143,7 +142,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
       name: mylimits
     spec:
        - default:
-         storage.kubernetes.io/logs: 200Mi
+         storage.kubernetes.io/scratch: 200Mi
          storage.kubernetes.io/overlay: 200Mi
          type: Container
        - default:
@@ -163,7 +162,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      - name: fooc
        resources:
          limits:
-           storage.kubernetes.io/logs: 200Mi
+           storage.kubernetes.io/scratch: 200Mi
            storage.kubernetes.io/overlay: 200Mi
        volumeMounts:
        - name: myEmptyDir
@@ -171,9 +170,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      volumes:
      - name: myEmptyDir
        emptyDir:
-         resources:
-           limits:
-             size: 1Gi
+	     size: 1Gi
     ```
 
 4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. 
@@ -189,7 +186,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
    - name: fooc
      resources:
        requests:
-         storage.kubernetes.io/logs: 500Mi
+         storage.kubernetes.io/scratch: 500Mi
          storage.kubernetes.io/overlay: 500Mi
      volumeMounts:
      - name: myEmptyDir
@@ -197,9 +194,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
    volumes:
    - name: myEmptyDir
      emptyDir:
-       resources:
-         limits:
-           size: 2Gi
+		size: 2Gi
   ```
 
 6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. it is recommended to use Persistent Durable Volumes as much as possible and avoid primary partitions.
@@ -354,7 +349,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      - name: fooc
        resources:
        limits:
-         storage.kubernetes.io/logs: 500Mi
+         storage.kubernetes.io/scratch: 500Mi
          storage.kubernetes.io/overlay: 1Gi
        volumeMounts:
        - name: myEphemeralPersistentVolume
@@ -547,7 +542,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
 
 # Features & Milestones
 
-The following two features are intended to prioritized over others to begin with.
+#### Features with owners
 
 1. Support for durable Local PVs
 2. Support for capacity isolation
@@ -555,7 +550,8 @@ The following two features are intended to prioritized over others to begin with
 Alpha support for these two features are targeted for v1.7. Beta and GA timelines are TBD.
 Currently, msau42@, jinxu@ and vishh@ will be developing these features.
 
-The following pending features need owners. Their delivery timelines will depend on the future owners.
+#### Features needing owners
+
 1. Support for persistent volumes tied to the lifetime of a pod (`inline PV`)
 2. Support for Logging Volumes
 3. Support for changing the writable layer type of containers

From 90a855dd3f3e7a5d71b59748335e0b91defbb47b Mon Sep 17 00:00:00 2001
From: Vish Kannan <vishh@users.noreply.github.com>
Date: Mon, 13 Mar 2017 19:52:12 -0700
Subject: [PATCH 10/19] Fix storage requirement typo

---
 contributors/design-proposals/local-storage-overview.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index 1dee9f4cbd3..05d40680e19 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -101,7 +101,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      volumes:
      - name: myEmptyDir
        emptyDir:
-	     size: 1Gi
+	     size: 20Gi
     ```
 
 3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.

From f50fca41c9aa3722e3b4d8d986fd994b927d96e4 Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Fri, 17 Mar 2017 15:51:56 -0700
Subject: [PATCH 11/19] Update resource names, storageclass usage for local
 PVs, scheduling failure timeouts

---
 .../local-storage-overview.md                 | 69 +++++++++++--------
 1 file changed, 41 insertions(+), 28 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index 05d40680e19..cc8cd37d515 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -93,7 +93,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      - name: fooc
        resources:
          limits:
-           storage.kubernetes.io/scratch: 500Mi
+           storage.kubernetes.io/logs: 500Mi
            storage.kubernetes.io/overlay: 1Gi
        volumeMounts:
        - name: myEmptyDir
@@ -105,12 +105,14 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
     ```
 
 3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
-4. `storage.kubernetes.io/scratch` resource is meant for logs. `storage.kubernetes.io/overlay` is meant for writable layer.
-5. `storage.kubernetes.io/overlay` resource can be satisfied by `storage.kubernetes.io/overlay` if exposed by nodes or by `storage.kubernetes.io/scratch` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod.
-5. Kubelet will rotate logs to keep scratch space usage of “fooc” under 500Mi
-6. Kubelet will track the usage of pods across logs and overlay filesystem and restart the container if it's total usage exceeds it's storage limits. If usage on `EmptyDir` volume exceeds its `limit`, then the pod will be evicted by the kubelet. By performing soft limiting, users will be able to easily identify pods that run out of storage.
-7. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints.
-8. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes.
+4. For the pod resources, `storage.kubernetes.io/logs` resource is meant for logs. `storage.kubernetes.io/overlay` is meant for writable layer.
+5. `storage.kubernetes.io/logs` is satisfied by `storage.kubernetes.io/scratch`.
+6. `storage.kubernetes.io/overlay` resource can be satisfied by `storage.kubernetes.io/overlay` if exposed by nodes or by `storage.kubernetes.io/scratch` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod.
+7. EmptyDir.size is both a request and limit that is satisfied by `storage.kubernetes.io/scratch`. 
+8. Kubelet will rotate logs to keep scratch space usage of “fooc” under 500Mi
+9. Kubelet will track the usage of pods across logs and overlay filesystem and restart the container if it's total usage exceeds it's storage limits. If usage on `EmptyDir` volume exceeds its `limit`, then the pod will be evicted by the kubelet. By performing soft limiting, users will be able to easily identify pods that run out of storage.
+10. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints.
+11. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes.
 
 ### Bob runs batch workloads and is unsure of “storage” requirements
 
@@ -142,7 +144,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
       name: mylimits
     spec:
        - default:
-         storage.kubernetes.io/scratch: 200Mi
+         storage.kubernetes.io/logs: 200Mi
          storage.kubernetes.io/overlay: 200Mi
          type: Container
        - default:
@@ -162,7 +164,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      - name: fooc
        resources:
          limits:
-           storage.kubernetes.io/scratch: 200Mi
+           storage.kubernetes.io/logs: 200Mi
            storage.kubernetes.io/overlay: 200Mi
        volumeMounts:
        - name: myEmptyDir
@@ -186,7 +188,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
    - name: fooc
      resources:
        requests:
-         storage.kubernetes.io/scratch: 500Mi
+         storage.kubernetes.io/logs: 500Mi
          storage.kubernetes.io/overlay: 500Mi
      volumeMounts:
      - name: myEmptyDir
@@ -202,8 +204,16 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
 ### Alice manages a Database which needs access to “durable” and fast scratch space
 
 1. Cluster administrator provisions machines with local SSDs and brings up the cluster
-2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a new node annotation that ties the volume to a specific node.  A StorageClass name that is prefixed with "local-" is required for the system to be able to differentiate between local and remote storage.  Labels may also be specified.  The volume consumes the entire partition.
+2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a new node annotation that ties the volume to a specific node.  A StorageClass is required and will have a new optional field `locality` for the system to be able to differentiate between local and remote storage.  Labels may also be specified.  The volume consumes the entire partition.
 
+    ```yaml
+    kind: StorageClass
+    apiVersion: storage.k8s.io/v1
+    metadata:
+      name: local-fast
+    provisioner: ""
+    locality: Node
+    ```
     ```yaml
     kind: PersistentVolume
     apiVersion: v1
@@ -231,7 +241,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
     local-pv-1 100Gi    RWO         Delete        Available         node-3
     local-pv-2 10Gi     RWO         Delete        Available         node-3
     ```
-3. Alice creates a StatefulSet that uses local PVCs.  The StorageClass prefix of "local-" indicates that the user wants local storage.  The PVC will only be bound to PVs that match the StorageClass name. 
+3. Alice creates a StatefulSet that requests local storage from StorageClass "local-fast".  The PVC will only be bound to PVs that match the StorageClass name. 
 
     ```yaml
     apiVersion: apps/v1beta1
@@ -302,22 +312,25 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
 5. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
 6. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario.  No toleration specified means that the failure is not tolerated.  In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV.  If a toleration is set, by default, it will be tolerated forever.  `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound.
 
-  Node taints already exist today.  New PV and scheduling taints can be added to handle additional failure use cases when using local storage.  A new PV taint will be introduced to handle unhealthy volumes.  The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy.  A scheduling taint could signal a scheduling failure for the pod due to being unable to fit on the node.
+  Node taints already exist today.  A new PV taint will be introduced to handle unhealthy volumes.  The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy.  Pod scheduling failures are specified separately as a timeout.
   ```yaml
-  nodeTolerations:
-    - key: node.alpha.kubernetes.io/notReady
-      operator: TolerationOpExists
-      tolerationSeconds: 600
-    - key: node.alpha.kubernetes.io/unreachable
-      operator: TolerationOpExists
-      tolerationSeconds: 1200
-  pvTolerations:
-    - key: storage.kubernetes.io/pvUnhealthy
-      operator: TolerationOpExists
-  schedulingTolerations:
-    - key: scheduler.kubernetes.io/podCannotFit
-      operator: TolerationOpExists
-      tolerationSeconds: 600
+  apiVersion: v1
+  kind: pod
+  metadata:
+    name: foo
+  spec:
+    <snip>
+    nodeTolerations:
+      - key: node.alpha.kubernetes.io/notReady
+        operator: TolerationOpExists
+        tolerationSeconds: 600
+      - key: node.alpha.kubernetes.io/unreachable
+        operator: TolerationOpExists
+        tolerationSeconds: 1200
+    pvTolerations:
+      - key: storage.kubernetes.io/pvUnhealthy
+        operator: TolerationOpExists
+    schedulingFailureTimeoutSeconds: 600
   ```
 
 7. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs.  The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster.
@@ -349,7 +362,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
      - name: fooc
        resources:
        limits:
-         storage.kubernetes.io/scratch: 500Mi
+         storage.kubernetes.io/logs: 500Mi
          storage.kubernetes.io/overlay: 1Gi
        volumeMounts:
        - name: myEphemeralPersistentVolume

From 8e006f051c944296a7531e3264c7599972635bb0 Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Mon, 20 Mar 2017 11:45:38 -0700
Subject: [PATCH 12/19] Modify PV toleration

---
 .../design-proposals/local-storage-overview.md  | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index cc8cd37d515..41abfe76d97 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -312,10 +312,10 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
 5. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
 6. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario.  No toleration specified means that the failure is not tolerated.  In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV.  If a toleration is set, by default, it will be tolerated forever.  `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound.
 
-  Node taints already exist today.  A new PV taint will be introduced to handle unhealthy volumes.  The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy.  Pod scheduling failures are specified separately as a timeout.
+  Node taints already exist today.  Pod scheduling failures are specified separately as a timeout.
   ```yaml
   apiVersion: v1
-  kind: pod
+  kind: Pod
   metadata:
     name: foo
   spec:
@@ -327,12 +327,21 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
       - key: node.alpha.kubernetes.io/unreachable
         operator: TolerationOpExists
         tolerationSeconds: 1200
+    schedulingFailureTimeoutSeconds: 600
+  ```
+
+  A new PV taint will be introduced to handle unhealthy volumes.  The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy. 
+  ```yaml
+  apiVersion: v1
+  kind: PersistentVolumeClaim
+  metadata:
+    name: foo
+  spec:
+    <snip>
     pvTolerations:
       - key: storage.kubernetes.io/pvUnhealthy
         operator: TolerationOpExists
-    schedulingFailureTimeoutSeconds: 600
   ```
-
 7. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs.  The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster.
 
 ### Bob manages a distributed filesystem which needs access to all available storage on each node

From 4893f8f94272db2d970638a6133f7ebe5327ab3f Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Tue, 21 Mar 2017 19:36:55 -0700
Subject: [PATCH 13/19] Fix code block formatting

---
 contributors/design-proposals/local-storage-overview.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index 41abfe76d97..b8fd489c745 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -79,7 +79,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
       allocatable:
         storage.kubernetes.io/overlay: 100Gi
         storage.kubernetes.io/scratch: 90Gi
-```
+    ```
 
 2. Alice adds new storage resource requirements to her pod, specifying limits for the container's writeable and overlay layers, and emptyDir volumes.
 

From cee4bca172d77a8d203d56887efaf3d757f23eba Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Thu, 30 Mar 2017 13:04:34 -0700
Subject: [PATCH 14/19] Clarify that block access can be considered as a
 separate feature.

---
 contributors/design-proposals/local-storage-overview.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index b8fd489c745..431c36780dd 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -468,6 +468,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
     ```
 
 ### Bob manages a specialized application that needs access to Block level storage
+Note: Block access will be considered as a separate feature because it can work for both remote and local storage.  The examples here are a suggestion on how such a feature can be applied to this local storage model, but is subject to change based on the final design for block access.
 
 1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet.
 2. The same addon DaemonSet can discover block devices in the same directory as the filesystem mount points and creates corresponding PVs for them with a new `volumeType = block` field.  This field indicates the original volume type upon PV creation.

From 03016686fdfa451c3b77716d7e8b10a41d82511e Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Mon, 3 Apr 2017 11:13:34 -0700
Subject: [PATCH 15/19] Change PV example to use topologyKey instead of
 locality field.

---
 .../design-proposals/local-storage-overview.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index 431c36780dd..7cfff748599 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -59,7 +59,7 @@ The local PVs can be precreated by an addon DaemonSet that discovers all the sec
 
 Local PVs can only provide semi-persistence, and are only suitable for specific use cases that need performance, data gravity and can tolerate data loss.  If the node or PV fails, then either the pod cannot run, or the pod has to give up on the local PV and find a new one.  Failure scenarios can be handled by unbinding the PVC from the local PV, and forcing the pod to reschedule and find a new PV.
 
-Since local PVs are only accessible from specific nodes, a new PV-node association will be used by the scheduler to place pods.  The association can be generalized to support any type of PV, not just local PVs.  This allows for any volume plugin to take advantage of this behavior.
+Since local PVs are only accessible from specific nodes, the scheduler needs to take into account a PV's node constraint when placing pods.  This can be generalized to a storage toplogy constraint, which can also work with zones, and in the future: racks, clusters, etc.
 
 # User Workflows
 
@@ -204,7 +204,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
 ### Alice manages a Database which needs access to “durable” and fast scratch space
 
 1. Cluster administrator provisions machines with local SSDs and brings up the cluster
-2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a new node annotation that ties the volume to a specific node.  A StorageClass is required and will have a new optional field `locality` for the system to be able to differentiate between local and remote storage.  Labels may also be specified.  The volume consumes the entire partition.
+2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a hostname label ties the volume to a specific node.  A StorageClass is required and will have a new optional field `toplogyKey` for the system to apply node constraints to local storage when scheduling pods that request this StorageClass.  Other labels may also be specified.
 
     ```yaml
     kind: StorageClass
@@ -212,15 +212,15 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
     metadata:
       name: local-fast
     provisioner: ""
-    locality: Node
+    toplogyKey: kubernetes.io/hostname
     ```
     ```yaml
     kind: PersistentVolume
     apiVersion: v1
     metadata:
       name: local-pv-1
-      annotations:
-        volume.kubernetes.io/node: node-1
+      labels:
+        kubernetes.io/hostname: node-1
     spec:
       capacity:
         storage: 100Gi
@@ -478,8 +478,8 @@ Note: Block access will be considered as a separate feature because it can work
     apiVersion: v1
     metadata:
       name: foo
-      annotations:
-        storage.kubernetes.io/node: node-1
+      labels:
+        kubernetes.io/hostname: node-1
     spec:
       volumeType: block 
       capacity:
@@ -515,8 +515,8 @@ Note: Block access will be considered as a separate feature because it can work
     apiVersion: v1
     metadata:
       name: foo
-      annotations:
-        storage.kubernetes.io/node: node-1
+      labels:
+        kubernetes.io/hostname: node-1
     spec:
       volumeType: block
       capacity:

From 86f560ac79b89483d903913130aedaf3009422d5 Mon Sep 17 00:00:00 2001
From: Vishnu kannan <vishnuk@google.com>
Date: Mon, 3 Apr 2017 16:45:11 -0700
Subject: [PATCH 16/19] adding an FAQ

Signed-off-by: Vishnu kannan <vishnuk@google.com>
---
 .../local-storage-overview.md                 | 23 +++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index 7cfff748599..07329a61907 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -50,7 +50,7 @@ Primary partitions are shared partitions that can provide ephemeral local storag
  This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IOPS for example) from this partition.
 
 ### Runtime
-This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition.
+This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition. Container image layers and writable later is stored here. If the runtime partition exists, `root` parition will not hold any image layer or writable layers.
 
 ## Secondary Partitions
 All other partitions are exposed as local persistent volumes.  Each local volume uses an entire partition.  The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod.  All the local PVs can be queried and viewed from a cluster level using the existing PV object.  Applications can continue to use their existing PVC specifications with minimal changes to request local storage.
@@ -101,7 +101,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
      volumes:
      - name: myEmptyDir
        emptyDir:
-	     size: 20Gi
+	     sizeLimit: 20Gi
     ```
 
 3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
@@ -563,6 +563,25 @@ Note: Block access will be considered as a separate feature because it can work
 * Make the container’s writable layer `readonly` if possible.
 * Another option is to keep the writable layer on tmpfs. Such a setup will allow you to eventually migrate from using local storage for anything but super fast caching purposes or distributed databases leading to higher reliability & uptime for nodes.
 
+# FAQ
+
+### Why is the kubelet managing logs?
+
+Kubelet is managing access to shared storage on the node. Container logs outputted via it's stdout and stderr ends up on the shared storage that kubelet is managing. So, kubelet needs direct control over the log data to keep the containers running (by rotating logs), store them long enough for break glass situations and apply different storage policies in a multi-tenent cluster. All of these features are not easily expressible through external logging agents like journald for example.
+
+
+### Master are upgraded prior to nodes. How should storage as a new compute resource be rolled out on to existing clusters?
+
+Capacity isolation of shared partitions (ephemeral storage) will be controlled using a feature gate. Do not enable this feature gate until all the nodes in a cluster are running a kubelet version that supports capacity isolation.
+Since older kubelets will not surface capacity of shared partitions, the scheduler will ignore those nodes when attempting to schedule pods that request storage capacity explicitly.
+
+
+### What happens if storage usage is unavailable for writable layer?
+
+Kubelet will attempt to enforce capacity limits on a best effort basis. If the underlying container runtime cannot surface usage metrics for the writable layer, then kubelet will not provide capacity isolation for the writable layer.
+
+
+
 # Features & Milestones
 
 #### Features with owners

From 70301e5aa7f9310f5450d2bc8caaa6e2a4af7a55 Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Mon, 10 Apr 2017 14:43:32 -0700
Subject: [PATCH 17/19] Add FAQ about LocalStorage PV partitions

---
 contributors/design-proposals/local-storage-overview.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index 07329a61907..920a86f3a25 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -581,6 +581,9 @@ Since older kubelets will not surface capacity of shared partitions, the schedul
 Kubelet will attempt to enforce capacity limits on a best effort basis. If the underlying container runtime cannot surface usage metrics for the writable layer, then kubelet will not provide capacity isolation for the writable layer.
 
 
+### Are LocalStorage PVs required to be a whole partition?
+
+No, but it is the recommended way to ensure capacity and performance isolation.  For HDDs, a whole disk is recommended for performance isolation.  In some environments, multiple storage partitions are not available, so the only option is to share the same filesystem.  In that case, directories in the same filesystem can be specified, and the adminstrator could configure group quota to provide capacity isolation.
 
 # Features & Milestones
 

From e4c61005b033621aba74758a4d816e6510cb848b Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Fri, 14 Apr 2017 12:34:16 -0700
Subject: [PATCH 18/19] Update API to explicitly distinguish between fs and
 block-based volumes.

---
 .../design-proposals/local-storage-overview.md  | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index 920a86f3a25..b7702a29612 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -225,7 +225,8 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
       capacity:
         storage: 100Gi
       local:
-        path: /var/lib/kubelet/storage-partitions/local-pv-1
+        fs:
+          path: /var/lib/kubelet/storage-partitions/local-pv-1
       accessModes:
         - ReadWriteOnce
       persistentVolumeReclaimPolicy: Delete
@@ -471,7 +472,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
 Note: Block access will be considered as a separate feature because it can work for both remote and local storage.  The examples here are a suggestion on how such a feature can be applied to this local storage model, but is subject to change based on the final design for block access.
 
 1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet.
-2. The same addon DaemonSet can discover block devices in the same directory as the filesystem mount points and creates corresponding PVs for them with a new `volumeType = block` field.  This field indicates the original volume type upon PV creation.
+2. The same addon DaemonSet can also discover block devices and creates corresponding PVs for them with the `block` field.  
 
     ```yaml
     kind: PersistentVolume
@@ -481,11 +482,11 @@ Note: Block access will be considered as a separate feature because it can work
       labels:
         kubernetes.io/hostname: node-1
     spec:
-      volumeType: block 
       capacity:
         storage: 100Gi
       local:
-        path: /var/lib/kubelet/storage-raw-devices/foo
+        block:
+          device: /var/lib/kubelet/storage-raw-devices/foo
       accessModes:
         - ReadWriteOnce
       persistentVolumeReclaimPolicy: Delete
@@ -508,7 +509,7 @@ Note: Block access will be considered as a separate feature because it can work
         requests:
           storage: 80Gi
     ```
-4. It is also possible for a PVC that requests `volumeType: file` to also use a PV with `volumeType: block`, if no file-based PVs are available.  In this situation, the block device would get formatted with the filesystem type specified in the PV spec.  And when the PV gets destroyed, then the filesystem also gets destroyed to return back to the original block state.
+4. It is also possible for a PVC that requests `volumeType: file` to also use a block-based PV.  In this situation, the block device would get formatted with the filesystem type specified in the PV spec.  And when the PV gets destroyed, then the filesystem also gets destroyed to return back to the original block state.
 
     ```yaml
     kind: PersistentVolume
@@ -518,12 +519,12 @@ Note: Block access will be considered as a separate feature because it can work
       labels:
         kubernetes.io/hostname: node-1
     spec:
-      volumeType: block
       capacity:
         storage: 100Gi
       local:
-        path: /var/lib/kubelet/storage-raw-devices/foo
-        fsType: ext4
+        block:
+          path: /var/lib/kubelet/storage-raw-devices/foo
+          fsType: ext4
       accessModes:
         - ReadWriteOnce
       persistentVolumeReclaimPolicy: Delete

From ba62a3f6cb9a301e95c4b64b9052455bdac9a3fe Mon Sep 17 00:00:00 2001
From: Michelle Au <msau@google.com>
Date: Mon, 17 Apr 2017 17:43:02 -0700
Subject: [PATCH 19/19] Addressed more comments.  Expanded scheduler workflow
 for local PV.

---
 .../local-storage-overview.md                 | 45 ++++++++++---------
 1 file changed, 24 insertions(+), 21 deletions(-)

diff --git a/contributors/design-proposals/local-storage-overview.md b/contributors/design-proposals/local-storage-overview.md
index b7702a29612..a21b8a4c58c 100644
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@@ -94,7 +94,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
        resources:
          limits:
            storage.kubernetes.io/logs: 500Mi
-           storage.kubernetes.io/overlay: 1Gi
+           storage.kubernetes.io/writable: 1Gi
        volumeMounts:
        - name: myEmptyDir
          mountPath: /mnt/data
@@ -105,13 +105,13 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
     ```
 
 3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
-4. For the pod resources, `storage.kubernetes.io/logs` resource is meant for logs. `storage.kubernetes.io/overlay` is meant for writable layer.
+4. For the pod resources, `storage.kubernetes.io/logs` resource is meant for logs. `storage.kubernetes.io/writable` is meant for writable layer.
 5. `storage.kubernetes.io/logs` is satisfied by `storage.kubernetes.io/scratch`.
-6. `storage.kubernetes.io/overlay` resource can be satisfied by `storage.kubernetes.io/overlay` if exposed by nodes or by `storage.kubernetes.io/scratch` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod.
+6. `storage.kubernetes.io/writable` resource can be satisfied by `storage.kubernetes.io/overlay` if exposed by nodes or by `storage.kubernetes.io/scratch` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod.
 7. EmptyDir.size is both a request and limit that is satisfied by `storage.kubernetes.io/scratch`. 
 8. Kubelet will rotate logs to keep scratch space usage of “fooc” under 500Mi
 9. Kubelet will track the usage of pods across logs and overlay filesystem and restart the container if it's total usage exceeds it's storage limits. If usage on `EmptyDir` volume exceeds its `limit`, then the pod will be evicted by the kubelet. By performing soft limiting, users will be able to easily identify pods that run out of storage.
-10. Health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints.
+10. Primary partition health is monitored by an external entity like the “Node Problem Detector” which is expected to place appropriate taints.
 11. If a primary partition becomes unhealthy, the node is tainted and all pods running in it will be evicted by default, unless they tolerate that taint. Kubelet’s behavior on a node with unhealthy primary partition is undefined. Cluster administrators are expected to fix unhealthy primary partitions on nodes.
 
 ### Bob runs batch workloads and is unsure of “storage” requirements
@@ -145,10 +145,10 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
     spec:
        - default:
          storage.kubernetes.io/logs: 200Mi
-         storage.kubernetes.io/overlay: 200Mi
+         storage.kubernetes.io/writable: 200Mi
          type: Container
        - default:
-         size: 1Gi
+         sizeLimit: 1Gi
          type: EmptyDir
     ```
 
@@ -165,18 +165,18 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
        resources:
          limits:
            storage.kubernetes.io/logs: 200Mi
-           storage.kubernetes.io/overlay: 200Mi
+           storage.kubernetes.io/writable: 200Mi
        volumeMounts:
        - name: myEmptyDir
          mountPath: /mnt/data
      volumes:
      - name: myEmptyDir
        emptyDir:
-	     size: 1Gi
+	     sizeLimit: 1Gi
     ```
 
 4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. 
-5. If Bob’s pod “foo” exceeds the “default” storage limits and gets evicted, then Bob can set a minimum storage requirement for his containers and a higher “capacity” for his EmptyDir volumes.
+5. If Bob’s pod “foo” exceeds the “default” storage limits and gets evicted, then Bob can set a minimum storage requirement for his containers and a higher `sizeLimit` for his EmptyDir volumes.
 
   ```yaml
   apiVersion: v1
@@ -189,22 +189,22 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
      resources:
        requests:
          storage.kubernetes.io/logs: 500Mi
-         storage.kubernetes.io/overlay: 500Mi
+         storage.kubernetes.io/writable: 500Mi
      volumeMounts:
      - name: myEmptyDir
        mountPath: /mnt/data
    volumes:
    - name: myEmptyDir
      emptyDir:
-		size: 2Gi
+		sizeLimit: 2Gi
   ```
 
-6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. it is recommended to use Persistent Durable Volumes as much as possible and avoid primary partitions.
+6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. it is recommended to use Persistent Volumes as much as possible and avoid primary partitions.
 
 ### Alice manages a Database which needs access to “durable” and fast scratch space
 
 1. Cluster administrator provisions machines with local SSDs and brings up the cluster
-2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a hostname label ties the volume to a specific node.  A StorageClass is required and will have a new optional field `toplogyKey` for the system to apply node constraints to local storage when scheduling pods that request this StorageClass.  Other labels may also be specified.
+2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a hostname label ties the volume to a specific node.  A StorageClass is required and will have a new optional field `toplogyKey`.  This field tells the scheduler to filter PVs with the same `topologyKey` value on the node. The `topologyKey` can be any label key applied to a node.  For the local storage case, the `topologyKey` is `kubernetes.io/hostname`, but the same mechanism could be used for zone constraints as well.
 
     ```yaml
     kind: StorageClass
@@ -224,7 +224,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
     spec:
       capacity:
         storage: 100Gi
-      local:
+      localStorage:
         fs:
           path: /var/lib/kubelet/storage-partitions/local-pv-1
       accessModes:
@@ -288,7 +288,10 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
               storage: 1Gi
     ```
 
-4. The scheduler identifies nodes for each pod that can satisfy cpu, memory, storage requirements and also contains available local PVs to satisfy the pod's PVC claims. It then binds the pod’s PVCs to specific PVs on the node and then binds the pod to the node. 
+4. The scheduler identifies nodes for each pod that can satisfy all the existing predicates.
+5. The nodes list is further filtered by looking at the PVC's StorageClass `topologyKey`, and checking if there are enough available PVs that have the same `topologyKey` value as the node.  In the case of local PVs, it checks that there are enough PVs with the same `kubernetes.io/hostname` value as the node.
+6. The scheduler chooses a node for the pod based on a ranking algorithm.
+7. Once the pod is assigned to a node, then the pod’s local PVCs get bound to specific local PVs on the node.
     ```
     $ kubectl get pvc
     NAME            STATUS VOLUME     CAPACITY ACCESSMODES … NODE
@@ -310,8 +313,8 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
     local-pv-2 10Gi       Bound     log-local-pvc-3 node-3
     ```
 
-5. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
-6. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario.  No toleration specified means that the failure is not tolerated.  In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV.  If a toleration is set, by default, it will be tolerated forever.  `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound.
+8. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
+9. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario.  No toleration specified means that the failure is not tolerated.  In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV.  If a toleration is set, by default, it will be tolerated forever.  `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound.
 
   Node taints already exist today.  Pod scheduling failures are specified separately as a timeout.
   ```yaml
@@ -343,7 +346,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
       - key: storage.kubernetes.io/pvUnhealthy
         operator: TolerationOpExists
   ```
-7. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs.  The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster.
+10. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs.  The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster.
 
 ### Bob manages a distributed filesystem which needs access to all available storage on each node
 
@@ -361,7 +364,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
 
 1. Phippy creates a dedicated partition with a separate device for her system daemons. She achieves this by making `/var/log/containers`, `/var/lib/kubelet`, `/var/lib/docker` (with the docker runtime) all reside on a separate partition.
 2. Phippy is aware that pods can cause abuse to each other.
-3. Whenever a pod experiences I/O issues with it's EmptyDir volume, Phippy reconfigures those pods to use Persistent Volumes whose lifetime is tied to the pod.
+3. Whenever a pod experiences I/O issues with it's EmptyDir volume, Phippy reconfigures those pods to use an inline Persistent Volume, whose lifetime is tied to the pod.
     ```yaml
     apiVersion: v1
     kind: pod
@@ -373,7 +376,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
        resources:
        limits:
          storage.kubernetes.io/logs: 500Mi
-         storage.kubernetes.io/overlay: 1Gi
+         storage.kubernetes.io/writable: 1Gi
        volumeMounts:
        - name: myEphemeralPersistentVolume
          mountPath: /mnt/tmpdata
@@ -484,7 +487,7 @@ Note: Block access will be considered as a separate feature because it can work
     spec:
       capacity:
         storage: 100Gi
-      local:
+      localStorage:
         block:
           device: /var/lib/kubelet/storage-raw-devices/foo
       accessModes: