Skip to content

Commit c8ead0e

Browse files
author
Tom Finley
committed
More Justin comments
1 parent 5b8b6ec commit c8ead0e

File tree

3 files changed

+62
-63
lines changed

3 files changed

+62
-63
lines changed

docs/code/IDataViewTypeSystem.md

Lines changed: 38 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -147,10 +147,10 @@ This document uses convenient shorthand for standard types:
147147
* `R4`, `R8`: single and double precision floating-point
148148

149149
* `I1`, `I2`, `I4`, `I8`: signed integer types with the indicated number of
150-
bytes
150+
bytes
151151

152152
* `U1`, `U2`, `U4`, `U8`: unsigned integer types with the indicated number of
153-
bytes
153+
bytes
154154

155155
* `UG`: unsigned type with 16-bytes, typically used as a unique ID
156156

@@ -161,10 +161,10 @@ bytes
161161
* `DZ`: datetime zone, a date and time with a timezone
162162

163163
* `U4[100-199]`: A key type based on `U4` representing legal values from 100
164-
to 199, inclusive
164+
to 199, inclusive
165165

166166
* `V<R4,3,2>`: A vector type with item type `R4` and dimensionality
167-
information [3,2]
167+
information [3,2]
168168

169169
See the sections on the specific types for more detail.
170170

@@ -233,18 +233,18 @@ type, which is a compatible column type.
233233

234234
For example:
235235

236-
* A column may have a `BL` valued piece of metadata associated with the string
237-
`IsNormalized` indicating whether the column can be interpreted as a label.
236+
* A column may indicate that it is normalized, by providing a `BL` valued
237+
piece of metadata named `IsNormalized`.
238238

239239
* A column whose type is `V<R4,17>`, meaning a vector of length 17 whose items
240-
are single-precision floating-point values, might have `SlotNames` metadata of
241-
type `V<TX,17>`, meaning a vector of length 17 whose items are text.
240+
are single-precision floating-point values, might have `SlotNames` metadata
241+
of type `V<TX,17>`, meaning a vector of length 17 whose items are text.
242242

243243
* A column produced by a scorer may have several pieces of associated
244-
metadata, indicating the "scoring column group id" that it belongs to, what
245-
kind of scorer produced the column (e.g., binary classification), and the
246-
precise semantics of the column (e.g., predicted label, raw score,
247-
probability).
244+
metadata, indicating the "scoring column group id" that it belongs to, what
245+
kind of scorer produced the column (e.g., binary classification), and the
246+
precise semantics of the column (e.g., predicted label, raw score,
247+
probability).
248248

249249
The `ISchema` interface, including the metadata API, is fully specified in
250250
another document.
@@ -401,7 +401,7 @@ Notes:
401401
representation values are from one up to and including `Count`. The `Count`
402402
is required to be representable in the underlying type, so, for example, the
403403
`Count` value of a key type based on `System.Byte` must not exceed `255`. As
404-
an example of the usefulness of the `Count` property, consider the
404+
an example of the usefulness of the `Count` property, consider the
405405
`KeyToVector` transform implemented as part of ML.NET. It maps from a key
406406
type value to an indicator vector. The length of the vector is the `Count`
407407
of the key type, which is required to be positive. For a key value of `k`,
@@ -416,7 +416,7 @@ Notes:
416416

417417
* The `Min` property returns the minimum semantic value of the key type. This
418418
is used exclusively for transforming from a representation value, where the
419-
valid values start at one, to user facing values, which might start at any
419+
valid values start at one, to user facing values, which might start at any
420420
non-negative value. The most common values for `Min` are zero and one.
421421

422422
* The boolean `Contiguous` property indicates whether values of the key type
@@ -428,13 +428,13 @@ Notes:
428428

429429
* A key type can be non-`Contiguous` only if `Count` is zero. The converse
430430
however is not true. A key type that is contiguous but has `Count` equal to
431-
zero is one where there is a reasonably small maximum, but that maximum is
431+
zero is one where there is a reasonably small maximum, but that maximum is
432432
unknown. In this case, an array might be a good choice for a map from the
433433
key type.
434434

435435
* The shorthand for a key type with representation type `U1`, and semantic
436436
values from `1000` to `1099`, inclusive, is `U1[1000-1099]`. Note that the
437-
`Min` value of this key type is outside the range of the underlying type,
437+
`Min` value of this key type is outside the range of the underlying type,
438438
`System.Byte`, but the `Count` value is only `100`, which is representable
439439
in a `System.Byte`. Recall that the representation values always start at 1
440440
and extend up to `Count`, in this case `100`.
@@ -454,7 +454,7 @@ There are standard conversions from one key type to another, provided:
454454

455455
* Either the number of bytes in the destination's underlying type is greater
456456
than the number of bytes in the source's underlying type, or the `Count`
457-
value is positive. In the latter case, the `Count` is necessarily less than
457+
value is positive. In the latter case, the `Count` is necessarily less than
458458
2k, where k is the number of bits in the destination type's underlying type.
459459
For example, `U1[1-*]` can be converted to `U2[1-*]`, but `U2[1-*]` cannot
460460
be converted to `U1[1-*]`. Also, `U1[1-100]` and `U2[1-100]` can be
@@ -502,17 +502,17 @@ partitioned into an unknown number of runs of consecutive slots each of length
502502
`64`.
503503

504504
As another example, consider an image data set. The data starts with a `TX`
505-
column containing URLs for images. Applying a BitmapLoader transform generates
506-
a column of a custom (non-standard) type, `Picture<*,*,4>`, where the
507-
asterisks indicate that the picture dimensions are unknown. The last dimension
508-
of `4` indicates that there are four channels in each pixel: the three color
509-
components, plus the alpha channel. Applying a `BitmapScaler` transform scales
510-
and crops the images to a specified size, for example, `100x100`, producing a
511-
type of `Picture<100,100,4>`. Finally, applying a `PixelExtractor` transform
512-
(and specifying that the alpha channel should be dropped), produces the vector
513-
type `V<R4,3,100,100>`. In this example, the `PixelExtractor` re-organized the
514-
color information into separate planes, and divided each pixel value by 256 to
515-
get pixel values between zero and one.
505+
column containing URLs for images. Applying an `ImageLoader` transform
506+
generates a column of a custom (non-standard) type, `Picture<*,*,4>`, where
507+
the asterisks indicate that the picture dimensions are unknown. The last
508+
dimension of `4` indicates that there are four channels in each pixel: the
509+
three color components, plus the alpha channel. Applying an `ImageResizer`
510+
transform scales and crops the images to a specified size, for example,
511+
`100x100`, producing a type of `Picture<100,100,4>`. Finally, applying a
512+
`ImagePixelExtractor` transform (and specifying that the alpha channel should
513+
be dropped), produces the vector type `V<R4,3,100,100>`. In this example, the
514+
`ImagePixelExtractor` re-organized the color information into separate planes,
515+
and divided each pixel value by 256 to get pixel values between zero and one.
516516

517517
### Equivalence
518518

@@ -556,14 +556,14 @@ Notes:
556556

557557
* The `Indices` array is only relevant when the vector is sparse. In the
558558
sparse case, `Indices` is parallel to `Values`, only the first `Count` items
559-
are meaningful, the indices must be non-negative and less than `Length`,
560-
and the indices must be strictly increasing. Note that when `Count` is zero,
559+
are meaningful, the indices must be non-negative and less than `Length`, and
560+
the indices must be strictly increasing. Note that when `Count` is zero,
561561
`Indices` may be null. In the dense case, `Indices` is not meaningful and
562562
may or may not be null.
563563

564564
* It is very common for the arrays in a `VBuffer<T>` to be larger than needed
565565
for their current value. A special case of this is when a dense `VBuffer<T>`
566-
has a non-null `Indices` array. The extra items in the arrays are not
566+
has a non-null `Indices` array. The extra items in the arrays are not
567567
meaningful and should be ignored. Allowing these buffers to be larger than
568568
currently needed reduces the need to reallocate buffers for different
569569
values. For example, when cursoring through a vector valued column with
@@ -574,7 +574,7 @@ Notes:
574574

575575
* Generally, vectors should use a sparse representation only when the number
576576
of non-default items is at most half the value of Length. However, this
577-
guideline is not a mandate.
577+
guideline is not a mandate.
578578

579579
See the full `IDataView` technical specification for additional details on
580580
`VBuffer<T>`, including complete discussion of programming idioms, and
@@ -668,7 +668,7 @@ There are standard conversions from one key type to another, provided:
668668

669669
* Either the number of bytes in the destination's underlying type is greater
670670
than the number of bytes in the source's underlying type, or the `Count`
671-
value is positive. In the latter case, the `Count` is necessarily less than
671+
value is positive. In the latter case, the `Count` is necessarily less than
672672
`2^^k`, where `k` is the number of bits in the destination type's underlying
673673
type. For example, `U1[1-*]` can be converted to `U2[1-*]`, but `U2[1-*]`
674674
cannot be converted to `U1[1-*]`. Also, `U1[1-100]` and `U2[1-100]` can be
@@ -709,7 +709,7 @@ In the following notes, the symbol `type` is a variable of type `ColumnType`.
709709

710710
* Certain .Net types have a corresponding `DataKind` `enum` value. The value
711711
of the `type.RawKind` property is consistent with `type.RawType`. For .Net
712-
types that do not have a corresponding `DataKind` value, the `type.RawKind`
712+
types that do not have a corresponding `DataKind` value, the `type.RawKind`
713713
property returns zero. The `type.RawKind` property is particularly useful
714714
when switching over raw type possibilities, but only after testing for the
715715
broader kind of the type (key type, numeric type, etc.).
@@ -730,22 +730,22 @@ In the following notes, the symbol `type` is a variable of type `ColumnType`.
730730

731731
* If `type` is a key type, then `type.KeyCount` is the same as
732732
`((KeyType)type).Count`. If `type` is not a key type, then `type.KeyCount`
733-
is zero. Note that a key type can have a `Count` value of zero, indicating
733+
is zero. Note that a key type can have a `Count` value of zero, indicating
734734
that the count is unknown, so `type.KeyCount` being zero does not imply that
735735
`type` is not a key type. In summary, `type.KeyCount` is equivalent to:
736736
`type is KeyType ? ((KeyType)type).Count : 0`.
737737

738738
* The `type.ItemType` property is the item type of the vector type, if `type`
739739
is a vector type, and is the same as `type` otherwise. For example, to test
740-
for a type that is either `TX` or a vector of `TX`, one can use
740+
for a type that is either `TX` or a vector of `TX`, one can use
741741
`type.ItemType.IsText`.
742742

743743
* The `type.IsKnownSizeVector` property is equivalent to `type.VectorSize >
744744
0`.
745745

746746
* The `type.VectorSize` property is zero if either `type` is not a vector type
747747
or if `type` is a vector type of unknown/variable length. Otherwise, it is
748-
the length of vectors belonging to the type.
748+
the length of vectors belonging to the type.
749749

750750
* The `type.ValueCount` property is one if `type` is not a vector type and the
751751
same as `type.VectorSize` if `type` is a vector type.
@@ -756,7 +756,7 @@ In the following notes, the symbol `type` is a variable of type `ColumnType`.
756756

757757
* The `SameSizeAndItemType` method is the same as `Equals` for non-vector
758758
types. For vector types, it returns true iff the two types have the same
759-
item type and have the same `VectorSize` values. For example, for the two
759+
item type and have the same `VectorSize` values. For example, for the two
760760
vector types `V<R4,3,2>` and `V<R4,6>`, `Equals` returns false but
761761
`SameSizeAndItemType` returns true.
762762

docs/code/IdvFileFormat.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -28,22 +28,22 @@ being:
2828
* All numbers are stored as little-endian, using their natural fix-length
2929
binary encoding.
3030

31-
* Strings are stored using an unsigned LEB128 number describing the number of
32-
bytes, followed by that many bytes containing the UTF-8 encoded string.
33-
34-
A note about this: [LEB128](https://en.wikipedia.org/wiki/LEB128) is a simple
35-
encoding to encode arbitrarily large integers. Each byte of 8-bits follows
36-
this convention. The most significant bit is 0 if and only if this is the end
37-
of the LEB128 encoding. The remaining 7 bits are a part of the number being
38-
encoded. The bytes are stored little-endian, that is, the first byte holds the
39-
7 least significant bits, the second byte (if applicable) holds the next 7
40-
least significant bits, etc., and the last byte holds the 7 most significant
41-
bits. LEB128 is used one or two places in this format. (I might tend to prefer
42-
use of LEB128 in places where we are writing values that, on balance, we
43-
expect to be relatively small, and only in cases where there is no potential
44-
for benefit for random access to the associated stream, since LEB128 is
45-
incompatible with random access. However, this is not formulated into anything
46-
approaching a definite policy.)
31+
* Strings are stored using an unsigned
32+
[LEB128](https://en.wikipedia.org/wiki/LEB128) number describing the number
33+
of bytes, followed by that many bytes containing the UTF-8 encoded string.
34+
35+
A note about this: LEB128 is a simple encoding to encode arbitrarily large
36+
integers. Each byte of 8-bits follows this convention. The most significant
37+
bit is 0 if and only if this is the end of the LEB128 encoding. The remaining
38+
7 bits are a part of the number being encoded. The bytes are stored
39+
little-endian, that is, the first byte holds the 7 least significant bits, the
40+
second byte (if applicable) holds the next 7 least significant bits, etc., and
41+
the last byte holds the 7 most significant bits. LEB128 is used one or two
42+
places in this format. (I might tend to prefer use of LEB128 in places where
43+
we are writing values that, on balance, we expect to be relatively small, and
44+
only in cases where there is no potential for benefit for random access to the
45+
associated stream, since LEB128 is incompatible with random access. However,
46+
this is not formulated into anything approaching a definite policy.)
4747

4848
## Header
4949

docs/code/VBufferCareFeeding.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@ nearly all trainers accept feature vectors as `VBuffer<float>`.
88

99
A `VBuffer<T>` is a generic type that supports both dense and sparse vectors
1010
over items of type `T`. This is the representation type for all
11-
[`VectorType`](../public/IDataViewTypeSystem.md#vector-representations)
12-
instances in the `IDataView` ecosystem. When an instance of this is passed to
13-
a row cursor getter, the callee is free to take ownership of and re-use the
14-
arrays (`Values` and `Indices`).
11+
[`VectorType`](IDataViewTypeSystem.md#vector-representations) instances in the
12+
`IDataView` ecosystem. When an instance of this is passed to a row cursor
13+
getter, the callee is free to take ownership of and re-use the arrays
14+
(`Values` and `Indices`).
1515

1616
A `VBuffer<T>` is a struct, and has the following `readonly` fields:
1717

@@ -43,11 +43,10 @@ inclusive and `Length` exclusive.
4343

4444
Regarding the generic type parameter `T`, the only real assumption made about
4545
this type is that assignment (that is, using `=`) is sufficient to create an
46-
*independent* copy of that item. All representation types of the
47-
[primitive types](../public/IDataViewTypeSystem.md#standard-column-types) have
48-
this property (e.g., `DvText`, `DvInt4`, `Single`, `Double`, etc.), but for
49-
example, `VBuffer<>` itself does not have this property. So, no `VBuffer` of
50-
`VBuffer`s for you.
46+
*independent* copy of that item. All representation types of the [primitive
47+
types](IDataViewTypeSystem.md#standard-column-types) have this property (e.g.,
48+
`DvText`, `DvInt4`, `Single`, `Double`, etc.), but for example, `VBuffer<>`
49+
itself does not have this property. So, no `VBuffer` of `VBuffer`s for you.
5150

5251
## Sparse Values as `default(T)`
5352

0 commit comments

Comments
 (0)