Skip to content

Commit 2d24842

Browse files
committed
Merge branch 'master' into spark-15735
2 parents ea89fc7 + 0b8d694 commit 2d24842

File tree

270 files changed

+3304
-1822
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

270 files changed

+3304
-1822
lines changed

LICENSE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -296,3 +296,4 @@ The text of each license is also included at licenses/LICENSE-[project].txt.
296296
(MIT License) blockUI (http://jquery.malsup.com/block/)
297297
(MIT License) RowsGroup (http://datatables.net/license/mit)
298298
(MIT License) jsonFormatter (http://www.jqueryscript.net/other/jQuery-Plugin-For-Pretty-JSON-Formatting-jsonFormatter.html)
299+
(MIT License) modernizr (https://github.com/Modernizr/Modernizr/blob/master/LICENSE)

NOTICE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Apache Spark
2-
Copyright 2014 The Apache Software Foundation.
2+
Copyright 2014 and onwards The Apache Software Foundation.
33

44
This product includes software developed at
55
The Apache Software Foundation (http://www.apache.org/).

R/DOCUMENTATION.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# SparkR Documentation
22

3-
SparkR documentation is generated using in-source comments annotated using using
4-
`roxygen2`. After making changes to the documentation, to generate man pages,
3+
SparkR documentation is generated by using in-source comments and annotated by using
4+
[`roxygen2`](https://cran.r-project.org/web/packages/roxygen2/index.html). After making changes to the documentation and generating man pages,
55
you can run the following from an R console in the SparkR home directory
6-
7-
library(devtools)
8-
devtools::document(pkg="./pkg", roclets=c("rd"))
9-
6+
```R
7+
library(devtools)
8+
devtools::document(pkg="./pkg", roclets=c("rd"))
9+
```
1010
You can verify if your changes are good by running
1111

1212
R CMD check pkg/

R/README.md

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,7 @@ SparkR is an R package that provides a light-weight frontend to use Spark from R
77
Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be done by running the script `$SPARK_HOME/R/install-dev.sh`.
88
By default the above script uses the system wide installation of R. However, this can be changed to any user installed location of R by setting the environment variable `R_HOME` the full path of the base directory where R is installed, before running install-dev.sh script.
99
Example:
10-
11-
```
10+
```bash
1211
# where /home/username/R is where R is installed and /home/username/R/bin contains the files R and RScript
1312
export R_HOME=/home/username/R
1413
./install-dev.sh
@@ -20,8 +19,8 @@ export R_HOME=/home/username/R
2019

2120
Build Spark with [Maven](http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn) and include the `-Psparkr` profile to build the R package. For example to use the default Hadoop versions you can run
2221

23-
```
24-
build/mvn -DskipTests -Psparkr package
22+
```bash
23+
build/mvn -DskipTests -Psparkr package
2524
```
2625

2726
#### Running sparkR
@@ -40,9 +39,8 @@ To set other options like driver memory, executor memory etc. you can pass in th
4039

4140
#### Using SparkR from RStudio
4241

43-
If you wish to use SparkR from RStudio or other R frontends you will need to set some environment variables which point SparkR to your Spark installation. For example
44-
45-
```
42+
If you wish to use SparkR from RStudio or other R frontends you will need to set some environment variables which point SparkR to your Spark installation. For example
43+
```R
4644
# Set this to where Spark is installed
4745
Sys.setenv(SPARK_HOME="/Users/username/spark")
4846
# This line loads SparkR from the installed directory
@@ -59,25 +57,25 @@ Once you have made your changes, please include unit tests for them and run exis
5957

6058
#### Generating documentation
6159

62-
The SparkR documentation (Rd files and HTML files) are not a part of the source repository. To generate them you can run the script `R/create-docs.sh`. This script uses `devtools` and `knitr` to generate the docs and these packages need to be installed on the machine before using the script.
60+
The SparkR documentation (Rd files and HTML files) are not a part of the source repository. To generate them you can run the script `R/create-docs.sh`. This script uses `devtools` and `knitr` to generate the docs and these packages need to be installed on the machine before using the script. Also, you may need to install these [prerequisites](https://github.com/apache/spark/tree/master/docs#prerequisites). See also, `R/DOCUMENTATION.md`
6361

6462
### Examples, Unit tests
6563

6664
SparkR comes with several sample programs in the `examples/src/main/r` directory.
6765
To run one of them, use `./bin/spark-submit <filename> <args>`. For example:
68-
69-
./bin/spark-submit examples/src/main/r/dataframe.R
70-
66+
```bash
67+
./bin/spark-submit examples/src/main/r/dataframe.R
68+
```
7169
You can also run the unit tests for SparkR by running. You need to install the [testthat](http://cran.r-project.org/web/packages/testthat/index.html) package first:
72-
73-
R -e 'install.packages("testthat", repos="http://cran.us.r-project.org")'
74-
./R/run-tests.sh
70+
```bash
71+
R -e 'install.packages("testthat", repos="http://cran.us.r-project.org")'
72+
./R/run-tests.sh
73+
```
7574

7675
### Running on YARN
7776

7877
The `./bin/spark-submit` can also be used to submit jobs to YARN clusters. You will need to set YARN conf dir before doing so. For example on CDH you can run
79-
80-
```
78+
```bash
8179
export YARN_CONF_DIR=/etc/hadoop/conf
8280
./bin/spark-submit --master yarn examples/src/main/r/dataframe.R
8381
```

R/pkg/R/utils.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -489,7 +489,7 @@ processClosure <- function(node, oldEnv, defVars, checkedFuncs, newEnv) {
489489
# checkedFunc An environment of function objects examined during cleanClosure. It can be
490490
# considered as a "name"-to-"list of functions" mapping.
491491
# return value
492-
# a new version of func that has an correct environment (closure).
492+
# a new version of func that has a correct environment (closure).
493493
cleanClosure <- function(func, checkedFuncs = new.env()) {
494494
if (is.function(func)) {
495495
newEnv <- new.env(parent = .GlobalEnv)

build/spark-build-info

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
#!/usr/bin/env bash
2+
3+
#
4+
# Licensed to the Apache Software Foundation (ASF) under one or more
5+
# contributor license agreements. See the NOTICE file distributed with
6+
# this work for additional information regarding copyright ownership.
7+
# The ASF licenses this file to You under the Apache License, Version 2.0
8+
# (the "License"); you may not use this file except in compliance with
9+
# the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing, software
14+
# distributed under the License is distributed on an "AS IS" BASIS,
15+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
# See the License for the specific language governing permissions and
17+
# limitations under the License.
18+
#
19+
20+
# This script generates the build info for spark and places it into the spark-version-info.properties file.
21+
# Arguments:
22+
# build_tgt_directory - The target directory where properties file would be created. [./core/target/extra-resources]
23+
# spark_version - The current version of spark
24+
25+
RESOURCE_DIR="$1"
26+
mkdir -p "$RESOURCE_DIR"
27+
SPARK_BUILD_INFO="${RESOURCE_DIR}"/spark-version-info.properties
28+
29+
echo_build_properties() {
30+
echo version=$1
31+
echo user=$USER
32+
echo revision=$(git rev-parse HEAD)
33+
echo branch=$(git rev-parse --abbrev-ref HEAD)
34+
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
35+
echo url=$(git config --get remote.origin.url)
36+
}
37+
38+
echo_build_properties $2 > "$SPARK_BUILD_INFO"

common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,6 @@ public long size() {
5151
* Creates a memory block pointing to the memory used by the long array.
5252
*/
5353
public static MemoryBlock fromLongArray(final long[] array) {
54-
return new MemoryBlock(array, Platform.LONG_ARRAY_OFFSET, array.length * 8);
54+
return new MemoryBlock(array, Platform.LONG_ARRAY_OFFSET, array.length * 8L);
5555
}
5656
}

core/pom.xml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -337,7 +337,38 @@
337337
<build>
338338
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
339339
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
340+
<resources>
341+
<resource>
342+
<directory>${project.basedir}/src/main/resources</directory>
343+
</resource>
344+
<resource>
345+
<!-- Include the properties file to provide the build information. -->
346+
<directory>${project.build.directory}/extra-resources</directory>
347+
<filtering>true</filtering>
348+
</resource>
349+
</resources>
340350
<plugins>
351+
<plugin>
352+
<groupId>org.apache.maven.plugins</groupId>
353+
<artifactId>maven-antrun-plugin</artifactId>
354+
<executions>
355+
<execution>
356+
<phase>generate-resources</phase>
357+
<configuration>
358+
<!-- Execute the shell script to generate the spark build information. -->
359+
<tasks>
360+
<exec executable="${project.basedir}/../build/spark-build-info">
361+
<arg value="${project.build.directory}/extra-resources"/>
362+
<arg value="${pom.version}"/>
363+
</exec>
364+
</tasks>
365+
</configuration>
366+
<goals>
367+
<goal>run</goal>
368+
</goals>
369+
</execution>
370+
</executions>
371+
</plugin>
341372
<plugin>
342373
<groupId>org.apache.maven.plugins</groupId>
343374
<artifactId>maven-dependency-plugin</artifactId>

core/src/main/java/org/apache/spark/shuffle/sort/ShuffleInMemorySorter.java

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,12 @@
2222
import org.apache.spark.memory.MemoryConsumer;
2323
import org.apache.spark.unsafe.Platform;
2424
import org.apache.spark.unsafe.array.LongArray;
25+
import org.apache.spark.unsafe.memory.MemoryBlock;
2526
import org.apache.spark.util.collection.Sorter;
2627
import org.apache.spark.util.collection.unsafe.sort.RadixSort;
2728

2829
final class ShuffleInMemorySorter {
2930

30-
private final Sorter<PackedRecordPointer, LongArray> sorter;
3131
private static final class SortComparator implements Comparator<PackedRecordPointer> {
3232
@Override
3333
public int compare(PackedRecordPointer left, PackedRecordPointer right) {
@@ -44,6 +44,9 @@ public int compare(PackedRecordPointer left, PackedRecordPointer right) {
4444
* An array of record pointers and partition ids that have been encoded by
4545
* {@link PackedRecordPointer}. The sort operates on this array instead of directly manipulating
4646
* records.
47+
*
48+
* Only part of the array will be used to store the pointers, the rest part is preserved as
49+
* temporary buffer for sorting.
4750
*/
4851
private LongArray array;
4952

@@ -54,14 +57,14 @@ public int compare(PackedRecordPointer left, PackedRecordPointer right) {
5457
private final boolean useRadixSort;
5558

5659
/**
57-
* Set to 2x for radix sort to reserve extra memory for sorting, otherwise 1x.
60+
* The position in the pointer array where new records can be inserted.
5861
*/
59-
private final int memoryAllocationFactor;
62+
private int pos = 0;
6063

6164
/**
62-
* The position in the pointer array where new records can be inserted.
65+
* How many records could be inserted, because part of the array should be left for sorting.
6366
*/
64-
private int pos = 0;
67+
private int usableCapacity = 0;
6568

6669
private int initialSize;
6770

@@ -70,9 +73,14 @@ public int compare(PackedRecordPointer left, PackedRecordPointer right) {
7073
assert (initialSize > 0);
7174
this.initialSize = initialSize;
7275
this.useRadixSort = useRadixSort;
73-
this.memoryAllocationFactor = useRadixSort ? 2 : 1;
7476
this.array = consumer.allocateArray(initialSize);
75-
this.sorter = new Sorter<>(ShuffleSortDataFormat.INSTANCE);
77+
this.usableCapacity = getUsableCapacity();
78+
}
79+
80+
private int getUsableCapacity() {
81+
// Radix sort requires same amount of used memory as buffer, Tim sort requires
82+
// half of the used memory as buffer.
83+
return (int) (array.size() / (useRadixSort ? 2 : 1.5));
7684
}
7785

7886
public void free() {
@@ -89,7 +97,8 @@ public int numRecords() {
8997
public void reset() {
9098
if (consumer != null) {
9199
consumer.freeArray(array);
92-
this.array = consumer.allocateArray(initialSize);
100+
array = consumer.allocateArray(initialSize);
101+
usableCapacity = getUsableCapacity();
93102
}
94103
pos = 0;
95104
}
@@ -101,14 +110,15 @@ public void expandPointerArray(LongArray newArray) {
101110
array.getBaseOffset(),
102111
newArray.getBaseObject(),
103112
newArray.getBaseOffset(),
104-
array.size() * (8 / memoryAllocationFactor)
113+
pos * 8L
105114
);
106115
consumer.freeArray(array);
107116
array = newArray;
117+
usableCapacity = getUsableCapacity();
108118
}
109119

110120
public boolean hasSpaceForAnotherRecord() {
111-
return pos < array.size() / memoryAllocationFactor;
121+
return pos < usableCapacity;
112122
}
113123

114124
public long getMemoryUsage() {
@@ -170,6 +180,14 @@ public ShuffleSorterIterator getSortedIterator() {
170180
PackedRecordPointer.PARTITION_ID_START_BYTE_INDEX,
171181
PackedRecordPointer.PARTITION_ID_END_BYTE_INDEX, false, false);
172182
} else {
183+
MemoryBlock unused = new MemoryBlock(
184+
array.getBaseObject(),
185+
array.getBaseOffset() + pos * 8L,
186+
(array.size() - pos) * 8L);
187+
LongArray buffer = new LongArray(unused);
188+
Sorter<PackedRecordPointer, LongArray> sorter =
189+
new Sorter<>(new ShuffleSortDataFormat(buffer));
190+
173191
sorter.sort(array, 0, pos, SORT_COMPARATOR);
174192
}
175193
return new ShuffleSorterIterator(pos, array, offset);

core/src/main/java/org/apache/spark/shuffle/sort/ShuffleSortDataFormat.java

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,15 @@
1919

2020
import org.apache.spark.unsafe.Platform;
2121
import org.apache.spark.unsafe.array.LongArray;
22-
import org.apache.spark.unsafe.memory.MemoryBlock;
2322
import org.apache.spark.util.collection.SortDataFormat;
2423

2524
final class ShuffleSortDataFormat extends SortDataFormat<PackedRecordPointer, LongArray> {
2625

27-
public static final ShuffleSortDataFormat INSTANCE = new ShuffleSortDataFormat();
26+
private final LongArray buffer;
2827

29-
private ShuffleSortDataFormat() { }
28+
ShuffleSortDataFormat(LongArray buffer) {
29+
this.buffer = buffer;
30+
}
3031

3132
@Override
3233
public PackedRecordPointer getKey(LongArray data, int pos) {
@@ -70,8 +71,8 @@ public void copyRange(LongArray src, int srcPos, LongArray dst, int dstPos, int
7071

7172
@Override
7273
public LongArray allocate(int length) {
73-
// This buffer is used temporary (usually small), so it's fine to allocated from JVM heap.
74-
return new LongArray(MemoryBlock.fromLongArray(new long[length]));
74+
assert (length <= buffer.size()) :
75+
"the buffer is smaller than required: " + buffer.size() + " < " + length;
76+
return buffer;
7577
}
76-
7778
}

0 commit comments

Comments
 (0)