Skip to content

Commit dbb61fb

Browse files
committed
Merge branch 'master' of https://github.com/apache/spark into gpu-sched-executor-clean
2 parents 4165c60 + bcd3b61 commit dbb61fb

File tree

204 files changed

+4142
-1153
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

204 files changed

+4142
-1153
lines changed

LICENSE

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -222,7 +222,7 @@ Python Software Foundation License
222222
----------------------------------
223223

224224
pyspark/heapq3.py
225-
225+
python/docs/_static/copybutton.js
226226

227227
BSD 3-Clause
228228
------------
@@ -258,4 +258,4 @@ data/mllib/images/kittens/29.5.a_b_EGDP022204.jpg
258258
data/mllib/images/kittens/54893.jpg
259259
data/mllib/images/kittens/DP153539.jpg
260260
data/mllib/images/kittens/DP802813.jpg
261-
data/mllib/images/multi-channel/chr30.4.184.jpg
261+
data/mllib/images/multi-channel/chr30.4.184.jpg

LICENSE-binary

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -489,7 +489,6 @@ Eclipse Distribution License (EDL) 1.0
489489
org.glassfish.jaxb:jaxb-runtime
490490
jakarta.xml.bind:jakarta.xml.bind-api
491491
com.sun.istack:istack-commons-runtime
492-
jakarta.activation:jakarta.activation-api
493492

494493

495494
Mozilla Public License (MPL) 1.1

README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
# Apache Spark
22

3-
[![Jenkins Build](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/badge/icon)](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7)
4-
[![AppVeyor Build](https://img.shields.io/appveyor/ci/ApacheSoftwareFoundation/spark/master.svg?style=plastic&logo=appveyor)](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark)
5-
[![PySpark Coverage](https://img.shields.io/badge/dynamic/xml.svg?label=pyspark%20coverage&url=https%3A%2F%2Fspark-test.github.io%2Fpyspark-coverage-site&query=%2Fhtml%2Fbody%2Fdiv%5B1%5D%2Fdiv%2Fh1%2Fspan&colorB=brightgreen&style=plastic)](https://spark-test.github.io/pyspark-coverage-site)
6-
7-
Spark is a fast and general cluster computing system for Big Data. It provides
3+
Spark is a unified analytics engine for large-scale data processing. It provides
84
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
95
supports general computation graphs for data analysis. It also supports a
106
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
117
MLlib for machine learning, GraphX for graph processing,
12-
and Spark Streaming for stream processing.
8+
and Structured Streaming for stream processing.
139

1410
<http://spark.apache.org/>
1511

12+
[![Jenkins Build](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/badge/icon)](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7)
13+
[![AppVeyor Build](https://img.shields.io/appveyor/ci/ApacheSoftwareFoundation/spark/master.svg?style=plastic&logo=appveyor)](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark)
14+
[![PySpark Coverage](https://img.shields.io/badge/dynamic/xml.svg?label=pyspark%20coverage&url=https%3A%2F%2Fspark-test.github.io%2Fpyspark-coverage-site&query=%2Fhtml%2Fbody%2Fdiv%5B1%5D%2Fdiv%2Fh1%2Fspan&colorB=brightgreen&style=plastic)](https://spark-test.github.io/pyspark-coverage-site)
15+
1616

1717
## Online Documentation
1818

@@ -41,19 +41,19 @@ The easiest way to start using Spark is through the Scala shell:
4141

4242
./bin/spark-shell
4343

44-
Try the following command, which should return 1000:
44+
Try the following command, which should return 1,000,000,000:
4545

46-
scala> sc.parallelize(1 to 1000).count()
46+
scala> spark.range(1000 * 1000 * 1000).count()
4747

4848
## Interactive Python Shell
4949

5050
Alternatively, if you prefer Python, you can use the Python shell:
5151

5252
./bin/pyspark
5353

54-
And run the following command, which should also return 1000:
54+
And run the following command, which should also return 1,000,000,000:
5555

56-
>>> sc.parallelize(range(1000)).count()
56+
>>> spark.range(1000 * 1000 * 1000).count()
5757

5858
## Example Programs
5959

bin/docker-image-tool.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,7 @@ do
282282
if ! minikube status 1>/dev/null; then
283283
error "Cannot contact minikube. Make sure it's running."
284284
fi
285-
eval $(minikube docker-env)
285+
eval $(minikube docker-env --shell bash)
286286
;;
287287
u) SPARK_UID=${OPTARG};;
288288
esac

common/network-yarn/pom.xml

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
<!-- Make sure all Hadoop dependencies are provided to avoid repackaging. -->
3636
<hadoop.deps.scope>provided</hadoop.deps.scope>
3737
<shuffle.jar>${project.build.directory}/scala-${scala.binary.version}/spark-${project.version}-yarn-shuffle.jar</shuffle.jar>
38-
<shade>org/spark_project/</shade>
38+
<shade>org/sparkproject/</shade>
3939
</properties>
4040

4141
<dependencies>
@@ -128,6 +128,50 @@
128128
</execution>
129129
</executions>
130130
</plugin>
131+
<!-- shade the native netty libs as well -->
132+
<plugin>
133+
<groupId>org.codehaus.mojo</groupId>
134+
<artifactId>build-helper-maven-plugin</artifactId>
135+
<executions>
136+
<execution>
137+
<id>regex-property</id>
138+
<goals>
139+
<goal>regex-property</goal>
140+
</goals>
141+
<configuration>
142+
<name>spark.shade.native.packageName</name>
143+
<value>${spark.shade.packageName}</value>
144+
<regex>\.</regex>
145+
<replacement>_</replacement>
146+
<failIfNoMatch>true</failIfNoMatch>
147+
</configuration>
148+
</execution>
149+
</executions>
150+
</plugin>
151+
<plugin>
152+
<groupId>org.apache.maven.plugins</groupId>
153+
<artifactId>maven-antrun-plugin</artifactId>
154+
<executions>
155+
<execution>
156+
<id>unpack</id>
157+
<phase>package</phase>
158+
<configuration>
159+
<target>
160+
<echo message="Shade netty native libraries to ${spark.shade.native.packageName}" />
161+
<unzip src="${shuffle.jar}" dest="${project.build.directory}/exploded/" />
162+
<move file="${project.build.directory}/exploded/META-INF/native/libnetty_transport_native_epoll_x86_64.so"
163+
tofile="${project.build.directory}/exploded/META-INF/native/lib${spark.shade.native.packageName}_netty_transport_native_epoll_x86_64.so" />
164+
<move file="${project.build.directory}/exploded/META-INF/native/libnetty_transport_native_kqueue_x86_64.jnilib"
165+
tofile="${project.build.directory}/exploded/META-INF/native/lib${spark.shade.native.packageName}_netty_transport_native_kqueue_x86_64.jnilib" />
166+
<jar destfile="${shuffle.jar}" basedir="${project.build.directory}/exploded" />
167+
</target>
168+
</configuration>
169+
<goals>
170+
<goal>run</goal>
171+
</goals>
172+
</execution>
173+
</executions>
174+
</plugin>
131175

132176
<!-- probes to validate that those dependencies which must be shaded are -->
133177
<plugin>

common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -319,6 +319,8 @@ public String toString() {
319319
appendUnit(sb, rest / MICROS_PER_MILLI, "millisecond");
320320
rest %= MICROS_PER_MILLI;
321321
appendUnit(sb, rest, "microsecond");
322+
} else if (months == 0) {
323+
sb.append(" 0 microseconds");
322324
}
323325

324326
return sb.toString();

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CalendarIntervalSuite.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,9 @@ public void equalsTest() {
4141
public void toStringTest() {
4242
CalendarInterval i;
4343

44+
i = new CalendarInterval(0, 0);
45+
assertEquals("interval 0 microseconds", i.toString());
46+
4447
i = new CalendarInterval(34, 0);
4548
assertEquals("interval 2 years 10 months", i.toString());
4649

conf/log4j.properties.template

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,8 @@ log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}:
2828
log4j.logger.org.apache.spark.repl.Main=WARN
2929

3030
# Settings to quiet third party logs that are too verbose
31-
log4j.logger.org.spark_project.jetty=WARN
32-
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
31+
log4j.logger.org.sparkproject.jetty=WARN
32+
log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle=ERROR
3333
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
3434
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
3535
log4j.logger.org.apache.parquet=ERROR

core/src/main/resources/org/apache/spark/log4j-defaults.properties

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,8 @@ log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}:
2828
log4j.logger.org.apache.spark.repl.Main=WARN
2929

3030
# Settings to quiet third party logs that are too verbose
31-
log4j.logger.org.spark_project.jetty=WARN
32-
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
31+
log4j.logger.org.sparkproject.jetty=WARN
32+
log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle=ERROR
3333
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
3434
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
3535

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

Lines changed: 57 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ private[spark] case class PythonFunction(
8686
private[spark] case class ChainedPythonFunctions(funcs: Seq[PythonFunction])
8787

8888
/** Thrown for exceptions in user Python code. */
89-
private[spark] class PythonException(msg: String, cause: Exception)
89+
private[spark] class PythonException(msg: String, cause: Throwable)
9090
extends RuntimeException(msg, cause)
9191

9292
/**
@@ -163,8 +163,63 @@ private[spark] object PythonRDD extends Logging {
163163
serveIterator(rdd.collect().iterator, s"serve RDD ${rdd.id}")
164164
}
165165

166+
/**
167+
* A helper function to create a local RDD iterator and serve it via socket. Partitions are
168+
* are collected as separate jobs, by order of index. Partition data is first requested by a
169+
* non-zero integer to start a collection job. The response is prefaced by an integer with 1
170+
* meaning partition data will be served, 0 meaning the local iterator has been consumed,
171+
* and -1 meaining an error occurred during collection. This function is used by
172+
* pyspark.rdd._local_iterator_from_socket().
173+
*
174+
* @return 2-tuple (as a Java array) with the port number of a local socket which serves the
175+
* data collected from these jobs, and the secret for authentication.
176+
*/
166177
def toLocalIteratorAndServe[T](rdd: RDD[T]): Array[Any] = {
167-
serveIterator(rdd.toLocalIterator, s"serve toLocalIterator")
178+
val (port, secret) = SocketAuthServer.setupOneConnectionServer(
179+
authHelper, "serve toLocalIterator") { s =>
180+
val out = new DataOutputStream(s.getOutputStream)
181+
val in = new DataInputStream(s.getInputStream)
182+
Utils.tryWithSafeFinally {
183+
184+
// Collects a partition on each iteration
185+
val collectPartitionIter = rdd.partitions.indices.iterator.map { i =>
186+
rdd.sparkContext.runJob(rdd, (iter: Iterator[Any]) => iter.toArray, Seq(i)).head
187+
}
188+
189+
// Read request for data and send next partition if nonzero
190+
var complete = false
191+
while (!complete && in.readInt() != 0) {
192+
if (collectPartitionIter.hasNext) {
193+
try {
194+
// Attempt to collect the next partition
195+
val partitionArray = collectPartitionIter.next()
196+
197+
// Send response there is a partition to read
198+
out.writeInt(1)
199+
200+
// Write the next object and signal end of data for this iteration
201+
writeIteratorToStream(partitionArray.toIterator, out)
202+
out.writeInt(SpecialLengths.END_OF_DATA_SECTION)
203+
out.flush()
204+
} catch {
205+
case e: SparkException =>
206+
// Send response that an error occurred followed by error message
207+
out.writeInt(-1)
208+
writeUTF(e.getMessage, out)
209+
complete = true
210+
}
211+
} else {
212+
// Send response there are no more partitions to read and close
213+
out.writeInt(0)
214+
complete = true
215+
}
216+
}
217+
} {
218+
out.close()
219+
in.close()
220+
}
221+
}
222+
Array(port, secret)
168223
}
169224

170225
def readRDDFromFile(

0 commit comments

Comments
 (0)