Commit d39f2e9
[SPARK-4477] [PySpark] remove numpy from RDDSampler
In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.
numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.
It also complicate the code a lot, so we may should remove numpy from RDDSampler.
I also did some benchmark to verify that:
```
>>> from pyspark.mllib.random import RandomRDDs
>>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
>>> rdd.count() # cache it
>>> rdd.sample(True, 0.9).count() # measure this line
```
the results:
|withReplacement | random | numpy.random |
------- | ------------ | -------
|True | 1.5 s| 1.4 s|
|False| 0.6 s | 0.8 s|
closes apache#2313
Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.
Author: Davies Liu <[email protected]>
Author: Xiangrui Meng <[email protected]>
Closes apache#3351 from davies/numpy and squashes the following commits:
5c438d7 [Davies Liu] fix comment
c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
98eb31b [Xiangrui Meng] make poisson sampling slightly faster
ee17d78 [Davies Liu] remove = for float
13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
f583023 [Davies Liu] fix tests
51649f5 [Davies Liu] remove numpy in RDDSampler
78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
f5fdf63 [Davies Liu] fix bug with int in weights
4dfa2cd [Davies Liu] refactor
f866bcf [Davies Liu] remove unneeded change
c7a2007 [Davies Liu] switch to python implementation
95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
0d9b256 [Davies Liu] refactor
1715ee3 [Davies Liu] address comments
41fce54 [Davies Liu] randomSplit()1 parent ad5f1f3 commit d39f2e9
2 files changed
+40
-69
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
310 | 310 | | |
311 | 311 | | |
312 | 312 | | |
313 | | - | |
314 | | - | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
315 | 318 | | |
316 | 319 | | |
317 | 320 | | |
| |||
343 | 346 | | |
344 | 347 | | |
345 | 348 | | |
346 | | - | |
347 | | - | |
| 349 | + | |
348 | 350 | | |
349 | 351 | | |
350 | 352 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
32 | | - | |
33 | | - | |
34 | | - | |
| 26 | + | |
35 | 27 | | |
36 | 28 | | |
37 | | - | |
38 | | - | |
39 | 29 | | |
40 | 30 | | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
| 31 | + | |
46 | 32 | | |
47 | 33 | | |
48 | 34 | | |
49 | 35 | | |
50 | 36 | | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
60 | 51 | | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
76 | 59 | | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
| 60 | + | |
| 61 | + | |
95 | 62 | | |
96 | 63 | | |
97 | 64 | | |
| |||
101 | 68 | | |
102 | 69 | | |
103 | 70 | | |
| 71 | + | |
104 | 72 | | |
105 | 73 | | |
106 | 74 | | |
107 | 75 | | |
108 | 76 | | |
109 | | - | |
| 77 | + | |
110 | 78 | | |
111 | 79 | | |
112 | 80 | | |
113 | 81 | | |
114 | | - | |
| 82 | + | |
115 | 83 | | |
116 | 84 | | |
117 | 85 | | |
118 | 86 | | |
119 | 87 | | |
120 | 88 | | |
121 | 89 | | |
122 | | - | |
123 | 90 | | |
124 | 91 | | |
125 | 92 | | |
126 | 93 | | |
| 94 | + | |
127 | 95 | | |
128 | | - | |
| 96 | + | |
129 | 97 | | |
130 | 98 | | |
131 | 99 | | |
| |||
136 | 104 | | |
137 | 105 | | |
138 | 106 | | |
| 107 | + | |
139 | 108 | | |
140 | 109 | | |
141 | 110 | | |
142 | 111 | | |
143 | 112 | | |
144 | | - | |
| 113 | + | |
145 | 114 | | |
146 | 115 | | |
147 | 116 | | |
148 | 117 | | |
149 | | - | |
| 118 | + | |
150 | 119 | | |
0 commit comments