Skip to content

Commit e20cd9f

Browse files
burakkosemengxr
authored andcommitted
[SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover
## What changes were proposed in this pull request? This PR continues the work from #11871 with the following changes: * load English stopwords as default * covert stopwords to list in Python * update some tests and doc ## How was this patch tested? Unit tests. Closes #11871 cc: burakkose srowen Author: Burak Köse <[email protected]> Author: Xiangrui Meng <[email protected]> Author: Burak KOSE <[email protected]> Closes #12843 from mengxr/SPARK-14050.
1 parent 5c8fad7 commit e20cd9f

File tree

20 files changed

+2614
-87
lines changed

20 files changed

+2614
-87
lines changed

licenses/LICENSE-postgresql.txt

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
PostgreSQL Database Management System
2+
(formerly known as Postgres, then as Postgres95)
3+
4+
Portions Copyright (c) 1996-2010, PostgreSQL Global Development Group
5+
6+
Portions Copyright (c) 1994, The Regents of the University of California
7+
8+
Permission to use, copy, modify, and distribute this software and its
9+
documentation for any purpose, without fee, and without a written agreement
10+
is hereby granted, provided that the above copyright notice and this
11+
paragraph and the following two paragraphs appear in all copies.
12+
13+
IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR
14+
DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
15+
LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
16+
DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE
17+
POSSIBILITY OF SUCH DAMAGE.
18+
19+
THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES,
20+
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
21+
AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS
22+
ON AN "AS IS" BASIS, AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO
23+
PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
24+
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
Stopwords Corpus
2+
3+
This corpus contains lists of stop words for several languages. These
4+
are high-frequency grammatical words which are usually ignored in text
5+
retrieval applications.
6+
7+
They were obtained from:
8+
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/
9+
10+
The English list has been augmented
11+
https://github.com/nltk/nltk_data/issues/22
12+
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
og
2+
i
3+
jeg
4+
det
5+
at
6+
en
7+
den
8+
til
9+
er
10+
som
11+
12+
de
13+
med
14+
han
15+
af
16+
for
17+
ikke
18+
der
19+
var
20+
mig
21+
sig
22+
men
23+
et
24+
har
25+
om
26+
vi
27+
min
28+
havde
29+
ham
30+
hun
31+
nu
32+
over
33+
da
34+
fra
35+
du
36+
ud
37+
sin
38+
dem
39+
os
40+
op
41+
man
42+
hans
43+
hvor
44+
eller
45+
hvad
46+
skal
47+
selv
48+
her
49+
alle
50+
vil
51+
blev
52+
kunne
53+
ind
54+
når
55+
være
56+
dog
57+
noget
58+
ville
59+
jo
60+
deres
61+
efter
62+
ned
63+
skulle
64+
denne
65+
end
66+
dette
67+
mit
68+
også
69+
under
70+
have
71+
dig
72+
anden
73+
hende
74+
mine
75+
alt
76+
meget
77+
sit
78+
sine
79+
vor
80+
mod
81+
disse
82+
hvis
83+
din
84+
nogle
85+
hos
86+
blive
87+
mange
88+
ad
89+
bliver
90+
hendes
91+
været
92+
thi
93+
jer
94+
sådan
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
de
2+
en
3+
van
4+
ik
5+
te
6+
dat
7+
die
8+
in
9+
een
10+
hij
11+
het
12+
niet
13+
zijn
14+
is
15+
was
16+
op
17+
aan
18+
met
19+
als
20+
voor
21+
had
22+
er
23+
maar
24+
om
25+
hem
26+
dan
27+
zou
28+
of
29+
wat
30+
mijn
31+
men
32+
dit
33+
zo
34+
door
35+
over
36+
ze
37+
zich
38+
bij
39+
ook
40+
tot
41+
je
42+
mij
43+
uit
44+
der
45+
daar
46+
haar
47+
naar
48+
heb
49+
hoe
50+
heeft
51+
hebben
52+
deze
53+
u
54+
want
55+
nog
56+
zal
57+
me
58+
zij
59+
nu
60+
ge
61+
geen
62+
omdat
63+
iets
64+
worden
65+
toch
66+
al
67+
waren
68+
veel
69+
meer
70+
doen
71+
toen
72+
moet
73+
ben
74+
zonder
75+
kan
76+
hun
77+
dus
78+
alles
79+
onder
80+
ja
81+
eens
82+
hier
83+
wie
84+
werd
85+
altijd
86+
doch
87+
wordt
88+
wezen
89+
kunnen
90+
ons
91+
zelf
92+
tegen
93+
na
94+
reeds
95+
wil
96+
kon
97+
niets
98+
uw
99+
iemand
100+
geweest
101+
andere
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
i
2+
me
3+
my
4+
myself
5+
we
6+
our
7+
ours
8+
ourselves
9+
you
10+
your
11+
yours
12+
yourself
13+
yourselves
14+
he
15+
him
16+
his
17+
himself
18+
she
19+
her
20+
hers
21+
herself
22+
it
23+
its
24+
itself
25+
they
26+
them
27+
their
28+
theirs
29+
themselves
30+
what
31+
which
32+
who
33+
whom
34+
this
35+
that
36+
these
37+
those
38+
am
39+
is
40+
are
41+
was
42+
were
43+
be
44+
been
45+
being
46+
have
47+
has
48+
had
49+
having
50+
do
51+
does
52+
did
53+
doing
54+
a
55+
an
56+
the
57+
and
58+
but
59+
if
60+
or
61+
because
62+
as
63+
until
64+
while
65+
of
66+
at
67+
by
68+
for
69+
with
70+
about
71+
against
72+
between
73+
into
74+
through
75+
during
76+
before
77+
after
78+
above
79+
below
80+
to
81+
from
82+
up
83+
down
84+
in
85+
out
86+
on
87+
off
88+
over
89+
under
90+
again
91+
further
92+
then
93+
once
94+
here
95+
there
96+
when
97+
where
98+
why
99+
how
100+
all
101+
any
102+
both
103+
each
104+
few
105+
more
106+
most
107+
other
108+
some
109+
such
110+
no
111+
nor
112+
not
113+
only
114+
own
115+
same
116+
so
117+
than
118+
too
119+
very
120+
s
121+
t
122+
can
123+
will
124+
just
125+
don
126+
should
127+
now
128+
d
129+
ll
130+
m
131+
o
132+
re
133+
ve
134+
y
135+
ain
136+
aren
137+
couldn
138+
didn
139+
doesn
140+
hadn
141+
hasn
142+
haven
143+
isn
144+
ma
145+
mightn
146+
mustn
147+
needn
148+
shan
149+
shouldn
150+
wasn
151+
weren
152+
won
153+
wouldn

0 commit comments

Comments
 (0)