Skip to content

Conversation

@timholy
Copy link
Member

@timholy timholy commented Mar 5, 2016

This is a largely janitorial change that allows us to defer the definition of StridedArray until much later in the bootstrap process. This was motivated by my (still unfinished) work on ReshapedArrays, but this appears to have some value on its own merits: now that we have nice iterators and indexing, "strides" simply aren't a very important concept anymore for pure julia code. So the usage of StridedArray should generally be limited to interactions with outside libraries that require strided data layout.

This also fixes a bug (currently on master):

immutable MyArray{T,N} <: AbstractArray{T,N}
    data::Array{T,N}
end

Base.size(A::MyArray) = size(A.data)
Base.getindex(A::MyArray, indexes...) = A.data[indexes...]

A = MyArray(rand(4,5))
b = rand(5)

julia> A*b
ERROR: LoadError: MethodError: no method matching A_mul_B!(::Array{Float64,1}, ::MyArray{Float64,2}, ::Array{Float64,1})
Closest candidates are:
...

The most important reason I decided to submit this separately is that janitorial changes have a nasty habit of generating unexpected performance regressions, so I wanted to runbenchmarks("array", vs = "JuliaLang/julia:master").


@test_throws ErrorException transpose(sub(sprandn(10, 10, 0.3), 1:4, 1:4))
@test_throws ErrorException ctranspose(sub(sprandn(10, 10, 0.3), 1:4, 1:4))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's always scary to delete tests, I wasn't sure why this was something that needed testing. On this branch,

julia> A = sub(sprandn(10, 10, 0.3), 1:4, 1:4)
4x4 SubArray{Float64,2,SparseMatrixCSC{Float64,Int64},Tuple{UnitRange{Int64},UnitRange{Int64}},false}:
 0.0  0.0      -1.41166    0.0    
 0.0  0.0       0.0        0.0    
 0.0  0.0       0.628624   1.40626
 0.0  1.08286   0.0       -0.97074

julia> transpose(A)
4x4 sparse matrix with 5 Float64 entries:
        [3, 1]  =  -1.41166
        [3, 3]  =  0.628624
        [4, 3]  =  1.40626
        [2, 4]  =  1.08286
        [4, 4]  =  -0.97074

julia> full(transpose(A))
4x4 Array{Float64,2}:
  0.0      0.0  0.0        0.0    
  0.0      0.0  0.0        1.08286
 -1.41166  0.0  0.628624   0.0    
  0.0      0.0  1.40626   -0.97074

which seems like a rather satisfactory result. (It's even sparse!)

CC @kshyatt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it perform relative to the non subarray sparse transpose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and could this be changed to an affirmative test now that it works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested, but I'm sure it's dreadful. We need a good iterator for sparse matrices and their views.

@timholy
Copy link
Member Author

timholy commented Mar 5, 2016

@jrevels, any idea about the Nanosoldier error?

@nalimilan
Copy link
Member

Really nice!

@tkelman
Copy link
Contributor

tkelman commented Mar 6, 2016

The fixed bug with a wrapper AbstractArray type should be tested for, no?

@timholy
Copy link
Member Author

timholy commented Mar 6, 2016

Let's try again and see if it magically works this time:
runbenchmarks("array", vs = "JuliaLang/julia:master")

timholy added 3 commits March 7, 2016 05:43
This allows us to defer the definition of StridedArray until much later in the bootstrap process.
@jrevels
Copy link
Member

jrevels commented Mar 7, 2016

@jrevels, any idea about the Nanosoldier error?

Sorry about the delay, I've been battling the flu. Let's see if the patch I just pushed fixed things:

runbenchmarks("array", vs = "JuliaLang/julia:master")

@timholy timholy force-pushed the teh/less_striding branch from 87cfcb4 to 1982118 Compare March 7, 2016 15:58
@timholy
Copy link
Member Author

timholy commented Mar 7, 2016

Sorry to hear about the flu! Hope you're winning the battle!

I think I discovered the problem: sparse matrix multiplication was dispatching differently. I added another commit to fix that. Can we cancel the run you started and try fresh?

@timholy
Copy link
Member Author

timholy commented Mar 7, 2016

When we launch it, I guess we probably want "array" || "sparse" || "linalg".

@jrevels
Copy link
Member

jrevels commented Mar 7, 2016

runbenchmarks("array" || "sparse" || "linalg", vs = "JuliaLang/julia:master")

@jrevels
Copy link
Member

jrevels commented Mar 7, 2016

Hmm. Looks like the tag predicate validation is wonky now. I'll look into it.

@timholy
Copy link
Member Author

timholy commented Mar 7, 2016

I should add that I've run the sparse tests locally, and things look fine. I'm launching a run with the full "array" || "linalg" || "sparse" locally, and will fix any issues. You can go back to bed & rest now!

@jrevels
Copy link
Member

jrevels commented Mar 7, 2016

Appreciated. Though it's difficult to rest when there are bugs afoot! Let's give it one last shot:

runbenchmarks("array" || "sparse" || "linalg", vs = "JuliaLang/julia:master")

@Jutho
Copy link
Contributor

Jutho commented Mar 7, 2016

Am I correct that the goal here is to delay the definition of StridedArray, even when this widens function definitions to generic AbstractArray instances in cases where it is probably not wise or even possible?

I spotted that permutedims for AbstractArray calls permutedims! which is still only defined for StridedArray. Also, the definitions of (c)transpose! using a recursive blocking strategy were also written with a strided memory layout in mind.

@timholy
Copy link
Member Author

timholy commented Mar 7, 2016

I spotted that permutedims for AbstractArray callspermutedims!which is still only defined forStridedArray

I noticed that too. My view is, one MethodError is as good as another MethodError. Now permutedims will succeed but the call to permutedims! will not.

Also, the definitions of (c)transpose! using a recursive blocking strategy were also written with a strided memory layout in mind.

I've given some thought to defining permutedims! in cases where we don't rely on striding, but I haven't coded anything. In general terms, I think it will look vaguely like this:

for jblk in 1:skipj:size(A,d1)  # d1 is the dimension that is fastest in the permuted array B
    for iblk in 1:skipi:size(A,1)
        for j in jblk:min(jblk+skipj-1, size(A,d1))
            for i in iblk:min(iblk+skipi-1, size(A,1))
                B[j,IB1,i,IB2] = A[i,IA1,j,IA2]
 end end end end

where all the I iterators are CartesianIndexes.

Might be a good "up for grabs" issue.

@timholy
Copy link
Member Author

timholy commented Mar 8, 2016

Oh, and forgot to respond to

Am I correct that the goal here is to delay the definition of StridedArray, even when this widens function definitions to generic AbstractArray instances in cases where it is probably not wise or even possible?

Delaying is part of the goal, but not the only one. Our multiplication routine, for example, is general, yet we get the error at the top. In general, I suspect we've been using StridedMatrix to mean "not sparse," but the problem is that cuts out more than just sparse.

@andreasnoack
Copy link
Member

I suspect we've been using StridedMatrix to mean "not sparse," but the problem is that cuts out more than just sparse.

Any kind of distributed array is also doomed when you start elementwise indexing. It is already a problem to the extent that during the development cycle, I don't make any kind of distributed array a subtype of AbstractArray because it hides the "generic" fallbacks that use elementwise indexing.

I'm sure you have mentioned it somewhere before, but why is it that your arrays cannot be considered Dense?

@timholy
Copy link
Member Author

timholy commented Mar 8, 2016

Sure, I completely agree that some types won't be efficient, and that this is a genuine problem. But that's true for many algorithms not among the ones I modified: try calling sum on a DistributedArray, and you'll be quite unhappy unless you've written a specialized version of sum. Yet we've not been restricting sum to StridedArrays.

AFAICT, in the functions I modified, with the exception of cummin, cummax, and permutedims, there's nothing about our implementations which assume strides. I replaced cummin and cummax with more generic versions, so only permutedims would need to be added (and I left the type-constraint on the workhorse function permutedims!, so we're basically in the same place we were before).

Since essentially none of the algorithms required strides, dispatching on StridedArray doesn't seem to make sense. I suppose we could dispatch on DenseArray, but what about this type:

type ProductMatrix{T} <: AbstractMatrix{T}
    xrange::StepRange{T,T}
    yrange::StepRange{T,T}
end

Base.getindex(A::ProductArray, i, j) = A.xrange[i]*A.yrange[j]

Definitely not a DenseArray, yet matrix multiplication and all those arithmetic operations work just fine as written. Therefore, they should be supported.

I'm simply of the growing belief that things should work first and foremost, and be efficient secondarily. If one needs a more efficient version of an algorithm, one should write it, but we shouldn't make people re-implement all of array arithmetic simply because we were worried that the generic algorithm might be slow for a subset of types.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed, but something went wrong when trying to upload the result data. cc @jrevels

EDIT: I forgot to turn off a debug flag I was using when messing with the report uploading code...oops. Here are the results -- Jarrett

Group ID Benchmark ID time GC time memory allocated number of allocations
"array index sum" ("sumelt","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Float32,2}",(3,5)) 0.78 1.00 1.00 1.00
"array index sum" ("sumelt","BaseBenchmarks.ArrayBenchmarks.ArrayLSLS{Int32,2}",(300,500)) 0.65 1.00 1.00 1.00
"array index sum" ("sumelt","BaseBenchmarks.ArrayBenchmarks.ArrayLS{Int32,2}",(300,500)) 0.65 1.00 1.00 1.00
"array index sum" ("sumrange","BaseBenchmarks.ArrayBenchmarks.ArrayLF{Int32,2}",(3,5)) 1.62 0.97 1.00 1.00
"simd" ("axpy!","Float32",10) 0.71 1.00 1.00 1.00
"simd" ("conditional_loop!","Float32",9) 0.79 1.00 1.00 1.00
"simd" ("conditional_loop!","Float32",10) 0.79 1.00 1.00 1.00
"simd" ("conditional_loop!","Float64",9) 0.75 1.00 1.00 1.00
"simd" ("conditional_loop!","Float64",10) 0.79 1.00 1.00 1.00
"simd" ("inner","Float32",10) 0.74 1.00 1.00 1.00
"simd" ("inner","Float32",255) 0.76 1.00 1.00 1.00
"simd" ("inner","Float32",256) 0.77 1.00 1.00 1.00
"simd" ("loop_fields!","Float32","BaseBenchmarks.SIMDBenchmarks.ImmutableFields{V<:AbstractArray{T,1}}",256) 0.48 0.98 1.00 1.00
"simd" ("loop_fields!","Float32","BaseBenchmarks.SIMDBenchmarks.MutableFields{V<:AbstractArray{T,1}}",256) 0.48 0.99 1.00 1.00
"simd" ("manual_example!","Float32",10) 0.71 1.00 1.00 1.00
"simd" ("manual_example!","Int64",9) 0.76 1.00 1.00 1.00
"simd" ("manual_example!","Int64",10) 0.76 1.00 1.00 1.00
"simd" ("two_reductions","Float32",10) 0.66 1.00 1.00 1.00
"simd" ("two_reductions","Float32",256) 0.79 1.00 1.00 1.00

@timholy
Copy link
Member Author

timholy commented Mar 8, 2016

While the nanosoldier benchmark didn't work out, it's not worth trying again: I've run the benchmarks on my local machine, and there are no reproducible regressions. For the record, when nanosoldier errors out, it's definitely worth looking into: without 1982118, because of the change in dispatch priority this would have had massive performance regressions for sparse matrix multiplication. ("...janitorial changes have a nasty habit of generating unexpected performance regressions" is a lesson learned the hard way.)

I'll leave this open longer to ensure that any concerns have been addressed.

@nalimilan
Copy link
Member

I agree with @timholy. Generic definitions should accept any AbstractArray as long as that interface is enough for them to return correct results. More efficient methods can then be provided to improve performance, but there's no reason to refuse doing something just because it's likely to be slow. In some cases, the slow version can be good enough; OTC, a MethodError is never useful.

@timholy
Copy link
Member Author

timholy commented Mar 8, 2016

Thanks, @jrevels!

@andreasnoack
Copy link
Member

a MethodError is never useful.

Above, I argued for the opposite. It is actually pretty useful when you try to exploit the structure of a type.

I'm okay with this change since it is consistent and will probably give a lot of users fewer error messages, but I wanted to point out that methods that extract each element of AbtractArrays will, in general, result in a lot of slow methods. Not just for sparse matrices but for all kinds of structured arrays. This includes the ProductMatrix for which you could easily write an O(n) matvec, but will fall back to the generic O(n^2) version with this change.

@nalimilan
Copy link
Member

Above, I argued for the opposite. It is actually pretty useful when you try to exploit the structure of a type.

Sorry, but I don't understand what a MethodError allows you that a generic fallback prevents.

I'm okay with this change since it is consistent and will probably give a lot of users fewer error messages, but I wanted to point out that methods that extract each element of AbtractArrays will, in general, result in a lot of slow methods. Not just for sparse matrices but for all kinds of structured arrays. This includes the ProductMatrix for which you could easily write an O(n) matvec, but will fall back to the generic O(n^2) version with this change.

You could easily write it even if the fallback exists, so what's the problem?

@andreasnoack
Copy link
Member

Sorry, but I don't understand what a MethodError allows you that a generic fallback prevents.

Getting the MethodError makes it easy to figure out which methods you'll have to define for a type. I'm simply talking from my own experience with implementing different kinds of distributed arrays. I end up not making them a subtype of AbstractArray to make sure they don't fall back on any generic method that uses indexing. The complete call graph of a method is not always easy to guess so in practice you'll have to do a lot of profiling to find the places where generic fallbacks are slowing down your code. It is easy to miss a method definition and, therefore, we'll end up with slow methods. In theory, there is nothing that prevents people from writing the fast methods but, in practice, people will forget some when they don't get an error.

@nalimilan
Copy link
Member

I see. With protocols/traits, it would be easy to list which all methods that the type needs to support and show for each whether it uses the AbstractArray fallback or whether it is specialized.

@timholy
Copy link
Member Author

timholy commented Mar 8, 2016

That's an interesting and quite intriguing development strategy, I'd never thought of that before.

I expect that the debugger will change the calculus a lot, since it will be much easier to step in and see what gets called. In other words, even though it's useful from the standpoint of someone writing the library, I doubt we want such considerations to affect how we architect the standard library from a user perspective.

@tkelman
Copy link
Contributor

tkelman commented Mar 8, 2016

There are certainly some types where you'd want fallback methods to be noticeable via warnings or errors in some debug/development/noisy configurations, but I think the default behavior should be to do the correct thing rather than error even if it's slow. Especially if the slowness is fixable - but to fix it someone has to be aware of it first.

timholy added a commit that referenced this pull request Mar 8, 2016
StridedArray->AbstractArray in many places
@timholy timholy merged commit dc0a038 into master Mar 8, 2016
@timholy timholy deleted the teh/less_striding branch March 8, 2016 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants