Skip to content

Conversation

@oscardssmith
Copy link
Member

Doesn't fix double rounding issues, but makes them occur 2^29 times less frequently. Should have minimal performance effects.

Doesn't fix double rounding issues, but makes them occur 2^29 times less frequently. Should have minimal performance effects.
@oscardssmith oscardssmith added bug Indicates an unexpected problem or unintended behavior maths Mathematical functions bignums BigInt and BigFloat float16 labels Mar 28, 2021
@oscardssmith oscardssmith requested a review from simonbyrne March 28, 2021 07:37
@oscardssmith
Copy link
Member Author

Can we get a review and merge on this? It's very low impact and a strict improvement.

@oscardssmith
Copy link
Member Author

Bump on this.

@quinnj quinnj merged commit 04e2028 into JuliaLang:master Mar 30, 2021
@oscardssmith
Copy link
Member Author

Thanks!

@oscardssmith oscardssmith deleted the oscardssmith-bigfloat-to-float16 branch March 30, 2021 16:12
@kimikage
Copy link
Contributor

kimikage commented Apr 2, 2021

@oscardssmith
Copy link
Member Author

I think the best solution to this is probably to write our own version of Float16(::Float64) so that the OS can't get it wrong.

@oscardssmith
Copy link
Member Author

I think the following is a correct implementation. It is 3x slower for Float32 (not fully sure why), but is about 13% faster for Float64.

function gentables()
	BASE  = zeros(UInt16, 512)
	SHIFT = zeros(UInt8,  512)
	for i in 1:256
	    e=i-128
	    if e<-24
			BASE[i|0x000]=0x0000
			BASE[i|0x100]=0x8000
			SHIFT[i|0x000]=24
			SHIFT[i|0x100]=24;
		elseif e<-14
			BASE[i|0x000]=(0x0400>>(-e-14))
			BASE[i|0x100]=(0x0400>>(-e-14)) | 0x8000
			SHIFT[i|0x000]=-e-1
			SHIFT[i|0x100]=-e-1
		elseif e<=15
			BASE[i|0x000]=((e+15)<<10)
			BASE[i|0x100]=((e+15)<<10) | 0x8000
			SHIFT[i|0x000]=13
			SHIFT[i|0x100]=13
		elseif(e<128)
			BASE[i|0x000]=0x7C00
			BASE[i|0x100]=0xFC00
			SHIFT[i|0x000]=24
			SHIFT[i|0x100]=24
		else
			BASE[i|0x000]=0x7C00
			BASE[i|0x100]=0xFC00
			SHIFT[i|0x000]=13
			SHIFT[i|0x100]=13
		end
	end
	return Tuple(BASE), Tuple(SHIFT)
end
const BASE, const SHIFT = gentables()

function Float16(x::Float32)
	xu = reinterpret(UInt32, x)
	e = (xu>>23)&0x1ff+1
	ans  = @inbounds BASE[e]
	ans += unsafe_trunc(Int16, ((xu&0x7fffff)>>SHIFT[e]))
	return reinterpret(Float16, ans)
end

function Float16(x::Float64)
	xu = reinterpret(UInt64, x)
	e = xu>>52
	e = ifelse(e<2048, clamp(e-895,  1,   256), 
					   clamp(e-2687, 257, 512))
	ans  = @inbounds BASE[e]
	ans += @inbounds unsafe_trunc(Int16, ((xu&0xfffffffffffff)>>(SHIFT[e]+29)))
	return reinterpret(Float16, ans)
end

@kimikage
Copy link
Contributor

kimikage commented Apr 3, 2021

The Ivy Bridge and newer x86 processors have Float32 <--> Float16 conversion instructions. That's earlier than the advent of AVX2. Although I am not familiar with Arm, we can practically assume that most processors support them.

See also #37510

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bignums BigInt and BigFloat bug Indicates an unexpected problem or unintended behavior float16 maths Mathematical functions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants