Make `Float16(::BigFloat)` go through `Float64` #40245

oscardssmith · 2021-03-28T07:35:13Z

Doesn't fix double rounding issues, but makes them occur 2^29 times less frequently. Should have minimal performance effects.

oscardssmith · 2021-03-29T20:08:21Z

Can we get a review and merge on this? It's very low impact and a strict improvement.

oscardssmith · 2021-03-30T14:23:17Z

Bump on this.

oscardssmith · 2021-03-30T16:11:33Z

Thanks!

kimikage · 2021-04-02T07:51:21Z

xref: https://discourse.julialang.org/t/os-dependency-of-float16-bigfloat-on-nightly/58414

oscardssmith · 2021-04-02T13:27:37Z

I think the best solution to this is probably to write our own version of Float16(::Float64) so that the OS can't get it wrong.

oscardssmith · 2021-04-03T04:16:15Z

I think the following is a correct implementation. It is 3x slower for Float32 (not fully sure why), but is about 13% faster for Float64.

function gentables()
	BASE  = zeros(UInt16, 512)
	SHIFT = zeros(UInt8,  512)
	for i in 1:256
	    e=i-128
	    if e<-24
			BASE[i|0x000]=0x0000
			BASE[i|0x100]=0x8000
			SHIFT[i|0x000]=24
			SHIFT[i|0x100]=24;
		elseif e<-14
			BASE[i|0x000]=(0x0400>>(-e-14))
			BASE[i|0x100]=(0x0400>>(-e-14)) | 0x8000
			SHIFT[i|0x000]=-e-1
			SHIFT[i|0x100]=-e-1
		elseif e<=15
			BASE[i|0x000]=((e+15)<<10)
			BASE[i|0x100]=((e+15)<<10) | 0x8000
			SHIFT[i|0x000]=13
			SHIFT[i|0x100]=13
		elseif(e<128)
			BASE[i|0x000]=0x7C00
			BASE[i|0x100]=0xFC00
			SHIFT[i|0x000]=24
			SHIFT[i|0x100]=24
		else
			BASE[i|0x000]=0x7C00
			BASE[i|0x100]=0xFC00
			SHIFT[i|0x000]=13
			SHIFT[i|0x100]=13
		end
	end
	return Tuple(BASE), Tuple(SHIFT)
end
const BASE, const SHIFT = gentables()

function Float16(x::Float32)
	xu = reinterpret(UInt32, x)
	e = (xu>>23)&0x1ff+1
	ans  = @inbounds BASE[e]
	ans += unsafe_trunc(Int16, ((xu&0x7fffff)>>SHIFT[e]))
	return reinterpret(Float16, ans)
end

function Float16(x::Float64)
	xu = reinterpret(UInt64, x)
	e = xu>>52
	e = ifelse(e<2048, clamp(e-895,  1,   256), 
					   clamp(e-2687, 257, 512))
	ans  = @inbounds BASE[e]
	ans += @inbounds unsafe_trunc(Int16, ((xu&0xfffffffffffff)>>(SHIFT[e]+29)))
	return reinterpret(Float16, ans)
end

kimikage · 2021-04-03T04:50:09Z

The Ivy Bridge and newer x86 processors have Float32 <--> Float16 conversion instructions. That's earlier than the advent of AVX2. Although I am not familiar with Arm, we can practically assume that most processors support them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Make `Float16(::BigFloat)` go through `Float64` #40245

Make `Float16(::BigFloat)` go through `Float64` #40245

Uh oh!

oscardssmith commented Mar 28, 2021

Uh oh!

oscardssmith commented Mar 29, 2021

Uh oh!

oscardssmith commented Mar 30, 2021

Uh oh!

oscardssmith commented Mar 30, 2021

Uh oh!

kimikage commented Apr 2, 2021

Uh oh!

oscardssmith commented Apr 2, 2021

Uh oh!

oscardssmith commented Apr 3, 2021

Uh oh!

kimikage commented Apr 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Make Float16(::BigFloat) go through Float64 #40245

Make Float16(::BigFloat) go through Float64 #40245

Uh oh!

Conversation

oscardssmith commented Mar 28, 2021

Uh oh!

oscardssmith commented Mar 29, 2021

Uh oh!

oscardssmith commented Mar 30, 2021

Uh oh!

oscardssmith commented Mar 30, 2021

Uh oh!

kimikage commented Apr 2, 2021

Uh oh!

oscardssmith commented Apr 2, 2021

Uh oh!

oscardssmith commented Apr 3, 2021

Uh oh!

kimikage commented Apr 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make `Float16(::BigFloat)` go through `Float64` #40245

Make `Float16(::BigFloat)` go through `Float64` #40245