Floats
This page documents Loom’s binary IEEE-754 floating-point types. Use these for real-number arithmetic, scientific formulas, graphics, and simulations where fractional values are required.
Quick reference
Type | Bits | IEEE-754 format | Sig. precision (bits / ~digits) | Min/Max finite | Default? |
---|---|---|---|---|---|
f16 | 16 | binary16 | 11 bits ≈ 3–4 dec digits | ~±6.10e−5 … ±6.55e+4 | No |
f32 | 32 | binary32 | 24 bits ≈ 6–7 dec digits | ~±1.18e−38 … ±3.40e+38 | No |
f64 | 64 | binary64 | 53 bits ≈ 15–16 dec digits | ~±2.23e−308 … ±1.79e+308 | Yes |
- Representation: IEEE-754 with sign, exponent, fraction; round-to-nearest, ties-to-even by default.
- Specials:
+0.0
,-0.0
,+∞
,-∞
,NaN
(quiet NaN used by default). - Subnormals: Supported (gradual underflow).
f16
is space-efficient but imprecise; preferf32
for UI/graphics andf64
for scientific/finance. Bare float literals without suffix infer tof64
unless context suggests otherwise.
Literals
Decimal literals with optional exponent and underscores:
let a: f64 = 1.0
let b: f32 = 3.1415927f32
let c = 6.022_140_76e23 # inferred f64
let d: f16 = 1.0e-3f16
let n: f64 = 0.0 # +0.0 and -0.0 both exist
Type suffixes: f16
, f32
, f64
.
Special constants:
let inf = f64::INFINITY
let ninf = f64::NEG_INFINITY
let nan = f64::NAN
Operators & semantics
- Arithmetic:
+ - * /
- Remainder:
a % b
is IEEE remainder (equivalent toa - trunc(a/b) * b
) - Comparisons:
== != < <= > >=
(see NaN rules below)
let x: f32 = 5.5
let y: f32 = 2.0
let q = x / y # 2.75
let r = x % y # 1.5
Exceptional results
- Divide by zero:
(+ value) / 0.0 → +∞
,(- value) / 0.0 → -∞
0.0 / 0.0
,∞ - ∞
,sqrt(-1.0)
→NaN
- Overflow →
±∞
; underflow → subnormal or±0.0
with loss of precision
NaN, ±0.0, and comparisons
- Any comparison with
NaN
is false except!=
, which is true. +0.0 == -0.0
is true, but they have different signs.- For a total ordering (sorting, maps), use
total_cmp(a, b)
.
Helpers:
x.is_nan()
x.is_finite()
x.is_infinite()
x.is_subnormal()
x.signum() # +1.0, -1.0, or NaN
copysign(mag, sign_source)
Rounding & next-after
- Default rounding: nearest, ties-to-even.
- Step to adjacent representable values:
x.next_up()
x.next_down()
Conversions
Float ↔ Float
-
Widening (e.g.,
f32 → f64
) is exact. -
Narrowing (e.g.,
f64 → f32
) rounds; use:as
cast (rounds to nearest)to_f32_checked()
→(f32, overflowed: bool)
(flags∞
,NaN
, or out-of-range)to_f32_saturating()
→ clamps to finite max/min
Integer ↔ Float
-
Int → Float: exact if within mantissa precision; else rounded.
-
Float → Int: truncates toward zero.
- Checked variants:
to_i32_checked()
,to_u32_checked()
, … - Saturating variants:
to_i32_saturating()
, …
- Checked variants:
let i: i32 = (3.9f32) as i32 # 3
let f: f64 = (1_000_000i32) as f64
Promotions & mixed-type rules
f16
,f32
,f64
in an expression promote to the widest present (f64
>f32
>f16
).- Mixing ints and floats promotes the int to the float’s type.
- No implicit conversion between floats and strings; use parse/format APIs.
Math library (selected)
- Roots & magnitudes:
sqrt
,cbrt
,hypot
- Rounds:
floor
,ceil
,round
,trunc
,fract
- Exponentials & logs:
exp
,exp2
,ln
,log10
,log2
,powf(y)
,powi(k)
- Trig:
sin
,cos
,tan
,asin
,acos
,atan
,atan2(y, x)
- Hyperbolic:
sinh
,cosh
,tanh
, … - FMA:
fma(a, b, c)
(computesa*b + c
with a single rounding) - Decompose/compose:
frexp()
→(mantissa, exp)
,ldexp(mantissa, exp)
- Split:
modf()
→(int_part, frac_part)
- Classification:
classify()
→enum of {Zero, Subnormal, Normal, Inf, NaN}
let r: f64 = f64::hypot(3.0, 4.0) # 5.0
let z: f32 = 1.0f32.fma(1e10f32, -1e10f32) # avoids catastrophic cancelation
Formatting & parsing
let v: f64 = 1234.56789
print(v) # 1234.56789 (default)
printf("fixed=%.2f sci=%.3e gen=%g\n", v, v, v)
# fixed=1234.57 sci=1.235e+03 gen=1234.57
let a: f32 = f32.parse("3.14")
let b: f64 = f64.parse("6.022e23")
let o_opt = f64.parse_opt("NaN") # → Option<f64>
Parsing accepts
inf
,+inf
,-inf
, andnan
(case-insensitive).nan(payload)
is permitted but payload bits are not guaranteed to round-trip across platforms.
Bit-level access & endianness
let bits: u32 = (3.5f32).to_bits() # raw IEEE-754
let x: f32 = f32.from_bits(bits)
let be = x.to_be_bytes() # explicit byte order for I/O
let y = f32.from_be_bytes(be)
Performance, precision & determinism
-
Prefer
f64
for numerically sensitive work;f32
for memory/throughput. -
Floating arithmetic is not associative:
(a+b)+c
may differ froma+(b+c)
. -
For reproducible results across platforms:
- Avoid “fast-math” compilation for critical code.
- Use
fma
, stable algorithms (e.g., Kahan summation), and fixed evaluation order.
-
Binary floats cannot exactly represent many decimals (e.g.,
0.1
). Compare with tolerances.
pub func approx_eq(a: f64, b: f64, eps: f64 = 1e-9): bool {
ret (a - b).abs() <= eps * (1.0 + a.abs().max(b.abs()))
}
Examples
Kahan (compensated) summation
pub func sum_kahan(xs: []f64): f64 {
var s = 0.0
var c = 0.0
for x in xs {
let y = x - c
let t = s + y
c = (t - s) - y
s = t
}
ret s
}
Fast, precise linear interpolation
# lerp(a, b, t) = a + t*(b - a), but fma avoids extra rounding
pub func lerp(a: f32, b: f32, t: f32): f32 {
ret (b - a).fma(t, a)
}
Safe normalization with edge cases
pub func normalize(x: f64, y: f64): (f64, f64) {
let d = f64::hypot(x, y) # robust sqrt(x*x + y*y)
if d == 0.0 { ret (0.0, 0.0) }
ret (x / d, y / d)
}
Constraining to a finite range
pub func clamp01(x: f32): f32 {
if x.is_nan() { ret 0.0f32 } # define your policy
ret x.max(0.0f32).min(1.0f32)
}
FAQs
Q: Why did my 0.1 + 0.2
become 0.30000000000000004
?
A: 0.1
and 0.2
are not exactly representable in binary floating point. Compare using a tolerance (see approx_eq
).
Q: Should I store currency in floats?
A: Prefer scaled integers (e.g., cents in i64
) or decimal types. Use floats only for approximate calculations.
Q: When should I use f16
?
A: For dense arrays where memory/bandwidth matter and precision requirements are low (e.g., ML activations, approximate textures). Convert to f32
/f64
for computation if needed.
Q: How do I sort values with NaNs?
A: Use total_cmp(a, b)
; it defines a total order (e.g., NaN
ordered after finite numbers).
See also
- Integers:
i8/i16/i32/i64
,u8/u16/u32/u64
- Complex numbers (if enabled by your profile)
- Numerics & math utilities in the standard library