Types
TypesFloats

Floats

This page documents Loom’s binary IEEE-754 floating-point types. Use these for real-number arithmetic, scientific formulas, graphics, and simulations where fractional values are required.


Quick reference

TypeBitsIEEE-754 formatSig. precision (bits / ~digits)Min/Max finiteDefault?
f1616binary1611 bits ≈ 3–4 dec digits~±6.10e−5 … ±6.55e+4No
f3232binary3224 bits ≈ 6–7 dec digits~±1.18e−38 … ±3.40e+38No
f6464binary6453 bits ≈ 15–16 dec digits~±2.23e−308 … ±1.79e+308Yes
  • Representation: IEEE-754 with sign, exponent, fraction; round-to-nearest, ties-to-even by default.
  • Specials: +0.0, -0.0, +∞, -∞, NaN (quiet NaN used by default).
  • Subnormals: Supported (gradual underflow).

f16 is space-efficient but imprecise; prefer f32 for UI/graphics and f64 for scientific/finance. Bare float literals without suffix infer to f64 unless context suggests otherwise.


Literals

Decimal literals with optional exponent and underscores:

let a: f64 = 1.0
let b: f32 = 3.1415927f32
let c      = 6.022_140_76e23          # inferred f64
let d: f16 = 1.0e-3f16
let n: f64 = 0.0                       # +0.0 and -0.0 both exist

Type suffixes: f16, f32, f64.

Special constants:

let inf  = f64::INFINITY
let ninf = f64::NEG_INFINITY
let nan  = f64::NAN

Operators & semantics

  • Arithmetic: + - * /
  • Remainder: a % b is IEEE remainder (equivalent to a - trunc(a/b) * b)
  • Comparisons: == != < <= > >= (see NaN rules below)
let x: f32 = 5.5
let y: f32 = 2.0
let q = x / y          # 2.75
let r = x % y          # 1.5

Exceptional results

  • Divide by zero: (+ value) / 0.0 → +∞, (- value) / 0.0 → -∞
  • 0.0 / 0.0, ∞ - ∞, sqrt(-1.0)NaN
  • Overflow → ±∞; underflow → subnormal or ±0.0 with loss of precision

NaN, ±0.0, and comparisons

  • Any comparison with NaN is false except !=, which is true.
  • +0.0 == -0.0 is true, but they have different signs.
  • For a total ordering (sorting, maps), use total_cmp(a, b).

Helpers:

x.is_nan()
x.is_finite()
x.is_infinite()
x.is_subnormal()
x.signum()        # +1.0, -1.0, or NaN
copysign(mag, sign_source)

Rounding & next-after

  • Default rounding: nearest, ties-to-even.
  • Step to adjacent representable values:
x.next_up()
x.next_down()

Conversions

Float ↔ Float

  • Widening (e.g., f32 → f64) is exact.

  • Narrowing (e.g., f64 → f32) rounds; use:

    • as cast (rounds to nearest)
    • to_f32_checked()(f32, overflowed: bool) (flags , NaN, or out-of-range)
    • to_f32_saturating() → clamps to finite max/min

Integer ↔ Float

  • Int → Float: exact if within mantissa precision; else rounded.

  • Float → Int: truncates toward zero.

    • Checked variants: to_i32_checked(), to_u32_checked(), …
    • Saturating variants: to_i32_saturating(), …
let i: i32 = (3.9f32) as i32   # 3
let f: f64 = (1_000_000i32) as f64

Promotions & mixed-type rules

  • f16, f32, f64 in an expression promote to the widest present (f64 > f32 > f16).
  • Mixing ints and floats promotes the int to the float’s type.
  • No implicit conversion between floats and strings; use parse/format APIs.

Math library (selected)

  • Roots & magnitudes: sqrt, cbrt, hypot
  • Rounds: floor, ceil, round, trunc, fract
  • Exponentials & logs: exp, exp2, ln, log10, log2, powf(y), powi(k)
  • Trig: sin, cos, tan, asin, acos, atan, atan2(y, x)
  • Hyperbolic: sinh, cosh, tanh, …
  • FMA: fma(a, b, c) (computes a*b + c with a single rounding)
  • Decompose/compose: frexp()(mantissa, exp), ldexp(mantissa, exp)
  • Split: modf()(int_part, frac_part)
  • Classification: classify()enum of {Zero, Subnormal, Normal, Inf, NaN}
let r: f64 = f64::hypot(3.0, 4.0)      # 5.0
let z: f32 = 1.0f32.fma(1e10f32, -1e10f32)  # avoids catastrophic cancelation

Formatting & parsing

let v: f64 = 1234.56789

print(v)                                  # 1234.56789 (default)
printf("fixed=%.2f sci=%.3e gen=%g\n", v, v, v)
# fixed=1234.57 sci=1.235e+03 gen=1234.57

let a: f32 = f32.parse("3.14")
let b: f64 = f64.parse("6.022e23")
let o_opt  = f64.parse_opt("NaN")         # → Option<f64>

Parsing accepts inf, +inf, -inf, and nan (case-insensitive). nan(payload) is permitted but payload bits are not guaranteed to round-trip across platforms.


Bit-level access & endianness

let bits: u32 = (3.5f32).to_bits()        # raw IEEE-754
let x: f32   = f32.from_bits(bits)

let be = x.to_be_bytes()                  # explicit byte order for I/O
let y  = f32.from_be_bytes(be)

Performance, precision & determinism

  • Prefer f64 for numerically sensitive work; f32 for memory/throughput.

  • Floating arithmetic is not associative: (a+b)+c may differ from a+(b+c).

  • For reproducible results across platforms:

    • Avoid “fast-math” compilation for critical code.
    • Use fma, stable algorithms (e.g., Kahan summation), and fixed evaluation order.
  • Binary floats cannot exactly represent many decimals (e.g., 0.1). Compare with tolerances.

pub func approx_eq(a: f64, b: f64, eps: f64 = 1e-9): bool {
    ret (a - b).abs() <= eps * (1.0 + a.abs().max(b.abs()))
}

Examples

Kahan (compensated) summation

pub func sum_kahan(xs: []f64): f64 {
    var s = 0.0
    var c = 0.0
    for x in xs {
        let y = x - c
        let t = s + y
        c = (t - s) - y
        s = t
    }
    ret s
}

Fast, precise linear interpolation

# lerp(a, b, t) = a + t*(b - a), but fma avoids extra rounding
pub func lerp(a: f32, b: f32, t: f32): f32 {
    ret (b - a).fma(t, a)
}

Safe normalization with edge cases

pub func normalize(x: f64, y: f64): (f64, f64) {
    let d = f64::hypot(x, y)     # robust sqrt(x*x + y*y)
    if d == 0.0 { ret (0.0, 0.0) }
    ret (x / d, y / d)
}

Constraining to a finite range

pub func clamp01(x: f32): f32 {
    if x.is_nan() { ret 0.0f32 }                  # define your policy
    ret x.max(0.0f32).min(1.0f32)
}

FAQs

Q: Why did my 0.1 + 0.2 become 0.30000000000000004? A: 0.1 and 0.2 are not exactly representable in binary floating point. Compare using a tolerance (see approx_eq).

Q: Should I store currency in floats? A: Prefer scaled integers (e.g., cents in i64) or decimal types. Use floats only for approximate calculations.

Q: When should I use f16? A: For dense arrays where memory/bandwidth matter and precision requirements are low (e.g., ML activations, approximate textures). Convert to f32/f64 for computation if needed.

Q: How do I sort values with NaNs? A: Use total_cmp(a, b); it defines a total order (e.g., NaN ordered after finite numbers).


See also

  • Integers: i8/i16/i32/i64, u8/u16/u32/u64
  • Complex numbers (if enabled by your profile)
  • Numerics & math utilities in the standard library