Ignore this post, dear reader.
Everything in it is correct, as far as it goes.
However, it has become abundantly clear that modern CPUs
allow mundane solutions to outperform those presented here.
This post will be revised accordingly, but will likely seem rather pointless.
I’m unsatisfied with Java’s typical bits-to-hex methods (*.toHexString and printf). That’s why—not for the first time—I was examining an issue generally considered settled or unimportant. I hope other malcontents find value in this work.
I created this nibble-to-hex conversion early in January 2025 (and, perhaps, one or more times in preceding years, without thinking twice about it). There is no telling how many programmers invented this before me, so I claim only its independent invention. That said, if these algorithms are new to you, I’d appreciate credit (a line comment will suffice). Even in January of 2026 (the time of publication), Gemini Pro was unaware of precedents for these algorithms, and it confirmed they are optimal. (No alternate approach uses fewer cycles, and no variant can simultaneously convert a sequence of nibbles.)
Returning to the subject…
Store a nibble in an “int” named “nibble” (its high 28 bits must be clear) then use one of the following expressions to obtain the corresponding ASCII/ISO-8859-1/Unicode code-point.
Nibble to Uppercase Hex Notation (3 Cycles)
(char) ((nibble | 0x30) + (9 - nibble >>> 29))
Nibble to Lowercase Hex Notation (4 Cycles)
(char) ((nibble | 0x30) + (9 - nibble >> 31 & 39))
That’s all there is to it. Java methods with detailed documentation, and C ports, are below the fold.
Those cycle counts are correct for the x86 and ARM64 architectures. Both depend on immediate operands, but only x86 benefits from instruction-level parallelism. Unlike x86, the ARM64 SUB instruction does not support an immediate minuend, so loading operand 9 prevents replication of the x86 parallelism. Instead, ARM64 saves a cycle at the end by using a “shifted-add” instruction to perform the logical right shift and addition in one cycle. That, I gather, is a trick x86 cannot perform. So, both architectures show limitations here, but they cancel-out.
(I spent many years programming in assembly on a daily basis, but that time has passed. I still look at microarchitecture documentation, but that does not produce an understanding of instruction sets and architectures comparable to someone who actually writes assembly code for the architectures. So, now, I double-check my conclusions with Gemini Pro.)
Java Method Implementations
private static char nibbleToHexDigitUc ( final int nibble ){ // Repetition of the following expression may appear tedious, and invites mistakes, // hence this method. The expression is short enough to ensure inlining. /* Clause 1, "nibble | 0x30" (equivalent to "nibble + '0'" in US-ASCII), produces "int" value "c1", ranging from 0x30 ('0') for "nibble" 0x0, to 0x3F ('?') for "nibble" 0xF. Clause 2, "9 - nibble", differentiates nibbles represented by US-ASCII characters '0'-'9' from those represented by 'A'-'F': (A) non-negative when "nibble" ≤ 9, or (B) negative when "nibble" ≥ 0xA. Clause 3, "c2 >>> 29", uses the high 3 bits of "c2" to produce "int" value "c3", the "c1" addend either converting US-ASCII characters ':'-'?' (produced from nibbles ≥ 0xA) to characters 'A'-'F', or leaving characters '0'-'9' unchanged: (A) 0 from "c2A", or (B) 7 from "c2B". Because "c3" is either 0, or a sequence of consecutive 1 bits, this clause is not adaptable to lowercase hex production, which would require a "c3B" value of 39 (0b100111). Clause 4, "c1 + c3" produces "int" value "c4": the US-ASCII code-point of the hexadecimal digit representing "nibble". Note: The whole expression allows two instances of instruction-level parallelism: clauses 1 and 2. That will not provide a great performance gain, because every operator is single-cycle, and eliminating one cycle is good, but not great. Nonetheless, some parallelism is better than none, and executing 50% of its operators in parallel is better than most expressions. */ //noinspection MagicNumber return (char) ((nibble | 0x30) + (9 - nibble >>> 29)); }
private static char nibbleToHexDigitLc ( final int nibble ){ // Repetition of the following expression may appear tedious, and invites mistakes, // hence this method. The expression is short enough to ensure inlining. /* Clause 1, "nibble | 0x30" (equivalent to "nibble + '0'" in US-ASCII), produces "int" value "c1", ranging from 0x30 ('0') for "nibble" 0x0, to 0x3F ('?') for "nibble" 0xF. Clause 2, "9 - nibble", differentiates nibbles represented by US-ASCII characters '0'-'9' from those represented by 'A'-'F': (A) non-negative when "nibble" ≤ 9, or (B) negative when "nibble" ≥ 0xA. Clause 3, "c2 >> 31", produces "int" value "c3": (A) 0 from "c2A", or (B) -1 from "c2B". Thus, "c3" is an all-or-nothing mask determining whether clause 4 produces an addend altering "c1", or 0 which leaves "c1" unchanged. Clause 4, "c3 & 39", produces "int" value "c4", the "c1" addend converting US-ASCII characters ':'-'?' (produced from nibbles ≥ 0xA) to characters 'a'-'f', while leaving characters '0'-'9' unchanged: (A) 0 when "nibble" ≤ 9, or (B) 39 when "nibble" ≥ 0xA. Value "c4B" makes this clause responsible for the letter case of hexadecimal digits > 9. As is, hexadecimal digits > 9 are lowercase. Uppercase digits would result from replacing 39 with 7 ... *except* there is no point to doing so, because a simpler, faster expression produces uppercase hex. Clause 5, "c1 + c4" produces "int" value "c5": the US-ASCII code-point of the hexadecimal digit representing "nibble". Note: The whole expression allows two instances of instruction-level parallelism: clauses 1 and 2. That will not provide a significant performance gain (every operator is single-cycle, so only one cycle is saved), but 40% parallelism is better than none. */ //noinspection MagicNumber return (char) ((nibble | 0x30) + (9 - nibble >> 31 & 39)); }
Those are the implementations for Java, my language of choice since the late 1990s. If porting, keep in mind: (1) both expressions implicitly operate on 32-bit, big-endian, two’s-complement integers throughout, (2) operator “>>>” is an unsigned right shift (JLS terminology), otherwise known as a “logical right shift” (typical assembly language terminology), and (3) operator “>>” is a signed (“arithemtic”) right shift.
C Ports (C99 and C23)
C, and, I think, most C-inspired languages, lack an equivalent operator. (Originally, C didn’t specify the sign-extension behavior of its right shift operator, >>, so one might encounter either behavior depending on the CPU and the compiler, although arithmetic [sign-extending] right shifts seem to have been the most common.) Although I spent 15 years programming in C (now designated “K&R C” or “C78”), I never encountered any of C’s standardized versions, and my only C documentation remains the 1978 edition of K&R. However, I gather C99 standardized logical right shift for unsigned integers, so the ports below are deterministic, unless one builds for a CPU that doesn’t use two’s-complement arithmetic on signed integers (apologies to Unisys developers).
Nibble to Uppercase Hex Notation, C99, 3 Cycles
Where “int32_t”, its high 28 bits are clear, and two’s-complement arithmetic remains the norm:
(nibble | 0x30) + ((uint32_t) (9 - nibble) >> 29)
Nibble to Lowercase Hex Notation, C99, 5 Cycles
Where “nibble” is of type “int32_t”, its high 28 bits are clear, and two’s-complement arithmetic remains the norm:
(nibble | 0x30) + ((-(int32_t) ((uint32_t) 9 - nibble >> 31)) & 39)
Nibble to Lowercase Hex Notation, C23, 4 Cycles
Where “nibble” is of type “int32_t”, and its high 28 bits are clear:
(nibble | 0x30) + (9 - nibble >> 31 & 39)
