Hugo Hacker News

Endianness, a constant source of conflict for decades

rkangel 2021-08-18 09:57:36 +0000 UTC [ - ]

> Most network protocols and formats until recently have been big endian (Network Byte Order).

This misses the point of the original reason for this - big endian is more convenient if you're reading the value into a shift register. You don't need to know in advance how big the value is because you just shift the contents left each time you get a new byte and you end up with the appropriate 0 padded value.

Basically for a period of history, big endian was easier to implement for comms and little endian for processors. The big endian comms reasons have generally died out while the processor ones remain.

Network byte order (big endian in communication) is a very strong convention though. If you define a network protocol and use little endian then I will be very sad.

AnIdiotOnTheNet 2021-08-18 13:58:11 +0000 UTC [ - ]

Why? If I'm designing my own protocol I could use middle-endian EBCDIC encoded octal and it shouldn't make any difference to anything else.

rkangel 2021-08-18 14:13:50 +0000 UTC [ - ]

Are you building every implementation of the protocol? Or are you publishing a communications standard for people to implement?

If I'm building a network stack and at every layer I'm pulling values out of the header in network byte order, except in one special case where I have to pull them the other way (for no good reason) then that's the protocol author's fault.

If there is a good reason for using middle-Endian EBCDIC (compatibility or technical) then fine. If not, then please use the thing that's the strong convention and therefore simplest to work with.

dahfizz 2021-08-18 17:55:58 +0000 UTC [ - ]

As long as you don't want anyone to adopt the protocol, then go ahead.

ghusbands 2021-08-18 10:50:39 +0000 UTC [ - ]

Big-endian is how we write each byte in a hex editor, and so big-endian should be more intuitive. If we have the number 0xDEADBEEF and view it in a memory dump, a big-endian system gives:

  DEADBEEF
And a little-endian system gives:

  FEBEADDE
Only one of those reads naturally. As long as your increasing memory order goes from left to right, as is the norm.

zepearl 2021-08-18 20:09:52 +0000 UTC [ - ]

This might be the most stupid question of 2021, anyway, here it is:

but in little-endian systems the order of the individual bits in a byte is still (kind of) big-endian, with the most meaningful bit on the left, right?

Personally I always think first about how the single bits are stored in a byte, which is as far as I know, with the most important bit on the left, independently if the system is big- or little-endian, for example:

===

0: 0000 0000

1: 0000 0001

2: 0000 0010

...

255: 1111 1111

===

Therefore if the number/value gets bigger than a byte for me it's just more natural to keep adding bits to the left of the sequence of bits (therefore big-endian style), for example:

===

256: 1 0000 0000

257: 1 0000 0001

...

===

ghusbands 2021-08-18 21:21:41 +0000 UTC [ - ]

That amounts to the same question - the way we write binary numbers is the same way we write hexadecimal numbers, decimal numbers or numbers in any base; we start with the most significant. This makes the natural form for LTR languages big-endian.

It has no bearing on the layout of bits in the RAM chips, caches, register files or data/address lines in a computer. Endianness pretty much only affects the operation of transferring values from/to memory.

zepearl 2021-08-18 21:55:29 +0000 UTC [ - ]

Well, I can change the way I think/do stuff (I can write "1234" or "4321", no problem) but what I meant was basically "for me to make sense, byte-endianness should be aligned to bit-endianness"... .

To me it would make sense to use little-endian if the same would be applied as well on the level of "bits" within a byte... . Using little-endian on a byte-level but big-endian on a bit-level (within a byte) is quite confusing for me. Just my subjective opinion.

yakubin 2021-08-18 23:39:21 +0000 UTC [ - ]

> but in little-endian systems the order of the individual bits in a byte is still (kind of) big-endian, with the most meaningful bit on the left, right?

Single bits are not addressable, so from the point of view of a programmer "order of bits" isn't well-defined. There isn't a way to distinguish endianness of bits for a programmer.

AnIdiotOnTheNet 2021-08-18 13:54:37 +0000 UTC [ - ]

> FEBEADDE

0xDEADBEFE?

api 2021-08-18 16:26:49 +0000 UTC [ - ]

Little-endian is superior at the system level largely because casts between integer sizes are free. Casting a 32-bit LE value to an 8-bit value is just reinterpreting what the pointer means. This is a fairly common operation.

Big-endian is pretty much dead. I am not aware of any BE systems on sale right now outside small MIPS routers and embedded. If the Internet were designed today network byte order would probably be little-endian. New cryptographic algorithms are being designed little-endian-first.

classichasclass 2021-08-18 23:35:17 +0000 UTC [ - ]

OpenPOWER systems (POWER8 and up) swing both ways. That said, my POWER9 Raptor Talos II runs little endian, though it is completely capable of running big (and at the low OPAL level in fact does), and some operating systems on it support that.

swiley 2021-08-18 16:47:25 +0000 UTC [ - ]

It also hides errors that come from casting between different pointers types.

api 2021-08-18 17:15:50 +0000 UTC [ - ]

That's the job of the language/compiler. Rust does a fine job of this.

recursive 2021-08-18 16:21:16 +0000 UTC [ - ]

Any significance to `BE`? It's the only byte that remains untouched.

PennRobotics 2021-08-18 09:00:46 +0000 UTC [ - ]

For those unfamiliar with German, they read numbers aloud in a unique way. 65536 is "sixty five thousand five hundred six and thirty" (Edit: this is inaccurate). I present German-endian as a common enemy!

-----

At least for me, reading 16- and/or 32-bit multi-channel sensor data and then transceiving via 8-bit radio (atmega32u4) is at least a little easier when all the endians align. No byte swapping. No jumping a pointer ahead for each byte in a packet. Most importantly, no calling swapbytes (and its relevant data casts) in Matlab reading the serial data directly off the microcontroller.

While this is all relatively straightforward to code, each extra line is another tiny chance of error, and the failure modes in streaming data are not always obvious e.g. FIFO size is a power of two but channel data is an odd number of bytes/words. Do you drop data or pause until FIFO empties? Do circular buffers increment or decrement?

Luckily, modern sensors (e.g. IMUs) I've used usually have registers to byte swap, drop LSBs when not needed, choose right or left zero bit padding, change channel order, alter FIFO behavior, interrupt at a FIFO threshold, and so on.

JoachimS 2021-08-18 09:17:03 +0000 UTC [ - ]

Getting a bit off topic I guess, but in Danish you do the same AND also have base 20 for numbers over 40 and less than 100. So for example (roughly translated) is 65 "five and half of thirty". And even harder, for odd tens you get them by taking half of the base 20 number. So 55 is "five and half of thirty".

https://www.babbel.com/en/magazine/counting-in-danish

kzrdude 2021-08-18 10:04:58 +0000 UTC [ - ]

Maybe 65 should be explained as said like "five and threesh" where "threesh" is implicitly/shortened for "three times 20". So we say: 5 + 3 x 20, but in a terse encoding.

Yes, 55 is "half threesh" in the same way, that is we say 5 + half 3 x 20, but half 3 == 2.5, of course!

JoachimS 2021-08-18 11:48:16 +0000 UTC [ - ]

Much better explanation, thanks!

dandellion 2021-08-18 09:56:08 +0000 UTC [ - ]

> is 65 "five and half of thirty". And even harder (...) 55 is "five and half of thirty".

So you call any number "five and half of thirty"? That sounds pretty easy to be honest.

JoachimS 2021-08-18 11:49:17 +0000 UTC [ - ]

Oops. I meant "five and thirty" for 65. and "five and half thirty" for 55.

agent327 2021-08-18 10:21:20 +0000 UTC [ - ]

Isn't it "five and sixty thousand five hunderd six and thirty"? At least that's how it is said in Dutch, and I believe German works the same way...

PennRobotics 2021-08-18 10:25:24 +0000 UTC [ - ]

It is, and what a wonderful example of how endianness can cause mistakes.

TacticalMalice 2021-08-18 09:08:56 +0000 UTC [ - ]

"five and sixty thousand five hundred six and thirty", right?

Dutch is similar and this is a source of mistakes when writing down (phone) numbers. I've resorted to calling out the digits in LTR order.

PaulIH 2021-08-18 10:47:35 +0000 UTC [ - ]

Norway used that as the standard order until 1951, when an official reform changed the language to LTR. This was due the older way of stating numbers causing confusion when reading phone numbers and similar. It's still not universal, but younger generations generally now state numbers universally left to right.

BoxOfRain 2021-08-18 09:16:33 +0000 UTC [ - ]

There’s examples of it in English too, although it’s very old-fashioned. An example would be the rhyme with “four and twenty blackbirds baked in a pie”.

junon 2021-08-18 09:14:41 +0000 UTC [ - ]

Correct. The grouping matters, and double-digits in groupings are also reversed - "fünf und sechzig tausend fünf hundert sechs und dreißig".

Another thing, years in German aren't spoken as "twenty twenty-one" as we commonly do in English, but instead the number is spoken out fully - "two-thousand one and twenty" ("zwei tausend eins und zwanzig").

twic 2021-08-18 09:53:28 +0000 UTC [ - ]

In England, the little voice in Google Maps told me to take the "B one thousand, one hundred and thirteen", when every human i know would call that road the "B one one one three".

darrenf 2021-08-18 16:58:10 +0000 UTC [ - ]

I would honestly most likely say "triple-one three"!

IME road number† pronunciation in England goes out of its way to avoid "thousand" and "hundred" – with exceptions, of course. Off the top of my head, I reckon I say them like this:

* one or two digits = spoken as the number rather than the digits: A three, M twenty five, etc.

* three digits = sometimes spoken as digits: A two-one-seven - but sometimes broken into two numbers: B one-eleven

* four digits = sometimes the number A thirty-one-hundred (never three-thousand-one-hundred!), sometimes digits B triple-one three, sometimes year-style B thirteen eighteen

There are probably more variations that I can't think of right now too. It's a mess :D

† and bus route numbers, for that matter

skydhash 2021-08-18 11:45:45 +0000 UTC [ - ]

In French, it’s faster to say « Cent treize » than « Un, Un, Trois ». Phone numbers in my country are grouped by two.

PennRobotics 2021-08-18 10:33:31 +0000 UTC [ - ]

This is now totally off-topic, but I'd like to know if there is any Googler at all working on adding location tags to their text-to-speech model.

Hearing Google Assistant/Maps mispronounce German street or city names in an American accent is very grating to the ears. The pronunciation of a location name should ignore the language spoken, right? (Ignore for a moment the edge cases, like München vs Munich... although the voice says, "Munchin'," which is wrong in both languages!) And it can't be too complicated to borrow phonemes from another language where they don't exist... Right? Your American text-to-speech algorithm encounters an umlaut, then generate the correct waveforms from a language with umlauts.

(I'm sure someone reading this is jumping up and down, yelling about the "photo of a bird" xkcd.)

Akronymus 2021-08-18 10:53:30 +0000 UTC [ - ]

Not really comparable to photo of a bird, because using the geographic bounds for what language spoken there should work in 99.99% of the cases.

(I have my phone set to english, because I prefer it like that, despite living in austria, europe. Street names are one of the reasons I rarely ever use google maps for navigation)

simtel20 2021-08-18 10:56:22 +0000 UTC [ - ]

It's a hard problem in a way. If my language is localized to English, am I more likely to understand the native pronunciation of a street, or the English mispronunciation?

PennRobotics 2021-08-18 16:19:09 +0000 UTC [ - ]

That's true. To extend this idea, should you pronounce someone's name as they pronounce it? Even if you've only known it one way?

(The German Michael is kinda... Michh-aye-ehl'.)

maxerickson 2021-08-18 11:04:56 +0000 UTC [ - ]

For map information, the map app might better tag the language of the word being sent to the TTS engine.

PennRobotics 2021-08-18 10:24:37 +0000 UTC [ - ]

Ah right. My mistake. It would be five and sixty thousand ... Yuck!

I guess this is exactly what we're talking about---mistakes because you are not natively familiar with a particular system, and then you miss the non-base case. For me, I got the tens digit right but not the ten thousands digit.

In the memory case, it's knowing to change a pointer location because an address to a 32-bit value will start or end at a different address than a 64-bit value.

agent327 2021-08-18 10:22:24 +0000 UTC [ - ]

I totally hate people who repeat a phone number back to you, but with different digit grouping. How the hell am I supposed to know if that's the same number!? Just repeat it as I said it already...

__del__ 2021-08-18 16:12:57 +0000 UTC [ - ]

numbers grouped in twos are great for mnemonic memorization. you'll easily come up with an association for many two digit numbers.

ex. 415-222-9670 becomes: sub universal (one less than the answer to life, the universe and everything) deck (52 cards) swift (she's feeling 22) resolution (old dpi on wandows) top speed (California speed limit)

now isn't "sub universal deck swift resolution top speed" easier than googling twitter hq? ;] granted, the associations have to make sense to you. for me, 96 was a toss up between nashville (code name of windows 96) and the resolution i had to train myself to remember after moving from the mac's 72.

skerit 2021-08-18 16:10:38 +0000 UTC [ - ]

I just can't write down phone numbers when people pronounce them that way. "Nul vier­honderd­vijf­en­zeventig twee­ën­tachtig zes­en­dertig een­en­negentig"? You lost me at nul.

thaumasiotes 2021-08-18 16:04:34 +0000 UTC [ - ]

> For those unfamiliar with German, they read numbers aloud in a unique way. 65536 is "sixty five thousand five hundred six and thirty" (Edit: this is inaccurate). I present German-endian as a common enemy!

What's unique about that?

Sing a song of sixpence, a pocket full of rye

Four and twenty blackbirds baked in a pie

When the pie was opened, the birds began to sing

Wasn't that a dainty dish to set before the king?

8ytecoder 2021-08-18 16:32:10 +0000 UTC [ - ]

Based on what I have read in English novels of the 19th century (like Sherlock Holmes), I'd assume "six and thirty" was common in English as well.

ithkuil 2021-08-18 11:39:10 +0000 UTC [ - ]

> Binary dumps look more in line with how humans with left-to-right scripts expect to read numbers.

I remember VMS EXAMINE command but also printed DEC manuals showing hex dumps where the columns where numbered RTL. The ascii view on the right half of the hex dump was following the LTR order, so basically the two views were mirrored. (see example at http://www0.mi.infn.it/~calcolo/OpenVMS/ssb71/4556/4556p004....)

With such a rendering, little endian does indeed natural.

We're doing it all the time when rendering bit positions:

bit pos: 3210

bit val: 1100

extending this layout for byte indices is quite natural indeed.

Miiko 2021-08-18 07:31:40 +0000 UTC [ - ]

Probably because English is not my native language, but it always looked to me that the names are backward:

* "big-endian" should have "big" (most significant) part on the end

* and "little-endian" should have "little" (least significant) bits at the end

Is there different mnemonics to remember what is what?

yetihehe 2021-08-18 07:35:14 +0000 UTC [ - ]

Try with original source [0]. Big-endian, because some Lilliputians eat eggs starting from big end, little-endian, because some start from little end. Works for me.

[0] https://en.wikipedia.org/wiki/Endianness#Etymology

Miiko 2021-08-18 07:45:14 +0000 UTC [ - ]

Thanks! now that makes sense - indeed, "endian" refer not to what's on the end, but from which end we start writing the number.

kevin_thibedeau 2021-08-18 08:28:05 +0000 UTC [ - ]

The little end is at the lower address for LE.

yetihehe 2021-08-18 09:55:16 +0000 UTC [ - ]

Unless you show your memory layout on picture, where adress 0 is on top.

kevin_thibedeau 2021-08-18 15:20:55 +0000 UTC [ - ]

Still numerically lower.

anyfoo 2021-08-18 16:43:53 +0000 UTC [ - ]

It's not, a 16bit LE value at address 0 will have the low byte at address 0 and the high byte at 1. Unless I misunderstood what you meant with "little end"? (But then there is still room for confusion in that term, apparently.)

kevin_thibedeau 2021-08-18 17:26:27 +0000 UTC [ - ]

The little end is always at a lower address. It doesn't matter which way you draw the number line.

anyfoo 2021-08-18 17:37:44 +0000 UTC [ - ]

Ah, I see what you mean now. We are in agreement after all.

kazinator 2021-08-18 08:03:01 +0000 UTC [ - ]

Under little endian, the same power digits of differently sized operands are at the same offsets. For instance, whether we have a 16 bit operand or 128 bit operand, the least signficant 8 bits of either one are at the lowest address and so on. This is important if we want to, say, add them together.

However, this effect can hide bugs under little endian, which will instantly reproduce on big endian.

Suppose that, say, a function expects a 32 bit parameter, but the caller thinks it is passing a byte, whose value is XX. Suppose that by fluke the memory is all zeros. Under little endian, the caller puts XX at the right memory location in the stack, resulting in XX 00 00 00. And, by golly, the callee gets the corect 32 bit value XX.

Under big endian, even if by fluke the memory is all zeros, the caller will put the XX byte resulting in the same XX 00 00 00. But this now looks like a huge 32 bit value to the callee, hopefully caught in testing.

The apparently correct value will not be caught in testing.

Little endian would need nonzero values in the extra bytes instead of the fluky zeros in order to see a bad value.

rwmj 2021-08-18 11:57:40 +0000 UTC [ - ]

When POWER 7 moved to LE, little endian essentially won [edit: for CPUs, not for network protocols]. The only CPU architecture we support that is still big endian is s/390.

classichasclass 2021-08-18 23:37:15 +0000 UTC [ - ]

The little endian mode in POWER7 was not frequently used and I believe has limitations. It's officially a thing for POWER8 and up.

That said, my POWER9 runs little, though it's perfectly capable of running big.

scratcheee 2021-08-18 09:25:03 +0000 UTC [ - ]

>[big endian] Binary dumps look more in line with how humans with left-to-right scripts expect to read numbers.

Maybe I'm just dumb, but surely the actual language order is entirely irrelevant? If hypothetically we wrote in English right-to-left instead, then we'd write our numbers right-to-left, and our memory dumps right-to-left, so then we'd find that little-endian caused the data to start with the smallest byte first (on the right).

Mirroring the language doesn't undo a mirroring within the language, and that's what little-endian is, so a more accurate statement would be:

>Binary dumps look more in line with how humans with single-direction scripts expect to read numbers.

cies 2021-08-18 09:45:46 +0000 UTC [ - ]

> If hypothetically we wrote in English right-to-left instead, then we'd write our numbers right-to-left

If you read the article the author shows you that in RTL langs (where our current number system originated from) the numbers were also RTL. We just stuck to the convention.

Interesting how this little bit of RTL snuck into Europes otherwise LTR languages to the extend that when i type a number in a spreadsheet it changes the allignment to with this... Interesting/insightful article!

thaumasiotes 2021-08-18 16:20:16 +0000 UTC [ - ]

> If you read the article the author shows you that in RTL langs (where our current number system originated from) the numbers were also RTL. We just stuck to the convention.

Well, no, the author says this:

> Our modern numbering system has its roots in the Hindu numbering system, which was invented somewhere between the 1st and 4th century. Like the dominant writing system of the time, numbers were written right-to-left

This is not obvious - it appears that there was a right-to-left Indic script centered around Pakistan ( https://en.wikipedia.org/wiki/Kharosthi ) and a left-to-right one ( https://en.wikipedia.org/wiki/Brahmi_script ) farther south / east.

Before that, Sanskrit was written from left to right. It seems far more likely to me in any event that the order in which numbers are written, when the system is innovated, will reflect the order in which they are spoken in whatever language, not directly the order in which the language is written down.

Over time they will always develop a big-endian order, because that allows sorting them.

renox 2021-08-18 13:46:38 +0000 UTC [ - ]

The thing to remember is that things starts by 'oral/spoken' numbers then by written numbers.

When we speak we don't really use big endian nor little endian, we say 'six thousand one hundred' not 'six one zero zero' and obviously we prefer to start with the 'big' part because it's the most important for the listener: I don't really care if your price is '6 thousand and one' or '6 thousand and two', the listener hear '6 thousand' and then switch off..

And then written language followed oral language of course..

swiley 2021-08-18 08:24:35 +0000 UTC [ - ]

Yes, we should have stuck with Big-endian.

dataflow 2021-08-18 08:31:11 +0000 UTC [ - ]

Hexadecimal data looks very confusing with big endian. 01234567 somehow becomes equal to 4567 0123 and 67 45 23 01... pretty darn out-of-order and unnatural.

Lvl999Noob 2021-08-18 08:46:44 +0000 UTC [ - ]

Isn't that the little endian representation? I thought big endian starts with the most significant bit.

Here, I think 0 would be the most significant nibble and would be written at the left most point.

dataflow 2021-08-18 09:00:58 +0000 UTC [ - ]

Sorry, I should clarify. I meant what you would see in a hex editor would turn out like that. i.e. if you had the byte sequence 67 45 23 01, and decided to display it in 2-byte words, LE could just display 6745 2301; you'd know the first byte is still the least significant regardless of the grouping, and you can just regard the spacing as a visual grouping aid (pretty natural). If you tried e.g. "Go To -> Offset 3", you'd still land on the 01, just as if you went right by 3 positions visually... pretty intuitive. Compare that with BE, where it'd have to show you 4567 0123, and suddenly if you Go To -> Offset 3 ('01'), you'd land to the left of offset 2 ('23'), which to me seems super confusing.

vardump 2021-08-18 09:09:07 +0000 UTC [ - ]

You're getting your endians confused.

Bytes 67 45 23 01 interpreted as 16-bit words in:

Little endian, LE, or what x86 uses: 4567 0123

Big endian, BE, or what 68k uses: 6745 2301

dataflow 2021-08-18 09:55:52 +0000 UTC [ - ]

I'm not saying the same thing you think I'm saying, but I do realize I'm definitely explaining what I mean poorly... likely because hex editors themselves have multiple behaviors in this regard. Some merely use spaces to separate bytes visually, others actually group the bytes and parse them as integers, swapping them as needed to read like human writing.

Let me try illustrating a different situation, hopefully without that confusion.

Let's say your data starts with the byte sequence 67 45 23 01...

If you assume these represents some LE numbers, and want to multiply by 256 (decimal), you end up with 00 67 45 23 01... it really doesn't matter (and you don't need to know) what the word sizes were. That's the only sane result, and the byte at offset 2 would end up being 45h... end of story. Even if your number was only supposed to be N bytes and now it's N+1 bytes, you can just chop it back to N bytes and your result will still be correct (modulo 256^N) and as intuitive as it could be.

But if you start working in BE, suddenly things get confusing fast. Imagine what this operation would be for 2-byte BE words. The first word in BE is 6745 and now becomes 674500, and you overflowed by 1 byte. So which part do you keep and which part do you overflow to the next word? If you keep the 6745, then the 00 ends up affecting the second word rather than the first one, which is just completely nonsensical. The other option is to keep 4500 and and shove the 67 into the next word, turning it from 2301 into 230167. Now you have to repeat the same procedure with the 23, etc. until you reach the end of the data.

Now look at what just happened in the BE case. You have the bizarre situation where your words are internally BE (i.e. "go to offset 0" would now land on the 00 byte, which are not the first 2 characters in the editor!). And across words, they're still treated like LE—the inter-word overflows are still moving bytes to higher-order words on the right, not the left! There's just no sane way to do math with N-byte words and avoid LE entirely; even if you treat each word as BE, you're absolutely forced to treat the word sequence as LE. The only way to actually avoid all LE is to interpret the whole thing as 1 gigantic bignum, where "multiply by 256" ends up being translated into "append 00 to the end of the stream". That's great if your data really was 1 gigantic bignum, but not so much if your data was just typical ints or longs.

If this is still confusing (I realize it might be) then I'm not sure how else to put my thoughts into words unfortunately (no pun intended). Hopefully you can kind of see what I'm getting at though, even if I'm explaining some portions of it poorly (sorry).

nybble41 2021-08-18 19:29:13 +0000 UTC [ - ]

> Some merely use spaces to separate bytes visually, others actually group the bytes and parse them as integers, swapping them as needed to read like human writing.

In other words, your problem is that your hex editor is incorrectly assuming that the numbers are little-endian. If it interpreted them as big-endian then nothing would be swapped since the order of the bytes matches the standard conventions for numbers in most European writing systems. This is not a big-endian problem, it's a little-endian problem. A decent hex editor will allow you to set the byte order to match your file. (And as the article points out, the issue would be reversed if the bytes were displayed right-to-left following the same conventions that we use for numbers… but then all your strings would be reversed.)

Both representations have strengths and weaknesses depending on what you want to do. Most arbitrary-precision math works better with LE. On the other hand, hexadecimal string formatting works better with the BE encoding, where LE would require either the input or output to be reversed.

dataflow 2021-08-18 20:12:12 +0000 UTC [ - ]

No, this has nothing to do with the editor or bignums. I'm just using those to illustrate the underlying inconsistency.

The most natural order/sequence out there is that of natural numbers with zero (aka whole numbers), i.e. N0 = 0, 1, 2, 3, ...

So I'm basically arguing that means the most digit places are in the same order, i.e. the coefficients would be ordered as 256^N i.e. 256^0, 256^1, 256^2, 256^3, ...

nybble41 2021-08-18 23:53:46 +0000 UTC [ - ]

I was addressing your first point about how the data is displayed in hex editors. But the rest IMHO is just fixating on one particular operation ("multiplying by 256" by prepending bytes without regard for the data format) which does not seem particularly useful or prudent. Sure, in LE you can prepend a 00 byte to multiply by 256. In BE you do the same by appending a 00 byte at the end, and a 00 at the beginning would just be extra padding, exactly like adding a 0 at the start or end of a decimal number (which are also typically written in big-endian notation). In either case you need to allow for the change in the overall length or risk losing data and probably corrupting the rest of the data file in the process. If your data is actually groups of 16-bit or 32-bit elements rather than bytes then you need to add a whole number of elements for this to make any sense at all, regardless of byte order. You can't have an arbitrary-precision integer consisting of 5½ elements. Removing a byte from the end to preserve the length corresponds to reducing module 256^N in LE notation or dividing by 256 in BE; which is "more natural" would depend on the context.

For most arbitrary-precision math operations LE is easier to work with. The only thing making BE more "natural" in some situations (in particular, reading numbers out of a byte-oriented hex listing) is the historical accident that we borrowed our number system from right-to-left languages, where they were written little-endian with the one's place on the right, without reversing the order to match the surrounding text. Which leads to the alignment issues highlighted in the article.

dandanua 2021-08-18 08:41:53 +0000 UTC [ - ]

Nice explanation and examples. I also wrote about this problem in a more broader context of mathematics and quantum computing [1] (it's not displayed nicely in Firefox for some reason).

[1] https://github.com/dandanua/little-endian-vs-big-endian-in-q...

baybal2 2021-08-18 09:19:37 +0000 UTC [ - ]

Big endian, little endian, all pale in comparison to the horror of mixed endian people have in embedded.

bitwize 2021-08-18 07:24:59 +0000 UTC [ - ]

Little endian won. You almost don't have to worry about a new piece of software running on a big-endian machine.

iainmerrick 2021-08-18 08:38:15 +0000 UTC [ - ]

Apart from most file formats and internet standards being big-endian, you mean?

Although to borrow from minusf’s point, it’s good for software robustness that file formats and hardware use different endianness, as it forces you to read things byte-by-byte rather than lazily assuming you can just read 4 bytes and cast them directly to an int32.

walki 2021-08-18 08:48:48 +0000 UTC [ - ]

> it’s good for software robustness that file formats and hardware use different endianness, as it forces you to read things byte-by-byte rather than lazily assuming you can just read 4 bytes and cast them directly to an int32.

Except that it is very bad for performance. As far as CPUs are concerned little-endian has definitely won, most CPU architectures that have been big endian in the past (e.g. PowerPC) are now little endian by default.

If all new CPU architectures are little endian this means that within a decade or two there won't be any operating systems that support big endian anymore.

classichasclass 2021-08-18 23:39:48 +0000 UTC [ - ]

Power ISA is actually still big by default, and even on little-endian capable systems starts big endian until changed by the OS. Low-level OPAL calls on OpenPOWER systems are made big endian, even if the OS is little (the OS has to switch the processor mode).

iainmerrick 2021-08-18 09:11:43 +0000 UTC [ - ]

For performance, can’t you “just” have a swap-endianness instruction in your CPU, and have the compiler use it when it detects byte-shuffling code?

(That may even happen already on some architectures for all I know)

walki 2021-08-18 09:26:58 +0000 UTC [ - ]

> For performance, can’t you “just” have a swap-endianness instruction in your CPU

Yes, most CPUs have special instructions for swapping between little and big endian byte arrangement. The GCC compiler has the __builtin_bswap64(x) for accessing this instruction. However this is an additional instruction that needs to be executed for each read of a 64-bit word that needs to be converted, in some workloads this can double the number of executed instructions and hence add significant overhead.

Supporting big endian CPUs in systems programming sucks beyond imagination. There are virtually no big endian users anymore and making sure your software works fine on big endian requires testing it on a big endian CPU. However it is not possible to buy a big endian CPU anymore as there exist no more consumer big endian CPUs. For this reason I still have a Mac PowerPC from 2003 at home running an ancient version of Mac OS X. But over the last 2 years I have stopped testing my software on big endian, I just don't care about big endian anymore...

foxfluff 2021-08-18 12:34:44 +0000 UTC [ - ]

> it forces you to read things byte-by-byte rather than lazily assuming you can just read 4 bytes and cast them directly to an int32

Why is this good? How does the extra work make software more robust?

flohofwoe 2021-08-18 08:51:41 +0000 UTC [ - ]

> as it forces you to read things byte-by-byte

Indeed, reading file headers byte by byte also avoids alignment issues on some CPUs. At least older ARM CPUs trapped misaligned reads (not sure if this is still the case though).

walki 2021-08-18 09:09:59 +0000 UTC [ - ]

> not sure if this is still the case though

No this is not the case anymore. Nowadays support for unaligned memory accesses is very good on ARM and most other CPU architectures. On x86 aligned memory used to be very important for SIMD but now there are even special SIMD instructions for unaligned data and the performance overhead of unaligned memory accesses is generally very small in my experience.

minusf 2021-08-18 07:34:48 +0000 UTC [ - ]

which is a loss, as according to openbsd developers developing for both caught a lot of bugs otherwise not found easily..

iainmerrick 2021-08-18 08:34:37 +0000 UTC [ - ]

Couldn’t you keep that benefit by testing on emulated big-endian hardware, though?

(edit to fix interesting autocorrect glitch... “bug-endian” indeed!)

edflsafoiewq 2021-08-18 07:42:57 +0000 UTC [ - ]

Give us an example.

saurik 2021-08-18 08:06:24 +0000 UTC [ - ]

(I mean it just seems obvious you would catch more bugs given the format of big-endian, right?... it comes directly from the dual of one of the not-really-a-benefits of little-endian: how you can cast between pointers of different integer types without moving the pointer.)

agent327 2021-08-18 20:10:19 +0000 UTC [ - ]

I'm totally onboard with casting integers, and I can understand casting between pointers-to-char and pointers-to-other, but why on earth would you want to cast between pointers to (different) integers? That seems like asking for trouble, for no benefit I can discern...

masklinn 2021-08-18 08:17:29 +0000 UTC [ - ]

KingOfCoders 2021-08-18 08:29:31 +0000 UTC [ - ]

"there is a distinct advantage to writing numbers in little endian order."

Only if you write left to right, and not right to left.

2021-08-18 08:41:03 +0000 UTC [ - ]

nybble41 2021-08-18 19:37:33 +0000 UTC [ - ]

If you write right-to-left and use little-endian writing order then you still write the least significant digit first, so the one's places line up along the right edge. The advantage is the same.

nly 2021-08-18 08:54:59 +0000 UTC [ - ]

Does it matter? Most good static languages have a type system strong enough to express endianness as part of the data type, making conversion transparent

a_t48 2021-08-18 09:01:21 +0000 UTC [ - ]

Putting aside “good” vs “popular”, static typing doesn’t save you from accidentally tagging the wrong endianness for your external data sources. And it matters if you’re doing math with your CPUs native type.

BiteCode_dev 2021-08-18 08:57:55 +0000 UTC [ - ]

It matters for protocols and data formats.