Rendered at 09:47:46 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
deathanatos 1 days ago [-]
But what was the checksum? Like the actual, specific value?
The Factorio devs found[1] that some devices do fail to compute checksums, in that they compute the checksum just fine, but they're doing something stupid with some values and so checksums of 0x0000 or 0xFFFF (the two values from the FFF) cause packet loss.
In any protocol that, when the packet repeats, repeats it with even the slightest permutation (different request ID, timestamp, sequence number, etc.), that will be enough to jiggle the checksum to a new value (probably), and then the protocol will keep going with only a minor blip that probably goes unnoticed.
But if the packet is deterministic, only then you hit the problem.
> calculating the UDP checksum is not exactly rocket science.
I've seen things that trivial get messed up. "Just read the standard" is a high bar, sometimes. (Though the above is probably "I dual purposed a u16 without realizing it didn't have any available niches for that…")
> Unlike the TCP checksum, the UDP checksum is optional; the value zero is transmitted in the checksum field of a UDP header to indicate the absence of a checksum. If the transmitter really calculates a UDP checksum of zero, it must transmit the checksum as all 1's (65535). No special action is required at the receiver, since zero and 65535 are equivalent in 1's complement arithmetic.
Using 0x0000 and 0xFFFF as special values via 1's complement creates the error, only for these 2 specific values, when 2's complement logic is used to calculate.
ongy 1 days ago [-]
In my university times I wrote a library (to help with some homework we gave students) that calculated the CRC32 for ethernet.
Which worked well unless compiled with `strict-aliasing` gcc optimizations enabled...
Just writing UDP RFC compliant code doesn't protect you from running into annoying behavior with your programming language of choice...
adzm 21 hours ago [-]
> Which worked well unless compiled with `strict-aliasing` gcc optimizations enabled
I can't imagine enabling this by default instead of opting in with __restrict or equivalent. Just so many things that could go wrong if every little piece of code was not written with aliasing in mind.
deathanatos 17 hours ago [-]
The GCC flag is `-fno-strict-aliasing`, unless there is one I'm unaware of, which tells the compiler to essentially assume code might make aliasing mistakes.
> Just so many things that could go wrong if every little piece of code was not written with aliasing in mind.
You should always be writing "with aliasing in mind". It is a rule of the language, which specifies the "strict aliasing" that flag refers to, and it's UB to alias in ways which are not allowed. (Some aliasing is permitted by C. It's mostly type-punning that isn't.)
For computing the packet checksum, I'm not sure how you'd manage to run afoul of the strict-aliasing rule (you're just iterating over an array of octets … right?), but C is one of those "assume nothing" languages…
yuye 1 days ago [-]
I love Wube's FFFs. I wish more devs would do it; not just a devlog, but really going into the nitty-gritty of how some systems work.
userbinator 1 days ago [-]
Without disassembling and tracing the Intel Windows drivers (something I don’t feel like doing)
As someone who generally doesn't use AI in software development nor RE, this is one thing that I'd recommend trying one on to see what it can do: the problem is clearly defined and a solution is easily validated, and it's a problem you're not intersted in digging deeper yourself. The other comment here about 0000 and FFFF checksums seems like a good place to start.
A little more digging found this discussion from TODAY regarding what looks like a very similar bug in one of Intel's Linux NIC drivers: https://lkml.org/lkml/2026/5/4/1886
codemog 1 days ago [-]
Exactly. Why would people willingly do this kind of tedious grunt work by hand instead of having a machine do it? I guess some people enjoy it, but it was always one of my least favorite parts.
stroebs 1 days ago [-]
I came across this very same issue with fika, a community-made mod for Escape from Tarkov. One player would consistently fail to join games and it took ages to figure out the different components that were failing. The code intentionally sent the join message 4 times in quick succession, which triggered the DoS protection on the internet firewall. Ok, disabled that. The next issue was the packets were being interfered with by the ALG on the internet firewall, so disabled that too. Then the last final hurdle was the Rx offloading on the Intel NIC which was the exact same issue with the checksum being set to all 0’s or all F’s.
What made it confusing at the time is the join packet would sometimes be accepted and passed through to the game, so it prompted further digging into why.
bombcar 1 days ago [-]
It'd be interesting to see what the wrong checksum it calculates is ...
ErroneousBosh 1 days ago [-]
Someone else mentioned further up that it's all zeroes or all ones. A checksum of all zeroes means "this packet has no checksum and that's okay". Because of the way it's calculated 0xffff works out the same as 0x0000, so if the checksum happens to sum to 0x0000 it's replaced with 0xffff.
Both values are totally valid checksums but some people don't believe that :-)
ranger_danger 1 days ago [-]
Usually all 0s or all Fs. I had the same problem with an old Dell PowerEdge with Broadcom nics... packet failures left and right without disabling the offloading options.
nubinetwork 1 days ago [-]
Interesting... I've heard enabling tx/rx offloading is actually beneficial, turns out that's not always the case...
RiverCrochet 21 hours ago [-]
Many NICs have embedded ARM or other CPU cores (sometimes multiple) to do offloading, and the OS NIC driver contains code to be run on them.
dijit 21 hours ago [-]
yeah, but sometimes the calculations they do are wrong.
Very annoying when it happens, used to be common on the chipsets in the TB16 Thunderbolt docks from Dell... if you knew to turn off the offloading the ethernet worked otherwise it was slower than wifi..
The Factorio devs found[1] that some devices do fail to compute checksums, in that they compute the checksum just fine, but they're doing something stupid with some values and so checksums of 0x0000 or 0xFFFF (the two values from the FFF) cause packet loss.
In any protocol that, when the packet repeats, repeats it with even the slightest permutation (different request ID, timestamp, sequence number, etc.), that will be enough to jiggle the checksum to a new value (probably), and then the protocol will keep going with only a minor blip that probably goes unnoticed.
But if the packet is deterministic, only then you hit the problem.
> calculating the UDP checksum is not exactly rocket science.
I've seen things that trivial get messed up. "Just read the standard" is a high bar, sometimes. (Though the above is probably "I dual purposed a u16 without realizing it didn't have any available niches for that…")
[1]: https://www.factorio.com/blog/post/fff-176
> Unlike the TCP checksum, the UDP checksum is optional; the value zero is transmitted in the checksum field of a UDP header to indicate the absence of a checksum. If the transmitter really calculates a UDP checksum of zero, it must transmit the checksum as all 1's (65535). No special action is required at the receiver, since zero and 65535 are equivalent in 1's complement arithmetic.
Using 0x0000 and 0xFFFF as special values via 1's complement creates the error, only for these 2 specific values, when 2's complement logic is used to calculate.
Which worked well unless compiled with `strict-aliasing` gcc optimizations enabled...
Just writing UDP RFC compliant code doesn't protect you from running into annoying behavior with your programming language of choice...
I can't imagine enabling this by default instead of opting in with __restrict or equivalent. Just so many things that could go wrong if every little piece of code was not written with aliasing in mind.
> Just so many things that could go wrong if every little piece of code was not written with aliasing in mind.
You should always be writing "with aliasing in mind". It is a rule of the language, which specifies the "strict aliasing" that flag refers to, and it's UB to alias in ways which are not allowed. (Some aliasing is permitted by C. It's mostly type-punning that isn't.)
For computing the packet checksum, I'm not sure how you'd manage to run afoul of the strict-aliasing rule (you're just iterating over an array of octets … right?), but C is one of those "assume nothing" languages…
As someone who generally doesn't use AI in software development nor RE, this is one thing that I'd recommend trying one on to see what it can do: the problem is clearly defined and a solution is easily validated, and it's a problem you're not intersted in digging deeper yourself. The other comment here about 0000 and FFFF checksums seems like a good place to start.
A little more digging found this discussion from TODAY regarding what looks like a very similar bug in one of Intel's Linux NIC drivers: https://lkml.org/lkml/2026/5/4/1886
What made it confusing at the time is the join packet would sometimes be accepted and passed through to the game, so it prompted further digging into why.
Both values are totally valid checksums but some people don't believe that :-)
Very annoying when it happens, used to be common on the chipsets in the TB16 Thunderbolt docks from Dell... if you knew to turn off the offloading the ethernet worked otherwise it was slower than wifi..
Realtek RTL8153 iirc.