Thanks for the quick and thoughtful replies, @felmue and @hacxx.
First, just to clarify a specific point: can we say authoritatively that the M5 + PPS module are designed to communicate natively using 3.3V I2C? I've seen other forum/reddit posts that seem a bit inconsistent about the 3.3V vs. 5V question, and that remains a basic spec that I have yet to find an authoritative clear answer about in M5's official docs.
To answer some of your questions and theorize a bit further:
Yes, I've got the STM32 communicating with another I2C device; specifically, a DS2484 module which translates to OneWire protocol. That can reliably round-trip all the way down to my 1W EEPROM, so no reason to suspect fundamental issues with I2C on this board. More thoughts on that below.
Pull-ups: I have not added my own pull-ups here because I'm using pins on the STM32 that are designated for I2C usage (PF0 + PF1, AKA I2C_B), and my reading of the ST docs is that they already have appropriate pull-ups.
I2C speed: I have tried both the 400kHz speed that the M5 example uses and taken it down to 100kHz. I haven't looked at the speed question beyond that. AFAIK, the PPS module supports both (and regardless, I'm pretty sure I tested both while deploying to the M5 Core3 SE).
I have tried some different delays in between commands, with no particular impact that I can see.
Regarding the other I2C device: I should be clear that I've been using different I2C pins for this (i.e. a different logical I2C bus on the STM32), but only on a "why not?" kind of basis because we only expect this whole assembly to ever use 2 I2C devices so figured might as well isolate them. I mention this bit simply because it means I have not verified I2C with another downstream device over exactly the same STM32 pins.
Regarding other potential factors: I did eventually observe that removing some/all of my Serial.println() statements from the loop() body seemed to make things incrementally happier. That is one of the clues pointing toward the general class of issues @hacxx refers to, e.g. perhaps some timing/bus-contention subtleties in the STM32 / Arduino stack implementation? My deeper embedded debug skills for that sort of thing are bit rusty, but I'm curious if y'all have any further thoughts on the value of poking deeper and/or any suggested methodologies/tools to do so?
Finally, to touch on @hacxx's point regarding differences between STM32's HAL vs. Arduino: I've been building my prototypes on the Arduino stack thus far mostly out of sheer convenience (and in the case of the M5 demo code, so I could do an otherwise apples-to-apples compare between the MCU targets). I've been considering porting over to STM32Cube or maybe some other middleware stack if it has significant advantages. I'm curious if y'all have any further thoughts on that?