Chris Dzombak

Debugging an Intermittent Arduino/ESP8266 ISR Crash

Part of the Getting Started with ESP8266 + Arduino series.

I’ve been playing with Wemos D1 Mini ESP8266 boards lately, in preparation for some upcoming project ideas, and just because learning new things is fun. I have some ideas for a series of blog posts about these boards; debugging this issue seems like as good a place as any to start. It’s not quite beginner-level, but hopefully it’s helpful.

For context, I’m using the Arduino core for ESP8266, and using PlatformIO with CLion, rather than the Arduino IDE, as my development environment.

I’m working on this program (that link is to the specific commit I’m debugging here). It does a few things, listed here in no particular order:

These tasks are coordinated using this Arduino cooperative multitasking library, which looks quite capable — my project is only scratching the surface of its API, which I wish was better documented.

Aside:Cooperative multitasking” means the operating system, or in this case the scheduler, does not interrupt a running task to give time to others; it’s up to each task to return quickly or take breaks to give other tasks a turn. (In the future I’d like to play with FreeRTOS on the ESP8266 platform, but I had to cut scope somewhere on these demos, and between my admittedly-cursory familiarity with Arduino and the large community it provides, I figured I should stick with Arduino for now.)

The problem

The HTTPS demo project crashes after running for several minutes. To start debugging this, I enabled crash decoding in the platformio device monitor tool. (I describe that setup in a separate blog post.)

This allowed me, eventually, to capture a useful crash log.

An initial attempt

Searching the Internet for esp8266 "rst cause:2, boot mode:(3,6)", which is printed at the bottom of the exception, I found several results saying that the ESP8266 will reset like this if you don’t call yield() often enough in long-running functions or inside loops. This is because the ESP8266’s WiFi stack needs processor time to run, and if it doesn’t get enough processor time, a watchdog timer will reset the chip.

“Easy enough,” I figured. This crash looked to be happening during HTTPS requests, which the documentation warns are hard work for this processor. So I assumed — wrongly — that maybe I could sprinkle some yields throughout this function. That did not fix the problem.

Diving deeper into the crash log

A few things jump out right away:

Illegal instruction
  epc1=0x4022698c in __divsf3 at /workdir/repo/gcc-gnu/libgcc/config/xtensa/ieee754-sf.S:1036

First, the “illegal instruction” message where the program apparently crashed is odd. This definitely should exist and be usable.

Then, let’s take a look at the stack trace. At the top:

0x4010064c in timer1_isr_handler at /Users/cdzombak/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_timer.cpp:34
0x401002c2 in ledTimerISR() at src/main.cpp:257
0x40100698 in timer1_isr_handler at /Users/cdzombak/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_timer.cpp:44

So, this crash happened while in my LED blinking interrupt service routine (ISR). That’s not particularly surprising — as this guide to interrupts suggests, ISRs are somewhat touchy. To quote my dad, “It’s best to keep ISRs as simple and short as possible.” Thing is, I thought this ISR was pretty simple — it just toggles the LED and sets the timer to keep the LED on or off for the correct amount of time. We’ll come back to that.

It’s confusing to read the rest of this trace; I don’t know exactly how to interpret a stack trace produced during an ISR. The few functions on top have no relation to those at the bottom, and it’s ambiguous to me where the the normal program flow ends and the interrupt-related stack entries begin. I’m sure someone out there with experience on this platform knows exactly how to read this … but I don’t.

At the bottom of the stack trace:

0x40235f98 in br_ssl_engine_closed at /home/earle/src/esp-quick-toolchain/arduino/tools/sdk/ssl/bearssl/src/inner.h:2211
 (inlined by) jump_handshake at /home/earle/src/esp-quick-toolchain/arduino/tools/sdk/ssl/bearssl/src/ssl/ssl_engine.c:1081
0x4023658d in br_ssl_engine_recvrec_ack at /home/earle/src/esp-quick-toolchain/arduino/tools/sdk/ssl/bearssl/src/ssl/ssl_engine.c:1206
0x4020b878 in thunk_br_ssl_engine_recvrec_ack at ??:?

I think this means that before the LED timer interrupt fired, the processor was doing some HTTPS stuff. This makes sense, since this is where the processor is spending the bulk of its time.

In the middle of this trace (snipped for brevity):

0x40106bec in spi_flash_read at ??:?
0x402182a2 in flash_hal_read(unsigned int, unsigned int, unsigned char*) at /Users/cdzombak/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/flash_hal.cpp:36
…
0x402134f0 in littlefs_impl::LittleFSFileImpl::read(unsigned char*, unsigned int) at /Users/cdzombak/.platformio/packages/framework-arduinoespressif8266/libraries/LittleFS/src/LittleFS.h:392

The important thing here is that it’s trying to read something from flash memory. Again, I don’t know how to accurately interpret this stack trace, but this is interesting and relevant because:

It will be useful to fully understand that last comment, so let’s discuss…

ESP8266’s memory architecture

Here I refer to the ESP8266 datasheet and the ESP8266 technical reference. They don’t go into as much detail as I would like, but we can get a high-level understanding of what’s happening here.

The datasheet (in sections 3.1.2 & 3.1.3) notes:

The technical reference, in section 5.1, says “… SPI module is used to read the running program from external Flash to CPU CACHE …”

Unfortunately, no reference material I can find details exactly how that process works — for example, exactly when does the ESP8266 read user code from flash? How much does it read at a time?

But our code is stored in flash, connected to the processor via SPI. And, during an ISR, something else may be using flash memory, so our ISR — and everything it calls — needs to be in RAM already, so the processor can execute it. This is what the IRAM_ATTR attribute does.

Armed with this knowledge, I think the stack trace sure looks like something in my ISR is touching the flash memory, but BearSSL is already using it. Hence the crash.

But what’s using the flash memory?

What’s touching the flash?

My ISR is already marked with IRAM_ATTR, and the stack trace hints that it’s at least started executing, so I can assume that’s not the issue.

However, in that ISR I do call this function, setInterval, in the timer library I was using. It’s not marked with IRAM_ATTR, and neither is the setFrequency function it calls.

It’s easy to verify that these calls are causing the problem: commenting out the ITimer.setInterval calls in the ISR fixes the crash. The LED doesn’t blink, of course, but this means I’m on the right track.

This problem now seems easy enough to fix:

Really, these hacks indicate that I shouldn’t be using this library — at least not the way I’m trying to. (Spoiler: I get rid of the library soon.)

Rebuild, re-upload, and … it still crashes in the same way: TLS stuff at the bottom of the stack trace, then SPI flash stuff, then the LED blinking ISR.

Taking another look at the library’s code, something jumped out at me: it’s performing floating-point division. And at the very top of the decoded exception, I remember something about division, in a file named ieee754-sf.S:

Illegal instruction
  epc1=0x4022698c in __divsf3 at /workdir/repo/gcc-gnu/libgcc/config/xtensa/ieee754-sf.S:1036

IEEE754 is a standard for floating-point arithmetic, and that file looks like it’s a software implementation of at least some floating-point math routines. A quick search tells me that ESP8266 doesn’t have a floating-point unit (plus, “floating-point,” “IEEE754,” and related terms don’t even appear in the datasheet or technical reference manual).

So, at this point I feel safe assuming that:

Again, I do wish there were more documentation about exactly how the ESP8266 manages the user code cache in RAM, but this seems like a solid enough theory.

Actually fixing it

After reviewing the ESP8266TimerInterrupt library’s code a little more, I concluded:

With all that in mind, I rewrote the LED blink management code to use the timer-management functions from the ESP8266 Arduino core, directly; and also to avoid doing any floating-point math or calling into any external libraries from the interrupt service routine.

And, just for fun, I also:

Verifying the fix

I’ll cheat, making the ISR fire more frequently to increase the chance of a crash:

diff --git a/src/main.cpp b/src/main.cpp
index b511378..943986d 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -47,11 +47,12 @@ BearSSL::Session tlsSession;
 WiFiClientSecure wifiClient;
 
 // LED (all times in milliseconds)
-#define CONNECTED_LED_TIME_ON   (TASK_SECOND/5)
-#define CONNECTED_LED_TIME_OFF  (TASK_SECOND*1.5)
-#define CONNECTING_LED_TIME_ON  (TASK_SECOND/5)
+// DO NOT COMMIT
+#define CONNECTED_LED_TIME_ON   (TASK_SECOND/20)
+#define CONNECTED_LED_TIME_OFF  (TASK_SECOND/10)
+#define CONNECTING_LED_TIME_ON  (TASK_SECOND/20)
 #define CONNECTING_LED_TIME_OFF CONNECTING_LED_TIME_ON
-#define CERT_ERR_LED_TIME_ON    (TASK_SECOND/1.5)
+#define CERT_ERR_LED_TIME_ON    (TASK_SECOND/10)
 #define CERT_ERR_LED_TIME_OFF   CERT_ERR_LED_TIME_ON
 
 // Tasks

Perfect. Applying this change before applying the actual bugfix, the program crashes reliably within 30 seconds.

Then, after applying the bugfix, it doesn’t crash! (It just blinks at an incredibly distracting rate.)

Lessons learned

I’ll try to be brief:

“Simple” means really, really simple. I thought my first interrupt service routine was simple, but it involved someone else’s code, some non-IRAM_ATTR function calls, and floating-point math that turned out to be implemented in software. Ew.

Just call nothing from an ISR. Like, if you can possibly push any work out of the ISR, do it. Even if it feels like that work should take about three clock cycles.

No floating point either. I vaguely recall having to migrate something to fixed-point during EECS 373 because of (different) floating point issues on a microcontroller; I should’ve remembered that!

Not all Internet troubleshooting suggestions are accurate. In retrospect, it was quite optimistic to hope that spraying yield()s all over my code would fix this, but that’s what a lot of Arduino forums said to do. They were wrong.

Be careful what libraries you let into your codebase. In day-to-day software development for desktops & servers, and even Raspberry Pis, I’m used to using external libraries. But this is a different environment, with much stricter limitations and less room for error or sloppiness. In just a week or so of playing with Arduino on the ESP8266, I’ve run into a few different Arduino libraries with various issues:

Yes.

Appendix: noInterrupts()

Earlier, before tracking down the real cause, I tried the hacky approach of calling noInterrupts() at the beginning of my ISR and interrupts() at the end. This sometimes helps “fix” interrupt conflicts, but:

Anyway, this approach couldn’t have worked here, since the processor needs interrupts to read from the flash memory over API. In retrospect, the best this could have accomplished would be to deadlock the whole system until a watchdog timer reset it.

In fact, this whole approach was doomed to make no difference: the underlying Arduino core already disables interrupts to the extent possible on ESP8266 when entering a timer ISR! (And, of course, it reenables them after the ISR exits.)