How I gained real control of an Echo
For more than 15 years I've built my own "Home Automation" components. Nothing fancy, though maybe back then it was. It started with some colored LED spots under my couch connected to my Ethernet via a micro controller. Later I added WiFi and 433MHz radio to control various remote switchable outlets. Some of them even were built into a light switch to control it. I even had Linux running in a light switch for some time. Then came my MagicMirror and with it the need for some presence detection based on WiFi. And so on, just to give you an introduction as to where this is coming from.
At one point I wanted to control those not only via smartphone, but also with my voice! 5 years ago my setup for that consisted of some ARM-based SBC and an USB microphone. I used Jasper, but of course I used "Jarvis" as a wake-word well before Mark Zuckerberg had the same idea :). But… it was horrible, you had to scream through the room to have a 25% chance of being understood. Clearly the microphone was to blame, but I hadn’t heard of microphone arrays yet. That was until I had a closer look at Amazon Echo's technology. However, my "Home Automation" was all locally in-house, and my voice should stay in-house as well.
Luckily I spotted a great blog post from F-Secure, "Alexa, are you listening", which was in turn based on the Cook/Clinton paper. They used the debug interface of the device to gain access to it, and streamed the microphone audio signal to another device. So hey, that’s what I wanted: the Echo had a microphone array, and could recognize everything much better with it, and was "jailbreakable"! Around that time I also learned about other microphone array projects like the ReSpeaker, but my focus stayed on the Echo. So about 3 years ago I bought an Echo (second hand) and asked a friend to 3D-print me a docking station for it.
I inserted some pogopins and gained control. A microSD adapter is perfect for wiring, and you can get a ready-to-use image from the echohacking wiki.
The bad idea
I was shocked. The system was outdated, even back then. A 2.6.x kernel and a rootfs based on an ancient glibc. Nothing I could easily turn into some open source powered voice assistant I thought. So I quickly made a plan to port a recent kernel to it and build an up-to-date rootfs with buildroot. The latter was quickly achieved, but for the kernel more time was needed to get it going. As I was also busy with other things, I worked on it once in a while, but eventually made it to a point when I was able to upstream the first bits, namely a device tree which landed in 5.6. The docking station wasn't suitable enough for longer periods of work, so finally I had to solder everything to the bottom board:
I noticed at an early point that there were no drivers for the microphones, and it took me ages to figure out that there was already a driver for the codec connected to the speaker, but it wasn’t easy to get it running. I first had some success with the microphones by porting the driver of another Echo device for the 4.14 kernel, and later even upstreamed the speaker part, which was released with 5.12. (And more fixes came for 5.13)
So with the microphones working on an open system, I tried ODAS to filter out my voice. Bad news: the Echo was by far too slow for that task. The original firmware uses the built-in DSP for that task, but guess what, the drivers are only available for 2.6.x kernels, or at best in a broken state until 3.16. And even if I had the drivers, the DSP code would still be proprietary. So what to do?
The not-so-bad idea
This winter I had time for another sprint and simply decided I’d make it this time. I dropped my old plan quite early and decided to make some hybrid of the Echo firmware and other software, mostly by turning off many Alexa services. That plan went so smoothly that after some hours/days (who counts?) I was already in need of a wake-word detection. I had previously stumbled upon Snips, but 3 years have passed and Snips isn’t what it was anymore. But hey, 3 years have passed and other things have been created: Porcupine, which recently added "Jarvis" and other words to their github repository together with an SDK under Apache 2.0. I know, the models and libraries are proprietary, but it’s a fair choice for a wake-word detection and it works brilliantly. Just not directly for me. The Echo firmware is still not hardfloat enabled, so I needed a second rootfs with hardfloat to chroot into. Easy, thanks to buildroot. With some hints from "Alexa, are you listening" and the Porcupine examples I quickly was able to stream the audio over network to a more powerful machine as soon as the wake-word was detected. Nice.
So what to do with that voice snippet these days? As I said, 3 years have passed, and in that time voice2json was created. Amazing! It wasn’t hard to create intents for my tasks and let them trigger the necessary actions. Next there was the LED ring. I wanted to use the original service for it, with all its conveniences. But how to communicate with it? Well, it uses a IPC library from lab26 called… lipc. I gave Google a chance to bring up more about it and... wow! I’m not the only guy doing such crazy things! There’s openLIPC from some Kindle hackers, that was unexpected! I’m not using it yet, but I plan to incorporate it soon. Then only days later it was all finished and sitting in the livingroom to serve commanded by voice. FINALLY! Time for a demo :)
- As soon as I have time, I'll start adding some components of my endeavor in this repository. The rootfs creation is a good first candidate I think. ☑
- The system can now control every relevant lights, can tell jokes, and more. A logical next step might be to connect it to an assistant like Mycroft. On the other hand I'm happy as it is now, just nice to see there are upgrades available.
- As multiple wake-words are allowed (see the demo video), I plan to add one for my other hobby project. It's a modified RC car with a camera, controlled by a Raspberry Pi Zero W, and supposed to find its way through my home itself. So being able to say something like "come to the living room", and the car does that, might be a good new feature. Maybe more on that at a later point.
I replaced Alexa with my own solution on an Amazon Echo from 2016 and will open source relevant parts of it here soon.
Thanks to the Jasper project for inspiration, the f-secure blog for making me curious, Clinton and Cook for releasing the pin out, Florian Müller for designing and printing of the dockingstation, echohacking wiki for the initial image, buildroot for making rootfs creation a matter of minutes, ifixit.com for early insights, ODAS for having an open source way to handle a microphone array, picovoice for providing the porcupine SDK and free wake-words models for it, voice2json for making it so easy to react to speech input, openlipc for being as crazy as this project :), and Zebediah Figura for spell/style fixes