Hi. I'm Sean, a backend nerd. You know my tribe: questionable social skills, not the folks the Sales and Marketing guys want talking directly to customers, and entrepreneurship is near zero. Entrepreneurs spot a business opportunity and try again and again until something sticks. I mostly understand the language they use, but I’m not always sure that I can see the picture they describe.
But I do have something in common with my entrepreneurial colleagues. I too will try again and again until I find something that sticks when diving into a technical challenge. The task won't let me rest until I find that thing that sticks. And, I too likely can only describe what I did in language that people outside of my tribe don't understand.
Just as every software company needs entrepreneurial leaders, every software company needs at least a few of us. Let me try to explain why.
Here at Pypestream, we build the backend to be as simple and scalable as possible. But to be honest, with shifting requirements, a continuous stream of new feature requests, hidden edge cases and assumptions, and several teammates updating the same code, things quickly get to the stage where as-simple-as-possible turns out to be not very simple at all. Add the drive to deliver, and sometimes issues make it to production.
Hey, that's the risk of delivering code quickly! Striving to eliminate all issues would only freeze all deliveries. It's not like we're Mr. Bean wiring his plug.
So yeah, sometimes bugs make it to production…
It started on a Tuesday. The ticket landed on my board. A customer has reported some inconsistent statistics. The cause seems to be that their end-chat-handler isn't being called on all sessions. It's 5 p.m., an hour before I have to bring my daughter to football training. I better give them something fast.
I quickly read the details to get familiar with the story. The easiest explanation presents itself: User Error. But surely if that was the case it wouldn't have made its way to my to-do list. The customer and service delivery have both examined it. I have to look deeper, but where to start?
I examine the logs, fine-tuning the filter to find evidence of the bug. The bot_start and bot_end events have equivalent numbers. No sign of any problem with that. Time is waning. I’d better get more information: When was it first noticed? Do we know that it wasn't happening before that? Did their solution change recently? Any example sessions? I'm off with the kids.
Three hours later, I’m back. It started last Wednesday, but was only noticed yesterday. The solution has regular updates, but nothing that should affect this. Hrmm… no leads there. I spend a half hour looking at the logs for one example session. After that, I’m only aware of logging out and going to bed. But I know from experience that there's loads going on in my subconscious.
I get to my desk next morning and can't wait for the laptop to boot up. I have an angle of attack. In the example session, the user clicked “X” while the end-chat-handler was executing. I load some logs in the GUI with the console debugger on, save the harfile and extract the token for opensearch. That GUI looks great, but it's not the tool I need today. I use the pype_id and that token in a script to get all the chat_start's for the last week from OpenSearch. Then, for each session, I get a list of all calls to the bot-framework for each session. Ten thousand sessions, maybe half a million log events in total, extracted, saved locally and processed into a story of the user experience for each session.
I look at the pattern of clicks and options taken by one anonymous user two days ago. The experience of a person 5,000 kilometers away analysed down to the second. A careful reconstruction of the conversation in terms of path through the solution. I’m focussed on that clicked “X” that ended the session while the end-chat-handler was in progress.
Another hour of extracting logs to figure out what the pattern across multiple log events would look like if this was the case, then checking for that pattern. But no! That pattern exists in sessions that were fine. And that pattern doesn't exist in other sessions that had the issue. So it's not that. Hrmm... back to the drawing board.
It started on the date that the solution was updated. Part of me wants to blame the Solution Designer. I have lots of scheduled work to get on with. But an avatar of the Solution Designer pipes up in my head: Why isn't the platform calling the end-chat-handler? That's a fair question. Hrmm... have to dig deeper.
Maybe it's... nah! It can't be that either.
Another hour of extracting logs to check another theory. By now the relevant code has been stripped down to the relevant parts. Unit tested in ten different scenarios with delays inserted to explore timing issues. All dead ends.
The Git history has been analysed, release history checked: nothing changed on our end last Wednesday. Hrmm... another dead end!
Sigh! This is a tough one! It's 5 p.m. again. The wife is working tonight. I have to go feed the kids. I’m mentally exhausted from the day-long wrestle with this bug. My son needs help with his homework. We all watch a few episodes of my daughter's favourite, Map Men. Off to bed with the kids. I go back to stare at the data for another hour, logging out at 10 p.m. Time to load the dish washer and go to bed.
I have the bed to myself tonight. I like to jump in when I get the chance. Nothing acrobatic, but with a clear 0.5-to-1 second gap between my feet leaving the floor and me landing on the bed. But tonight, in that 0.5-to-1 second gap, it's revealed! A flash of insight! Some explain it in terms of brain chemicals, but that misses something important. It's that energy that drove Archimedes to run down the street naked 2000 years ago screaming, “Eureka!” That 'aha!' moment when the bug reveals itself. The moment that makes the 10+ hours of slogging so worth it!
Like a child playing hide and seek who has just been found, the answer meets you with a grin,
“Haha! What took you so long? I could see you searching. You were right there! I can't believe you didn't find me earlier!”
As I hit the bed, I can see it all clearly. The code changed a month earlier to support the end-chat-handler requesting a transcript. When that's exercised by this new version it takes ~20 seconds to execute; more than the 25 seconds allowed for session shutdown, and the end-chat-handler response is ignored. All the log patterns that were pointing away from my previous theories are pointing directly to this thing that I couldn't see all day.
Inspiration comes just when the problem has left your conscious thoughts: in the shower, loading the dish washer. And this time, when jumping into bed, where I lay with eyes wide open seeing the big picture clearly after having examined the microscopic details all day. A deep peace settles in.
Thursday morning I’ve recreated the issue in a unit-test ten minutes after logging in. The patch is ready 15 minutes after logging in. Hrmm replaced with Aahhh!
Now, where was I on that scheduled work?
“Work” is the wrong word for what I do when wrestling a bug. But turning that process into a blog post? That’s work and I need a break and reach for Alex Bellos’s Can You Solve My Problems? Random page, question 58: using only a 7-minute and an 11-minute sand timer, time a quarter of an hour exactly. 7 + 11 = 18, 7 + 7 = 14. Hrmm… maybe... nope… what if… Yeah! Now back to the hard work of writing a blog post that needs a conclusion.
The word puzzle had a nicely contained question, the minimum amount of evidence needed, and maybe a superfluous bit to see if you realise you don't need it. And since it's a puzzle, there is one neat answer. You just have to find it.
In the real world, I have to search and find the evidence myself, decide what's relevant and what's not, and there's not always a neat, logical answer. But this is my career. Can you believe that? Computer Programming is the closest thing to being a professional puzzler without actually being one.
So my conclusion is: Here at Pypestream, the backend is built by people, my important, talented tribe, who love what they do.
Aahhh! Now the hard work is done. Where did I put that sudoku book???





