Bayesian HMM for Single-Molecule Protein Sequencing — Erisyon

PyMCJAXNumPyro

The Challenge

Erisyon is building technology to identify proteins one molecule at a time. Their instrument shines light on individual protein molecules and records the fluorescence pattern — but going from that noisy optical signal to "this is protein X" is a formidable inference problem. The space of possible proteins is vast. The fluorescence signals are stochastic and different amino acids can produce overlapping signatures. And because these identifications feed into downstream biological conclusions, they need to carry honest confidence estimates, not just best guesses. On top of all that, the pipeline has to be fast — Erisyon's experiments produce enormous amounts of data.

Our Approach

We built a probabilistic sequence model that treats protein identification as a decoding problem: given a series of noisy fluorescence observations, what protein sequence most likely produced them, and how confident should we be? The model captures the sequential nature of the signal — each observation depends on where you are in the protein and what happened before — while propagating uncertainty through the entire inference chain.

The main engineering challenge was speed. A naive implementation would have been far too slow for the scale of Erisyon's data. We wrote the core inference computations in a high-performance numerical framework and integrated them into the probabilistic model as a custom component, achieving the speedup needed to process experimental data at throughput scale. Extensive simulation-based validation confirmed that the model's assumptions matched real instrument behavior before we ever touched experimental data.

Results

The pipeline works. Erisyon can now run protein identification on high-throughput experimental data and get back not just sequence calls but calibrated probabilities — "we're 94% confident this is protein X" rather than just "protein X." That calibration matters because downstream analyses inherit whatever uncertainty exists in the identification step.

The custom computational approach delivered a dramatic speedup over what a standard implementation would have achieved. As one member of the Erisyon team put it: "You did in weeks what would have taken us a year."

PyMC Labs Team

Adrian
Maxim
Thomas Wiecki

Let's Chat, We Respond Fast

Tell us about your problem. We typically respond within 24 hours.

Schedule a Consultation