The goal
We wanted a radio station with no human in the chair that still sounds like there is one. Not a playlist on shuffle with a 2-second crossfade β a DJ. Something that picks a harmonically compatible next track, matches its tempo to the beat, lines the kick drums up sample-for-sample, and hands the low end over so cleanly that you have to concentrate to notice a transition happened at all.
A listener should not be able to tell where one track ends and the next begins. That single sentence drove every engineering decision below.
The result is live at ravist.in, streaming a continuous mix from a library of 850+ tracks (and growing toward several thousand).
Architecture
Do the hard work once, keep the live path dumb
A naive auto-DJ analyzes audio while it plays. That is a trap β analysis is slow and the live path cannot afford to stall. So we split the system in two: heavy offline analysis feeds a deliberately cheap, fast live path.
This is why the whole live stack runs comfortably on a handful of CPU cores while the broadcast never skips. The live engine just reads JSON and does arithmetic.
Stage 1 β Ingest
Turning audio into a musical fingerprint
When a track lands in our S3 bucket, an ingest worker downloads it, analyzes it, and writes a metadata JSON. Here is what we extract:
| Property | Tool | Why it matters |
|---|---|---|
| Tempo (BPM) + beat grid | Essentia RhythmExtractor2013 | The foundation of beatmatching |
| Downbeats (bar starts) | madmom DBN tracker | Phrase-accurate mix points β the β1β of each bar |
| Musical key | Essentia KeyExtractor β Camelot | Harmonic mixing β no key clashes |
| Loudness (LUFS) | pyloudnorm (ITU-R BS.1770) | Consistent perceived volume across tracks |
| Energy curve | RMS per second, normalized | Finding the intro, breakdown, and outro |
| Tags | mutagen | Artist / title / genre for the now-playing readout |
Why two beat trackers?
Essentia gives us every beat. But a DJ does not mix on any beat β they mix on a downbeat, the β1β that starts a bar, and ideally on a phrase boundary (every 8 bars in dance music). Guessing downbeats as βevery 4th beatβ is off by one or two beats often enough to ruin a transition. So we run madmom's DBN downbeat tracker on top:
from madmom.features.downbeats import (
DBNDownBeatTrackingProcessor, RNNDownBeatProcessor,
)
act = RNNDownBeatProcessor()(path) # neural activation
proc = DBNDownBeatTrackingProcessor(beats_per_bar=[4], fps=100)
tracked = proc(act) # [time, beat_position]
downbeats = [t for t, pos in tracked if round(pos) == 1] # keep only the "1"sWe store downbeats, not every beat β alignment and phrasing happen at the bar level. Phrase starts are then simply every 8th downbeat.
The deterministic ID trick
Each track's ID is trk_<sha1(s3_uri)[:12]>. Because it is derived from the S3 URI, re-analyzing a track overwrites its metadata instead of creating a duplicate β which matters, because a duplicate in the pool means the DJ could mix a track into itself.
Stage 2 β Selection
The harmonic + tempo brain
Given the track that is playing, which track should come next? We score every candidate on three axes and pick a weighted-random winner from the top few. No ML β just music theory encoded as arithmetic.
Harmonic mixing and the Camelot Wheel
DJs use the Camelot Wheel, a relabeling of the circle of fifths that makes βwhich keys mix togetherβ trivial. Each key becomes a clock position 1β12 plus a letter: A = minor, B = major. Compatible moves share most of their notes β same key, a step around the wheel, or the relative major/minor. We reject anything below 0.8 compatibility outright.
Tempo proximity (with half / double-time)
Two tracks beatmatch only if their tempos are close β we accept within Β±6%. The subtlety: beat trackers often report half- or double-time, so we test all three octaves.
def bpm_within(deck_bpm, cand_bpm, tol=0.06):
for mult in (0.5, 1.0, 2.0): # half, normal, double time
delta = abs(cand_bpm * mult - deck_bpm) / deck_bpm
if delta <= tol:
return delta # compatible
return None # too far apartA 126 BPM deck happily accepts a track Essentia read as 63 BPM.
The scoring formula
Dayparting & the 24-hour cooldown
Real stations change mood through the day. Ours biases genre and tempo by the hour β house in the morning, peak techno after midnight β and the deck tempo ramps gently between targets so energy rises and falls instead of lurching. A time-stamped play history excludes anything heard in the last 24 hours; the map self-prunes each pick, and if the library is ever too small the cooldown relaxes automatically so the stream never stalls.
Stage 3 β Rendering
Where the magic (and the math) lives
Given track A playing and track B chosen, we render a single beat-locked audio segment that blends one into the other. Four problems to solve, in order.
3.1 β Beatmatching without changing pitch
We stretch B to the deck tempo using Rubber Band, a high-quality phase-vocoder that changes tempo without changing pitch. The stretch ratio is simply deck_bpm / track_bpm, and we fold gross octave errors toward the target first so a 62 BPM misread does not get stretched into oblivion.
Our test harness silently lacked the Rubber Band binary and fell back to a hand-rolled phase vocoder. Every incoming track sounded subtly smeared β we nearly chased it as a musical bug. The fix was bundling the real binary and adding a 0.4% βdead-zoneβ: a 0.12% tempo difference is inaudible, so we do not run a stretcher for it at all. Never time-stretch for a difference you cannot hear.
3.2 β Kick-lock: sample-accurate alignment
Metadata downbeats get us close, but βcloseβ is not a DJ. We fine-tune by cross-correlating the low-frequency energy envelopes of the two tracks around the mix point and shifting B by the lag that maximizes overlap (searched over Β±1 beat). We decimate both envelopes to ~1 kHz first, so it is cheap β and crucially it does not trust the sometimes-imperfect beat grid. The kicks line up by their actual acoustic energy.
3.3 β The blend: an equal-power EQ handover
This is where most auto-DJs fail. Summing two kick drums gives you mud. So like a real DJ, we never play two basses at full volume at once. We split both tracks into low and high bands at 200 Hz and treat them differently.
Why cos/sin and not a straight linear fade? Because for two uncorrelated signals, power (not amplitude) adds. A linear fade dips to 50% power at the midpoint β an audible hole. The equal-power curve keeps total power constant.
Our first handover swapped the bass with a hard cut β listeners heard βtwo slaps.β We over-corrected to a single crossover point and created a ~2-second bass hole. Plotting the low-band RMS through the blend caught it; the equal-power crossfade fixed both: constant bass energy, one kick at a time.
3.4 β Loudness compensation: killing the βtellβ
The single biggest perceptual upgrade. We mix track A's quiet outro into track B's quiet intro β so the middle of every blend sagged in volume, and that dip is exactly the cue your ear uses to notice a mix is happening. We measure the short-term RMS of both sources and the blend, then apply a smoothed gain so the output tracks the energy-fair interpolation of the two inputs:
target(t) = sqrt[ (1-t)Β·RMS_A(t)Β² + tΒ·RMS_B(t)Β² ]
gain(t) = clip( target(t) / RMS_blend(t), 0.75, 1.45 )The gain is smoothed over ~250 ms so it lifts the dip without pumping. After this, testers went from βI can easily tell mixing is happeningβ to βGreat!!!β on the same transition.
Stage 4 β Streaming
Keeping it dead simple
The rendered segments are sample-continuous β each one's blend consumes the previous one's tail β so the player needs zero DSP. Liquidsoap plays the WAV segments back to back and pushes a 128 kbps MP3 to AzuraCast, which handles the public mount and CDN. We learned the hard way that real-time DSP crossfading cannot hit sample-accurate downbeat lock; pre-rendering on the server gives true beat-phase lock. The player's only job is to not get in the way.
The biggest single win
The cue-point problem
Where in track A do we start mixing out? Our first heuristic picked the earliest calm phrase in the back half β which, on a 6β8 minute track, is almost always the mid-track breakdown at the 50% mark. The DJ would abandon the track right before its biggest moment. The dancefloor dies.
The fix: pick the latest phrase whose energy at the cue is calm β the true outro β using the stored per-second energy curve.
Tooling & tuning
The tools that made it possible
We could not tune a live stream by waiting for transitions on air, so we built a render-to-clip harness: feed it two tracks, it renders just the transition (20s of A + the blend + 20s of B) to a WAV in seconds. We paired it with recue.py, which recomputes only the mix points from a track's already-stored beat grid and energy curve β no neural networks re-run. When we finalized the outro fix, we pushed it to all 736 existing tracks in minutes instead of re-ingesting for hours.
Every knob is an environment variable, tuned by ear:
| Knob | Value | Controls |
|---|---|---|
| BASE_BPM | 120 | Deck tempo at set open |
| MAX_BPM | 170 | Tracks faster than this are dropped |
| BLEND_BARS | 8 (~15s) | Transition length |
| CROSSOVER_HZ | 200 | Low/high split for the bass handover |
| SELECT_TOP_N | 3 | Variety β weighted random over top N |
| COOLDOWN_HOURS | 24 | No track repeats within this window |
| Bass crossfade window | 0.35β0.65 | Where the low end hands over |
| Loudness gain clip | 0.75β1.45 | Limits on the loudness lift |
What's next
From guessing structure to knowing it
The system still guesses song structure from loudness. The next phase teaches it to know:
- Structure detection β label every track's intro / build / drop / breakdown / outro by name, so cues snap to musical roles instead of energy levels.
- Vocal-clash avoidance β detect vocal sections and never overlap two of them.
- A transition-type engine β choose between a long blend, a quick cut, or a filter sweep based on what the two tracks actually are.
A great AI DJ is not a bigger model β it is a pile of small, correct decisions about beats, keys, energy, and loudness, each one made where a human DJ would make it. We measured every one of them.