How We Built an AI DJ That Knows When to Drop the Next Track

The goal

We wanted a radio station with no human in the chair that still sounds like there is one. Not a playlist on shuffle with a 2-second crossfade — a DJ. Something that picks a harmonically compatible next track, matches its tempo to the beat, lines the kick drums up sample-for-sample, and hands the low end over so cleanly that you have to concentrate to notice a transition happened at all.

🎯

The bar we set

A listener should not be able to tell where one track ends and the next begins. That single sentence drove every engineering decision below.

The result is live at ravist.in, streaming a continuous mix from a library of 850+ tracks (and growing toward several thousand).

Architecture

Do the hard work once, keep the live path dumb

A naive auto-DJ analyzes audio while it plays. That is a trap — analysis is slow and the live path cannot afford to stall. So we split the system in two: heavy offline analysis feeds a deliberately cheap, fast live path.

Four-stage pipeline: Ingest, Select, Render, Stream — Every track is analyzed exactly once at ingest → ~5 KB of JSON. The live engine never touches a neural network or an FFT.

This is why the whole live stack runs comfortably on a handful of CPU cores while the broadcast never skips. The live engine just reads JSON and does arithmetic.

Stage 1 — Ingest

Turning audio into a musical fingerprint

When a track lands in our S3 bucket, an ingest worker downloads it, analyzes it, and writes a metadata JSON. Here is what we extract:

Property	Tool	Why it matters
Tempo (BPM) + beat grid	Essentia RhythmExtractor2013	The foundation of beatmatching
Downbeats (bar starts)	madmom DBN tracker	Phrase-accurate mix points — the “1” of each bar
Musical key	Essentia KeyExtractor → Camelot	Harmonic mixing — no key clashes
Loudness (LUFS)	pyloudnorm (ITU-R BS.1770)	Consistent perceived volume across tracks
Energy curve	RMS per second, normalized	Finding the intro, breakdown, and outro
Tags	mutagen	Artist / title / genre for the now-playing readout

Why two beat trackers?

Essentia gives us every beat. But a DJ does not mix on any beat — they mix on a downbeat, the “1” that starts a bar, and ideally on a phrase boundary (every 8 bars in dance music). Guessing downbeats as “every 4th beat” is off by one or two beats often enough to ruin a transition. So we run madmom's DBN downbeat tracker on top:

from madmom.features.downbeats import (
    DBNDownBeatTrackingProcessor, RNNDownBeatProcessor,
)
act = RNNDownBeatProcessor()(path)                       # neural activation
proc = DBNDownBeatTrackingProcessor(beats_per_bar=[4], fps=100)
tracked = proc(act)                                      # [time, beat_position]
downbeats = [t for t, pos in tracked if round(pos) == 1] # keep only the "1"s

We store downbeats, not every beat — alignment and phrasing happen at the bar level. Phrase starts are then simply every 8th downbeat.

The deterministic ID trick

Each track's ID is trk_<sha1(s3_uri)[:12]>. Because it is derived from the S3 URI, re-analyzing a track overwrites its metadata instead of creating a duplicate — which matters, because a duplicate in the pool means the DJ could mix a track into itself.

Stage 2 — Selection

The harmonic + tempo brain

Given the track that is playing, which track should come next? We score every candidate on three axes and pick a weighted-random winner from the top few. No ML — just music theory encoded as arithmetic.

Harmonic mixing and the Camelot Wheel

DJs use the Camelot Wheel, a relabeling of the circle of fifths that makes “which keys mix together” trivial. Each key becomes a clock position 1–12 plus a letter: A = minor, B = major. Compatible moves share most of their notes — same key, a step around the wheel, or the relative major/minor. We reject anything below 0.8 compatibility outright.

The Camelot Wheel with 8A and its compatible neighbours highlighted — An 8A track is only ever followed by 7A, 8A, 9A, or 8B — never a jarring 3B.

Tempo proximity (with half / double-time)

Two tracks beatmatch only if their tempos are close — we accept within ±6%. The subtlety: beat trackers often report half- or double-time, so we test all three octaves.

def bpm_within(deck_bpm, cand_bpm, tol=0.06):
    for mult in (0.5, 1.0, 2.0):           # half, normal, double time
        delta = abs(cand_bpm * mult - deck_bpm) / deck_bpm
        if delta <= tol:
            return delta                   # compatible
    return None                            # too far apart

A 126 BPM deck happily accepts a track Essentia read as 63 BPM.

The scoring formula

Weighted scoring: 0.5 harmonic, 0.4 bpm, 0.1 energy — Reject below 0.8 harmonic compatibility, then draw a score-weighted random pick from the top 3 — variety guaranteed.

Dayparting & the 24-hour cooldown

Real stations change mood through the day. Ours biases genre and tempo by the hour — house in the morning, peak techno after midnight — and the deck tempo ramps gently between targets so energy rises and falls instead of lurching. A time-stamped play history excludes anything heard in the last 24 hours; the map self-prunes each pick, and if the library is ever too small the cooldown relaxes automatically so the stream never stalls.

Stage 3 — Rendering

Where the magic (and the math) lives

Given track A playing and track B chosen, we render a single beat-locked audio segment that blends one into the other. Four problems to solve, in order.

3.1 — Beatmatching without changing pitch

We stretch B to the deck tempo using Rubber Band, a high-quality phase-vocoder that changes tempo without changing pitch. The stretch ratio is simply deck_bpm / track_bpm, and we fold gross octave errors toward the target first so a 62 BPM misread does not get stretched into oblivion.

🐛

A lesson that cost us a day

Our test harness silently lacked the Rubber Band binary and fell back to a hand-rolled phase vocoder. Every incoming track sounded subtly smeared — we nearly chased it as a musical bug. The fix was bundling the real binary and adding a 0.4% “dead-zone”: a 0.12% tempo difference is inaudible, so we do not run a stretcher for it at all. Never time-stretch for a difference you cannot hear.

3.2 — Kick-lock: sample-accurate alignment

Metadata downbeats get us close, but “close” is not a DJ. We fine-tune by cross-correlating the low-frequency energy envelopes of the two tracks around the mix point and shifting B by the lag that maximizes overlap (searched over ±1 beat). We decimate both envelopes to ~1 kHz first, so it is cheap — and crucially it does not trust the sometimes-imperfect beat grid. The kicks line up by their actual acoustic energy.

3.3 — The blend: an equal-power EQ handover

This is where most auto-DJs fail. Summing two kick drums gives you mud. So like a real DJ, we never play two basses at full volume at once. We split both tracks into low and high bands at 200 Hz and treat them differently.

200 Hz band split: full crossfade on highs, quick centered crossfade on lows — Highs cross over the whole blend; the low band hands over fast and centered. Because the kicks are phase-locked, the brief overlap reads as one reinforced kick.

Why cos/sin and not a straight linear fade? Because for two uncorrelated signals, power (not amplitude) adds. A linear fade dips to 50% power at the midpoint — an audible hole. The equal-power curve keeps total power constant.

Equal-power crossfade keeps cos squared plus sin squared at 1; linear fade dips to 0.5 — cos²(θ) + sin²(θ) = 1 — constant power for every t. A linear fade audibly dips to 0.5.

🔊

Two bugs we heard before we measured them

Our first handover swapped the bass with a hard cut — listeners heard “two slaps.” We over-corrected to a single crossover point and created a ~2-second bass hole. Plotting the low-band RMS through the blend caught it; the equal-power crossfade fixed both: constant bass energy, one kick at a time.

3.4 — Loudness compensation: killing the “tell”

The single biggest perceptual upgrade. We mix track A's quiet outro into track B's quiet intro — so the middle of every blend sagged in volume, and that dip is exactly the cue your ear uses to notice a mix is happening. We measure the short-term RMS of both sources and the blend, then apply a smoothed gain so the output tracks the energy-fair interpolation of the two inputs:

target(t) = sqrt[ (1-t)·RMS_A(t)²  +  t·RMS_B(t)² ]
gain(t)   = clip( target(t) / RMS_blend(t),  0.75,  1.45 )

The gain is smoothed over ~250 ms so it lifts the dip without pumping. After this, testers went from “I can easily tell mixing is happening” to “Great!!!” on the same transition.

Stage 4 — Streaming

Keeping it dead simple

The rendered segments are sample-continuous — each one's blend consumes the previous one's tail — so the player needs zero DSP. Liquidsoap plays the WAV segments back to back and pushes a 128 kbps MP3 to AzuraCast, which handles the public mount and CDN. We learned the hard way that real-time DSP crossfading cannot hit sample-accurate downbeat lock; pre-rendering on the server gives true beat-phase lock. The player's only job is to not get in the way.

The biggest single win

The cue-point problem

Where in track A do we start mixing out? Our first heuristic picked the earliest calm phrase in the back half — which, on a 6–8 minute track, is almost always the mid-track breakdown at the 50% mark. The DJ would abandon the track right before its biggest moment. The dancefloor dies.

The fix: pick the latest phrase whose energy at the cue is calm — the true outro — using the stored per-second energy curve.

Energy curve showing the wrong mid-track mix-out versus the correct late outro mix-out — Across the library this moved 709 of 736 tracks from a ~50% exit to ~85–98%. Blind A/B quality jumped from 5/10 to ~8/10 from this change alone.

Tooling & tuning

The tools that made it possible

We could not tune a live stream by waiting for transitions on air, so we built a render-to-clip harness: feed it two tracks, it renders just the transition (20s of A + the blend + 20s of B) to a WAV in seconds. We paired it with recue.py, which recomputes only the mix points from a track's already-stored beat grid and energy curve — no neural networks re-run. When we finalized the outro fix, we pushed it to all 736 existing tracks in minutes instead of re-ingesting for hours.

Every knob is an environment variable, tuned by ear:

Knob	Value	Controls
BASE_BPM	120	Deck tempo at set open
MAX_BPM	170	Tracks faster than this are dropped
BLEND_BARS	8 (~15s)	Transition length
CROSSOVER_HZ	200	Low/high split for the bass handover
SELECT_TOP_N	3	Variety — weighted random over top N
COOLDOWN_HOURS	24	No track repeats within this window
Bass crossfade window	0.35–0.65	Where the low end hands over
Loudness gain clip	0.75–1.45	Limits on the loudness lift

What's next

From guessing structure to knowing it

The system still guesses song structure from loudness. The next phase teaches it to know:

Structure detection — label every track's intro / build / drop / breakdown / outro by name, so cues snap to musical roles instead of energy levels.
Vocal-clash avoidance — detect vocal sections and never overlap two of them.
A transition-type engine — choose between a long blend, a quick cut, or a filter sweep based on what the two tracks actually are.

🎚️

The throughline

A great AI DJ is not a bigger model — it is a pile of small, correct decisions about beats, keys, energy, and loudness, each one made where a human DJ would make it. We measured every one of them.

Hear it live at ravist.in →

How We Built a Robot That Actually Knows When to Drop the Next Track