Engineering🎧 AI DJ🎚️ Beat-locked radio

How We Built a Robot That Actually Knows When to Drop the Next Track

Engineering notes from the team behind Ravist Radio β€” a 24/7 AI DJ that mixes EDM, house, and techno into one continuous, beat-locked broadcast.

✍️ Shahzoor AliπŸ•’ 12 min readπŸ“» Listen live

The goal

We wanted a radio station with no human in the chair that still sounds like there is one. Not a playlist on shuffle with a 2-second crossfade β€” a DJ. Something that picks a harmonically compatible next track, matches its tempo to the beat, lines the kick drums up sample-for-sample, and hands the low end over so cleanly that you have to concentrate to notice a transition happened at all.

🎯
The bar we set

A listener should not be able to tell where one track ends and the next begins. That single sentence drove every engineering decision below.

The result is live at ravist.in, streaming a continuous mix from a library of 850+ tracks (and growing toward several thousand).

Architecture

Do the hard work once, keep the live path dumb

A naive auto-DJ analyzes audio while it plays. That is a trap β€” analysis is slow and the live path cannot afford to stall. So we split the system in two: heavy offline analysis feeds a deliberately cheap, fast live path.

Four-stage pipeline: Ingest, Select, Render, Stream
Every track is analyzed exactly once at ingest β†’ ~5 KB of JSON. The live engine never touches a neural network or an FFT.

This is why the whole live stack runs comfortably on a handful of CPU cores while the broadcast never skips. The live engine just reads JSON and does arithmetic.

Stage 1 β€” Ingest

Turning audio into a musical fingerprint

When a track lands in our S3 bucket, an ingest worker downloads it, analyzes it, and writes a metadata JSON. Here is what we extract:

PropertyToolWhy it matters
Tempo (BPM) + beat gridEssentia RhythmExtractor2013The foundation of beatmatching
Downbeats (bar starts)madmom DBN trackerPhrase-accurate mix points β€” the β€œ1” of each bar
Musical keyEssentia KeyExtractor β†’ CamelotHarmonic mixing β€” no key clashes
Loudness (LUFS)pyloudnorm (ITU-R BS.1770)Consistent perceived volume across tracks
Energy curveRMS per second, normalizedFinding the intro, breakdown, and outro
TagsmutagenArtist / title / genre for the now-playing readout

Why two beat trackers?

Essentia gives us every beat. But a DJ does not mix on any beat β€” they mix on a downbeat, the β€œ1” that starts a bar, and ideally on a phrase boundary (every 8 bars in dance music). Guessing downbeats as β€œevery 4th beat” is off by one or two beats often enough to ruin a transition. So we run madmom's DBN downbeat tracker on top:

from madmom.features.downbeats import (
    DBNDownBeatTrackingProcessor, RNNDownBeatProcessor,
)
act = RNNDownBeatProcessor()(path)                       # neural activation
proc = DBNDownBeatTrackingProcessor(beats_per_bar=[4], fps=100)
tracked = proc(act)                                      # [time, beat_position]
downbeats = [t for t, pos in tracked if round(pos) == 1] # keep only the "1"s

We store downbeats, not every beat β€” alignment and phrasing happen at the bar level. Phrase starts are then simply every 8th downbeat.

The deterministic ID trick

Each track's ID is trk_<sha1(s3_uri)[:12]>. Because it is derived from the S3 URI, re-analyzing a track overwrites its metadata instead of creating a duplicate β€” which matters, because a duplicate in the pool means the DJ could mix a track into itself.

Stage 2 β€” Selection

The harmonic + tempo brain

Given the track that is playing, which track should come next? We score every candidate on three axes and pick a weighted-random winner from the top few. No ML β€” just music theory encoded as arithmetic.

Harmonic mixing and the Camelot Wheel

DJs use the Camelot Wheel, a relabeling of the circle of fifths that makes β€œwhich keys mix together” trivial. Each key becomes a clock position 1–12 plus a letter: A = minor, B = major. Compatible moves share most of their notes β€” same key, a step around the wheel, or the relative major/minor. We reject anything below 0.8 compatibility outright.

The Camelot Wheel with 8A and its compatible neighbours highlighted
An 8A track is only ever followed by 7A, 8A, 9A, or 8B β€” never a jarring 3B.

Tempo proximity (with half / double-time)

Two tracks beatmatch only if their tempos are close β€” we accept within Β±6%. The subtlety: beat trackers often report half- or double-time, so we test all three octaves.

def bpm_within(deck_bpm, cand_bpm, tol=0.06):
    for mult in (0.5, 1.0, 2.0):           # half, normal, double time
        delta = abs(cand_bpm * mult - deck_bpm) / deck_bpm
        if delta <= tol:
            return delta                   # compatible
    return None                            # too far apart

A 126 BPM deck happily accepts a track Essentia read as 63 BPM.

The scoring formula

Weighted scoring: 0.5 harmonic, 0.4 bpm, 0.1 energy
Reject below 0.8 harmonic compatibility, then draw a score-weighted random pick from the top 3 β€” variety guaranteed.

Dayparting & the 24-hour cooldown

Real stations change mood through the day. Ours biases genre and tempo by the hour β€” house in the morning, peak techno after midnight β€” and the deck tempo ramps gently between targets so energy rises and falls instead of lurching. A time-stamped play history excludes anything heard in the last 24 hours; the map self-prunes each pick, and if the library is ever too small the cooldown relaxes automatically so the stream never stalls.

Stage 3 β€” Rendering

Where the magic (and the math) lives

Given track A playing and track B chosen, we render a single beat-locked audio segment that blends one into the other. Four problems to solve, in order.

3.1 β€” Beatmatching without changing pitch

We stretch B to the deck tempo using Rubber Band, a high-quality phase-vocoder that changes tempo without changing pitch. The stretch ratio is simply deck_bpm / track_bpm, and we fold gross octave errors toward the target first so a 62 BPM misread does not get stretched into oblivion.

πŸ›
A lesson that cost us a day

Our test harness silently lacked the Rubber Band binary and fell back to a hand-rolled phase vocoder. Every incoming track sounded subtly smeared β€” we nearly chased it as a musical bug. The fix was bundling the real binary and adding a 0.4% β€œdead-zone”: a 0.12% tempo difference is inaudible, so we do not run a stretcher for it at all. Never time-stretch for a difference you cannot hear.

3.2 β€” Kick-lock: sample-accurate alignment

Metadata downbeats get us close, but β€œclose” is not a DJ. We fine-tune by cross-correlating the low-frequency energy envelopes of the two tracks around the mix point and shifting B by the lag that maximizes overlap (searched over Β±1 beat). We decimate both envelopes to ~1 kHz first, so it is cheap β€” and crucially it does not trust the sometimes-imperfect beat grid. The kicks line up by their actual acoustic energy.

3.3 β€” The blend: an equal-power EQ handover

This is where most auto-DJs fail. Summing two kick drums gives you mud. So like a real DJ, we never play two basses at full volume at once. We split both tracks into low and high bands at 200 Hz and treat them differently.

200 Hz band split: full crossfade on highs, quick centered crossfade on lows
Highs cross over the whole blend; the low band hands over fast and centered. Because the kicks are phase-locked, the brief overlap reads as one reinforced kick.

Why cos/sin and not a straight linear fade? Because for two uncorrelated signals, power (not amplitude) adds. A linear fade dips to 50% power at the midpoint β€” an audible hole. The equal-power curve keeps total power constant.

Equal-power crossfade keeps cos squared plus sin squared at 1; linear fade dips to 0.5
cosΒ²(ΞΈ) + sinΒ²(ΞΈ) = 1 β€” constant power for every t. A linear fade audibly dips to 0.5.
πŸ”Š
Two bugs we heard before we measured them

Our first handover swapped the bass with a hard cut β€” listeners heard β€œtwo slaps.” We over-corrected to a single crossover point and created a ~2-second bass hole. Plotting the low-band RMS through the blend caught it; the equal-power crossfade fixed both: constant bass energy, one kick at a time.

3.4 β€” Loudness compensation: killing the β€œtell”

The single biggest perceptual upgrade. We mix track A's quiet outro into track B's quiet intro β€” so the middle of every blend sagged in volume, and that dip is exactly the cue your ear uses to notice a mix is happening. We measure the short-term RMS of both sources and the blend, then apply a smoothed gain so the output tracks the energy-fair interpolation of the two inputs:

target(t) = sqrt[ (1-t)Β·RMS_A(t)Β²  +  tΒ·RMS_B(t)Β² ]
gain(t)   = clip( target(t) / RMS_blend(t),  0.75,  1.45 )

The gain is smoothed over ~250 ms so it lifts the dip without pumping. After this, testers went from β€œI can easily tell mixing is happening” to β€œGreat!!!” on the same transition.

Stage 4 β€” Streaming

Keeping it dead simple

The rendered segments are sample-continuous β€” each one's blend consumes the previous one's tail β€” so the player needs zero DSP. Liquidsoap plays the WAV segments back to back and pushes a 128 kbps MP3 to AzuraCast, which handles the public mount and CDN. We learned the hard way that real-time DSP crossfading cannot hit sample-accurate downbeat lock; pre-rendering on the server gives true beat-phase lock. The player's only job is to not get in the way.

The biggest single win

The cue-point problem

Where in track A do we start mixing out? Our first heuristic picked the earliest calm phrase in the back half β€” which, on a 6–8 minute track, is almost always the mid-track breakdown at the 50% mark. The DJ would abandon the track right before its biggest moment. The dancefloor dies.

The fix: pick the latest phrase whose energy at the cue is calm β€” the true outro β€” using the stored per-second energy curve.

Energy curve showing the wrong mid-track mix-out versus the correct late outro mix-out
Across the library this moved 709 of 736 tracks from a ~50% exit to ~85–98%. Blind A/B quality jumped from 5/10 to ~8/10 from this change alone.

Tooling & tuning

The tools that made it possible

We could not tune a live stream by waiting for transitions on air, so we built a render-to-clip harness: feed it two tracks, it renders just the transition (20s of A + the blend + 20s of B) to a WAV in seconds. We paired it with recue.py, which recomputes only the mix points from a track's already-stored beat grid and energy curve β€” no neural networks re-run. When we finalized the outro fix, we pushed it to all 736 existing tracks in minutes instead of re-ingesting for hours.

Every knob is an environment variable, tuned by ear:

KnobValueControls
BASE_BPM120Deck tempo at set open
MAX_BPM170Tracks faster than this are dropped
BLEND_BARS8 (~15s)Transition length
CROSSOVER_HZ200Low/high split for the bass handover
SELECT_TOP_N3Variety β€” weighted random over top N
COOLDOWN_HOURS24No track repeats within this window
Bass crossfade window0.35–0.65Where the low end hands over
Loudness gain clip0.75–1.45Limits on the loudness lift

What's next

From guessing structure to knowing it

The system still guesses song structure from loudness. The next phase teaches it to know:

  • Structure detection β€” label every track's intro / build / drop / breakdown / outro by name, so cues snap to musical roles instead of energy levels.
  • Vocal-clash avoidance β€” detect vocal sections and never overlap two of them.
  • A transition-type engine β€” choose between a long blend, a quick cut, or a filter sweep based on what the two tracks actually are.
🎚️
The throughline

A great AI DJ is not a bigger model β€” it is a pile of small, correct decisions about beats, keys, energy, and loudness, each one made where a human DJ would make it. We measured every one of them.

Hear it live at ravist.in β†’

Hit enter to search or ESC to close

β‡Ύ Discover Events

  • Loading events...

β‡Ύ Explore Artists

  • Loading artists...