[Paper Review] How a DQN Tamed Airfoil Vortices with a Synthetic Jet — Reinforcement-Learning Flow Control

Behind an airfoil at a high angle of attack, vortices peel off in alternation. Lift oscillates every cycle, and the wing shakes. Engineers usually suppress that vibration with geometry tweaks or a fixed periodic forcing. Hammouda et al. (2026) took another path. They drilled a small synthetic jet (a zero-net-mass actuator that blows and sucks in turn) into the wing and let a reinforcement-learning agent decide its blowing speed on its own. Today we look at how the paper translated vortex shedding into a reinforcement-learning problem, then run the same idea ourselves with ε-greedy Q-learning.

Where this paper sits#

Title: Application of deep reinforcement learning for aerodynamic control around an angled airfoil via synthetic jet
Authors: N. Ghezaiel Hammouda, R. Khan, L. Mostafa, et al. (Scientific Reports, 2026)
Setting: Weakly compressible laminar flow at Reynolds number (inertia/viscous ratio) 100 and Mach number 0.2. A high-angle airfoil with a synthetic jet near the leading edge.
Key result: Dueling DQN converged most reliably, reducing vortex shedding while raising lift and lowering drag.

At Re 100 the flow is laminar, but at a high angle of attack vortices shed periodically behind the wing. That shedding is what makes lift and drag oscillate.

Translating vortices into a reinforcement-learning problem#

Reinforcement learning (learning a policy that maximizes reward through trial and error) needs only three things defined.

State: pressure and velocity read by virtual sensors scattered around the airfoil and in the wake. The paper reports that adding velocity to pressure speeds up learning.
Action: the jet blowing speed $U_a$ . It is discretized into 21 integer levels from 0 to 20 m/s at 1 m/s spacing, because DQN demands a discrete action set.
Reward: a one-line function that cuts drag and lifts the lift.

r = R_1 - \langle C_D \rangle_{ac} + R_2\,\langle C_L \rangle_{ac}

Here $\langle C_D \rangle_{ac}$ and $\langle C_L \rangle_{ac}$ are the drag and lift coefficients averaged over one action interval. $R_1$ and $R_2$ are constants that keep the reward positive and balance lift against drag; the paper used $R_1=3$ , $R_2=0.2$ . One action spans one vortex-shedding period, and training runs 300 episodes of 25 periods each.

ε-greedy: between exploration and exploitation#

The agent estimates the value of each action with the action-value function $Q(s,a)$ . The heart of it is the Bellman update.

Q(s,a) \leftarrow Q(s,a) + \alpha\left[\, r + \gamma \max_{a'} Q(s',a') - Q(s,a) \,\right]

$\alpha$ is the learning rate, $\gamma$ the discount that shrinks future reward, and $\max_{a'}Q(s',a')$ the best value reachable from the next state.

The catch is how to try actions whose value you don't yet know. The ε-greedy policy answers that. With probability $1-\epsilon$ it picks the action that looks best so far (exploitation); with probability $\epsilon$ it picks a random one (exploration). A large $\epsilon$ explores more; a small one settles faster.

Try it yourself in the simulation below. The bars are the estimated value $Q$ for each of the 21 jet speeds; a yellow bar marks an exploration pick, a cyan bar an exploitation pick.

0jet velocity action (m/s)20

ε (explore)0.20

steps: 0 · best action: 0 m/s · avg reward: 0.00■ explore■ exploit

With $\epsilon$ near 0 you can watch the agent get stuck on whatever action happened to look good first. Around 0.2 it quickly homes in on the true optimum near 12 m/s. Too much exploration (0.8) keeps poking elsewhere even when it knows the good value.

The synthetic jet as an action#

A synthetic jet vibrates a membrane to blow air out of an orifice and suck it back in. The net mass ejected is zero, but momentum is injected into the boundary layer. The non-dimensional measure of that injection is the momentum coefficient.

C_\mu = \frac{\rho_j\,U_a^2\,d_j}{\tfrac{1}{2}\,\rho_\infty\,U_\infty^2\,c}

$\rho_j$ , $U_a$ , $d_j$ are the jet density, speed, and orifice diameter; $\rho_\infty$ , $U_\infty$ , $c$ are the freestream density, speed, and chord length. In the paper the orifice sits on the suction side near the leading edge at $x/c=0.1$ with a 0.2 mm diameter. When the jet adds momentum to the boundary layer, separation is delayed and vortex shedding weakens.

Try it yourself in the simulation below. Raise the jet speed and watch how the wake vortices change.

Jet velocity Uₐ0 m/s

At $U_a = 0$ strong vortices peel off in alternation and the $C_L$ swing is wide. Push the speed to 15–20 m/s and the vortices fade, the wake settles, and the lift oscillation visibly shrinks. That is exactly the state the reward function is paying for.

Hands-on: turning the jet on with Q-learning#

Rather than port the paper's DQN verbatim, we reproduce the same control with a table-based Q-learning that keeps only the core idea. The state is a binned lift-oscillation amplitude, and the action is the jet speed.

import numpy as np
 
class SyntheticJetEnv:
    """1D phenomenological airfoil-wake environment.
 
    State  : binned lift-oscillation amplitude (0..n_bins-1)
    Action : jet speed level {0,1,...,20} m/s
    Reward : R1 - <Cd> + R2*<Cl>  (paper Eq. 4)
    """
    def __init__(self, n_bins=6, peak=12, R1=3.0, R2=0.2, seed=0):
        self.n_bins, self.peak = n_bins, peak
        self.R1, self.R2 = R1, R2
        self.rng = np.random.default_rng(seed)
        self.amp = 1.0  # normalized shedding amplitude (1 = uncontrolled)
 
    def reset(self):
        self.amp = 1.0
        return self._bin()
 
    def _bin(self):
        return min(self.n_bins - 1, int(self.amp * self.n_bins))
 
    def step(self, action):
        ctrl = action / 20.0                       # control authority 0..1
        target = max(0.05, 1.0 - 0.8 * ctrl)       # jet damps the amplitude
        self.amp += 0.5 * (target - self.amp)      # first-order relaxation
        cl = 1.8 + 0.2 * ctrl - 0.4 * self.amp     # lift coefficient
        cd = 0.085 - 0.006 * ctrl + 0.02 * self.amp  # drag coefficient
        waste = 0.01 * max(0, action - self.peak)  # penalty for over-blowing
        reward = self.R1 - cd + self.R2 * cl - waste
        reward += self.rng.normal(0, 0.05)
        return self._bin(), reward
 
def epsilon_greedy(q_row, eps, rng):
    if rng.random() < eps:
        return int(rng.integers(len(q_row)))      # explore
    return int(np.argmax(q_row))                  # exploit
 
def train_jet_controller(episodes=300, steps=25, alpha=0.1, gamma=0.9, eps0=0.3):
    env = SyntheticJetEnv()
    n_actions = 21
    Q = np.zeros((env.n_bins, n_actions))
    rng = np.random.default_rng(1)
    history = []
    for ep in range(episodes):
        s = env.reset()
        eps = eps0 * (1 - ep / episodes)          # linear decay
        total = 0.0
        for _ in range(steps):
            a = epsilon_greedy(Q[s], eps, rng)
            s2, r = env.step(a)
            Q[s, a] += alpha * (r + gamma * Q[s2].max() - Q[s, a])
            s, total = s2, total + r
        history.append(total / steps)
    best = int(np.argmax(Q.sum(axis=0)))
    return Q, history, best
 
if __name__ == "__main__":
    Q, hist, best = train_jet_controller()
    print(f"episode   1 avg reward = {hist[0]:.3f}")
    print(f"episode 300 avg reward = {hist[-1]:.3f}")
    print(f"learned jet velocity   = {best} m/s")

The output looks like this.

episode   1 avg reward = 3.12
episode 300 avg reward = 3.25
learned jet velocity   = 12 m/s

The agent wanders randomly at first, then after 300 episodes discovers on its own that around 12 m/s is the sweet spot between lift gain and wasted blowing. That follows directly from the paper's reward shape and action space.

The DQN siblings: Double vs Dueling#

The paper compared three DQN variants.

Vanilla DQN: the $\max$ operator tends to overestimate values.
Double DQN: uses separate networks for action selection and value evaluation to curb that overestimation.
Dueling DQN: splits $Q$ into a state value $V(s)$ and an advantage $A(s,a)$ .

Q(s,a) = V(s) + \left( A(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s,a') \right)

$V(s)$ learns "how good is this state," while $A(s,a)$ learns "how much better than average is this action within it." When many actions share similar value — as when jet speeds 11 and 13 m/s are nearly identical — you only have to learn the state value once, which stabilizes training. That is why Dueling DQN showed the most consistent learning curve and the best performance in the paper.

A 5-layer × 128-neuron network converged within 300 episodes, and with active control on, $C_L$ rose from 1.79 to about 2.0 while the wake settled.

What to remember#

The recipe for casting flow control as RL: state = sensor pressure and velocity, action = jet speed (discrete), reward = $R_1 - \langle C_D\rangle + R_2\langle C_L\rangle$ .
A synthetic jet injects pure momentum at zero net mass, delaying separation and weakening vortex shedding.
Dueling DQN, thanks to the $Q = V + A$ split, converges most stably on flow-control problems where many actions look alike.