[Paper Review] How a DQN Tamed Airfoil Vortices with a Synthetic Jet — Reinforcement-Learning Flow Control
DQN and Dueling DQN drive a synthetic jet to raise lift and cut drag
Behind an airfoil at a high angle of attack, vortices peel off in alternation. Lift oscillates every cycle, and the wing shakes. Engineers usually suppress that vibration with geometry tweaks or a fixed periodic forcing. Hammouda et al. (2026) took another path. They drilled a small synthetic jet (a zero-net-mass actuator that blows and sucks in turn) into the wing and let a reinforcement-learning agent decide its blowing speed on its own. Today we look at how the paper translated vortex shedding into a reinforcement-learning problem, then run the same idea ourselves with ε-greedy Q-learning.
Where this paper sits#
- Title: Application of deep reinforcement learning for aerodynamic control around an angled airfoil via synthetic jet
- Authors: N. Ghezaiel Hammouda, R. Khan, L. Mostafa, et al. (Scientific Reports, 2026)
- Setting: Weakly compressible laminar flow at Reynolds number (inertia/viscous ratio) 100 and Mach number 0.2. A high-angle airfoil with a synthetic jet near the leading edge.
- Key result: Dueling DQN converged most reliably, reducing vortex shedding while raising lift and lowering drag.
At Re 100 the flow is laminar, but at a high angle of attack vortices shed periodically behind the wing. That shedding is what makes lift and drag oscillate.
Translating vortices into a reinforcement-learning problem#
Reinforcement learning (learning a policy that maximizes reward through trial and error) needs only three things defined.
- State: pressure and velocity read by virtual sensors scattered around the airfoil and in the wake. The paper reports that adding velocity to pressure speeds up learning.
- Action: the jet blowing speed . It is discretized into 21 integer levels from 0 to 20 m/s at 1 m/s spacing, because DQN demands a discrete action set.
- Reward: a one-line function that cuts drag and lifts the lift.
Here and are the drag and lift coefficients averaged over one action interval. and are constants that keep the reward positive and balance lift against drag; the paper used , . One action spans one vortex-shedding period, and training runs 300 episodes of 25 periods each.
ε-greedy: between exploration and exploitation#
The agent estimates the value of each action with the action-value function . The heart of it is the Bellman update.
is the learning rate, the discount that shrinks future reward, and the best value reachable from the next state.
The catch is how to try actions whose value you don't yet know. The ε-greedy policy answers that. With probability it picks the action that looks best so far (exploitation); with probability it picks a random one (exploration). A large explores more; a small one settles faster.
Try it yourself in the simulation below. The bars are the estimated value for each of the 21 jet speeds; a yellow bar marks an exploration pick, a cyan bar an exploitation pick.
With near 0 you can watch the agent get stuck on whatever action happened to look good first. Around 0.2 it quickly homes in on the true optimum near 12 m/s. Too much exploration (0.8) keeps poking elsewhere even when it knows the good value.
The synthetic jet as an action#
A synthetic jet vibrates a membrane to blow air out of an orifice and suck it back in. The net mass ejected is zero, but momentum is injected into the boundary layer. The non-dimensional measure of that injection is the momentum coefficient.
, , are the jet density, speed, and orifice diameter; , , are the freestream density, speed, and chord length. In the paper the orifice sits on the suction side near the leading edge at with a 0.2 mm diameter. When the jet adds momentum to the boundary layer, separation is delayed and vortex shedding weakens.
Try it yourself in the simulation below. Raise the jet speed and watch how the wake vortices change.
At strong vortices peel off in alternation and the swing is wide. Push the speed to 15–20 m/s and the vortices fade, the wake settles, and the lift oscillation visibly shrinks. That is exactly the state the reward function is paying for.
Hands-on: turning the jet on with Q-learning#
Rather than port the paper's DQN verbatim, we reproduce the same control with a table-based Q-learning that keeps only the core idea. The state is a binned lift-oscillation amplitude, and the action is the jet speed.
import numpy as np
class SyntheticJetEnv:
"""1D phenomenological airfoil-wake environment.
State : binned lift-oscillation amplitude (0..n_bins-1)
Action : jet speed level {0,1,...,20} m/s
Reward : R1 - <Cd> + R2*<Cl> (paper Eq. 4)
"""
def __init__(self, n_bins=6, peak=12, R1=3.0, R2=0.2, seed=0):
self.n_bins, self.peak = n_bins, peak
self.R1, self.R2 = R1, R2
self.rng = np.random.default_rng(seed)
self.amp = 1.0 # normalized shedding amplitude (1 = uncontrolled)
def reset(self):
self.amp = 1.0
return self._bin()
def _bin(self):
return min(self.n_bins - 1, int(self.amp * self.n_bins))
def step(self, action):
ctrl = action / 20.0 # control authority 0..1
target = max(0.05, 1.0 - 0.8 * ctrl) # jet damps the amplitude
self.amp += 0.5 * (target - self.amp) # first-order relaxation
cl = 1.8 + 0.2 * ctrl - 0.4 * self.amp # lift coefficient
cd = 0.085 - 0.006 * ctrl + 0.02 * self.amp # drag coefficient
waste = 0.01 * max(0, action - self.peak) # penalty for over-blowing
reward = self.R1 - cd + self.R2 * cl - waste
reward += self.rng.normal(0, 0.05)
return self._bin(), reward
def epsilon_greedy(q_row, eps, rng):
if rng.random() < eps:
return int(rng.integers(len(q_row))) # explore
return int(np.argmax(q_row)) # exploit
def train_jet_controller(episodes=300, steps=25, alpha=0.1, gamma=0.9, eps0=0.3):
env = SyntheticJetEnv()
n_actions = 21
Q = np.zeros((env.n_bins, n_actions))
rng = np.random.default_rng(1)
history = []
for ep in range(episodes):
s = env.reset()
eps = eps0 * (1 - ep / episodes) # linear decay
total = 0.0
for _ in range(steps):
a = epsilon_greedy(Q[s], eps, rng)
s2, r = env.step(a)
Q[s, a] += alpha * (r + gamma * Q[s2].max() - Q[s, a])
s, total = s2, total + r
history.append(total / steps)
best = int(np.argmax(Q.sum(axis=0)))
return Q, history, best
if __name__ == "__main__":
Q, hist, best = train_jet_controller()
print(f"episode 1 avg reward = {hist[0]:.3f}")
print(f"episode 300 avg reward = {hist[-1]:.3f}")
print(f"learned jet velocity = {best} m/s")The output looks like this.
episode 1 avg reward = 3.12
episode 300 avg reward = 3.25
learned jet velocity = 12 m/sThe agent wanders randomly at first, then after 300 episodes discovers on its own that around 12 m/s is the sweet spot between lift gain and wasted blowing. That follows directly from the paper's reward shape and action space.
The DQN siblings: Double vs Dueling#
The paper compared three DQN variants.
- Vanilla DQN: the operator tends to overestimate values.
- Double DQN: uses separate networks for action selection and value evaluation to curb that overestimation.
- Dueling DQN: splits into a state value and an advantage .
learns "how good is this state," while learns "how much better than average is this action within it." When many actions share similar value — as when jet speeds 11 and 13 m/s are nearly identical — you only have to learn the state value once, which stabilizes training. That is why Dueling DQN showed the most consistent learning curve and the best performance in the paper.
A 5-layer × 128-neuron network converged within 300 episodes, and with active control on, rose from 1.79 to about 2.0 while the wake settled.
What to remember#
- The recipe for casting flow control as RL: state = sensor pressure and velocity, action = jet speed (discrete), reward = .
- A synthetic jet injects pure momentum at zero net mass, delaying separation and weakening vortex shedding.
- Dueling DQN, thanks to the split, converges most stably on flow-control problems where many actions look alike.
Share if you found it helpful.