Preface #
This blog post originates from the course ORF522: Linear and Nonlinear Optimization , a course taught by Prof. Stellato that I found both interesting and challenging. In the previous blog post , we discussed the well-known KKT conditions in the mathematical optimization field, but from a more engineering-oriented perspective. Today, we revisit and try to rigorously formulate these conditions through Farkas’ Lemma .
Relevant Concepts #
Feasible Direction #
Consider the optimization problem:
\begin{align} \mathrm{minimize}\quad& f(x)\\ \mathrm{s.t.}\quad& x\in\mathcal{C}. \end{align}
Given $x\in\mathcal{C}$, a vector $d$ is called a feasible direction at $x$ if there exists $\bar{t}>0$ such that
$$x+td\in\mathcal{C},\qquad\forall t\in [0,\bar{t}]. \tag{1} \label{eq1}$$
Let $F(x)$ denote the set of all feasible directions at $x$. We then illustrate this concept with the following three examples
$\mathcal{C} = \{Ax=b\}\ \Rightarrow\ F(x) = \{d\,|\,Ad=0\}\ (\text{nullspace of }A)$$\mathcal{C} = \{Ax\leq b\}\ \Rightarrow\ F(x) = \{d\,|\,a_i^\mathrm{T}d\leq 0\quad\mathrm{if}\ a_i^\mathrm{T}x=b_i\}$$\mathcal{C} = \{g_i(x)\leq 0,\ \mathrm{(nonlinear)}\}\ \Rightarrow\ F(x) = \{d\,|\,\nabla g_i(x)^\mathrm{T}d<0\quad\mathrm{if}\ g_i(x)=0\}$
Descent Direction #
Suppose $f$ is continuously differentiable. A vector $d$ is called a descent direction at $x$ if there exists $\bar{t}$ such that
$$f(x+td)<f(x),\quad\forall t\in [0,\bar{t}].$$
Similarly, let $D(x)$ be the set of all descent directions at $x$. It is well known that for any descent direction $d\in D(x)$, the inequality
$$\nabla f(x)^\mathrm{T}d<0 \tag{2} \label{eq2}$$
holds.
Necessary Optimality Condition Idea #
At any feasible point, one can consider all directions that keep the iterate within the feasible set; these are called feasible directions. Among them, some directions may decrease the objective function and are therefore descent directions. In general, not all feasible directions are descent directions. At a local optimum $x^\ast$, however, this becomes more extreme: all feasible directions are not descent directions. Otherwise, one could move slightly along such a direction and obtain a lower objective value while remaining feasible, contradicting local optimality. Formally, this idea is captured by the condition
$$F(x^\ast)\cap D(x^\ast)=\emptyset, \tag{3} \label{eq3}$$
which states that there is no feasible descent direction.
Farkas’ Lemma #
Theorem 1
Given $A$ and $b$, exactly one of the following statements is true:
- There exists an
$x$with$Ax=b,\ x\geq 0$ - There exists a
$y$with$A^\mathrm{T}y\geq 0,\ b^\mathrm{T}y<0$
This result admits a clean geometric interpretation. Writing $A=[A_1,\,\cdots,A_n]$, the first alternative states that
$$b = \sum_{i=1}^nx_iA_i,\quad x_i\geq 0,$$
thus $b$ lies in the convex cone
generated by the columns of $A$. In contrast, the second alternative asserts the existence of a vector $y$ such that $y^\mathrm{T}A_i\geq 0$ for all columns $A_i$, while $y^\mathrm{T}b<0$. The hyperplane
$\{z\,|\,y^\mathrm{T}z=0\}$ therefore separates $b$ from the cone spanned by $\{A_1,\,\cdots,A_n\}$: all columns of $A$ lie on the nonnegative side of the hyperplane, while $b$ lies strictly on the negative side.
The two alternatives cannot both be true. Indeed, if $x\geq 0,\ Ax=b,$ and $A^\mathrm{T}y\geq 0$, then
$$b^\mathrm{T}y = y^\mathrm{T}Ax\geq 0$$, which contradicts $b^\mathrm{T}y<0$ in alternative 2.
They also cannot both be false, which can be shown via duality. Consider the following primal-dual pair:
\begin{align} &(P)\quad\mathrm{minimize}\quad 0 &\quad &\mathrm{s.t.}\quad Ax=b,\ x\geq 0,\\ &(D)\quad\mathrm{maximize}\quad -b^\mathrm{T}y &\quad &\mathrm{s.t.}\quad A^\mathrm{T}y\geq 0. \end{align}
and let $p^\ast$ and $d^\ast$ denote the optimal objective values of (P) and (D), respectively.
The dual is always feasible (for example, $y=0$), and strong duality
holds. If the primal is feasible, then the optimal values satisfy $p^\ast=d^\ast=0$, which implies $b^\mathrm{T}y\geq 0$ for all $y$ with $A^\mathrm{T}y\geq 0$, which rules out the second alternative. If the primal is infeasible, then the dual is unbounded above ($p^\ast=d^\ast=+\infty$), yielding a vector $y$ with $A^\mathrm{T}y\geq 0$ and $b^\mathrm{T}y<0$. Here, $y$ is an infeasibility certificate for $Ax=b,\ x\geq 0$.
Nonlinear Optimization with Equality Constraints #
\begin{align} \mathrm{minimize}\quad& f(x)\\ \mathrm{s.t.}\quad& Ax=b \tag{4} \label{eq4} \end{align}
Theorem 2
If $x^\ast$ is a local optimum, then there exists $y$ such that
$$\nabla f(x^\ast)+A^\mathrm{T}y=0. \tag{5} \label{eq5}$$
Proof of therorem
To prove Theorem 2, it is useful to formulate the argument in terms of feasible and descent directions. At a local minimizer, these two sets cannot intersect (Equation $\eqref{eq3}$). This observation leads to a classical “theorem of alternatives”. Exactly one of the following two statements must hold: (i) there exists a feasible descent direction satisfying $Ad=0$ and $\nabla f(x^\ast)^\mathrm{T}d<0$; or (ii) there exists a vector $y$ such that $\nabla f(x^\ast)+A^\mathrm{T}y=0$.
They cannot both be true. If (ii) holds then for any $d$ with $Ad=0$ we have $\nabla f(x^\ast)^\mathrm{T}d=-y^\mathrm{T}Ad=0$, which contradicts the existence of a strict descent direction in alternative (i).
They also cannot both be false. Consider the following primal-dual pair:
\begin{align} &(P)\quad\mathrm{minimize}\quad\nabla f(x^\ast)^\mathrm{T}d &\quad &\mathrm{s.t.}\quad Ad=0,\\ &(D)\quad\mathrm{maximize}\quad 0 &\quad &\mathrm{s.t.}\quad\nabla f(x^\ast)+A^\mathrm{T}y=0, \end{align}
If (i) holds, then the primal is unbounded below, i.e., $p^\ast=-\infty$. By strong duality, we have $d^\ast=-\infty$ (dual infeasible). If (ii) holds, then the dual is feasible and hence $d^\ast=0$. Strong duality then yields $p^\ast=0$, which contradicts the descent direction condition in (i). Therefore, exactly one of the two alternatives holds. $\blacksquare$
Geometric Interpretation
We now provide a geometric interpretation of the theorem. For the equality-constrained problem, the feasible directions at a candidate optimum $x$ are exactly those $d$ that keeps you on the contraint manifold to first order, i.e., $Ad=0$ ($d\in\mathrm{null}(A)$). At a local minimum, taking an infinitesimal step in any feasible direction cannot decrease the objective value. Therefore, the directional derivative must satisfy
$$\nabla f(x^\ast)^\mathrm{T}d\geq 0,\quad \forall d\in\mathrm{null}(A).$$
Since feasibility is preserved under sign change for $Ad=0$, if $d$ is feasible then so is $-d$. This gives
$$\nabla f(x^\ast)^\mathrm{T}(-d)\geq 0\quad\Rightarrow\quad\nabla f(x^\ast)^\mathrm{T}d\leq 0.$$
Therefore, we have
$$\nabla f(x^\ast)^\mathrm{T}d=0,\quad \forall d\in\mathrm{null}(A),$$
which shows $\nabla f(x^\ast)\perp\mathrm{null}(A)$. By the fundamental subspace relation $\mathrm{null}(A)^\mathrm{T} = \mathrm{range}(A^\mathrm{T})$, this is equivalent to
$$\nabla f(x^\ast)\in\mathrm{range}(A^\mathrm{T}).$$
Hence, there exists a multiplier $y$ such that Equation $\eqref{eq5}$ holds true. Geometrically, the gradient at the optimum is perpendicular to the feasible hyperplane $\{x\,|\,Ax=b\}$.
Nonlinear Optimization with Inequality Constraints #
\begin{align} \mathrm{minimize}\quad& f(x)\\ \mathrm{s.t.}\quad& g_i(x)\leq 0,\quad i=1,\,\cdots,m, \tag{6} \label{eq6} \end{align}
At any point $x$, the active set $\mathcal{A}(x)=\{i\,|\,g_i(x)=0\}$ collects the constraints that are tight at $x$. A standard regularity assumption is the linearly independence constraint qualification (LICQ), which requires that the gradients of the active constraints, $\{\nabla g_i(x),\ i\in\mathcal{A}(x)\}$, be linealy independent. Under LICQ, we obtain the following theorem.
Theorem 3
If $x^\ast$ is a local minimum and LICQ holds, then there exists $y\geq 0$ such that
\begin{align} &\nabla f(x^\ast)+\sum_{i=1}^m y_i\nabla g_i(x^\ast)=0\\ &y_ig_i(x^\ast)=0,\quad i=1,\,\cdots,m. \tag{7} \label{eq7} \end{align}
Proof of theorem
We take advantage of the variation of Farkas’ Lemma. Specifically, given $A$, exactly one of the following statements is true:
- There exists a
$d$with$Ad<0$ - There exists a
$u$with$A^\mathrm{T}u=0,\ \mathbf{1}^\mathrm{T}u=1$, and$u\geq 0$
Now we show that the above variation is equivalent to the original Farkas’ Lemma. They cannot both be true. If $Ad<0$, $u\geq 0$ and $\mathbf{1}^\mathrm{T}u=1$, then we have $u^\mathrm{T}Ad<0$. This contradicts the second alternative, where $A^\mathrm{T}u=0$. They also cannot both be false. To see this, rewrite the first alternative $Ad<0$ in a form compatible with the standard Farkas lemma. There exists $\varepsilon>0$ such that
$$Ad\leq -\varepsilon\mathbf{1}.$$
Define the augmented matrix and vectors
$$\tilde{A}=[A,\mathbf{1}],\quad c=(0,\,\cdots,0,1),\quad \tilde{d}=(-d,-\epsilon).$$
Then $Ad<0$ is equivalent to
$$\tilde{A}\tilde{d}\geq 0,\quad c^\mathrm{T}\tilde{d}<0.$$
Applying the original Farkas’ Lemma to $(\tilde{A},c)$, we obtain the alternative: either such a $\tilde{d}$ exists, or $\exists u\geq 0$ such that
$$\tilde{A}^\mathrm{T}u=c. \tag{8} \label{eq8}$$
Expanding Equation $\eqref{eq8}$ gives
$$A^\mathrm{T}u=0,\quad \mathbf{1}^\mathrm{T}u=1,\quad u\geq 0,$$
which is exactly the second alternative in the variation. Hence the two alternatives cannot both be false. Therefore, the variation is equivalent to the abovementioned Farkas’ Lemma via a simple reparameterization, and exactly one of the two statements must hold.
Based on the examples presented in this section
, the set of feasible directions at a local optimum $x^\ast$ of Problem $\eqref{eq6}$ is given by
$$F(x^\ast) = \{d\,|\,\nabla g_i(x^\ast)^\mathrm{T}d<0,\ i\in\mathcal{A}(x)\}.$$
The set of descent directions is
$$D(x^\ast) = \{d\,|\,\nabla f(x^\ast)^\mathrm{T}d<0\}.$$
Let
$$A = \left[\nabla f(x),\nabla g_{i_1}(x),\,\cdots,\nabla g_{i_n}(x)\right]^\mathrm{T},$$
where $\{i_1,\,\cdots,i_n\}=\mathcal{A}(x)$, then Equation $\eqref{eq3}$ is exactly the infeasibility of the system $Ad<0$. By the Farkas’ Lemma variation, this implies the existence of a vector $u\geq 0$ such that
$$A^\mathrm{T}u=0,\quad\mathbf{1}^\mathrm{T}u=1.$$
Writing $u=(u_0,u_{i_1},\,\cdots,u_{i_n})$, this condition translates to
$$u_0\nabla f(x^\ast)+\sum_{i\in\mathcal{A}(x^\ast)}u_i\nabla g_i(x^\ast) = 0,\quad u\geq 0,\quad\mathbf{1}^\mathrm{T}u=1. \tag{9} \label{eq9}$$
If $u_0=0$, then $\sum_{i\in\mathcal{A}(x^\ast)}u_i\nabla g_i(x^\ast)=0$, which violates the LICQ since it is a nontrivial linear combination of active constraint gradients. Hence, $u_0>0$. Defining $y_i = u_i/u_0$ yields
$$\nabla f(x^\ast)+\sum_{i\in\mathcal{A}(x^\ast)}y_i\nabla g_i(x^\ast) = 0,\quad y_i\geq 0. \tag{10} \label{eq10}$$
Suppose there are $m$ constraints, and extend $y_i=0$ for all inactive constraints. Then Equation $\eqref{eq10}$ can be rewritten as
$$\nabla f(x^\ast)+\sum_{i=1}^m y_i\nabla g_i(x^\ast) = 0,\quad y_ig_i(x^\ast)=0,\ i=1,\,\cdots,m.\quad \blacksquare$$
KKT Necessary Conditions for Nonlinear Optimization #
We now consider the case with both equality and inequality constraints taken into account:
\begin{align} \mathrm{minimize}\quad& f(x)\\ \mathrm{s.t.}\quad& g_i(x)\leq 0,\quad i=1,\,\cdots,m\\ & h_i(x)=0,\quad i=1,\,\cdots,p \tag{11} \label{eq11} \end{align}
With Equations $\eqref{eq5}$ and $\eqref{eq7}$, we can readily derive the following theorem, thus giving the KKT conditions.
Theorem 4
If $x^\ast$ is a local minimizer and LICQ holds, then there exists $y^\ast,\ v^\ast$ such that
\begin{align} &\nabla f(x^\ast)+\sum_{i=1}^my_i^\ast\nabla g_i(x^\ast)+\sum_{i=1}^pv_i^\ast\nabla h_i(x^\ast) = 0&\quad &\text{(stationarity)}\\ &y^\ast\geq 0&\quad &\text{(dual feasibility)}\\ &g_i(x^\ast)\leq 0,\quad i=1,\,\cdots,m&\quad &\text{(primal feasibility)}\\ &h_i(x^\ast)=0,\quad i=1,\,\cdots,p&\quad &\text{(primal feasibility)}\\ &y_i^\ast g_i(x^\ast)=0,\quad i=1,\,\cdots,m&\quad &\text{(complementary slackness)} \end{align}
Duality #
We can also investigate the KKT conditions from the perspective of duality. The Lagrangian of Problem $\eqref{eq11}$ is written as
$$L(x,y,v) = f(x)+\sum_{i=1}^my_ig_i(x)+\sum_{i=1}^pv_ih_i(x),$$
where $y\geq 0$ are multipliers for inequality constraints and $v$ are multipliers for equality constraints. The Lagrange dual function
$$q(y,v) = \inf_x L(x,y,v)$$
provides a lower bound on the primal optimal value $p^\ast$, that is, for any $y\geq 0$ and $v$, we have $q(y,v)\leq p^\ast$.
Proof. Let $\tilde{x}$ be a feasible point. Then,
$$q(y,v) = \inf_x L(x,y,v)\leq f(\tilde{x})+\sum_{i=1}^my_ig_i(\tilde{x})+\sum_{i=1}^pv_ih_i(\tilde{x})\leq f(\tilde{x}),$$
where the second inequality is due to the fact that $y_i\geq 0,\ g_i(\tilde{x})\leq 0,\ h_i(\tilde{x})=0$. $\blacksquare$
The corresponding dual problem maximizes this bound:
$$d^\ast = \max_{y,v}\ q(y,v)\quad\mathrm{s.t.}\ y\geq 0.$$
Note that, this is always a convex optimization problem, even if the primal is not. The lower-bound property immediately implies weak duality
: $d^\ast\leq p^\ast$. The strong duality $p^\ast=d^\ast$ does not hold in general. However, for convex problems it does under mild constraint qualifications.
Theorem 5
If the problem is convex and there exists at least a strictly feasible point $x$, i.e.,
\begin{align} &g_i(x)\leq 0,\quad \text{(for all affine )}g_i\\ &g_i(x)<0,\quad \text{(for all non-affine )}g_i\\ &h_i(x)=0,\quad i=1,\,\cdots,p \end{align}
then the strong duality holds: $p^\ast=d^\ast$. This condition is known as the Slater’s condition
, which is a sufficient condition for strong duality to hold for a convex optimization problem. Moreover, it implies that the dual is not unbounded ($d^\ast\neq +\infty$).
Under strong duality, the KKT conditions become the correct first-order characterization of optimality. We have the following theorem.
Theorem 6
If $x^\ast$ is a local minimizer and strong duality holds, then there exists $y^\ast,\ v^\ast$ such that
\begin{align} &\nabla_x L(x,y,v) = \nabla f(x^\ast)+\sum_{i=1}^my_i^\ast\nabla g_i(x^\ast)+\sum_{i=1}^pv_i^\ast\nabla h_i(x^\ast) = 0&\quad &\text{(stationarity)}\\ &y^\ast\geq 0&\quad &\text{(dual feasibility)}\\ &g_i(x^\ast)\leq 0,\quad i=1,\,\cdots,m&\quad &\text{(primal feasibility)}\\ &h_i(x^\ast)=0,\quad i=1,\,\cdots,p&\quad &\text{(primal feasibility)}\\ &y_i^\ast g_i(x^\ast)=0,\quad i=1,\,\cdots,m&\quad &\text{(complementary slackness)} \end{align}
For Convex Problems #
For convex optimization problems, KKT conditions are not only necessary but also sufficient. That is, if $x^\ast,\ y^\ast,\ v^\ast$ satisfy KKT conditions for convex problem, then they are optimal.
Proof. By complementary slackness,
$$L(x^\ast,y^\ast,v^\ast) = f(x^\ast)+\sum_{i=1}^my_ig_i(x)+\sum_{i=1}^pv_ih_i(x)=f(x^\ast).$$
Since $L(x,y,v)$ is convex in $x$ and $\nabla_xL(x^\ast,y^\ast,v^\ast) = 0$, we have
$$q(y^\ast,v^\ast)=\inf_xL(x,y^\ast,v^\ast) = L(x^\ast,y^\ast,v^\ast)\ \Rightarrow\ p^\ast=f(x^\ast)=q(y^\ast,v^\ast)=d^\ast,$$
which means that $x^\ast$ is globally optimal. Thus, in the convex setting, optimality, duality, and the KKT conditions fit together into a single, unified framework.
Some Remarks on KKT
- For unconstrained problems, they reduce to necessary first-order condition
$\nabla f(x)=0$. - In general, we can replace LICQ assumption with strong duality (refer to Theorem 4 and Theorem 6).
- KKT conditions are always sufficient. If Slater’s condition holds, KKT conditions are necessary and sufficient.
Last modified on 2026-01-08