## A caffeinated adventure into optimization algorithms and numerical solver libraries in python

Motivated by some optimization problem in quantitative finance as well as simple curiosity, I started looking into some word-of-mouth ML related algorithms and various useful libraries to solve large scale constrained optimization problem. Perhaps my understanding of optimization has deepened over the years, and the newly bought green matcha tea bag has wrought wonder to my head, the documentation in the open source community with regard to the various numerical libraries seemed exceedingly clear. Here I will simply share a lightly annotated laundry list of all the useful tidbits picked up in 2 hours of distracted self-study:

1. Bayes point algorithm: I first heard about this through a colleague’s work, where the main selling point seems to be the ability to control the model directly through the example weight, rather than feature weight. The definite reference seems to be this 34 page paper, which is clear on a sober afternoon, but can be quite daunting if one simply wants to scan through it to pick up the main idea. I haven’t finished it, but the first 5 pages look quite sensible and promise good return on time spent. My current speculative understanding is that this is a mixture of ideas from support vector machines, which focus on frequentist style optimization problem, and Bayesian inference, which is more expensive but has the nice “anytime” property, meaning even a non-converged model is useful. One thing I found funny was that the authors talked about Cox’s theorem on objective probability; not sure if it is really necessary in a technical paper like this, but authors are certainly allowed to digress a little philosophically.
2. Bayesian linear regression: I learned this mainly through the namesake wikipedia article. Don’t look at the multivariate version, since it distracts you from the main point. The idea is to have a prior of some distribution for the weights (i.e., linear regressors), which can be conveniently chosen to be Gaussian (a conjugate prior). Then the posterior will have some Gaussian distribution whose mean and variance depend on the data as well as the prior. The formula presented towards the end indeed shows that if the prior is highly concentrated near its mean (high confidence), then the posterior distribution will lean towards the prior mean.
3. Gauss Newton method: I thought I knew Newton’s method since grade school. Turns out even a veteran like me can get abysmally confused about the distinction between solving an equation and optimizing an objective function. So I spent a few minutes wondering why Newton’s method is considered a second order method, though to be fair, the label of second order has nothing to do with the use of 2nd derivatives, but mainly with the quadratic convergence rate. To those equally uninitiated readers, to a (vector-valued) equation of the form $F(x) = y$ will typically have isolated solutions only when $x$ and $y$ are of the same dimension (assumed both to be in some Euclidean space). Otherwise you get a sub-variety as your solution, which numerical analysts and engineers typically don’t care for. Similarly optimization only makes sense when the objective function is real-valued, rather than vector-valued. By taking the gradient, one converts the later type of problem to the former. In any rate, I started looking up Gauss Newton, which isn’t exactly Newton Ralphson, after seeing this line in the implementation note of the trf library, which by the way, is extremely well written and makes me understand trust region reflective algorithm in one sitting. In it, the author mentions that one can approximate the Hessian matrix of a nonlinear objective function with $J^T J$. This looked vaguely similar to the typical linear regression exact solution, and in fact is related. As long as the objective function is a sum of squares of individual components, the GN algorithm works, but approximating Hessian with first order derivatives. This obviously can speed up things a lot.
4. fmin_slsqp function in scipy: I found out about this mainly through this blog post. Upon looking at the implementation on github, I got a bit dismayed since a heavy portion of the code is done using python while and for loops. But I keep telling myself not to prematurely optimize, so maybe this will be my first library of choice. The underlying Sequential Least SQuares Programming approach looks somewhat quadratic.
5. least_squares implementation in scipy: This one looks more promising in terms of performance, in particular it uses the trf library mentioned below, which promises to be appropriate for large scale problems.
6. trust region reflective algorithm in scipy: the implementation note for trf above is quite good. The essential idea seems to be to treat a constrained optimization problem locally as a quadratic programming problem and bake the inequality constraints into the objective. Then reshape the region of optimization by the inverse of the diagonal matrix consisting of the distance to the boundaries in each direction (presumably the region is always convex so that this makes sense).
7. Cauchy point: the solution of the gradient descent multiplier to maximize descent under the constraint that the independent variable doesn’t move beyond a certain radius. This seems related to Wolfe condition, but the latter guarantees convergence of gradient norm to  0, whereas Cauchy is just a way to cheaply optimize a single gradient descent step under constraint.