Optimization problems arising in intelligent systems are similar to those studied in other fields (such as operations research, control, and computational physics). But they also have a few prominent features that are not addressed particularly well by classic optimization methods.

One big issue is that classic optimization methods for high-dimensional problems are not well equipped to deal with imprecisions and uncertainty in the computation itself. This is an issue because Big Data problems often have the property that computational precision can be traded off against computational cost. One of the most widely occuring problem structure is that one has to find a (local) optimum of a function $L$ that is the sum of many similar terms, each arising from an individual data point $y_i$

$$L(x) = \frac{1}{N}\sum_{i = 1} ^N \ell(y_i,x) $$

Examples of this problem include the training of neural networks, of logistic regressors, and many other linear and nonlinear regression/classification algorithms. If the dataset is very large or even infinite, it is impossible, or at least inefficient, to evaluate the entire sum. Instead, one draws $M\ll N$ (hopefully representative) *samples $y_j$* from some distribution and computes the approximation

$$\hat{L}(x) = \frac{1}{M} \sum_{j=1} ^M \ell(y_j,x) \approx L(x)$$

If the representers $y_j$ are drawn independently and identically from some distribution, then this approximation deviates, relative to the true $L(x)$, by an approximately Gaussian disturbance. Unfortunately, efficient classic numerical methods (like e.g. quasi-Newton methods) often react to these disturbances in an unstable way. It isn't even straightforward to choose good step-sizes for such methods.

Our work in this area includes the characterization of classic optimization methods as autoregressive methods, and the development of robust routines for optimization. One compact but important result of our work is a *probabilistic line search* -- a method that efficiently selects step lengths for algorithms like stochastic gradient descent and its variants. This algorithm is increasingly used in industrial optimization problems by some of the most prominent industrial players in machine learning.

7 results

**Early Stopping Without a Validation Set**
*arXiv preprint arXiv:1703.09580*, 2017 (article)

**Coupling Adaptive Batch Sizes with Learning Rates**
In *Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI)*, pages: 410-419, (Editors: Gal Elidan and Kristian Kersting), 2017 (inproceedings)

**Probabilistic Line Searches for Stochastic Optimization**
In *Advances in Neural Information Processing Systems 28*, pages: 181-189, (Editors: C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama and R. Garnett), Curran Associates, Inc., 29th Annual Conference on Neural Information Processing Systems (NIPS), 2015 (inproceedings)

**Probabilistic Interpretation of Linear Solvers**
*SIAM Journal on Optimization*, 25(1):234-260, 2015 (article)

**Quasi-Newton Methods: A New Direction**
*Journal of Machine Learning Research*, 14(1):843-865, March 2013 (article)

**Fast Probabilistic Optimization from Noisy Gradients**
In *Proceedings of The 30th International Conference on Machine Learning, JMLR W&CP 28(1)*, pages: 62–70, (Editors: S Dasgupta and D McAllester), ICML, 2013 (inproceedings)

**Quasi-Newton Methods: A New Direction**
In *Proceedings of the 29th International Conference on Machine Learning*, pages: 25-32, ICML ’12, (Editors: John Langford and Joelle Pineau), Omnipress, New York, NY, USA, ICML, July 2012 (inproceedings)