CLC number: TP18
On-line Access: 2022-09-21
Received: 2022-07-10
Revision Accepted: 2022-09-21
Crosschecked: 2022-07-24
Cited: 0
Clicked: 2205
Yi MA, Doris TSAO, Heung-Yeung SHUM. On the principles of Parsimony and Self-consistency for the emergence of intelligence[J]. Frontiers of Information Technology & Electronic Engineering, 2022, 23(9): 1298-1323.
@article{title="On the principles of Parsimony and Self-consistency for the emergence of intelligence",
author="Yi MA, Doris TSAO, Heung-Yeung SHUM",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="23",
number="9",
pages="1298-1323",
year="2022",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2200297"
}
%0 Journal Article
%T On the principles of Parsimony and Self-consistency for the emergence of intelligence
%A Yi MA
%A Doris TSAO
%A Heung-Yeung SHUM
%J Frontiers of Information Technology & Electronic Engineering
%V 23
%N 9
%P 1298-1323
%@ 2095-9184
%D 2022
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2200297
TY - JOUR
T1 - On the principles of Parsimony and Self-consistency for the emergence of intelligence
A1 - Yi MA
A1 - Doris TSAO
A1 - Heung-Yeung SHUM
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 23
IS - 9
SP - 1298
EP - 1323
%@ 2095-9184
Y1 - 2022
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2200297
Abstract: Ten years into the revival of deep networks and artificial intelligence, we propose a theoretical framework that sheds light on understanding deep networks within a bigger picture of intelligence in general. We introduce two fundamental principles, parsimony and self-consistency, which address two fundamental questions regarding intelligence: what to learn and how to learn, respectively. We believe the two principles serve as the cornerstone for the emergence of intelligence, artificial or natural. While they have rich classical roots, we argue that they can be stated anew in entirely measurable and computable ways. More specifically, the two principles lead to an effective and efficient computational framework, compressive closed-loop transcription, which unifies and explains the evolution of modern deep networks and most practices of artificial intelligence. While we use mainly visual data modeling as an example, we believe the two principles will unify understanding of broad families of autonomous intelligent systems and provide a framework for understanding the brain.
[1]Agarwal A, Kakade S, Krishnamurthy A, et al., 2020. FLAMBE: structural complexity and representation learning of low rank MDPs. Proc 34th Int Conf on Neural Information Processing Systems, p.20095-20107.
[2]Azulay A, Weiss Y, 2019. Why do deep convolutional networks generalize so poorly to small image transformations? https://arxiv.org/abs/1805.12177
[3]Baek C, Wu ZY, Chan KHR, et al., 2022. Efficient maximal coding rate reduction by variational forms. https://arxiv.org/abs/2204.00077
[4]Bai SJ, Kolter JZ, Koltun V, 2019. Deep equilibrium models. Proc 33rd Int Conf on Neural Information Processing Systems, p.690-701.
[5]Baker B, Gupta O, Naik N, et al., 2017. Designing neural network architectures using reinforcement learning. https://arxiv.org/abs/1611.02167
[6]Bao PL, She L, McGill M, et al., 2020. A map of object space in primate inferotemporal cortex. Nature, 583(7814):103-108.
[7]Barlow HB, 1961. Possible principles underlying the transformations of sensory messages. In: Rosenblith WA (Ed.), Sensory Communication. MIT Press, Cambridge, MA, USA, p.217-234.
[8]Bear DM, Fan CF, Mrowca D, et al., 2020. Learning physical graph representations from visual scenes. Proc 34th Int Conf on Neural Information Processing Systems, p.6027-6039.
[9]Belkin M, Hsu D, Ma SY, et al., 2019. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc Natl Acad Sci USA, 116(32):15849-15854.
[10]Benna MK, Fusi S, 2021. Place cells may simply be memory cells: memory compression leads to spatial tuning and history dependence. Proc Natl Acad Sci USA, 118(51):e2018422118.
[11]Bennett J, Carbery A, Christ M, et al., 2008. The Brascamp–Lieb inequalities: finiteness, structure and extremals. Geom Funct Anal, 17(5):1343-1415.
[12]Berner C, Brockman G, Chan B, et al., 2019. Dota 2 with large scale deep reinforcement learning. https://arxiv.org/abs/1912.06680
[13]Bertsekas DP, 2012. Dynamic Programming and Optimal Control, Volume I and II. Athena Scientific, Belmont, Massachusetts, USA.
[14]Bronstein MM, Bruna J, Cohen T, et al., 2021. Geometric deep learning: grids, groups, graphs, geodesics, and gauges. https://arxiv.org/abs/2104.13478
[15]Bruna J, Mallat S, 2013. Invariant scattering convolution networks. IEEE Trans Patt Anal Mach Intell, 35(8):1872-1886.
[16]Buchanan S, Gilboa D, Wright J, 2021. Deep networks and the multiple manifold problem. https://arxiv.org/abs/2008.11245
[17]Candès EJ, Li XD, Ma Y, et al., 2011. Robust principal component analysis? J ACM, 58(3):11.
[18]Chai JX, Tong X, Chan SC, et al., 2000. Plenoptic sampling. Proc 27th Annual Conf on Computer Graphics and Interactive Techniques, p.307-318.
[19]Chan ER, Monteiro M, Kellnhofer P, et al., 2021. pi-GAN: periodic implicit generative adversarial networks for 3D-aware image synthesis. https://arxiv.org/abs/2012.00926
[20]Chan KHR, Yu YD, You C, et al., 2022. ReduNet: a white-box deep network from the principle of maximizing rate reduction. J Mach Learn Res, 23(114):1-103.
[21]Chan TH, Jia K, Gao SH, et al., 2015. PCANet: a simple deep learning baseline for image classification? IEEE Trans Image Process, 24(12):5017-5032.
[22]Chang L, Tsao DY, 2017. The code for facial identity in the primate brain. Cell, 169(6):1013-1028.
[23]Cohen H, Kumar A, Miller SD, et al., 2017. The sphere packing problem in dimension 24. Ann Math, 185(3):1017-1033.
[24]Cohen TS, Welling M, 2016. Group equivariant convolutional networks. https://arxiv.org/abs/1602.07576
[25]Cohen TS, Geiger M, Weiler M, 2019. A general theory of equivariant CNNs on homogeneous spaces. Proc 33rd Int Conf on Neural Information Processing Systems, p.9145-9156.
[26]Cover TM, Thomas JA, 2006. Elements of Information Theory (2nd Ed.). John Wiley & Sons, Inc., Hoboken, New Jersey, USA.
[27]Dai XL, Tong SB, Li MY, et al., 2022. Closed-loop data transcription to an LDR via minimaxing rate reduction. https://arxiv.org/abs/2111.06636
[28]Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16×16 words: transformers for image recognition at scale. https://arxiv.org/abs/2010.11929
[29]El Ghaoui L, Gu FD, Travacca B, et al., 2021. Implicit deep learning. SIAM J Math Data Sci, 3(3):930-958.
[30]Engstrom L, Tran B, Tsipras D, et al., 2019. A rotation and a translation suffice: fooling CNNs with simple transformations. https://arxiv.org/abs/1712.02779v3
[31]Fefferman C, Mitter S, Narayanan H, 2013. Testing the manifold hypothesis. https://arxiv.org/abs/1310.0425
[32]Fiez T, Chasnov B, Ratliff LJ, 2019. Convergence of learning dynamics in Stackelberg games. https://arxiv.org/abs/1906.01217
[33]Friston K, 2009. The free-energy principle: a rough guide to the brain? Trends Cogn Sci, 13(7):293-301.
[34]Fukushima K, 1980. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern, 36(4):193-202.
[35]Goodfellow IJ, Pouget-Abadie J, Mirza M, et al., 2014. Generative adversarial nets. Proc 27th Int Conf on Neural Information Processing Systems, p.2672-2680.
[36]Gortler SJ, Grzeszczuk R, Szeliski R, et al., 1996. The lumigraph. Proc 23rd Annual Conf on Computer Graphics and Interactive Techniques, p.43-54.
[37]Gregor K, LeCun Y, 2010. Learning fast approximations of sparse coding. Proc 27th Int Conf on Machine Learning, p.399-406.
[38]Hadsell R, Chopra S, LeCun Y, 2006. Dimensionality reduction by learning an invariant mapping. IEEE Computer Society Conf on Computer Vision and Pattern Recognition, p.1735-1742.
[39]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.
[40]Hinton GE, Zemel RS, 1993. Autoencoders, minimum description length and Helmholtz free energy. Proc 6th Int Conf on Neural Information Processing Systems, p.3-10.
[41]Hinton GE, Dayan P, Frey BJ, et al., 1995. The “wake-sleep” algorithm for unsupervised neural networks. Science, 268(5214):1158-1161.
[42]Ho J, Jain A, Abbeel P, 2020. Denoising diffusion probabilistic models. https://arxiv.org/abs/2006.11239
[43]Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735-1780.
[44]Huang G, Liu Z, van der Maaten L, et al., 2017. Densely connected convolutional networks. IEEE Conf on Computer Vision and Pattern Recognition, p.2261-2269.
[45]Hughes JF, van Dam A, McGuire M, et al., 2014. Computer Graphics: Principles and Practice (3rd Ed.). Addison-Wesley, Upper Saddle River, NJ, USA.
[46]Hutter F, Kotthoff L, Vanschoren J, 2019. Automated Machine Learning: Methods, Systems, Challenges. Springer Cham.
[47]Hyvärinen A, 1997. A family of fixed-point algorithms for independent component analysis. IEEE Int Conf on Acoustics, Speech, and Signal Processing, p.3917-3920.
[48]Hyvärinen A, Oja E, 1997. A fast fixed-point algorithm for independent component analysis. Neur Comput, 9(7):1483-1492.
[49]Jin C, Netrapalli P, Jordan MI, 2020. What is local optimality in nonconvex-nonconcave minimax optimization? https://arxiv.org/abs/1902.00618
[50]Jolliffe IT, 1986. Principal Component Analysis. Springer-Verlag, New York, NY, USA.
[51]Josselyn SA, Tonegawa S, 2020. Memory engrams: recalling the past and imagining the future. Science, 367(6473):eaaw4325.
[52]Kakade SM, 2001. A natural policy gradient. Proc 14th Int Conf on Neural Information Processing Systems: Natural and Synthetic, p.1531-1538.
[53]Kanwisher N, 2010. Functional specificity in the human brain: a window into the functional architecture of the mind. Proc Natl Acad Sci USA, 107(25):11163-11170.
[54]Kanwisher N, McDermott J, Chun MM, 1997. The fusiform face area: a module in human extrastriate cortex specialized for face perception. J Neurosci, 17(11):4302-4311.
[55]Keller GB, Mrsic-Flogel TD, 2018. Predictive processing: a canonical cortical computation. Neuron, 100(2):424-435.
[56]Kelley HJ, 1960. Gradient theory of optimal flight paths. ARS J, 30(10):947-954.
[57]Kingma DP, Welling M, 2013. Auto-encoding variational Bayes. https://arxiv.org/abs/1312.6114
[58]Kobyzev I, Prince SJD, Brubaker MA, 2021. Normalizing flows: an introduction and review of current methods. IEEE Trans Patt Anal Mach Intell, 43(11):3964-3979.
[59]Koopman BO, 1931. Hamiltonian systems and transformation in Hilbert space. Proc Natl Acad Sci USA, 17(5):315-318.
[60]Kramer MA, 1991. Nonlinear principal component analysis using autoassociative neural networks. AIChE J, 37(2):233-243.
[61]Kriegeskorte N, Mur M, Ruff DA, et al., 2008. Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron, 60(6):1126-1141.
[62]Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 25th Int Conf on Neural Information Processing Systems, p.1097-1105.
[63]Kulkarni TD, Whitney WF, Kohli P, et al., 2015. Deep convolutional inverse graphics network. Proc 28th Int Conf on Neural Information Processing Systems, p.2539-2547.
[64]LeCun Y, 2022. A Path Towards Autonomous Machine Intelligence. https://openreview.net/pdf?id=BZ5a1r-kVsf
[65]LeCun Y, Browning J, 2022. What AI can tell us about intelligence. NO-EMA Magazine. https://www.noemamag.com/what-ai-can-tell-us-about-intelligence/
[66]LeCun Y, Bottou L, Bengio Y, et al., 1998. Gradient-based learning applied to document recognition. Proc IEEE, 86(11):2278-2324.
[67]LeCun Y, Bengio Y, Hinton G, 2015. Deep learning. Nature, 521(7553):436-444.
[68]Lei N, Su KH, Cui L, et al., 2017. A geometric view of optimal transportation and generative model. https://arxiv.org/abs/1710.05488
[69]Levoy M, Hanrahan P, 1996. Light field rendering. Proc 23rd Annual Conf on Computer Graphics and Interactive Techniques, p.31-42.
[70]Li G, Wei YT, Chi YJ, et al., 2020. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Proc 34th Int Conf on Neural Information Processing Systems, p.12861-12872.
[71]Ma Y, Soatto S, Košecká J, et al., 2004. An Invitation to 3-D Vision: from Images to Geometric Models. Springer-Verlag, New York, USA.
[72]Ma Y, Derksen H, Hong W, et al., 2007. Segmentation of multivariate mixed data via lossy data coding and compression. IEEE Trans Patt Anal Mach Intell, 29(9):1546-1562.
[73]MacDonald J, Wäldchen S, Hauch S, et al., 2019. A rate-distortion framework for explaining neural network decisions. https://arxiv.org/abs/1905.11092
[74]Marcus G, 2020. The next decade in AI: four steps towards robust artificial intelligence. https://arxiv.org/abs/2002.06177
[75]Marr D, 1982. Vision. MIT Press, Cambridge, MA, USA.
[76]Mayr O, 1970. The Origins of Feedback Control. MIT Press, Cambridge, MA, USA.
[77]McCloskey M, Cohen NJ, 1989. Catastrophic interference in connectionist networks: the sequential learning problem. Psychol Learn Motiv, 24:109-165.
[78]Mildenhall B, Srinivasan PP, Tancik M, et al., 2020. NeRF: representing scenes as neural radiance fields for view synthesis. https://arxiv.org/abs/2003.08934
[79]Nash J, 1951. Non-cooperative games. Ann Math, 54(2):286-295.
[80]Newell A, Simon HA, 1972. Human Problem Solving. Prentice Hall, Englewood Cliffs, New Jersey, USA.
[81]Ng AY, Russell SJ, 2000. Algorithms for inverse reinforcement learning. Proc 17th Int Conf on Machine Learning, p.663-670.
[82]Olshausen BA, Field DJ, 1996. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607-609.
[83]Osband I, van Roy B, 2014. Model-based reinforcement learning and the eluder dimension. Proc 27th Int Conf on Neural Information Processing Systems, p.1466-1474.
[84]Pai D, Psenka M, Chiu CY, et al., 2022. Pursuit of a discriminative representation for multiple subspaces via sequential games. https://arxiv.org/abs/2206.09120
[85]Papyan V, Romano Y, Sulam J, et al., 2018. Theoretical foundations of deep learning via sparse representations: a multilayer sparse model and its connection to convolutional neural networks. IEEE Signal Process Mag, 35(4):72-89.
[86]Papyan V, Han XY, Donoho DL, 2020. Prevalence of neural collapse during the terminal phase of deep learning training. https://arxiv.org/abs/2008.08186
[87]Patterson D, Gonzalez J, Hölzle U, et al., 2022. The carbon footprint of machine learning training will plateau, then shrink. https://arxiv.org/abs/2204.05149
[88]Quinlan JR, 1986. Induction of decision trees. Mach Learn, 1(1):81-106.
[89]Rao RPN, Ballard DH, 1999. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat Neurosci, 2(1):79-87.
[90]Rifai S, Vincent P, Muller X, et al., 2011. Contractive auto-encoders: explicit invariance during feature extraction. Proc 28th Int Conf on Machine Learning, p.833-840.
[91]Rissanen J, 1989. Stochastic Complexity in Statistical Inquiry. World Scientific Publishing Co., Inc., Singapore.
[92]Roberts DA, Yaida S, 2022. The Principles of Deep Learning Theory. Cambridge University Press, Cambridge, MA, USA.
[93]Rosenblatt F, 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev, 65(6):386-408.
[94]Rumelhart DE, Hinton GE, Williams RJ, 1986. Learning representations by back-propagating errors. Nature, 323(6088):533-536.
[95]Russell S, Norvig P, 2020. Artificial Intelligence: a Modern Approach (4th Ed.). Pearson Education, Inc., River Street, Hoboken, NJ, USA.
[96]Sastry S, 1999. Nonlinear Systems: Analysis, Stability, and Control. Springer, New York, USA.
[97]Saxe AM, Bansal Y, Dapello J, et al., 2019. On the information bottleneck theory of deep learning. J Stat Mech, 2019:124020.
[98]Shamir A, Melamed O, BenShmuel O, 2022. The dimpled manifold model of adversarial examples in machine learning. https://arxiv.org/abs/2106.10151
[99]Shannon CE, 1948. A mathematical theory of communication. Bell Syst Techn J, 27(3):379-423.
[100]Shazeer N, Mirhoseini A, Maziarz K, et al., 2017. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. https://arxiv.org/abs/1701.06538
[101]Shum HY, Chan SC, Kang SB, 2007. Image-Based Rendering. Springer, New York, USA.
[102]Shwartz-Ziv R, Tishby N, 2017. Opening the black box of deep neural networks via information. https://arxiv.org/abs/1703.00810
[103]Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484-489.
[104]Silver D, Schrittwieser J, Simonyan K, et al., 2017. Mastering the game of Go without human knowledge. Nature, 550(7676):354-359.
[105]Simon HA, 1969. The Sciences of the Artificial. MIT Press, Cambridge, MA, USA.
[106]Srivastava A, Valkoz L, Russell C, et al., 2017. VeeGAN: reducing mode collapse in GANs using implicit variational learning. Proc 31st Int Conf on Neural Information Processing Systems, p.3310-3320.
[107]Srivastava RK, Greff K, Schmidhuber J, 2015. Highway networks. https://arxiv.org/abs/1505.00387
[108]Sutton RS, Barto AG, 2018. Reinforcement Learning: an Introduction (2nd Ed.). MIT Press, Cambridge, MA, USA.
[109]Szegedy C, Zaremba W, Sutskever I, et al., 2014. Intriguing properties of neural networks. https://arxiv.org/abs/1312.6199
[110]Szeliski R, 2022. Computer Vision: Algorithms and Applications (2nd Ed.). Springer-Verlag, Switzerland.
[111]Tenenbaum JB, de Silva V, Langford JC, 2000. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319-2323.
[112]Tishby N, Zaslavsky N, 2015. Deep learning and the information bottleneck principle. IEEE Information Theory Workshop, p.1-5.
[113]Tong SB, Dai XL, Wu ZY, et al., 2022. Incremental learning of structured memory via closed-loop transcription. https://arxiv.org/abs/2202.05411
[114]Uehara M, Zhang XZ, Sun W, 2022. Representation learning for online and offline RL in low-rank MDPs. https://arxiv.org/abs/2110.04652v1
[115]van den Oord A, Li YZ, Vinyals O, 2019. Representation learning with contrastive predictive coding. https://arxiv.org/abs/1807.03748v1
[116]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. https://arxiv.org/abs/1706.03762
[117]Viazovska MS, 2017. The sphere packing problem in dimension 8. Ann Math, 185(3):991-1015.
[118]Vidal R, 2022. Attention: Self-Expression Is All You Need. https://openreview.net/forum?id=MmujBClawFo
[119]Vidal R, Ma Y, Sastry SS, 2016. Generalized Principal Component Analysis. Springer Verlag, New York, USA.
[120]Vinyals O, Babuschkin I, Czarnecki WM, et al., 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350-354.
[121]von Neumann J, Morgenstern O, 1944. Theory of Games and Economic Behavior. Princeton University Press, Princeton, NJ, USA.
[122]Wang TR, Buchanan S, Gilboa D, et al., 2021. Deep networks provably classify data on curves. https://arxiv.org/abs/2107.14324
[123]Wiatowski T, Bölcskei H, 2018. A mathematical theory of deep convolutional neural networks for feature extraction. IEEE Trans Inform Theory, 64(3):1845-1866.
[124]Wiener N, 1948. Cybernetics. MIT Press, Cambridge, MA, USA.
[125]Wiener N, 1961. Cybernetics (2nd Ed.). MIT Press, Cambridge, MA, USA.
[126]Wisdom S, Powers T, Pitton J, et al., 2017. Building recurrent networks by unfolding iterative thresholding for sequential sparse recovery. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.4346-4350.
[127]Wood E, Baltrušaitis T, Hewitt C, et al., 2021. Fake it till you make it: face analysis in the wild using synthetic data alone. IEEE/CVF Int Conf on Computer Vision, p.3661-3671.
[128]Wright J, Ma Y, 2022. High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications. Cambridge University Press, Cambridge, MA, USA.
[129]Wright J, Tao Y, Lin ZY, et al., 2007. Classification via minimum incremental coding length (MICL). Proc 20th Int Conf on Neural Information Processing Systems, p.1633-1640.
[130]Xie SN, Girshick R, Dollár P, et al., 2017. Aggregated residual transformations for deep neural networks. IEEE Conf on Computer Vision and Pattern Recognition, p.5987-5995.
[131]Yang ZT, Yu YD, You C, et al., 2020. Rethinking bias-variance trade-off for generalization of neural networks. Proc 37th Int Conf on Machine Learning, p.10767-10777.
[132]Yildirim I, Belledonne M, Freiwald W, et al., 2020. Efficient inverse graphics in biological face processing. Sci Adv, 6(10):eaax5979.
[133]Yu A, Fridovich-Keil S, Tancik M, et al., 2021. Plenoxels: radiance fields without neural networks. https://arxiv.org/abs/2112.05131
[134]Yu YD, Chan KHR, You C, et al., 2020. Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Proc 34th Int Conf on Neural Information Processing Systems, p.9422-9434.
[135]Zeiler MD, Fergus R, 2014. Visualizing and understanding convolutional networks. Proc 13th European Conf on Computer Vision, p.818-833.
[136]Zhai YX, Yang ZT, Liao ZY, et al., 2020. Complete dictionary learning via l4-norm maximization over the orthogonal group. J Mach Learn Res, 21(1):6622-6689.
[137]Zhu JY, Park T, Isola P, et al., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. IEEE Int Conf on Computer Vision, p.2242-2251.
[138]Zoph B, Le QV, 2017. Neural architecture search with reinforcement learning. https://arxiv.org/abs/1611.01578
Open peer comments: Debate/Discuss/Question/Opinion
<1>