Is convolution neural network (CNN) a special case of multilayer perceptron (MLP)? And why not use MLP for everything?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












If convolution can be expressed with matrix multiplication (example) Can we say convolution neural network (CNN) is a special case of multilayer perceptron (MLP)?



If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?










share|cite|improve this question





























    up vote
    2
    down vote

    favorite












    If convolution can be expressed with matrix multiplication (example) Can we say convolution neural network (CNN) is a special case of multilayer perceptron (MLP)?



    If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?










    share|cite|improve this question

























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      If convolution can be expressed with matrix multiplication (example) Can we say convolution neural network (CNN) is a special case of multilayer perceptron (MLP)?



      If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?










      share|cite|improve this question















      If convolution can be expressed with matrix multiplication (example) Can we say convolution neural network (CNN) is a special case of multilayer perceptron (MLP)?



      If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?







      machine-learning neural-networks conv-neural-network






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Aug 26 at 6:45

























      asked Aug 26 at 6:37









      hxd1011

      17.1k443131




      17.1k443131




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          2
          down vote













          A convolution can be expressed as matrix multiplication but the matrix is multiplied with a patch around every position in the image separately. So you go to (1/1) and extract a patch and multiply it with an MLP. Then you do the same thing at position (1/2) and so forth. So obviously there are less degrees of freedom than applying an MLP directly. Most people regard an MLP as a special case of a convolution where the spatial dimensions are 1x1.



          Edit Start



          Regarding MLP as special case of CNN, some comments do not share this opinion. Yann LeCun, who can be counted as one of the inventors of CNNs, made a similar comment before on FB: https://www.facebook.com/yann.lecun/posts/10152820758292143



          He said that in CNNs there is no such thing as a "fully connected" layer, there is only a layer with 1x1 spatial extent and a kernel with 1x1 spatial extent. If one can "convert" FC layers, which are the single layers of MLPs into convolutional layers, then one can obviously also convert an entire MLP into a CNN by interpreting the input as a vector with only channel dimensions.



          An example: If I have an image of size $Htimes Wtimes C$ ($C$ channels) and I apply a single layer of an MLP to it, then I will transform the input into a vector $x$ of size $V=HWC$. I will then apply a matrix $Win mathbbR^Utimes V$ to it, thereby creating $U$ hidden activations. I could interpret the input vector $x$ as an image with only one pixel but $V$ "channels": $xinmathbbR^1times 1times V$ and the weight matrix as a Kernel with only one pixel area but $U$ filters taking in $V$ channels each: $WinmathbbR^Utimes 1times 1times V$. I can then call some Conv2D function that carries out the operation and computes exactly the same as the MLP.



          Edit End




          If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?




          That is a nice idea (and probably worth doing research on) but it's simply not practical:



          1. The MLP has too many degrees of freedom, it's likely to overfit.

          2. In addition to learning the weights, you would have to learn their dependency structure.

          As most Deep Learning research is closely related to NLP/speech processing/computer vision, people are eager to solve their problems and maybe less eager to investigate how a function space more general than a CNN could constrain itself to that particular function space. Though imho it's certainly interesting to think about that.






          share|cite|improve this answer






















          • I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
            – Neil Slater
            Aug 26 at 11:50










          • @NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
            – jmaxx
            Aug 26 at 15:56










          • Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/…
            – Neil Slater
            Aug 26 at 16:05










          • @NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
            – jmaxx
            Aug 26 at 16:09










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "65"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f364001%2fis-convolution-neural-network-cnn-a-special-case-of-multilayer-perceptron-mlp%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote













          A convolution can be expressed as matrix multiplication but the matrix is multiplied with a patch around every position in the image separately. So you go to (1/1) and extract a patch and multiply it with an MLP. Then you do the same thing at position (1/2) and so forth. So obviously there are less degrees of freedom than applying an MLP directly. Most people regard an MLP as a special case of a convolution where the spatial dimensions are 1x1.



          Edit Start



          Regarding MLP as special case of CNN, some comments do not share this opinion. Yann LeCun, who can be counted as one of the inventors of CNNs, made a similar comment before on FB: https://www.facebook.com/yann.lecun/posts/10152820758292143



          He said that in CNNs there is no such thing as a "fully connected" layer, there is only a layer with 1x1 spatial extent and a kernel with 1x1 spatial extent. If one can "convert" FC layers, which are the single layers of MLPs into convolutional layers, then one can obviously also convert an entire MLP into a CNN by interpreting the input as a vector with only channel dimensions.



          An example: If I have an image of size $Htimes Wtimes C$ ($C$ channels) and I apply a single layer of an MLP to it, then I will transform the input into a vector $x$ of size $V=HWC$. I will then apply a matrix $Win mathbbR^Utimes V$ to it, thereby creating $U$ hidden activations. I could interpret the input vector $x$ as an image with only one pixel but $V$ "channels": $xinmathbbR^1times 1times V$ and the weight matrix as a Kernel with only one pixel area but $U$ filters taking in $V$ channels each: $WinmathbbR^Utimes 1times 1times V$. I can then call some Conv2D function that carries out the operation and computes exactly the same as the MLP.



          Edit End




          If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?




          That is a nice idea (and probably worth doing research on) but it's simply not practical:



          1. The MLP has too many degrees of freedom, it's likely to overfit.

          2. In addition to learning the weights, you would have to learn their dependency structure.

          As most Deep Learning research is closely related to NLP/speech processing/computer vision, people are eager to solve their problems and maybe less eager to investigate how a function space more general than a CNN could constrain itself to that particular function space. Though imho it's certainly interesting to think about that.






          share|cite|improve this answer






















          • I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
            – Neil Slater
            Aug 26 at 11:50










          • @NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
            – jmaxx
            Aug 26 at 15:56










          • Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/…
            – Neil Slater
            Aug 26 at 16:05










          • @NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
            – jmaxx
            Aug 26 at 16:09














          up vote
          2
          down vote













          A convolution can be expressed as matrix multiplication but the matrix is multiplied with a patch around every position in the image separately. So you go to (1/1) and extract a patch and multiply it with an MLP. Then you do the same thing at position (1/2) and so forth. So obviously there are less degrees of freedom than applying an MLP directly. Most people regard an MLP as a special case of a convolution where the spatial dimensions are 1x1.



          Edit Start



          Regarding MLP as special case of CNN, some comments do not share this opinion. Yann LeCun, who can be counted as one of the inventors of CNNs, made a similar comment before on FB: https://www.facebook.com/yann.lecun/posts/10152820758292143



          He said that in CNNs there is no such thing as a "fully connected" layer, there is only a layer with 1x1 spatial extent and a kernel with 1x1 spatial extent. If one can "convert" FC layers, which are the single layers of MLPs into convolutional layers, then one can obviously also convert an entire MLP into a CNN by interpreting the input as a vector with only channel dimensions.



          An example: If I have an image of size $Htimes Wtimes C$ ($C$ channels) and I apply a single layer of an MLP to it, then I will transform the input into a vector $x$ of size $V=HWC$. I will then apply a matrix $Win mathbbR^Utimes V$ to it, thereby creating $U$ hidden activations. I could interpret the input vector $x$ as an image with only one pixel but $V$ "channels": $xinmathbbR^1times 1times V$ and the weight matrix as a Kernel with only one pixel area but $U$ filters taking in $V$ channels each: $WinmathbbR^Utimes 1times 1times V$. I can then call some Conv2D function that carries out the operation and computes exactly the same as the MLP.



          Edit End




          If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?




          That is a nice idea (and probably worth doing research on) but it's simply not practical:



          1. The MLP has too many degrees of freedom, it's likely to overfit.

          2. In addition to learning the weights, you would have to learn their dependency structure.

          As most Deep Learning research is closely related to NLP/speech processing/computer vision, people are eager to solve their problems and maybe less eager to investigate how a function space more general than a CNN could constrain itself to that particular function space. Though imho it's certainly interesting to think about that.






          share|cite|improve this answer






















          • I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
            – Neil Slater
            Aug 26 at 11:50










          • @NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
            – jmaxx
            Aug 26 at 15:56










          • Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/…
            – Neil Slater
            Aug 26 at 16:05










          • @NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
            – jmaxx
            Aug 26 at 16:09












          up vote
          2
          down vote










          up vote
          2
          down vote









          A convolution can be expressed as matrix multiplication but the matrix is multiplied with a patch around every position in the image separately. So you go to (1/1) and extract a patch and multiply it with an MLP. Then you do the same thing at position (1/2) and so forth. So obviously there are less degrees of freedom than applying an MLP directly. Most people regard an MLP as a special case of a convolution where the spatial dimensions are 1x1.



          Edit Start



          Regarding MLP as special case of CNN, some comments do not share this opinion. Yann LeCun, who can be counted as one of the inventors of CNNs, made a similar comment before on FB: https://www.facebook.com/yann.lecun/posts/10152820758292143



          He said that in CNNs there is no such thing as a "fully connected" layer, there is only a layer with 1x1 spatial extent and a kernel with 1x1 spatial extent. If one can "convert" FC layers, which are the single layers of MLPs into convolutional layers, then one can obviously also convert an entire MLP into a CNN by interpreting the input as a vector with only channel dimensions.



          An example: If I have an image of size $Htimes Wtimes C$ ($C$ channels) and I apply a single layer of an MLP to it, then I will transform the input into a vector $x$ of size $V=HWC$. I will then apply a matrix $Win mathbbR^Utimes V$ to it, thereby creating $U$ hidden activations. I could interpret the input vector $x$ as an image with only one pixel but $V$ "channels": $xinmathbbR^1times 1times V$ and the weight matrix as a Kernel with only one pixel area but $U$ filters taking in $V$ channels each: $WinmathbbR^Utimes 1times 1times V$. I can then call some Conv2D function that carries out the operation and computes exactly the same as the MLP.



          Edit End




          If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?




          That is a nice idea (and probably worth doing research on) but it's simply not practical:



          1. The MLP has too many degrees of freedom, it's likely to overfit.

          2. In addition to learning the weights, you would have to learn their dependency structure.

          As most Deep Learning research is closely related to NLP/speech processing/computer vision, people are eager to solve their problems and maybe less eager to investigate how a function space more general than a CNN could constrain itself to that particular function space. Though imho it's certainly interesting to think about that.






          share|cite|improve this answer














          A convolution can be expressed as matrix multiplication but the matrix is multiplied with a patch around every position in the image separately. So you go to (1/1) and extract a patch and multiply it with an MLP. Then you do the same thing at position (1/2) and so forth. So obviously there are less degrees of freedom than applying an MLP directly. Most people regard an MLP as a special case of a convolution where the spatial dimensions are 1x1.



          Edit Start



          Regarding MLP as special case of CNN, some comments do not share this opinion. Yann LeCun, who can be counted as one of the inventors of CNNs, made a similar comment before on FB: https://www.facebook.com/yann.lecun/posts/10152820758292143



          He said that in CNNs there is no such thing as a "fully connected" layer, there is only a layer with 1x1 spatial extent and a kernel with 1x1 spatial extent. If one can "convert" FC layers, which are the single layers of MLPs into convolutional layers, then one can obviously also convert an entire MLP into a CNN by interpreting the input as a vector with only channel dimensions.



          An example: If I have an image of size $Htimes Wtimes C$ ($C$ channels) and I apply a single layer of an MLP to it, then I will transform the input into a vector $x$ of size $V=HWC$. I will then apply a matrix $Win mathbbR^Utimes V$ to it, thereby creating $U$ hidden activations. I could interpret the input vector $x$ as an image with only one pixel but $V$ "channels": $xinmathbbR^1times 1times V$ and the weight matrix as a Kernel with only one pixel area but $U$ filters taking in $V$ channels each: $WinmathbbR^Utimes 1times 1times V$. I can then call some Conv2D function that carries out the operation and computes exactly the same as the MLP.



          Edit End




          If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?




          That is a nice idea (and probably worth doing research on) but it's simply not practical:



          1. The MLP has too many degrees of freedom, it's likely to overfit.

          2. In addition to learning the weights, you would have to learn their dependency structure.

          As most Deep Learning research is closely related to NLP/speech processing/computer vision, people are eager to solve their problems and maybe less eager to investigate how a function space more general than a CNN could constrain itself to that particular function space. Though imho it's certainly interesting to think about that.







          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited Aug 26 at 15:54

























          answered Aug 26 at 8:19









          jmaxx

          21519




          21519











          • I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
            – Neil Slater
            Aug 26 at 11:50










          • @NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
            – jmaxx
            Aug 26 at 15:56










          • Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/…
            – Neil Slater
            Aug 26 at 16:05










          • @NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
            – jmaxx
            Aug 26 at 16:09
















          • I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
            – Neil Slater
            Aug 26 at 11:50










          • @NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
            – jmaxx
            Aug 26 at 15:56










          • Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/…
            – Neil Slater
            Aug 26 at 16:05










          • @NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
            – jmaxx
            Aug 26 at 16:09















          I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
          – Neil Slater
          Aug 26 at 11:50




          I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
          – Neil Slater
          Aug 26 at 11:50












          @NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
          – jmaxx
          Aug 26 at 15:56




          @NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
          – jmaxx
          Aug 26 at 15:56












          Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/…
          – Neil Slater
          Aug 26 at 16:05




          Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/…
          – Neil Slater
          Aug 26 at 16:05












          @NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
          – jmaxx
          Aug 26 at 16:09




          @NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
          – jmaxx
          Aug 26 at 16:09

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f364001%2fis-convolution-neural-network-cnn-a-special-case-of-multilayer-perceptron-mlp%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          How many registers does an x86_64 CPU actually have?

          Nur Jahan