Is convolution neural network (CNN) a special case of multilayer perceptron (MLP)? And why not use MLP for everything?
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
If convolution can be expressed with matrix multiplication (example) Can we say convolution neural network (CNN) is a special case of multilayer perceptron (MLP)?
If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?
machine-learning neural-networks conv-neural-network
add a comment |Â
up vote
2
down vote
favorite
If convolution can be expressed with matrix multiplication (example) Can we say convolution neural network (CNN) is a special case of multilayer perceptron (MLP)?
If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?
machine-learning neural-networks conv-neural-network
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
If convolution can be expressed with matrix multiplication (example) Can we say convolution neural network (CNN) is a special case of multilayer perceptron (MLP)?
If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?
machine-learning neural-networks conv-neural-network
If convolution can be expressed with matrix multiplication (example) Can we say convolution neural network (CNN) is a special case of multilayer perceptron (MLP)?
If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?
machine-learning neural-networks conv-neural-network
machine-learning neural-networks conv-neural-network
edited Aug 26 at 6:45
asked Aug 26 at 6:37
hxd1011
17.1k443131
17.1k443131
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
A convolution can be expressed as matrix multiplication but the matrix is multiplied with a patch around every position in the image separately. So you go to (1/1) and extract a patch and multiply it with an MLP. Then you do the same thing at position (1/2) and so forth. So obviously there are less degrees of freedom than applying an MLP directly. Most people regard an MLP as a special case of a convolution where the spatial dimensions are 1x1
.
Edit Start
Regarding MLP as special case of CNN, some comments do not share this opinion. Yann LeCun, who can be counted as one of the inventors of CNNs, made a similar comment before on FB: https://www.facebook.com/yann.lecun/posts/10152820758292143
He said that in CNNs there is no such thing as a "fully connected" layer, there is only a layer with 1x1
spatial extent and a kernel with 1x1
spatial extent. If one can "convert" FC layers, which are the single layers of MLPs into convolutional layers, then one can obviously also convert an entire MLP into a CNN by interpreting the input as a vector with only channel dimensions.
An example: If I have an image of size $Htimes Wtimes C$ ($C$ channels) and I apply a single layer of an MLP to it, then I will transform the input into a vector $x$ of size $V=HWC$. I will then apply a matrix $Win mathbbR^Utimes V$ to it, thereby creating $U$ hidden activations. I could interpret the input vector $x$ as an image with only one pixel but $V$ "channels": $xinmathbbR^1times 1times V$ and the weight matrix as a Kernel with only one pixel area but $U$ filters taking in $V$ channels each: $WinmathbbR^Utimes 1times 1times V$. I can then call some Conv2D function that carries out the operation and computes exactly the same as the MLP.
Edit End
If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?
That is a nice idea (and probably worth doing research on) but it's simply not practical:
- The MLP has too many degrees of freedom, it's likely to overfit.
- In addition to learning the weights, you would have to learn their dependency structure.
As most Deep Learning research is closely related to NLP/speech processing/computer vision, people are eager to solve their problems and maybe less eager to investigate how a function space more general than a CNN could constrain itself to that particular function space. Though imho it's certainly interesting to think about that.
I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
â Neil Slater
Aug 26 at 11:50
@NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
â jmaxx
Aug 26 at 15:56
Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/â¦
â Neil Slater
Aug 26 at 16:05
@NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
â jmaxx
Aug 26 at 16:09
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
A convolution can be expressed as matrix multiplication but the matrix is multiplied with a patch around every position in the image separately. So you go to (1/1) and extract a patch and multiply it with an MLP. Then you do the same thing at position (1/2) and so forth. So obviously there are less degrees of freedom than applying an MLP directly. Most people regard an MLP as a special case of a convolution where the spatial dimensions are 1x1
.
Edit Start
Regarding MLP as special case of CNN, some comments do not share this opinion. Yann LeCun, who can be counted as one of the inventors of CNNs, made a similar comment before on FB: https://www.facebook.com/yann.lecun/posts/10152820758292143
He said that in CNNs there is no such thing as a "fully connected" layer, there is only a layer with 1x1
spatial extent and a kernel with 1x1
spatial extent. If one can "convert" FC layers, which are the single layers of MLPs into convolutional layers, then one can obviously also convert an entire MLP into a CNN by interpreting the input as a vector with only channel dimensions.
An example: If I have an image of size $Htimes Wtimes C$ ($C$ channels) and I apply a single layer of an MLP to it, then I will transform the input into a vector $x$ of size $V=HWC$. I will then apply a matrix $Win mathbbR^Utimes V$ to it, thereby creating $U$ hidden activations. I could interpret the input vector $x$ as an image with only one pixel but $V$ "channels": $xinmathbbR^1times 1times V$ and the weight matrix as a Kernel with only one pixel area but $U$ filters taking in $V$ channels each: $WinmathbbR^Utimes 1times 1times V$. I can then call some Conv2D function that carries out the operation and computes exactly the same as the MLP.
Edit End
If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?
That is a nice idea (and probably worth doing research on) but it's simply not practical:
- The MLP has too many degrees of freedom, it's likely to overfit.
- In addition to learning the weights, you would have to learn their dependency structure.
As most Deep Learning research is closely related to NLP/speech processing/computer vision, people are eager to solve their problems and maybe less eager to investigate how a function space more general than a CNN could constrain itself to that particular function space. Though imho it's certainly interesting to think about that.
I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
â Neil Slater
Aug 26 at 11:50
@NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
â jmaxx
Aug 26 at 15:56
Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/â¦
â Neil Slater
Aug 26 at 16:05
@NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
â jmaxx
Aug 26 at 16:09
add a comment |Â
up vote
2
down vote
A convolution can be expressed as matrix multiplication but the matrix is multiplied with a patch around every position in the image separately. So you go to (1/1) and extract a patch and multiply it with an MLP. Then you do the same thing at position (1/2) and so forth. So obviously there are less degrees of freedom than applying an MLP directly. Most people regard an MLP as a special case of a convolution where the spatial dimensions are 1x1
.
Edit Start
Regarding MLP as special case of CNN, some comments do not share this opinion. Yann LeCun, who can be counted as one of the inventors of CNNs, made a similar comment before on FB: https://www.facebook.com/yann.lecun/posts/10152820758292143
He said that in CNNs there is no such thing as a "fully connected" layer, there is only a layer with 1x1
spatial extent and a kernel with 1x1
spatial extent. If one can "convert" FC layers, which are the single layers of MLPs into convolutional layers, then one can obviously also convert an entire MLP into a CNN by interpreting the input as a vector with only channel dimensions.
An example: If I have an image of size $Htimes Wtimes C$ ($C$ channels) and I apply a single layer of an MLP to it, then I will transform the input into a vector $x$ of size $V=HWC$. I will then apply a matrix $Win mathbbR^Utimes V$ to it, thereby creating $U$ hidden activations. I could interpret the input vector $x$ as an image with only one pixel but $V$ "channels": $xinmathbbR^1times 1times V$ and the weight matrix as a Kernel with only one pixel area but $U$ filters taking in $V$ channels each: $WinmathbbR^Utimes 1times 1times V$. I can then call some Conv2D function that carries out the operation and computes exactly the same as the MLP.
Edit End
If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?
That is a nice idea (and probably worth doing research on) but it's simply not practical:
- The MLP has too many degrees of freedom, it's likely to overfit.
- In addition to learning the weights, you would have to learn their dependency structure.
As most Deep Learning research is closely related to NLP/speech processing/computer vision, people are eager to solve their problems and maybe less eager to investigate how a function space more general than a CNN could constrain itself to that particular function space. Though imho it's certainly interesting to think about that.
I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
â Neil Slater
Aug 26 at 11:50
@NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
â jmaxx
Aug 26 at 15:56
Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/â¦
â Neil Slater
Aug 26 at 16:05
@NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
â jmaxx
Aug 26 at 16:09
add a comment |Â
up vote
2
down vote
up vote
2
down vote
A convolution can be expressed as matrix multiplication but the matrix is multiplied with a patch around every position in the image separately. So you go to (1/1) and extract a patch and multiply it with an MLP. Then you do the same thing at position (1/2) and so forth. So obviously there are less degrees of freedom than applying an MLP directly. Most people regard an MLP as a special case of a convolution where the spatial dimensions are 1x1
.
Edit Start
Regarding MLP as special case of CNN, some comments do not share this opinion. Yann LeCun, who can be counted as one of the inventors of CNNs, made a similar comment before on FB: https://www.facebook.com/yann.lecun/posts/10152820758292143
He said that in CNNs there is no such thing as a "fully connected" layer, there is only a layer with 1x1
spatial extent and a kernel with 1x1
spatial extent. If one can "convert" FC layers, which are the single layers of MLPs into convolutional layers, then one can obviously also convert an entire MLP into a CNN by interpreting the input as a vector with only channel dimensions.
An example: If I have an image of size $Htimes Wtimes C$ ($C$ channels) and I apply a single layer of an MLP to it, then I will transform the input into a vector $x$ of size $V=HWC$. I will then apply a matrix $Win mathbbR^Utimes V$ to it, thereby creating $U$ hidden activations. I could interpret the input vector $x$ as an image with only one pixel but $V$ "channels": $xinmathbbR^1times 1times V$ and the weight matrix as a Kernel with only one pixel area but $U$ filters taking in $V$ channels each: $WinmathbbR^Utimes 1times 1times V$. I can then call some Conv2D function that carries out the operation and computes exactly the same as the MLP.
Edit End
If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?
That is a nice idea (and probably worth doing research on) but it's simply not practical:
- The MLP has too many degrees of freedom, it's likely to overfit.
- In addition to learning the weights, you would have to learn their dependency structure.
As most Deep Learning research is closely related to NLP/speech processing/computer vision, people are eager to solve their problems and maybe less eager to investigate how a function space more general than a CNN could constrain itself to that particular function space. Though imho it's certainly interesting to think about that.
A convolution can be expressed as matrix multiplication but the matrix is multiplied with a patch around every position in the image separately. So you go to (1/1) and extract a patch and multiply it with an MLP. Then you do the same thing at position (1/2) and so forth. So obviously there are less degrees of freedom than applying an MLP directly. Most people regard an MLP as a special case of a convolution where the spatial dimensions are 1x1
.
Edit Start
Regarding MLP as special case of CNN, some comments do not share this opinion. Yann LeCun, who can be counted as one of the inventors of CNNs, made a similar comment before on FB: https://www.facebook.com/yann.lecun/posts/10152820758292143
He said that in CNNs there is no such thing as a "fully connected" layer, there is only a layer with 1x1
spatial extent and a kernel with 1x1
spatial extent. If one can "convert" FC layers, which are the single layers of MLPs into convolutional layers, then one can obviously also convert an entire MLP into a CNN by interpreting the input as a vector with only channel dimensions.
An example: If I have an image of size $Htimes Wtimes C$ ($C$ channels) and I apply a single layer of an MLP to it, then I will transform the input into a vector $x$ of size $V=HWC$. I will then apply a matrix $Win mathbbR^Utimes V$ to it, thereby creating $U$ hidden activations. I could interpret the input vector $x$ as an image with only one pixel but $V$ "channels": $xinmathbbR^1times 1times V$ and the weight matrix as a Kernel with only one pixel area but $U$ filters taking in $V$ channels each: $WinmathbbR^Utimes 1times 1times V$. I can then call some Conv2D function that carries out the operation and computes exactly the same as the MLP.
Edit End
If yes, why people do not use a big enough MLP for everything, that let the computer to learn to use the convolution by self?
That is a nice idea (and probably worth doing research on) but it's simply not practical:
- The MLP has too many degrees of freedom, it's likely to overfit.
- In addition to learning the weights, you would have to learn their dependency structure.
As most Deep Learning research is closely related to NLP/speech processing/computer vision, people are eager to solve their problems and maybe less eager to investigate how a function space more general than a CNN could constrain itself to that particular function space. Though imho it's certainly interesting to think about that.
edited Aug 26 at 15:54
answered Aug 26 at 8:19
jmaxx
21519
21519
I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
â Neil Slater
Aug 26 at 11:50
@NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
â jmaxx
Aug 26 at 15:56
Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/â¦
â Neil Slater
Aug 26 at 16:05
@NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
â jmaxx
Aug 26 at 16:09
add a comment |Â
I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
â Neil Slater
Aug 26 at 11:50
@NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
â jmaxx
Aug 26 at 15:56
Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/â¦
â Neil Slater
Aug 26 at 16:05
@NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
â jmaxx
Aug 26 at 16:09
I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
â Neil Slater
Aug 26 at 11:50
I am not sure if that is the case ("people regard an MLP as a special case of a convolution"). Another equally valid way of looking at it is that a CNN is a special case of a MLP where only local connections have a weight different from zero, and that the weights of local connections are shared. Definitely that is how I was introduced to the concept of CNNs after learning about fully-connected networks.
â Neil Slater
Aug 26 at 11:50
@NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
â jmaxx
Aug 26 at 15:56
@NeilSlater I've edited the post and tried to clarify my view on that matter, I hope it's now easier to understand. If you could take the time to share your point of view, that would be very much appreciated, as I think the question is quite interesting. Thank you!
â jmaxx
Aug 26 at 15:56
Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/â¦
â Neil Slater
Aug 26 at 16:05
Yann LeCun's comment also leads to this Q&A on Stack Exchange: datascience.stackexchange.com/questions/12830/â¦
â Neil Slater
Aug 26 at 16:05
@NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
â jmaxx
Aug 26 at 16:09
@NeilSlater Thanks for the link, I hadn't seen that there is an answer on another Stack Exchange site. As far as I can tell, that answer gives a similar perspective as mine in that it shows how any MLP can be computed as a $1times 1$ convolution, i.e. those two are equivalent.
â jmaxx
Aug 26 at 16:09
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f364001%2fis-convolution-neural-network-cnn-a-special-case-of-multilayer-perceptron-mlp%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password