What code prevents mount namespace loops? In a more complex case involving mount propagation

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












Background information



The following commands return an error:



# touch /tmp/a
# mount --bind /proc/self/ns/mnt /tmp/a
mount: /tmp/a: wrong fs type, bad option, bad superblock on /proc/self/ns/mnt, missing codepage or helper program, or other error.


This is because the kernel code (see extracts below) prevents a simple mount namespace loop. The code comments explain why this is not allowed. The lifetime of a mount namespace is tracked by a simple reference count. If you have a loop where mount namespaces A and B both reference the other, then both A and B will always have at least one reference, and they would never be freed. The allocated memory would be lost, until you rebooted the entire system.



For comparison, the kernel allows the following, which is not a loop:



# unshare -m
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# touch /tmp/a
# mount --bind /proc/8456/ns/mnt /tmp/a
#
# umount /tmp/a # cleanup
#


Question



Where does the kernel code distinguish between the following two cases?



If I try to create a loop using mount propagation, it fails:



# mount --make-shared /tmp
# unshare -m --propagation shared
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# mount --bind /proc/8456/ns/mnt /tmp/a
mount: /tmp/a: wrong fs type, bad option, bad superblock on /proc/9061/ns/mnt, missing codepage or helper program, or other error.


But if I remove the mount propagation, no loop is created, and it succeeds:



# unshare -m --propagation private
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# mount --bind /proc/8456/ns/mnt /tmp/a
#
# umount /tmp/a # cleanup


Kernel code which handles the simpler case



https://elixir.bootlin.com/linux/v4.18/source/fs/namespace.c



static bool mnt_ns_loop(struct dentry *dentry)

/* Could bind mounting the mount namespace inode cause a
* mount namespace loop?
*/
struct mnt_namespace *mnt_ns;
if (!is_mnt_ns_file(dentry))
return false;

mnt_ns = to_mnt_ns(get_proc_ns(dentry->d_inode));
return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;



...



 err = -EINVAL;
if (mnt_ns_loop(old_path.dentry))
goto out;


...



 * Assign a sequence number so we can detect when we attempt to bind
* mount a reference to an older mount namespace into the current
* mount namespace, preventing reference counting loops. A 64bit
* number incrementing at 10Ghz will take 12,427 years to wrap which
* is effectively never, so we can ignore the possibility.
*/
static atomic64_t mnt_ns_seq = ATOMIC64_INIT(1);

static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)









share|improve this question























  • It distinguishes in the last line of mnt_ns_loop(): return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;. newer mnt_ns objects have a seq greater that older objects. Or is it something else you're unclear about?
    – mosvy
    yesterday










  • @mosvy I don't understand. Why would that make a difference between my two cases with and without mount propagation? The relative order of the sequence numbers should be the same in both cases.
    – sourcejedi
    yesterday














up vote
2
down vote

favorite












Background information



The following commands return an error:



# touch /tmp/a
# mount --bind /proc/self/ns/mnt /tmp/a
mount: /tmp/a: wrong fs type, bad option, bad superblock on /proc/self/ns/mnt, missing codepage or helper program, or other error.


This is because the kernel code (see extracts below) prevents a simple mount namespace loop. The code comments explain why this is not allowed. The lifetime of a mount namespace is tracked by a simple reference count. If you have a loop where mount namespaces A and B both reference the other, then both A and B will always have at least one reference, and they would never be freed. The allocated memory would be lost, until you rebooted the entire system.



For comparison, the kernel allows the following, which is not a loop:



# unshare -m
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# touch /tmp/a
# mount --bind /proc/8456/ns/mnt /tmp/a
#
# umount /tmp/a # cleanup
#


Question



Where does the kernel code distinguish between the following two cases?



If I try to create a loop using mount propagation, it fails:



# mount --make-shared /tmp
# unshare -m --propagation shared
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# mount --bind /proc/8456/ns/mnt /tmp/a
mount: /tmp/a: wrong fs type, bad option, bad superblock on /proc/9061/ns/mnt, missing codepage or helper program, or other error.


But if I remove the mount propagation, no loop is created, and it succeeds:



# unshare -m --propagation private
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# mount --bind /proc/8456/ns/mnt /tmp/a
#
# umount /tmp/a # cleanup


Kernel code which handles the simpler case



https://elixir.bootlin.com/linux/v4.18/source/fs/namespace.c



static bool mnt_ns_loop(struct dentry *dentry)

/* Could bind mounting the mount namespace inode cause a
* mount namespace loop?
*/
struct mnt_namespace *mnt_ns;
if (!is_mnt_ns_file(dentry))
return false;

mnt_ns = to_mnt_ns(get_proc_ns(dentry->d_inode));
return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;



...



 err = -EINVAL;
if (mnt_ns_loop(old_path.dentry))
goto out;


...



 * Assign a sequence number so we can detect when we attempt to bind
* mount a reference to an older mount namespace into the current
* mount namespace, preventing reference counting loops. A 64bit
* number incrementing at 10Ghz will take 12,427 years to wrap which
* is effectively never, so we can ignore the possibility.
*/
static atomic64_t mnt_ns_seq = ATOMIC64_INIT(1);

static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)









share|improve this question























  • It distinguishes in the last line of mnt_ns_loop(): return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;. newer mnt_ns objects have a seq greater that older objects. Or is it something else you're unclear about?
    – mosvy
    yesterday










  • @mosvy I don't understand. Why would that make a difference between my two cases with and without mount propagation? The relative order of the sequence numbers should be the same in both cases.
    – sourcejedi
    yesterday












up vote
2
down vote

favorite









up vote
2
down vote

favorite











Background information



The following commands return an error:



# touch /tmp/a
# mount --bind /proc/self/ns/mnt /tmp/a
mount: /tmp/a: wrong fs type, bad option, bad superblock on /proc/self/ns/mnt, missing codepage or helper program, or other error.


This is because the kernel code (see extracts below) prevents a simple mount namespace loop. The code comments explain why this is not allowed. The lifetime of a mount namespace is tracked by a simple reference count. If you have a loop where mount namespaces A and B both reference the other, then both A and B will always have at least one reference, and they would never be freed. The allocated memory would be lost, until you rebooted the entire system.



For comparison, the kernel allows the following, which is not a loop:



# unshare -m
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# touch /tmp/a
# mount --bind /proc/8456/ns/mnt /tmp/a
#
# umount /tmp/a # cleanup
#


Question



Where does the kernel code distinguish between the following two cases?



If I try to create a loop using mount propagation, it fails:



# mount --make-shared /tmp
# unshare -m --propagation shared
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# mount --bind /proc/8456/ns/mnt /tmp/a
mount: /tmp/a: wrong fs type, bad option, bad superblock on /proc/9061/ns/mnt, missing codepage or helper program, or other error.


But if I remove the mount propagation, no loop is created, and it succeeds:



# unshare -m --propagation private
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# mount --bind /proc/8456/ns/mnt /tmp/a
#
# umount /tmp/a # cleanup


Kernel code which handles the simpler case



https://elixir.bootlin.com/linux/v4.18/source/fs/namespace.c



static bool mnt_ns_loop(struct dentry *dentry)

/* Could bind mounting the mount namespace inode cause a
* mount namespace loop?
*/
struct mnt_namespace *mnt_ns;
if (!is_mnt_ns_file(dentry))
return false;

mnt_ns = to_mnt_ns(get_proc_ns(dentry->d_inode));
return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;



...



 err = -EINVAL;
if (mnt_ns_loop(old_path.dentry))
goto out;


...



 * Assign a sequence number so we can detect when we attempt to bind
* mount a reference to an older mount namespace into the current
* mount namespace, preventing reference counting loops. A 64bit
* number incrementing at 10Ghz will take 12,427 years to wrap which
* is effectively never, so we can ignore the possibility.
*/
static atomic64_t mnt_ns_seq = ATOMIC64_INIT(1);

static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)









share|improve this question















Background information



The following commands return an error:



# touch /tmp/a
# mount --bind /proc/self/ns/mnt /tmp/a
mount: /tmp/a: wrong fs type, bad option, bad superblock on /proc/self/ns/mnt, missing codepage or helper program, or other error.


This is because the kernel code (see extracts below) prevents a simple mount namespace loop. The code comments explain why this is not allowed. The lifetime of a mount namespace is tracked by a simple reference count. If you have a loop where mount namespaces A and B both reference the other, then both A and B will always have at least one reference, and they would never be freed. The allocated memory would be lost, until you rebooted the entire system.



For comparison, the kernel allows the following, which is not a loop:



# unshare -m
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# touch /tmp/a
# mount --bind /proc/8456/ns/mnt /tmp/a
#
# umount /tmp/a # cleanup
#


Question



Where does the kernel code distinguish between the following two cases?



If I try to create a loop using mount propagation, it fails:



# mount --make-shared /tmp
# unshare -m --propagation shared
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# mount --bind /proc/8456/ns/mnt /tmp/a
mount: /tmp/a: wrong fs type, bad option, bad superblock on /proc/9061/ns/mnt, missing codepage or helper program, or other error.


But if I remove the mount propagation, no loop is created, and it succeeds:



# unshare -m --propagation private
# echo $$
8456
# kill -STOP $$
[1]+ Stopped unshare -m

# mount --bind /proc/8456/ns/mnt /tmp/a
#
# umount /tmp/a # cleanup


Kernel code which handles the simpler case



https://elixir.bootlin.com/linux/v4.18/source/fs/namespace.c



static bool mnt_ns_loop(struct dentry *dentry)

/* Could bind mounting the mount namespace inode cause a
* mount namespace loop?
*/
struct mnt_namespace *mnt_ns;
if (!is_mnt_ns_file(dentry))
return false;

mnt_ns = to_mnt_ns(get_proc_ns(dentry->d_inode));
return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;



...



 err = -EINVAL;
if (mnt_ns_loop(old_path.dentry))
goto out;


...



 * Assign a sequence number so we can detect when we attempt to bind
* mount a reference to an older mount namespace into the current
* mount namespace, preventing reference counting loops. A 64bit
* number incrementing at 10Ghz will take 12,427 years to wrap which
* is effectively never, so we can ignore the possibility.
*/
static atomic64_t mnt_ns_seq = ATOMIC64_INIT(1);

static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)






mount linux-kernel namespace






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday

























asked 2 days ago









sourcejedi

20.6k42888




20.6k42888











  • It distinguishes in the last line of mnt_ns_loop(): return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;. newer mnt_ns objects have a seq greater that older objects. Or is it something else you're unclear about?
    – mosvy
    yesterday










  • @mosvy I don't understand. Why would that make a difference between my two cases with and without mount propagation? The relative order of the sequence numbers should be the same in both cases.
    – sourcejedi
    yesterday
















  • It distinguishes in the last line of mnt_ns_loop(): return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;. newer mnt_ns objects have a seq greater that older objects. Or is it something else you're unclear about?
    – mosvy
    yesterday










  • @mosvy I don't understand. Why would that make a difference between my two cases with and without mount propagation? The relative order of the sequence numbers should be the same in both cases.
    – sourcejedi
    yesterday















It distinguishes in the last line of mnt_ns_loop(): return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;. newer mnt_ns objects have a seq greater that older objects. Or is it something else you're unclear about?
– mosvy
yesterday




It distinguishes in the last line of mnt_ns_loop(): return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;. newer mnt_ns objects have a seq greater that older objects. Or is it something else you're unclear about?
– mosvy
yesterday












@mosvy I don't understand. Why would that make a difference between my two cases with and without mount propagation? The relative order of the sequence numbers should be the same in both cases.
– sourcejedi
yesterday




@mosvy I don't understand. Why would that make a difference between my two cases with and without mount propagation? The relative order of the sequence numbers should be the same in both cases.
– sourcejedi
yesterday










1 Answer
1






active

oldest

votes

















up vote
0
down vote













It is because propagate_one() calls copy_tree() without CL_COPY_MNT_NS_FILE. In this case, if the tree root is a mount of a NS file, copy_tree() fails with the error EINVAL. The term "a NS file" means one of the files /proc/*/ns/mnt.



Reading further, I notice that if the tree root is not an NS file, but one of the child mounts is, it is excluded from propagation (in the same way as an unbindable mount is).



Kernel code



https://elixir.bootlin.com/linux/v4.18/source/fs/pnode.c#L226



static int propagate_one(struct mount *m)
{
...
/* Notice when we are propagating across user namespaces */
if (m->mnt_ns->user_ns != user_ns)
type |= CL_UNPRIVILEGED;
child = copy_tree(last_source, last_source->mnt.mnt_root, type);
if (IS_ERR(child))
return PTR_ERR(child);


https://elixir.bootlin.com/linux/v4.18/source/fs/namespace.c#L1790



struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
int flag)
{
struct mount *res, *p, *q, *r, *parent;

if (!(flag & CL_COPY_UNBINDABLE) && IS_MNT_UNBINDABLE(mnt))
return ERR_PTR(-EINVAL);

if (!(flag & CL_COPY_MNT_NS_FILE) && is_mnt_ns_file(dentry))
return ERR_PTR(-EINVAL);


Example of a NS file being silently skipped during propagation



# mount --make-shared /tmp
# cd /tmp
# mkdir private_mnt
# mount --bind private_mnt private_mnt
# mount --make-private private_mnt
# touch private_mnt/child_ns
# unshare --mount=private_mnt/child_ns --propagation=shared ls -l /proc/self/ns/mnt
lrwxrwxrwx. 1 root root 0 Oct 7 18:25 /proc/self/ns/mnt -> 'mnt:[4026532807]'
# findmnt | grep /tmp
├─/tmp tmpfs tmpfs ...
│ ├─/tmp/private_mnt tmpfs[/private_mnt] tmpfs ...
│ │ └─/tmp/private_mnt/child_ns nsfs[mnt:[4026532807]] nsfs ...


Let's create a normal mount for comparison



# mkdir private_mnt/child_mnt
# mount --bind private_mnt/child_mnt private_mnt/child_mnt


Now try to propagate everything. (Create a recursive bind mount of private_mnt inside /tmp. /tmp is a shared mount).



# mkdir shared_mnt
# mount --rbind private_mnt shared_mnt
# findmnt | grep /tmp/shared_mnt
│ └─/tmp/shared_mnt tmpfs[/private_mnt] tmpfs ...
│ ├─/tmp/shared_mnt/child_ns nsfs[mnt:[4026532809]] nsfs ...
│ └─/tmp/shared_mnt/child_mnt tmpfs[/private_mnt/child_mnt] tmpfs ...
# nsenter --mount=/tmp/private_mnt/child_ns findmnt|grep /tmp/shared_mnt
│ └─/tmp/shared_mnt tmpfs[/private_mnt] tmpfs ...
│ └─/tmp/shared_mnt/child_mnt tmpfs[/private_mnt/child_mnt] tmpfs ...





share|improve this answer






















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f473717%2fwhat-code-prevents-mount-namespace-loops-in-a-more-complex-case-involving-mount%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    It is because propagate_one() calls copy_tree() without CL_COPY_MNT_NS_FILE. In this case, if the tree root is a mount of a NS file, copy_tree() fails with the error EINVAL. The term "a NS file" means one of the files /proc/*/ns/mnt.



    Reading further, I notice that if the tree root is not an NS file, but one of the child mounts is, it is excluded from propagation (in the same way as an unbindable mount is).



    Kernel code



    https://elixir.bootlin.com/linux/v4.18/source/fs/pnode.c#L226



    static int propagate_one(struct mount *m)
    {
    ...
    /* Notice when we are propagating across user namespaces */
    if (m->mnt_ns->user_ns != user_ns)
    type |= CL_UNPRIVILEGED;
    child = copy_tree(last_source, last_source->mnt.mnt_root, type);
    if (IS_ERR(child))
    return PTR_ERR(child);


    https://elixir.bootlin.com/linux/v4.18/source/fs/namespace.c#L1790



    struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
    int flag)
    {
    struct mount *res, *p, *q, *r, *parent;

    if (!(flag & CL_COPY_UNBINDABLE) && IS_MNT_UNBINDABLE(mnt))
    return ERR_PTR(-EINVAL);

    if (!(flag & CL_COPY_MNT_NS_FILE) && is_mnt_ns_file(dentry))
    return ERR_PTR(-EINVAL);


    Example of a NS file being silently skipped during propagation



    # mount --make-shared /tmp
    # cd /tmp
    # mkdir private_mnt
    # mount --bind private_mnt private_mnt
    # mount --make-private private_mnt
    # touch private_mnt/child_ns
    # unshare --mount=private_mnt/child_ns --propagation=shared ls -l /proc/self/ns/mnt
    lrwxrwxrwx. 1 root root 0 Oct 7 18:25 /proc/self/ns/mnt -> 'mnt:[4026532807]'
    # findmnt | grep /tmp
    ├─/tmp tmpfs tmpfs ...
    │ ├─/tmp/private_mnt tmpfs[/private_mnt] tmpfs ...
    │ │ └─/tmp/private_mnt/child_ns nsfs[mnt:[4026532807]] nsfs ...


    Let's create a normal mount for comparison



    # mkdir private_mnt/child_mnt
    # mount --bind private_mnt/child_mnt private_mnt/child_mnt


    Now try to propagate everything. (Create a recursive bind mount of private_mnt inside /tmp. /tmp is a shared mount).



    # mkdir shared_mnt
    # mount --rbind private_mnt shared_mnt
    # findmnt | grep /tmp/shared_mnt
    │ └─/tmp/shared_mnt tmpfs[/private_mnt] tmpfs ...
    │ ├─/tmp/shared_mnt/child_ns nsfs[mnt:[4026532809]] nsfs ...
    │ └─/tmp/shared_mnt/child_mnt tmpfs[/private_mnt/child_mnt] tmpfs ...
    # nsenter --mount=/tmp/private_mnt/child_ns findmnt|grep /tmp/shared_mnt
    │ └─/tmp/shared_mnt tmpfs[/private_mnt] tmpfs ...
    │ └─/tmp/shared_mnt/child_mnt tmpfs[/private_mnt/child_mnt] tmpfs ...





    share|improve this answer


























      up vote
      0
      down vote













      It is because propagate_one() calls copy_tree() without CL_COPY_MNT_NS_FILE. In this case, if the tree root is a mount of a NS file, copy_tree() fails with the error EINVAL. The term "a NS file" means one of the files /proc/*/ns/mnt.



      Reading further, I notice that if the tree root is not an NS file, but one of the child mounts is, it is excluded from propagation (in the same way as an unbindable mount is).



      Kernel code



      https://elixir.bootlin.com/linux/v4.18/source/fs/pnode.c#L226



      static int propagate_one(struct mount *m)
      {
      ...
      /* Notice when we are propagating across user namespaces */
      if (m->mnt_ns->user_ns != user_ns)
      type |= CL_UNPRIVILEGED;
      child = copy_tree(last_source, last_source->mnt.mnt_root, type);
      if (IS_ERR(child))
      return PTR_ERR(child);


      https://elixir.bootlin.com/linux/v4.18/source/fs/namespace.c#L1790



      struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
      int flag)
      {
      struct mount *res, *p, *q, *r, *parent;

      if (!(flag & CL_COPY_UNBINDABLE) && IS_MNT_UNBINDABLE(mnt))
      return ERR_PTR(-EINVAL);

      if (!(flag & CL_COPY_MNT_NS_FILE) && is_mnt_ns_file(dentry))
      return ERR_PTR(-EINVAL);


      Example of a NS file being silently skipped during propagation



      # mount --make-shared /tmp
      # cd /tmp
      # mkdir private_mnt
      # mount --bind private_mnt private_mnt
      # mount --make-private private_mnt
      # touch private_mnt/child_ns
      # unshare --mount=private_mnt/child_ns --propagation=shared ls -l /proc/self/ns/mnt
      lrwxrwxrwx. 1 root root 0 Oct 7 18:25 /proc/self/ns/mnt -> 'mnt:[4026532807]'
      # findmnt | grep /tmp
      ├─/tmp tmpfs tmpfs ...
      │ ├─/tmp/private_mnt tmpfs[/private_mnt] tmpfs ...
      │ │ └─/tmp/private_mnt/child_ns nsfs[mnt:[4026532807]] nsfs ...


      Let's create a normal mount for comparison



      # mkdir private_mnt/child_mnt
      # mount --bind private_mnt/child_mnt private_mnt/child_mnt


      Now try to propagate everything. (Create a recursive bind mount of private_mnt inside /tmp. /tmp is a shared mount).



      # mkdir shared_mnt
      # mount --rbind private_mnt shared_mnt
      # findmnt | grep /tmp/shared_mnt
      │ └─/tmp/shared_mnt tmpfs[/private_mnt] tmpfs ...
      │ ├─/tmp/shared_mnt/child_ns nsfs[mnt:[4026532809]] nsfs ...
      │ └─/tmp/shared_mnt/child_mnt tmpfs[/private_mnt/child_mnt] tmpfs ...
      # nsenter --mount=/tmp/private_mnt/child_ns findmnt|grep /tmp/shared_mnt
      │ └─/tmp/shared_mnt tmpfs[/private_mnt] tmpfs ...
      │ └─/tmp/shared_mnt/child_mnt tmpfs[/private_mnt/child_mnt] tmpfs ...





      share|improve this answer
























        up vote
        0
        down vote










        up vote
        0
        down vote









        It is because propagate_one() calls copy_tree() without CL_COPY_MNT_NS_FILE. In this case, if the tree root is a mount of a NS file, copy_tree() fails with the error EINVAL. The term "a NS file" means one of the files /proc/*/ns/mnt.



        Reading further, I notice that if the tree root is not an NS file, but one of the child mounts is, it is excluded from propagation (in the same way as an unbindable mount is).



        Kernel code



        https://elixir.bootlin.com/linux/v4.18/source/fs/pnode.c#L226



        static int propagate_one(struct mount *m)
        {
        ...
        /* Notice when we are propagating across user namespaces */
        if (m->mnt_ns->user_ns != user_ns)
        type |= CL_UNPRIVILEGED;
        child = copy_tree(last_source, last_source->mnt.mnt_root, type);
        if (IS_ERR(child))
        return PTR_ERR(child);


        https://elixir.bootlin.com/linux/v4.18/source/fs/namespace.c#L1790



        struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
        int flag)
        {
        struct mount *res, *p, *q, *r, *parent;

        if (!(flag & CL_COPY_UNBINDABLE) && IS_MNT_UNBINDABLE(mnt))
        return ERR_PTR(-EINVAL);

        if (!(flag & CL_COPY_MNT_NS_FILE) && is_mnt_ns_file(dentry))
        return ERR_PTR(-EINVAL);


        Example of a NS file being silently skipped during propagation



        # mount --make-shared /tmp
        # cd /tmp
        # mkdir private_mnt
        # mount --bind private_mnt private_mnt
        # mount --make-private private_mnt
        # touch private_mnt/child_ns
        # unshare --mount=private_mnt/child_ns --propagation=shared ls -l /proc/self/ns/mnt
        lrwxrwxrwx. 1 root root 0 Oct 7 18:25 /proc/self/ns/mnt -> 'mnt:[4026532807]'
        # findmnt | grep /tmp
        ├─/tmp tmpfs tmpfs ...
        │ ├─/tmp/private_mnt tmpfs[/private_mnt] tmpfs ...
        │ │ └─/tmp/private_mnt/child_ns nsfs[mnt:[4026532807]] nsfs ...


        Let's create a normal mount for comparison



        # mkdir private_mnt/child_mnt
        # mount --bind private_mnt/child_mnt private_mnt/child_mnt


        Now try to propagate everything. (Create a recursive bind mount of private_mnt inside /tmp. /tmp is a shared mount).



        # mkdir shared_mnt
        # mount --rbind private_mnt shared_mnt
        # findmnt | grep /tmp/shared_mnt
        │ └─/tmp/shared_mnt tmpfs[/private_mnt] tmpfs ...
        │ ├─/tmp/shared_mnt/child_ns nsfs[mnt:[4026532809]] nsfs ...
        │ └─/tmp/shared_mnt/child_mnt tmpfs[/private_mnt/child_mnt] tmpfs ...
        # nsenter --mount=/tmp/private_mnt/child_ns findmnt|grep /tmp/shared_mnt
        │ └─/tmp/shared_mnt tmpfs[/private_mnt] tmpfs ...
        │ └─/tmp/shared_mnt/child_mnt tmpfs[/private_mnt/child_mnt] tmpfs ...





        share|improve this answer














        It is because propagate_one() calls copy_tree() without CL_COPY_MNT_NS_FILE. In this case, if the tree root is a mount of a NS file, copy_tree() fails with the error EINVAL. The term "a NS file" means one of the files /proc/*/ns/mnt.



        Reading further, I notice that if the tree root is not an NS file, but one of the child mounts is, it is excluded from propagation (in the same way as an unbindable mount is).



        Kernel code



        https://elixir.bootlin.com/linux/v4.18/source/fs/pnode.c#L226



        static int propagate_one(struct mount *m)
        {
        ...
        /* Notice when we are propagating across user namespaces */
        if (m->mnt_ns->user_ns != user_ns)
        type |= CL_UNPRIVILEGED;
        child = copy_tree(last_source, last_source->mnt.mnt_root, type);
        if (IS_ERR(child))
        return PTR_ERR(child);


        https://elixir.bootlin.com/linux/v4.18/source/fs/namespace.c#L1790



        struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
        int flag)
        {
        struct mount *res, *p, *q, *r, *parent;

        if (!(flag & CL_COPY_UNBINDABLE) && IS_MNT_UNBINDABLE(mnt))
        return ERR_PTR(-EINVAL);

        if (!(flag & CL_COPY_MNT_NS_FILE) && is_mnt_ns_file(dentry))
        return ERR_PTR(-EINVAL);


        Example of a NS file being silently skipped during propagation



        # mount --make-shared /tmp
        # cd /tmp
        # mkdir private_mnt
        # mount --bind private_mnt private_mnt
        # mount --make-private private_mnt
        # touch private_mnt/child_ns
        # unshare --mount=private_mnt/child_ns --propagation=shared ls -l /proc/self/ns/mnt
        lrwxrwxrwx. 1 root root 0 Oct 7 18:25 /proc/self/ns/mnt -> 'mnt:[4026532807]'
        # findmnt | grep /tmp
        ├─/tmp tmpfs tmpfs ...
        │ ├─/tmp/private_mnt tmpfs[/private_mnt] tmpfs ...
        │ │ └─/tmp/private_mnt/child_ns nsfs[mnt:[4026532807]] nsfs ...


        Let's create a normal mount for comparison



        # mkdir private_mnt/child_mnt
        # mount --bind private_mnt/child_mnt private_mnt/child_mnt


        Now try to propagate everything. (Create a recursive bind mount of private_mnt inside /tmp. /tmp is a shared mount).



        # mkdir shared_mnt
        # mount --rbind private_mnt shared_mnt
        # findmnt | grep /tmp/shared_mnt
        │ └─/tmp/shared_mnt tmpfs[/private_mnt] tmpfs ...
        │ ├─/tmp/shared_mnt/child_ns nsfs[mnt:[4026532809]] nsfs ...
        │ └─/tmp/shared_mnt/child_mnt tmpfs[/private_mnt/child_mnt] tmpfs ...
        # nsenter --mount=/tmp/private_mnt/child_ns findmnt|grep /tmp/shared_mnt
        │ └─/tmp/shared_mnt tmpfs[/private_mnt] tmpfs ...
        │ └─/tmp/shared_mnt/child_mnt tmpfs[/private_mnt/child_mnt] tmpfs ...






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited yesterday

























        answered yesterday









        sourcejedi

        20.6k42888




        20.6k42888



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f473717%2fwhat-code-prevents-mount-namespace-loops-in-a-more-complex-case-involving-mount%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Bahrain

            Postfix configuration issue with fips on centos 7; mailgun relay