libseccomp: incorrect compilation of arithmetic comparisons When libseccomp compiles filters for 64-bit systems, it needs to split 64-bit comparisons into 32-bit comparisons because classic BPF can't operate on 64-bit values directly. libseccomp offers both bitwise comparisons (NE, EQ, MASKED_EQ) and arithmetic comparisons (LT, LE, GE, GT). Bitwise comparisons can always be implemented with no more than two comparisons; but that doesn't work for arithmetic comparisons. Consider the case where a filter attempts to check whether args[0]<0x123456789abc. The cases are: args[0].high < 0x1234: matches args[0].high > 0x1234: no match args[0].high == 0x1234 && args[0].low < 0x56789abc: matches args[0].high == 0x1234 && args[0].low >= 0x56789abc: no match So in pseudocode, you'd want something like the following: if args[0].high < 0x1234 return ACCEPT if args[0].high > 0x1234 return REJECT if args[0].low < 0x56789abc return ACCEPT return REJECT But actually, when libseccomp is invoked as follows: scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW); if (ctx == NULL) err(1, "seccomp_init"); if (seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EBADSLT), SCMP_SYS(mincore), 1, SCMP_A0(SCMP_CMP_LT, 0x123456789abcUL))) err(1, "seccomp_rule_add"); if (seccomp_load(ctx)) err(1, "seccomp_load"); it generates the following seccomp filter: # ./seccomp_dump 96148 simple ===== filter 0 (13 instructions) ===== 0001 if arch != X86_64: [true +10, false +0] -> ret KILL 0003 if nr < 0x40000000: [true +1, false +0] 0005 if nr != 0x0000001b: [true +5, false +0] -> ret ALLOW (syscalls: ) 0007 if args[0].high < 0x00001234: [true +2, false +0] -> ret ERRNO 0009 if args[0].low >= 0x56789abc: [true +1, false +0] -> ret ALLOW (syscalls: mincore) 000a ret ERRNO [...] As you can see, the case of `args[0].high > 0x1234 && args[0].low < 0x56789abc` is handled incorrectly. Here's a demo, tested with libseccomp from git master: =========================================== jannh@jannh2:~/tests/libseccomp-stuff$ cat compare.c #include #include #include #include #include #include #include // any mincore() starting below this address should be denied with -EBADSLT #define ADDR_LIMIT 0x123456789abcUL static void sctest(unsigned long addr) { unsigned char vec; printf("mincore(0x%012lx, 0) = ", addr); int res = mincore((void*)addr, 0, &vec); if (res == 0) { printf(" 0\n"); } else { printf("-%d (%m)\n", errno); } } int main(int argc, char **argv) { setbuf(stdout, NULL); printf("my pid is %d\n", (int)getpid()); scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW); if (ctx == NULL) err(1, "seccomp_init"); if (seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EBADSLT), SCMP_SYS(mincore), 1, SCMP_A0(SCMP_CMP_LT, ADDR_LIMIT))) err(1, "seccomp_rule_add"); if (seccomp_load(ctx)) err(1, "seccomp_load"); sctest(0); sctest(0x123000000000); sctest(0x1230f0000000); sctest(0x123400000000); sctest(0x123450000000); sctest(0x123460000000); sctest(0x1234f0000000); sctest(0x123500000000); sctest(0x1235f0000000); sctest(0x123600000000); while (1) pause(); } jannh@jannh2:~/tests/libseccomp-stuff$ gcc -o compare compare.c -Wall -I/h/git/foreign/libseccomp/include/ -L/h/git/foreign/libseccomp/src/.libs -lseccomp -Wl,-rpath /h/git/foreign/libseccomp/src/.libs/ jannh@jannh2:~/tests/libseccomp-stuff$ ./compare my pid is 104373 mincore(0x000000000000, 0) = -57 (Invalid slot) mincore(0x123000000000, 0) = -57 (Invalid slot) mincore(0x1230f0000000, 0) = -57 (Invalid slot) mincore(0x123400000000, 0) = -57 (Invalid slot) mincore(0x123450000000, 0) = -57 (Invalid slot) mincore(0x123460000000, 0) = 0 mincore(0x1234f0000000, 0) = 0 mincore(0x123500000000, 0) = -57 (Invalid slot) mincore(0x1235f0000000, 0) = 0 mincore(0x123600000000, 0) = -57 (Invalid slot) =========================================== This probably isn't terribly interesting for most users of libseccomp, but the Tor daemon (https://gitweb.torproject.org/tor.git/tree/src/lib/sandbox/sandbox.c) does use arithmetic comparisons to prevent writes to a certain memory region: =========================================== /* * Allow mprotect with PROT_READ|PROT_WRITE because openssl uses it, but * never over the memory region used by the protected strings. * * PROT_READ|PROT_WRITE was originally fully allowed in sb_mprotect(), but * had to be removed due to limitation of libseccomp regarding intervals. * * There is a restriction on how much you can mprotect with R|W up to the * size of the canary. */ ret = seccomp_rule_add_3(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mprotect), SCMP_CMP(0, SCMP_CMP_LT, (intptr_t) pr_mem_base), SCMP_CMP(1, SCMP_CMP_LE, MALLOC_MP_LIM), SCMP_CMP(2, SCMP_CMP_EQ, PROT_READ|PROT_WRITE)); [...] ret = seccomp_rule_add_3(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mprotect), SCMP_CMP(0, SCMP_CMP_GT, (intptr_t) pr_mem_base + pr_mem_size + MALLOC_MP_LIM), SCMP_CMP(1, SCMP_CMP_LE, MALLOC_MP_LIM), SCMP_CMP(2, SCMP_CMP_EQ, PROT_READ|PROT_WRITE)); [...] =========================================== systemd also has some code that uses arithmetic comparisons in https://github.com/systemd/systemd/blob/master/src/shared/seccomp-util.c , specifically for two purposes: - If you whitelist a range of address families for socket() using RestrictAddressFamilies, anything outside that range gets blocked with SCMP_CMP_LT/SCMP_CMP_GT. - If you restrict the use of scheduling classes, anything above the permitted class is blocked via SCMP_CMP_GT. (Both of these, by the way, are for syscalls that silently discard the upper 32 bits of their arguments.) The start of the second seccomp filter generated for a systemd unit with "RestrictAddressFamilies=AF_INET AF_INET6" is: ===== filter 1 (57 instructions) ===== 0001 if arch != X86_64: [true +54, false +0] -> ret ALLOW (syscalls: ) 0003 if nr < 0x40000000: [true +1, false +0] 0005 if nr != 0x00000029: [true +50, false +0] -> ret ALLOW (syscalls: ) 0007 if args[0].high != 0x00000000: [true +42, false +0] 0033 if args[0].high < 0x00000000: [true +3, false +0] -> ret ERRNO 0035 if args[0].low > 0x0000000a: [true +1, false +0] -> ret ERRNO 0036 if args[0].low >= 0x00000002: [true +1, false +0] -> ret ALLOW (syscalls: socket) 0037 ret ERRNO So this filter will e.g. permit socket() calls in the range from 0x100000002 to 0x10000000a (and the kernel will ignore the high bit, meaning that in effect, this filter grants access to families like AF_AX25); but as far as I can tell, the other filter installed by systemd prevents this. In the open-source users of libseccomp that I have been able to find on codesearch.debian.net, this issue doesn't seem to have significant impact; but someone might rely on this behavior, so I've decided to treat this as a security bug.