Content-Length: 57250 | pFad | http://lwn.net/Articles/76621/

[RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix [LWN.net]
|
|
Subscribe / Log in / New account

[RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

From:  Rajesh Venkatasubramanian <vrajesh@umich.edu>
To:  akpm@osdl.org
Subject:  [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix
Date:  Sun, 21 Mar 2004 17:10:45 -0500 (EST)
Cc:  torvalds@osdl.org, hugh@veritas.com, mbligh@aracnet.com, andrea@suse.de, riel@redhat.com, mingo@elte.hu, linux-kernel@vger.kernel.org, linux-mm@kvack.org, vrajesh@umich.edu


This patch is an attempt at fixing the much discussed search complexity issues
in objrmap design. The key idea is to replace the i_mmap{_shared} list with a
new tree data structure.

The radix priority search tree (prio_tree) first proposed by Edward M. McCreight
is used as the new data structure. A prio_tree is indexed by two indices which
we call radix_index and heap_index. The tree is a simple binary radix tree on
the radix_index with an additional heap tree like property that a parent node's
heap_index is always greater than or equal to the heap_indices of its children.

An interesting property of the prio_tree is that they are useful to store and
query intervals, for example, a file-mapped vm_area_struct can be considered
as an interval of file pages. If we store all vmas that map a file in a
prio_tree, then we can execute a stabbing query, i.e., choosing a set of vmas
that a map a single file page or a set of contiguous file pages, in O(log n + m)
time where "log n" indicates the height of the tree (maximum 64 in a 32 bit
machine) and "m" represents the number of vmas that map the page(s).

The test results below show that the prio_tree effectively solves the objrmap
i_mmap{_shared} search complexity problems. The tests were done on a PII
200 MHz machine with 256MB RAM using UP kernels.

This patch is for 2.6.5-rc2. The patch boots and works both on SMP and UP.
Further testing will help. If you like broken-out patches please check:

http://www-personal.engin.umich.edu/~vrajesh/~vrajesh/linux/prio_tree/

Some Results:

Kernel compile - 2.6.2 defconfig - make
2.6.5-rc2	 3313.97 user 261.08 system 1:00:11 elapsed 98% CPU
rc2+prio+objrmap 3315.30 user 258.59 system 1:00:14 elapsed 98% CPU
rc2+objrmap	 3316.41 user 257.77 system 1:00:10 elapsed 98% CPU

rmap-test 1 - ./rmap-test -l -i 10 -n 100 -s 600 -t 100 foo
2.6.5-rc2	 67.57 user 277.14 system 0:13:12 elapsed 43% CPU
rc2+prio+objrmap 71.99 user 203.90 system 0:13:30 elapsed 34% CPU
rc2+objrmap    70.45 user 19834.38 system 7:28:04 elapsed 74% CPU
	-I killed the process after 7 hours. System was responsive afer
	killing the process. Compared to previous results, the program
	should not lock or take so long. Maybe it is due to this problem
	identified by Andrea:
	http://marc.theaimsgroup.com/?l=linux-kernel&m=107966438414248

	Andrea says the system may hang, however, in this case system
	does not hang.

rmap-test 2 - ./rmap-test -vv -V 2 -r -i 1 -n 100 -s 600 -t 100 foo
2.6.5-rc2	 0.58 user 212.50 system 0: 7:32 elapsed 47% CPU
rc2+prio+objrmap 0.63 user 101.77 system 0: 4:44 elapsed 36% CPU
rc2+objrmap	 0.60 user 605.97 system 0:14:35 elapsed 69% CPU


rmap-test 3 - ./rmap-test -v -l -i 10 -n 10000 -s 7 -t 1 foo
2.6.5-rc2	 1.07 user 31.08 system 0:16:06 elapsed 3% CPU
rc2+prio+objrmap 1.03 user 31.41 system 0:16:38 elapsed 3% CPU
rc2+objrmap    0.53 user 1588.40 system 2:25:27 elapsed 18% CPU
			-I killed the process after around 2 1/2 hours.
			 System was responsive afer killing the process.

test-mmap2                             H M Sec.
2.6.5-rc2	 0.00 user 0.34 system 0:0:01.55 elapsed 22% CPU
rc2+prio+objramp 0.00 user 0.35 system 0:0:01.49 elapsed 23% CPU
rc2+objrmap	 - didn't try - already known to lock the system


test-mmap3                             H M Sec.
2.6.5-rc2	 0.06 user 3.38 system 0:0:14.62 elapsed 23% CPU
rc2+prio+objrmap 0.09 user 3.65 system 0:0:13.99 elapsed 26% CPU
rc2+objrmap	 - didn't try - known to lock the system


Lowlights of the patch:

  * Adds a new tree data structure (around 500 lines of code + bugs?)
  	- code seems reasonably stable. More testing will help a lot.

  * Breaks compilation of hugetlbfs, xfs, and few archs.
  	- easily fixable.

  * Adds 2 extra pointers to vm_area_struct
  	- both of these pointers can be removed later.

	- Plan:

	   * Shove vm_list_head into vm_private_data.

	   * I need a single bit protected by i_shared_sem to
	     mark prio_tree nodes. When I get convinced that I can use
	     the least significant bit of the vm_list_head for marking
	     nodes (vm_area_struct alignment?), I plan to remove the parent
	     field and use percpu array in prio_tree_insert and
	     prio_tree_remove. In invalidate_mmap_range_list, we can try to
	     allocate an array from slab (helps to avoid high latency in
	     truncate), if we fail, we can fall back to percpu array.

Useful Links:

[1] Andrew Morton's rmap-test.c
	http://marc.theaimsgroup.com/?l=linux-kernel&m=104954444204356

[2] Ingo's test-mmap2.c
	http://marc.theaimsgroup.com/?l=linux-kernel&m=107883030601436

[3] Ingo's test-mmap3.c
	http://marc.theaimsgroup.com/?l=linux-kernel&m=107886160312419



 fs/exec.c                 |    2
 fs/inode.c                |    4
 fs/locks.c                |    6
 fs/proc/task_mmu.c        |    2
 include/linux/fs.h        |    5
 include/linux/mm.h        |  168 +++++++++++++
 include/linux/prio_tree.h |   78 ++++++
 init/main.c               |    2
 kernel/fork.c             |    4
 mm/Makefile               |    3
 mm/filemap.c              |    2
 mm/memory.c               |   15 -
 mm/mmap.c                 |   66 +++--
 mm/mremap.c               |   21 +
 mm/prio_tree.c            |  574 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/shmem.c                |    2
 mm/swap_state.c           |    4
 mm/vmscan.c               |    4
 18 files changed, 910 insertions(+), 52 deletions(-)

diff -puN fs/exec.c~prio_tree_core fs/exec.c
--- mmlinux-2.6/fs/exec.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/fs/exec.c	2004-03-21 16:25:34.000000000 -0500
@@ -430,7 +430,7 @@ int setup_arg_pages(struct linux_binprm
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
-		INIT_LIST_HEAD(&mpnt->shared);
+		INIT_VMA_SHARED(mpnt);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
 		mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
diff -puN fs/inode.c~prio_tree_core fs/inode.c
--- mmlinux-2.6/fs/inode.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/fs/inode.c	2004-03-21 16:25:01.000000000 -0500
@@ -189,8 +189,8 @@ void inode_init_once(struct inode *inode
 	atomic_set(&inode->i_data.truncate_count, 0);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
-	INIT_LIST_HEAD(&inode->i_data.i_mmap);
-	INIT_LIST_HEAD(&inode->i_data.i_mmap_shared);
+	INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
+	INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap_shared);
 	spin_lock_init(&inode->i_lock);
 	i_size_ordered_init(inode);
 }
diff -puN fs/locks.c~prio_tree_core fs/locks.c
--- mmlinux-2.6/fs/locks.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/fs/locks.c	2004-03-21 16:25:01.000000000 -0500
@@ -1455,8 +1455,7 @@ int fcntl_setlk(struct file *filp, unsig
 	if (IS_MANDLOCK(inode) &&
 	    (inode->i_mode & (S_ISGID | S_IXGRP)) == S_ISGID) {
 		struct address_space *mapping = filp->f_mapping;
-
-		if (!list_empty(&mapping->i_mmap_shared)) {
+		if (!prio_tree_empty(&mapping->i_mmap_shared)) {
 			error = -EAGAIN;
 			goto out;
 		}
@@ -1593,8 +1592,7 @@ int fcntl_setlk64(struct file *filp, uns
 	if (IS_MANDLOCK(inode) &&
 	    (inode->i_mode & (S_ISGID | S_IXGRP)) == S_ISGID) {
 		struct address_space *mapping = filp->f_mapping;
-
-		if (!list_empty(&mapping->i_mmap_shared)) {
+		if (!prio_tree_empty(&mapping->i_mmap_shared)) {
 			error = -EAGAIN;
 			goto out;
 		}
diff -puN fs/proc/task_mmu.c~prio_tree_core fs/proc/task_mmu.c
--- mmlinux-2.6/fs/proc/task_mmu.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/fs/proc/task_mmu.c	2004-03-21 16:25:01.000000000 -0500
@@ -65,7 +65,7 @@ int task_statm(struct mm_struct *mm, int
 				*shared += pages;
 			continue;
 		}
-		if (vma->vm_flags & VM_SHARED || !list_empty(&vma->shared))
+		if (vma->vm_flags & VM_SHARED || !vma_shared_empty(vma))
 			*shared += pages;
 		if (vma->vm_flags & VM_EXECUTABLE)
 			*text += pages;
diff -puN include/linux/fs.h~prio_tree_core include/linux/fs.h
--- mmlinux-2.6/include/linux/fs.h~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/include/linux/fs.h	2004-03-21 16:25:01.000000000 -0500
@@ -18,6 +18,7 @@
 #include <linux/stat.h>
 #include <linux/cache.h>
 #include <linux/radix-tree.h>
+#include <linux/prio_tree.h>
 #include <linux/kobject.h>
 #include <asm/atomic.h>

@@ -329,8 +330,8 @@ struct address_space {
 	struct list_head	io_pages;	/* being prepared for I/O */
 	unsigned long		nrpages;	/* number of total pages */
 	struct address_space_operations *a_ops;	/* methods */
-	struct list_head	i_mmap;		/* list of private mappings */
-	struct list_head	i_mmap_shared;	/* list of shared mappings */
+	struct prio_tree_root	i_mmap;		/* tree of private mappings */
+	struct prio_tree_root	i_mmap_shared;	/* tree of shared mappings */
 	struct semaphore	i_shared_sem;	/* protect both above lists */
 	atomic_t		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		dirtied_when;	/* jiffies of first page dirtying */
diff -puN include/linux/mm.h~prio_tree_core include/linux/mm.h
--- mmlinux-2.6/include/linux/mm.h~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/include/linux/mm.h	2004-03-21 16:25:34.000000000 -0500
@@ -11,6 +11,7 @@
 #include <linux/list.h>
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
+#include <linux/prio_tree.h>
 #include <linux/fs.h>

 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
@@ -67,8 +68,29 @@ struct vm_area_struct {
 	 * one of the address_space->i_mmap{,shared} lists,
 	 * for shm areas, the list of attaches, otherwise unused.
 	 */
-	struct list_head shared;
+	union {
+		struct {
+			struct list_head list;
+			void *parent;
+		} vm_set;
+
+		struct prio_tree_node  prio_tree_node;
+
+		struct {
+			void *first;
+			void *second;
+			void *parent;
+		} both;
+	} shared;

+	/*
+	 * shared.vm_set : list of vmas that map exactly the same set of pages
+	 * vm_set_head   : head of the vm_set list
+	 *
+	 * TODO: try to shove the following field into vm_private_data ??
+	 */
+	struct vm_area_struct *vm_set_head;
+
 	/* Function pointers to deal with this struct. */
 	struct vm_operations_struct * vm_ops;

@@ -129,6 +151,150 @@ struct vm_area_struct {
 #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)

 /*
+ * The following macros are used for implementing prio_tree for i_mmap{_shared}
+ */
+
+#define	RADIX_INDEX(vma)  ((vma)->vm_pgoff)
+#define	VMA_SIZE(vma)	  (((vma)->vm_end - (vma)->vm_start) >> PAGE_SHIFT)
+/* avoid overflow */
+#define	HEAP_INDEX(vma)	  ((vma)->vm_pgoff + (VMA_SIZE(vma) - 1))
+
+#define GET_INDEX_VMA(vma, radix, heap)		\
+do {						\
+	radix = RADIX_INDEX(vma);		\
+	heap = HEAP_INDEX(vma);			\
+} while (0)
+
+#define GET_INDEX(node, radix, heap)		\
+do { 						\
+	struct vm_area_struct *__tmp = 		\
+	  prio_tree_entry(node, struct vm_area_struct, shared.prio_tree_node);\
+	GET_INDEX_VMA(__tmp, radix, heap); 	\
+} while (0)
+
+#define	INIT_VMA_SHARED_LIST(vma)			\
+do {							\
+	INIT_LIST_HEAD(&(vma)->shared.vm_set.list);	\
+	(vma)->shared.vm_set.parent = NULL;		\
+	(vma)->vm_set_head = NULL;			\
+} while (0)
+
+#define INIT_VMA_SHARED(vma)			\
+do {						\
+	(vma)->shared.both.first = NULL;	\
+	(vma)->shared.both.second = NULL;	\
+	(vma)->shared.both.parent = NULL;	\
+	(vma)->vm_set_head = NULL;		\
+} while (0)
+
+extern void __vma_prio_tree_insert(struct prio_tree_root *,
+	struct vm_area_struct *);
+
+extern void __vma_prio_tree_remove(struct prio_tree_root *,
+	struct vm_area_struct *);
+
+static inline int vma_shared_empty(struct vm_area_struct *vma)
+{
+	return vma->shared.both.first == NULL;
+}
+
+/*
+ * Helps to add a new vma that maps the same (identical) set of pages as the
+ * old vma to an i_mmap tree.
+ */
+static inline void __vma_prio_tree_add(struct vm_area_struct *vma,
+	struct vm_area_struct *old)
+{
+	INIT_VMA_SHARED_LIST(vma);
+
+	/* Leave these BUG_ONs till prio_tree patch stabilizes */
+	BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
+	BUG_ON(HEAP_INDEX(vma) != HEAP_INDEX(old));
+
+	if (old->shared.both.parent) {
+		if (old->vm_set_head) {
+			list_add_tail(&vma->shared.vm_set.list,
+					&old->vm_set_head->shared.vm_set.list);
+			return;
+		}
+		else {
+			old->vm_set_head = vma;
+			vma->vm_set_head = old;
+		}
+	}
+	else
+		list_add(&vma->shared.vm_set.list, &old->shared.vm_set.list);
+}
+
+/*
+ * We cannot modify vm_start, vm_end, vm_pgoff fields of a vma that has been
+ * already present in an i_mmap{_shared} tree without modifying the tree. The
+ * following helper function should be used when such modifications are
+ * necessary. We should hold the mapping's i_shared_sem.
+ *
+ * This function can be (micro)optimized for some special cases (maybe later).
+ */
+static inline void __vma_modify(struct prio_tree_root *root,
+	struct vm_area_struct *vma, unsigned long start, unsigned long end,
+	unsigned long pgoff)
+{
+	if (root)
+		__vma_prio_tree_remove(root, vma);
+	vma->vm_start = start;
+	vma->vm_end = end;
+	vma->vm_pgoff = pgoff;
+	if (root)
+		__vma_prio_tree_insert(root, vma);
+}
+
+/*
+ * Helper functions to enumerate vmas that map a given file page or a set of
+ * contiguous file pages. The functions return vmas that at least map a single
+ * page in the given range of contiguous file pages.
+ */
+static inline struct vm_area_struct *__vma_prio_tree_first(
+	struct prio_tree_root *root, struct prio_tree_iter *iter,
+	unsigned long begin, unsigned long end)
+{
+	struct prio_tree_node *ptr;
+
+	ptr = prio_tree_first(root, iter, begin, end);
+
+	if (ptr)
+		return prio_tree_entry(ptr, struct vm_area_struct,
+				shared.prio_tree_node);
+	else
+		return NULL;
+}
+
+static inline struct vm_area_struct *__vma_prio_tree_next(
+	struct vm_area_struct *vma, struct prio_tree_root *root,
+	struct prio_tree_iter *iter, unsigned long begin, unsigned long end)
+{
+	struct prio_tree_node *ptr;
+	struct vm_area_struct *next;
+
+	if (vma->shared.both.parent) {
+		if (vma->vm_set_head)
+			return vma->vm_set_head;
+	}
+	else {
+		next = list_entry(vma->shared.vm_set.list.next,
+				struct vm_area_struct, shared.vm_set.list);
+		if (!(next->vm_set_head))
+			return next;
+	}
+
+	ptr = prio_tree_next(root, iter, begin, end);
+
+	if (ptr)
+		return prio_tree_entry(ptr, struct vm_area_struct,
+				shared.prio_tree_node);
+	else
+		return NULL;
+}
+
+/*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
  */
diff -puN /dev/null include/linux/prio_tree.h
--- /dev/null	2003-01-30 05:24:37.000000000 -0500
+++ mmlinux-2.6-jaya/include/linux/prio_tree.h	2004-03-21 16:25:01.000000000 -0500
@@ -0,0 +1,78 @@
+#ifndef _LINUX_PRIO_TREE_H
+#define _LINUX_PRIO_TREE_H
+
+struct prio_tree_node {
+	struct prio_tree_node *left;
+	struct prio_tree_node *right;
+	struct prio_tree_node *parent;
+};
+
+struct prio_tree_root {
+	struct prio_tree_node	*prio_tree_node;
+	unsigned int 		index_bits;
+};
+
+struct prio_tree_iter {
+	struct prio_tree_node	*cur;
+	unsigned long		mask;
+	unsigned long		value;
+	int			size_level;
+};
+
+#define PRIO_TREE_ROOT	(struct prio_tree_root) {NULL, 1}
+
+#define PRIO_TREE_ROOT_INIT	{NULL, 1}
+
+#define INIT_PRIO_TREE_ROOT(ptr)	\
+do {					\
+	(ptr)->prio_tree_node = NULL;	\
+	(ptr)->index_bits = 1;		\
+} while (0)
+
+#define PRIO_TREE_NODE_INIT(name)	{&(name), &(name), &(name)}
+
+#define PRIO_TREE_NODE(name) \
+	struct prio_tree_node name = PRIO_TREE_NODE_INIT(name)
+
+#define INIT_PRIO_TREE_NODE(ptr)				\
+do {								\
+	(ptr)->left = (ptr)->right = (ptr)->parent = (ptr);	\
+} while (0)
+
+#define	prio_tree_entry(ptr, type, member) \
+       ((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))
+
+#define	PRIO_TREE_ITER	(struct prio_tree_iter) {NULL, 0UL, 0UL, 0}
+
+static inline int prio_tree_empty(const struct prio_tree_root *root)
+{
+	return root->prio_tree_node == NULL;
+}
+
+static inline int prio_tree_root(const struct prio_tree_node *node)
+{
+	return node->parent == node;
+}
+
+static inline int prio_tree_left_empty(const struct prio_tree_node *node)
+{
+	return node->left == node;
+}
+
+static inline int prio_tree_right_empty(const struct prio_tree_node *node)
+{
+	return node->right == node;
+}
+
+extern struct prio_tree_node *prio_tree_insert(struct prio_tree_root *,
+	struct prio_tree_node *);
+
+extern void prio_tree_remove(struct prio_tree_root *, struct prio_tree_node *);
+
+extern struct prio_tree_node *prio_tree_first(struct prio_tree_root *,
+	struct prio_tree_iter *, unsigned long, unsigned long);
+
+extern struct prio_tree_node *prio_tree_next(struct prio_tree_root *,
+	struct prio_tree_iter *, unsigned long, unsigned long);
+
+#endif
diff -puN init/main.c~prio_tree_core init/main.c
--- mmlinux-2.6/init/main.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/init/main.c	2004-03-21 16:25:01.000000000 -0500
@@ -86,6 +86,7 @@ extern void pidhash_init(void);
 extern void pidmap_init(void);
 extern void pte_chain_init(void);
 extern void radix_tree_init(void);
+extern void prio_tree_init(void);
 extern void free_initmem(void);
 extern void populate_rootfs(void);
 extern void driver_init(void);
@@ -460,6 +461,7 @@ asmlinkage void __init start_kernel(void
 	calibrate_delay();
 	pidmap_init();
 	pgtable_cache_init();
+	prio_tree_init();
 	pte_chain_init();
 #ifdef CONFIG_X86
 	if (efi_enabled)
diff -puN kernel/fork.c~prio_tree_core kernel/fork.c
--- mmlinux-2.6/kernel/fork.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/kernel/fork.c	2004-03-21 16:25:01.000000000 -0500
@@ -313,7 +313,7 @@ static inline int dup_mmap(struct mm_str
 		tmp->vm_mm = mm;
 		tmp->vm_next = NULL;
 		file = tmp->vm_file;
-		INIT_LIST_HEAD(&tmp->shared);
+		INIT_VMA_SHARED(tmp);
 		if (file) {
 			struct inode *inode = file->f_dentry->d_inode;
 			get_file(file);
@@ -322,7 +322,7 @@ static inline int dup_mmap(struct mm_str

 			/* insert tmp into the share list, just after mpnt */
 			down(&file->f_mapping->i_shared_sem);
-			list_add_tail(&tmp->shared, &mpnt->shared);
+			__vma_prio_tree_add(tmp, mpnt);
 			up(&file->f_mapping->i_shared_sem);
 		}

diff -puN mm/filemap.c~prio_tree_core mm/filemap.c
--- mmlinux-2.6/mm/filemap.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/filemap.c	2004-03-21 16:25:01.000000000 -0500
@@ -630,7 +630,7 @@ page_ok:
 		 * virtual addresses, take care about potential aliasing
 		 * before reading the page on the kernel side.
 		 */
-		if (!list_empty(&mapping->i_mmap_shared))
+		if (!prio_tree_empty(&mapping->i_mmap_shared))
 			flush_dcache_page(page);

 		/*
diff -puN mm/Makefile~prio_tree_core mm/Makefile
--- mmlinux-2.6/mm/Makefile~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/Makefile	2004-03-21 16:25:01.000000000 -0500
@@ -9,6 +9,7 @@ mmu-$(CONFIG_MMU)	:= fremap.o highmem.o

 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o readahead.o \
-			   slab.o swap.o truncate.o vmscan.o $(mmu-y)
+			   slab.o swap.o truncate.o vmscan.o prio_tree.o \
+			   $(mmu-y)

 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
diff -puN mm/memory.c~prio_tree_core mm/memory.c
--- mmlinux-2.6/mm/memory.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/memory.c	2004-03-21 16:25:34.000000000 -0500
@@ -1097,11 +1097,11 @@ no_pte_chain:
  * An hlen of zero blows away the entire portion file after hba.
  */
 static void
-invalidate_mmap_range_list(struct list_head *head,
+invalidate_mmap_range_list(struct prio_tree_root *root,
 			   unsigned long const hba,
 			   unsigned long const hlen)
 {
-	struct list_head *curr;
+	struct prio_tree_iter iter;
 	unsigned long hea;	/* last page of hole. */
 	unsigned long vba;
 	unsigned long vea;	/* last page of corresponding uva hole. */
@@ -1112,17 +1112,16 @@ invalidate_mmap_range_list(struct list_h
 	hea = hba + hlen - 1;	/* avoid overflow. */
 	if (hea < hba)
 		hea = ULONG_MAX;
-	list_for_each(curr, head) {
-		vp = list_entry(curr, struct vm_area_struct, shared);
+	vp = __vma_prio_tree_first(root, &iter, hba, hea);
+	while(vp) {
 		vba = vp->vm_pgoff;
 		vea = vba + ((vp->vm_end - vp->vm_start) >> PAGE_SHIFT) - 1;
-		if (hea < vba || vea < hba)
-		    	continue;	/* Mapping disjoint from hole. */
 		zba = (hba <= vba) ? vba : hba;
 		zea = (vea <= hea) ? vea : hea;
 		zap_page_range(vp,
 			       ((zba - vba) << PAGE_SHIFT) + vp->vm_start,
 			       (zea - zba + 1) << PAGE_SHIFT);
+		vp = __vma_prio_tree_next(vp, root, &iter, hba, hea);
 	}
 }

@@ -1157,9 +1156,9 @@ void invalidate_mmap_range(struct addres
 	down(&mapping->i_shared_sem);
 	/* Protect against page fault */
 	atomic_inc(&mapping->truncate_count);
-	if (unlikely(!list_empty(&mapping->i_mmap)))
+	if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
 		invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen);
-	if (unlikely(!list_empty(&mapping->i_mmap_shared)))
+	if (unlikely(!prio_tree_empty(&mapping->i_mmap_shared)))
 		invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen);
 	up(&mapping->i_shared_sem);
 }
diff -puN mm/mmap.c~prio_tree_core mm/mmap.c
--- mmlinux-2.6/mm/mmap.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/mmap.c	2004-03-21 16:25:01.000000000 -0500
@@ -64,12 +64,16 @@ EXPORT_SYMBOL(vm_committed_space);
  * Requires inode->i_mapping->i_shared_sem
  */
 static inline void
-__remove_shared_vm_struct(struct vm_area_struct *vma, struct inode *inode)
+__remove_shared_vm_struct(struct vm_area_struct *vma, struct inode *inode,
+	struct address_space *mapping)
 {
 	if (inode) {
 		if (vma->vm_flags & VM_DENYWRITE)
 			atomic_inc(&inode->i_writecount);
-		list_del_init(&vma->shared);
+		if (vma->vm_flags & VM_SHARED)
+			__vma_prio_tree_remove(&mapping->i_mmap_shared, vma);
+		else
+			__vma_prio_tree_remove(&mapping->i_mmap, vma);
 	}
 }

@@ -83,7 +87,8 @@ static void remove_shared_vm_struct(stru
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
 		down(&mapping->i_shared_sem);
-		__remove_shared_vm_struct(vma, file->f_dentry->d_inode);
+		__remove_shared_vm_struct(vma, file->f_dentry->d_inode,
+				mapping);
 		up(&mapping->i_shared_sem);
 	}
 }
@@ -257,10 +262,10 @@ static inline void __vma_link_file(struc
 		if (vma->vm_flags & VM_DENYWRITE)
 			atomic_dec(&file->f_dentry->d_inode->i_writecount);

-		if (vma->vm_flags & VM_SHARED)
-			list_add_tail(&vma->shared, &mapping->i_mmap_shared);
-		else
-			list_add_tail(&vma->shared, &mapping->i_mmap);
+		if (vma->vm_flags & VM_SHARED)
+			__vma_prio_tree_insert(&mapping->i_mmap_shared, vma);
+		else
+			__vma_prio_tree_insert(&mapping->i_mmap, vma);
 	}
 }

@@ -390,7 +395,9 @@ static int vma_merge(struct mm_struct *m
 {
 	spinlock_t *lock = &mm->page_table_lock;
 	struct inode *inode = file ? file->f_dentry->d_inode : NULL;
+	struct address_space *mapping = file ? file->f_mapping : NULL;
 	struct semaphore *i_shared_sem;
+	struct prio_tree_root *root = NULL;

 	/*
 	 * We later require that vma->vm_flags == vm_flags, so this tests
@@ -401,6 +408,13 @@ static int vma_merge(struct mm_struct *m

 	i_shared_sem = file ? &file->f_mapping->i_shared_sem : NULL;

+	if (mapping) {
+		if (vm_flags & VM_SHARED)
+			root = &mapping->i_mmap_shared;
+		else
+			root = &mapping->i_mmap;
+	}
+
 	if (!prev) {
 		prev = rb_entry(rb_parent, struct vm_area_struct, vm_rb);
 		goto merge_next;
@@ -421,18 +435,18 @@ static int vma_merge(struct mm_struct *m
 			need_up = 1;
 		}
 		spin_lock(lock);
-		prev->vm_end = end;

 		/*
 		 * OK, it did.  Can we now merge in the successor as well?
 		 */
 		next = prev->vm_next;
-		if (next && prev->vm_end == next->vm_start &&
+		if (next && end == next->vm_start &&
 				can_vma_merge_before(next, vm_flags, file,
 					pgoff, (end - addr) >> PAGE_SHIFT)) {
-			prev->vm_end = next->vm_end;
+			__vma_modify(root, prev, prev->vm_start,
+					next->vm_end, prev->vm_pgoff);
 			__vma_unlink(mm, next, prev);
-			__remove_shared_vm_struct(next, inode);
+			__remove_shared_vm_struct(next, inode, mapping);
 			spin_unlock(lock);
 			if (need_up)
 				up(i_shared_sem);
@@ -443,6 +457,7 @@ static int vma_merge(struct mm_struct *m
 			kmem_cache_free(vm_area_cachep, next);
 			return 1;
 		}
+		__vma_modify(root, prev, prev->vm_start, end, prev->vm_pgoff);
 		spin_unlock(lock);
 		if (need_up)
 			up(i_shared_sem);
@@ -462,8 +477,8 @@ static int vma_merge(struct mm_struct *m
 			if (file)
 				down(i_shared_sem);
 			spin_lock(lock);
-			prev->vm_start = addr;
-			prev->vm_pgoff -= (end - addr) >> PAGE_SHIFT;
+			__vma_modify(root, prev, addr, prev->vm_end,
+				prev->vm_pgoff - ((end - addr) >> PAGE_SHIFT));
 			spin_unlock(lock);
 			if (file)
 				up(i_shared_sem);
@@ -649,7 +664,7 @@ munmap_back:
 	vma->vm_file = NULL;
 	vma->vm_private_data = NULL;
 	vma->vm_next = NULL;
-	INIT_LIST_HEAD(&vma->shared);
+	INIT_VMA_SHARED(vma);

 	if (file) {
 		error = -EINVAL;
@@ -1196,6 +1211,7 @@ int split_vma(struct mm_struct * mm, str
 {
 	struct vm_area_struct *new;
 	struct address_space *mapping = NULL;
+	struct prio_tree_root *root = NULL;

 	if (mm->map_count >= MAX_MAP_COUNT)
 		return -ENOMEM;
@@ -1207,7 +1223,7 @@ int split_vma(struct mm_struct * mm, str
 	/* most fields are the same, copy all, and then fixup */
 	*new = *vma;

-	INIT_LIST_HEAD(&new->shared);
+	INIT_VMA_SHARED(new);

 	if (new_below)
 		new->vm_end = addr;
@@ -1222,18 +1238,24 @@ int split_vma(struct mm_struct * mm, str
 	if (new->vm_ops && new->vm_ops->open)
 		new->vm_ops->open(new);

-	if (vma->vm_file)
+	if (vma->vm_file) {
 		 mapping = vma->vm_file->f_mapping;

+		 if (vma->vm_flags & VM_SHARED)
+			 root = &mapping->i_mmap_shared;
+		 else
+			 root = &mapping->i_mmap;
+	}
+
 	if (mapping)
 		down(&mapping->i_shared_sem);
 	spin_lock(&mm->page_table_lock);

-	if (new_below) {
-		vma->vm_start = addr;
-		vma->vm_pgoff += ((addr - new->vm_start) >> PAGE_SHIFT);
-	} else
-		vma->vm_end = addr;
+	if (new_below)
+		__vma_modify(root, vma, addr, vma->vm_end,
+			vma->vm_pgoff + ((addr - new->vm_start) >> PAGE_SHIFT));
+	else
+		__vma_modify(root, vma, vma->vm_start, addr, vma->vm_pgoff);

 	__insert_vm_struct(mm, new);

@@ -1406,7 +1428,7 @@ unsigned long do_brk(unsigned long addr,
 	vma->vm_pgoff = 0;
 	vma->vm_file = NULL;
 	vma->vm_private_data = NULL;
-	INIT_LIST_HEAD(&vma->shared);
+	INIT_VMA_SHARED(vma);

 	vma_link(mm, vma, prev, rb_link, rb_parent);

diff -puN mm/mremap.c~prio_tree_core mm/mremap.c
--- mmlinux-2.6/mm/mremap.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/mremap.c	2004-03-21 16:25:01.000000000 -0500
@@ -251,7 +251,7 @@ static unsigned long move_vma(struct vm_

 		if (allocated_vma) {
 			*new_vma = *vma;
-			INIT_LIST_HEAD(&new_vma->shared);
+			INIT_VMA_SHARED(new_vma);
 			new_vma->vm_start = new_addr;
 			new_vma->vm_end = new_addr+new_len;
 			new_vma->vm_pgoff += (addr-vma->vm_start) >> PAGE_SHIFT;
@@ -309,6 +309,8 @@ unsigned long do_mremap(unsigned long ad
 	unsigned long flags, unsigned long new_addr)
 {
 	struct vm_area_struct *vma;
+	struct address_space *mapping = NULL;
+	struct prio_tree_root *root = NULL;
 	unsigned long ret = -EINVAL;
 	unsigned long charged = 0;

@@ -416,9 +418,24 @@ unsigned long do_mremap(unsigned long ad
 		/* can we just expand the current mapping? */
 		if (max_addr - addr >= new_len) {
 			int pages = (new_len - old_len) >> PAGE_SHIFT;
+
+			if (vma->vm_file) {
+				mapping = vma->vm_file->f_mapping;
+				if (vma->vm_flags & VM_SHARED)
+					root = &mapping->i_mmap_shared;
+				else
+					root = &mapping->i_mmap;
+				down(&mapping->i_shared_sem);
+			}
+
 			spin_lock(&vma->vm_mm->page_table_lock);
-			vma->vm_end = addr + new_len;
+			__vma_modify(root, vma, vma->vm_start,
+					addr + new_len, vma->vm_pgoff);
 			spin_unlock(&vma->vm_mm->page_table_lock);
+
+			if(mapping)
+				up(&mapping->i_shared_sem);
+
 			current->mm->total_vm += pages;
 			if (vma->vm_flags & VM_LOCKED) {
 				current->mm->locked_vm += pages;
diff -puN /dev/null mm/prio_tree.c
--- /dev/null	2003-01-30 05:24:37.000000000 -0500
+++ mmlinux-2.6-jaya/mm/prio_tree.c	2004-03-21 16:25:01.000000000 -0500
@@ -0,0 +1,574 @@
+/*
+ * mm/prio_tree.c - priority search tree for mapping->i_mmap{,_shared}
+ *
+ * Copyright (C) 2004, Rajesh Venkatasubramanian <vrajesh@umich.edu>
+ *
+ * Based on the radix priority search tree proposed by Edward M. McCreight
+ * SIAM Journal of Computing, vol. 14, no.2, pages 257-276, May 1985
+ *
+ * 02Feb2004	Initial version
+ */
+
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/prio_tree.h>
+
+/*
+ * A clever mix of heap and radix trees forms a radix priority search tree (PST)
+ * which is useful for storing intervals, e.g, we can consider a vma as a closed
+ * interval of file pages [offset_begin, offset_end], and store all vmas that
+ * map a file in a PST. Then, using the PST, we can answer a stabbing query,
+ * i.e., selecting a set of stored intervals (vmas) that overlap with (map) a
+ * given input interval X (a set of consecutive file pages), in "O(log n + m)"
+ * time where 'log n' is the height of the PST, and 'm' is the number of stored
+ * intervals (vmas) that overlap (map) with the input interval X (the set of
+ * consecutive file pages).
+ *
+ * In our implementation, we store closed intervals of the form [radix_index,
+ * heap_index]. We assume that always radix_index <= heap_index. McCreight's PST
+ * is designed for storing intervals with unique radix indices, i.e., each
+ * interval have different radix_index. However, this limitation can be easily
+ * overcome by using the size, i.e., heap_index - radix_index, as part of the
+ * index, so we index the tree using [(radix_index,size), heap_index].
+ *
+ * When the above-mentioned indexing scheme is used, theoretically, in a 32 bit
+ * machine, the maximum height of a PST can be 64. We can use a balanced version
+ * of the priority search tree to optimize the tree height, but the balanced
+ * tree proposed by McCreight is too complex and memory-hungry for our purpose.
+ */
+
+static unsigned long index_bits_to_maxindex[BITS_PER_LONG];
+
+/*
+ * Maximum heap_index that can be stored in a PST with index_bits bits
+ */
+static inline unsigned long prio_tree_maxindex(unsigned int bits)
+{
+	return index_bits_to_maxindex[bits - 1];
+}
+
+/*
+ * Extend a priority search tree so that it can store a node with heap_index
+ * max_heap_index. In the worst case, this algorithm takes O((log n)^2).
+ * However, this function is used rarely and the common case performance is
+ * not bad.
+ */
+static struct prio_tree_node *prio_tree_expand(struct prio_tree_root *root,
+	struct prio_tree_node *node, unsigned long max_heap_index)
+{
+	struct prio_tree_node *first = NULL, *prev, *last = NULL;
+
+	if (max_heap_index > prio_tree_maxindex(root->index_bits))
+		root->index_bits++;
+
+	while (max_heap_index > prio_tree_maxindex(root->index_bits)) {
+		root->index_bits++;
+
+		if (prio_tree_empty(root))
+			continue;
+
+		if (first == NULL) {
+			first = root->prio_tree_node;
+			prio_tree_remove(root, root->prio_tree_node);
+			INIT_PRIO_TREE_NODE(first);
+			last = first;
+		}
+		else {
+			prev = last;
+			last = root->prio_tree_node;
+			prio_tree_remove(root, root->prio_tree_node);
+			INIT_PRIO_TREE_NODE(last);
+			prev->left = last;
+			last->parent = prev;
+		}
+	}
+
+	INIT_PRIO_TREE_NODE(node);
+
+	if (first) {
+		node->left = first;
+		first->parent = node;
+	}
+	else
+		last = node;
+
+	if (!prio_tree_empty(root)) {
+		last->left = root->prio_tree_node;
+		last->left->parent = last;
+	}
+
+	root->prio_tree_node = node;
+	return node;
+}
+
+/*
+ * Replace a prio_tree_node with a new node and return the old node
+ */
+static inline struct prio_tree_node *prio_tree_replace(
+	struct prio_tree_root *root, struct prio_tree_node *old,
+	struct prio_tree_node *node)
+{
+	INIT_PRIO_TREE_NODE(node);
+
+	if (prio_tree_root(old)) {
+		BUG_ON(root->prio_tree_node != old);
+		/*
+		 * We can reduce root->index_bits here. However, it is complex
+		 * and does not help much to improve performance (IMO).
+		 */
+		node->parent = node;
+		root->prio_tree_node = node;
+	}
+	else {
+		node->parent = old->parent;
+		if (old->parent->left == old)
+			old->parent->left = node;
+		else {
+			BUG_ON(old->parent->right != old);
+			old->parent->right = node;
+		}
+	}
+
+	if (!prio_tree_left_empty(old)) {
+		node->left = old->left;
+		old->left->parent = node;
+	}
+
+	if (!prio_tree_right_empty(old)) {
+		node->right = old->right;
+		old->right->parent = node;
+	}
+
+	return old;
+}
+
+#undef	swap
+#define	swap(x,y,z)	do {z = x; x = y; y = z; } while (0)
+
+/*
+ * Insert a prio_tree_node @node into a radix priority search tree @root. The
+ * algorithm typically takes O(log n) time where 'log n' is the number of bits
+ * required to represent the maximum heap_index. In the worst case, the algo
+ * can take O((log n)^2) - check prio_tree_expand.
+ *
+ * If a prior node with same radix_index and heap_index is already found in
+ * the tree, then returns the address of the prior node. Otherwise, inserts
+ * @node into the tree and returns @node.
+ */
+
+struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root,
+			struct prio_tree_node *node)
+{
+	struct prio_tree_node *cur, *res = node;
+	unsigned long radix_index, heap_index;
+	unsigned long r_index, h_index, index, mask;
+	int size_flag = 0;
+
+	GET_INDEX(node, radix_index, heap_index);
+
+	if (prio_tree_empty(root) ||
+			heap_index > prio_tree_maxindex(root->index_bits))
+		return prio_tree_expand(root, node, heap_index);
+
+	cur = root->prio_tree_node;
+	mask = 1UL << (root->index_bits - 1);
+
+	while (mask) {
+		GET_INDEX(cur, r_index, h_index);
+
+		if (r_index == radix_index && h_index == heap_index)
+			return cur;
+
+                if (h_index < heap_index || (h_index == heap_index &&
+						r_index > radix_index))
+		{
+			struct prio_tree_node *tmp = node;
+			node = prio_tree_replace(root, cur, node);
+			cur = tmp;
+			swap(r_index, radix_index, index);
+			swap(h_index, heap_index, index);
+		}
+
+		if (size_flag)
+			index = heap_index - radix_index;
+		else
+			index = radix_index;
+
+		if (index & mask) {
+			if (prio_tree_right_empty(cur)) {
+				INIT_PRIO_TREE_NODE(node);
+				cur->right = node;
+				node->parent = cur;
+				return res;
+			}
+			else
+				cur = cur->right;
+		}
+		else {
+			if (prio_tree_left_empty(cur)) {
+				INIT_PRIO_TREE_NODE(node);
+				cur->left = node;
+				node->parent = cur;
+				return res;
+			}
+			else
+				cur = cur->left;
+		}
+
+		mask >>= 1;
+
+		if (!mask) {
+			mask = 1UL << (root->index_bits - 1);
+			size_flag = 1;
+		}
+	}
+	/* Should not reach here */
+	BUG();
+	return NULL;
+}
+
+/*
+ * Remove a prio_tree_node @node from a radix priority search tree @root. The
+ * algorithm takes O(log n) time where 'log n' is the number of bits required
+ * to represent the maximum heap_index.
+ */
+
+void prio_tree_remove(struct prio_tree_root *root, struct prio_tree_node *node)
+{
+	struct prio_tree_node *cur;
+	unsigned long r_index, h_index_right, h_index_left;
+
+	cur = node;
+
+	while (!prio_tree_left_empty(cur) || !prio_tree_right_empty(cur)) {
+		if (!prio_tree_left_empty(cur))
+			GET_INDEX(cur->left, r_index, h_index_left);
+		else {
+			cur = cur->right;
+			continue;
+		}
+
+		if (!prio_tree_right_empty(cur))
+			GET_INDEX(cur->right, r_index, h_index_right);
+		else {
+			cur = cur->left;
+			continue;
+		}
+
+		/* both h_index_left and h_index_right cannot be 0 */
+		if (h_index_left >= h_index_right)
+			cur = cur->left;
+		else
+			cur = cur->right;
+	}
+
+	if (prio_tree_root(cur)) {
+		BUG_ON(root->prio_tree_node != cur);
+		*root = PRIO_TREE_ROOT;
+		return;
+	}
+
+	if (cur->parent->right == cur)
+		cur->parent->right = cur->parent;
+	else {
+		BUG_ON(cur->parent->left != cur);
+		cur->parent->left = cur->parent;
+	}
+
+	while (cur != node)
+		cur = prio_tree_replace(root, cur->parent, cur);
+
+	return;
+}
+
+/*
+ * Following functions help to enumerate all prio_tree_nodes in the tree that
+ * overlap with the input interval X [radix_index, heap_index]. The enumeration
+ * takes O(log n + m) time where 'log n' is the height of the tree (which is
+ * proportional to # of bits required to represent the maximum heap_index) and
+ * 'm' is the number of prio_tree_nodes that overlap the interval X.
+ */
+
+static inline struct prio_tree_node *__prio_tree_left(
+	struct prio_tree_root *root, struct prio_tree_iter *iter,
+	unsigned long radix_index, unsigned long heap_index,
+	unsigned long *r_index, unsigned long *h_index)
+{
+	if (prio_tree_left_empty(iter->cur))
+		return NULL;
+
+	GET_INDEX(iter->cur->left, *r_index, *h_index);
+
+	if (radix_index <= *h_index) {
+		iter->cur = iter->cur->left;
+		iter->mask >>= 1;
+		if (iter->mask) {
+			if (iter->size_level)
+				iter->size_level++;
+		}
+		else {
+			iter->size_level = 1;
+			iter->mask = 1UL << (root->index_bits - 1);
+		}
+		return iter->cur;
+	}
+
+	return NULL;
+}
+
+
+static inline struct prio_tree_node *__prio_tree_right(
+	struct prio_tree_root *root, struct prio_tree_iter *iter,
+	unsigned long radix_index, unsigned long heap_index,
+	unsigned long *r_index, unsigned long *h_index)
+{
+	unsigned long value;
+
+	if (prio_tree_right_empty(iter->cur))
+		return NULL;
+
+	if (iter->size_level)
+		value = iter->value;
+	else
+		value = iter->value | iter->mask;
+
+	if (heap_index < value)
+		return NULL;
+
+	GET_INDEX(iter->cur->right, *r_index, *h_index);
+
+	if (radix_index <= *h_index) {
+		iter->cur = iter->cur->right;
+		iter->mask >>= 1;
+		iter->value = value;
+		if (iter->mask) {
+			if (iter->size_level)
+				iter->size_level++;
+		}
+		else {
+			iter->size_level = 1;
+			iter->mask = 1UL << (root->index_bits - 1);
+		}
+		return iter->cur;
+	}
+
+	return NULL;
+}
+
+static inline struct prio_tree_node *__prio_tree_parent(
+	struct prio_tree_iter *iter)
+{
+	iter->cur = iter->cur->parent;
+	iter->mask <<= 1;
+	if (iter->size_level) {
+		if (iter->size_level == 1)
+			iter->mask = 1UL;
+		iter->size_level--;
+	}
+	else if (iter->value & iter->mask)
+		iter->value ^= iter->mask;
+	return iter->cur;
+}
+
+static inline int overlap(unsigned long radix_index, unsigned long heap_index,
+	unsigned long r_index, unsigned long h_index)
+{
+	if (heap_index < r_index || radix_index > h_index)
+		return 0;
+
+	return 1;
+}
+
+/*
+ * prio_tree_first:
+ *
+ * Get the first prio_tree_node that overlaps with the interval [radix_index,
+ * heap_index]. Note that always radix_index <= heap_index. We do a pre-order
+ * traversal of the tree.
+ */
+struct prio_tree_node *prio_tree_first(struct prio_tree_root *root,
+	struct prio_tree_iter *iter, unsigned long radix_index,
+	unsigned long heap_index)
+{
+	unsigned long r_index, h_index;
+
+	*iter = PRIO_TREE_ITER;
+
+	if (prio_tree_empty(root))
+		return NULL;
+
+	GET_INDEX(root->prio_tree_node, r_index, h_index);
+
+	if (radix_index > h_index)
+		return NULL;
+
+	iter->mask = 1UL << (root->index_bits - 1);
+	iter->cur = root->prio_tree_node;
+
+	while (1) {
+		if (overlap(radix_index, heap_index, r_index, h_index))
+			return iter->cur;
+
+		if (__prio_tree_left(root, iter, radix_index, heap_index,
+					&r_index, &h_index))
+			continue;
+
+		if (__prio_tree_right(root, iter, radix_index, heap_index,
+					&r_index, &h_index))
+			continue;
+
+		break;
+	}
+	return NULL;
+}
+
+/* Get the next prio_tree_node that overlaps with the input interval in iter */
+struct prio_tree_node *prio_tree_next(struct prio_tree_root *root,
+	struct prio_tree_iter *iter, unsigned long radix_index,
+	unsigned long heap_index)
+{
+	unsigned long r_index, h_index;
+
+repeat:
+	while (__prio_tree_left(root, iter, radix_index, heap_index,
+				&r_index, &h_index))
+		if (overlap(radix_index, heap_index, r_index, h_index))
+			return iter->cur;
+
+	while (!__prio_tree_right(root, iter, radix_index, heap_index,
+				&r_index, &h_index)) {
+	    	while (!prio_tree_root(iter->cur) &&
+				iter->cur->parent->right == iter->cur)
+			__prio_tree_parent(iter);
+
+		if (prio_tree_root(iter->cur))
+			return NULL;
+
+		__prio_tree_parent(iter);
+	}
+
+	if (overlap(radix_index, heap_index, r_index, h_index))
+			return iter->cur;
+
+	goto repeat;
+}
+
+/*
+ * Radix priority search tree for address_space->i_mmap_{_shared}
+ *
+ * For each vma that map a unique set of file pages i.e., unique [radix_index,
+ * heap_index] value, we have a corresponing priority search tree node. If
+ * multiple vmas have identical [radix_index, heap_index] value, then one of
+ * them is used as a tree node and others are stored in a vm_set list. The tree
+ * node points to the first vma (head) of the list using vm_set_head.
+ *
+ * prio_tree_root
+ *      |
+ *      A       vm_set_head
+ *     / \      /
+ *    L   R -> H-I-J-K-M-N-O-P-Q-S
+ *    ^   ^    <-- vm_set.list -->
+ *  tree nodes
+ *
+ * We need some way to identify whether a vma is a tree node, head of a vm_set
+ * list, or just a member of a vm_set list. We cannot use vm_flags to store
+ * such information. The reason is, in the above figure, it is possible that
+ * vm_flags' of R and H are covered by the different mmap_sems. When R is
+ * removed under R->mmap_sem, H replaces R as a tree node. Since we do not hold
+ * H->mmap_sem, we cannot use H->vm_flags for marking that H is a tree node now.
+ * That's why some trick involving shared.both.parent is used for identifying
+ * tree nodes and list head nodes. We can possibly use the least significant
+ * bit of the vm_set_head field to mark tree and list head nodes. I was worried
+ * about the alignment of vm_area_struct in various architectures.
+ *
+ * vma radix priority search tree node rules:
+ *
+ * vma->shared.both.parent != NULL	==>	a tree node
+ *
+ * vma->shared.both.parent == NULL
+ * 	vma->vm_set_head != NULL  ==>  list head of vmas that map same pages
+ * 	vma->vm_set_head == NULL  ==>  a list node
+ */
+
+void __vma_prio_tree_insert(struct prio_tree_root *root,
+	struct vm_area_struct *vma)
+{
+	struct prio_tree_node *ptr;
+	struct vm_area_struct *old;
+
+	ptr = prio_tree_insert(root, &vma->shared.prio_tree_node);
+
+	if (ptr == &vma->shared.prio_tree_node) {
+		vma->vm_set_head = NULL;
+		return;
+	}
+
+	old = prio_tree_entry(ptr, struct vm_area_struct,
+			shared.prio_tree_node);
+
+	__vma_prio_tree_add(vma, old);
+}
+
+void __vma_prio_tree_remove(struct prio_tree_root *root,
+	struct vm_area_struct *vma)
+{
+	struct vm_area_struct *node, *head, *new_head;
+
+	if (vma->shared.both.parent == NULL && vma->vm_set_head == NULL) {
+		list_del_init(&vma->shared.vm_set.list);
+		INIT_VMA_SHARED(vma);
+		return;
+	}
+
+	if (vma->vm_set_head) {
+		/* Leave this BUG_ON till prio_tree patch stabilizes */
+		BUG_ON(vma->vm_set_head->vm_set_head != vma);
+		if (vma->shared.both.parent) {
+			head = vma->vm_set_head;
+			if (!list_empty(&head->shared.vm_set.list)) {
+				new_head = list_entry(
+					head->shared.vm_set.list.next,
+					struct vm_area_struct,
+					shared.vm_set.list);
+				list_del_init(&head->shared.vm_set.list);
+			}
+			else
+				new_head = NULL;
+
+			prio_tree_replace(root, &vma->shared.prio_tree_node,
+					&head->shared.prio_tree_node);
+			head->vm_set_head = new_head;
+			if (new_head)
+				new_head->vm_set_head = head;
+
+		}
+		else {
+			node = vma->vm_set_head;
+			if (!list_empty(&vma->shared.vm_set.list)) {
+				new_head = list_entry(
+					vma->shared.vm_set.list.next,
+					struct vm_area_struct,
+					shared.vm_set.list);
+				list_del_init(&vma->shared.vm_set.list);
+				node->vm_set_head = new_head;
+				new_head->vm_set_head = node;
+			}
+			else
+				node->vm_set_head = NULL;
+		}
+		INIT_VMA_SHARED(vma);
+		return;
+	}
+
+	prio_tree_remove(root, &vma->shared.prio_tree_node);
+	INIT_VMA_SHARED(vma);
+}
+
+void __init prio_tree_init(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(index_bits_to_maxindex) - 1; i++)
+		index_bits_to_maxindex[i] = (1UL << (i + 1)) - 1;
+	index_bits_to_maxindex[ARRAY_SIZE(index_bits_to_maxindex) - 1] = ~0UL;
+}
diff -puN mm/shmem.c~prio_tree_core mm/shmem.c
--- mmlinux-2.6/mm/shmem.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/shmem.c	2004-03-21 16:25:01.000000000 -0500
@@ -1328,7 +1328,7 @@ static void do_shmem_file_read(struct fi
 			 * virtual addresses, take care about potential aliasing
 			 * before reading the page on the kernel side.
 			 */
-			if (!list_empty(&mapping->i_mmap_shared))
+			if (!prio_tree_empty(&mapping->i_mmap_shared))
 				flush_dcache_page(page);
 			/*
 			 * Mark the page accessed if we read the beginning.
diff -puN mm/swap_state.c~prio_tree_core mm/swap_state.c
--- mmlinux-2.6/mm/swap_state.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/swap_state.c	2004-03-21 16:25:01.000000000 -0500
@@ -32,8 +32,8 @@ struct address_space swapper_space = {
 	.locked_pages	= LIST_HEAD_INIT(swapper_space.locked_pages),
 	.a_ops		= &swap_aops,
 	.backing_dev_info = &swap_backing_dev_info,
-	.i_mmap		= LIST_HEAD_INIT(swapper_space.i_mmap),
-	.i_mmap_shared	= LIST_HEAD_INIT(swapper_space.i_mmap_shared),
+	.i_mmap		= PRIO_TREE_ROOT_INIT,
+	.i_mmap_shared	= PRIO_TREE_ROOT_INIT,
 	.i_shared_sem	= __MUTEX_INITIALIZER(swapper_space.i_shared_sem),
 	.truncate_count  = ATOMIC_INIT(0),
 	.private_lock	= SPIN_LOCK_UNLOCKED,
diff -puN mm/vmscan.c~prio_tree_core mm/vmscan.c
--- mmlinux-2.6/mm/vmscan.c~prio_tree_core	2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/vmscan.c	2004-03-21 16:25:01.000000000 -0500
@@ -191,9 +191,9 @@ static inline int page_mapping_inuse(str
 		return 1;

 	/* File is mmap'd by somebody. */
-	if (!list_empty(&mapping->i_mmap))
+	if (!prio_tree_empty(&mapping->i_mmap))
 		return 1;
-	if (!list_empty(&mapping->i_mmap_shared))
+	if (!prio_tree_empty(&mapping->i_mmap_shared))
 		return 1;

 	return 0;

_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://lwn.net/Articles/76621/

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy