Discussion:
[DSJ] Created: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem
(too old to reply)
Simon Brown (JIRA)
2010-01-20 14:55:59 UTC
Permalink
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------

Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Attachments: batch_importer_speedup.patch

As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/

As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().

The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.

Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.

However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.

This patch addresses both of these issues in a way which has a minimal effect on the rest of the code base; I don't necessarily consider it to be the "best" way, but I wanted to keep the patch small so it could be put out. What it does is:

1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.

As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Richard Rodgers (JIRA)
2010-01-20 15:44:59 UTC
Permalink
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11088#action_11088 ]

Richard Rodgers commented on DS-470:
------------------------------------

Hi Simon, Tom etc.Thanks for the careful analysis & work. Without looking closely at the patch, I'm wondering whether there might be a simpler solution.
You can use a single API call (setDispatcher in the Context class) in ItemImport to use the 'noindex' dispatcher, which does not call any of the usual event consumers, including search (the dispatcher is already defined in dspace.cfg) Then, after the import, just run 'index_all'. The event system was designed to facilitate just this sort of context-specific use.

I'll be glad to furnish further details if this isn't clear.
Post by Simon Brown (JIRA)
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------
Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Attachments: batch_importer_speedup.patch
As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().
The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.
Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.
However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.
1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.
As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
2010-01-20 18:05:49 UTC
Permalink
Hi Richard,
This caught my eye this morning because we have a large repository (currently 122,091 Items). We too have issues with our imports really slowing down as our repository grows in size and have looked for a solution to the problem. I just wanted to mention that the solution where you turn off the event consumers and then build the indexes (I assume you meant index-init when you said index-all...?) would not work well for us since it takes up to a week for our index-init to complete. Perhaps it would work for us to just run index-update afterward as I don't think this takes nearly as long to run, but I'm not absolutely sure.
Sue

-----Original Message-----
From: Richard Rodgers (JIRA) [mailto:***@dspace.org]
Sent: Wednesday, January 20, 2010 10:45 AM
To: dspace-***@lists.sourceforge.net
Subject: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem


[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11088#action_11088 ]

Richard Rodgers commented on DS-470:
------------------------------------

Hi Simon, Tom etc.Thanks for the careful analysis & work. Without looking closely at the patch, I'm wondering whether there might be a simpler solution.
You can use a single API call (setDispatcher in the Context class) in ItemImport to use the 'noindex' dispatcher, which does not call any of the usual event consumers, including search (the dispatcher is already defined in dspace.cfg) Then, after the import, just run 'index_all'. The event system was designed to facilitate just this sort of context-specific use.

I'll be glad to furnish further details if this isn't clear.
Post by Simon Brown (JIRA)
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------
Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Attachments: batch_importer_speedup.patch
As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().
The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.
Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.
However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.
1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.
As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Richard Rodgers
2010-01-20 23:22:03 UTC
Permalink
Hi Sue:

Apologies for the confusion - 'index_all' was the old name for the script: I did mean index-update. One wouldn't run index-init except in cases of new systems, corrupt indices or the like.
Index-update operates incrementally, and is *much* faster.

Richard
Post by Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
Hi Richard,
This caught my eye this morning because we have a large repository (currently 122,091 Items). We too have issues with our imports really slowing down as our repository grows in size and have looked for a solution to the problem. I just wanted to mention that the solution where you turn off the event consumers and then build the indexes (I assume you meant index-init when you said index-all...?) would not work well for us since it takes up to a week for our index-init to complete. Perhaps it would work for us to just run index-update afterward as I don't think this takes nearly as long to run, but I'm not absolutely sure.
Sue
-----Original Message-----
Sent: Wednesday, January 20, 2010 10:45 AM
Subject: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11088#action_11088 ]
------------------------------------
Hi Simon, Tom etc.Thanks for the careful analysis & work. Without looking closely at the patch, I'm wondering whether there might be a simpler solution.
You can use a single API call (setDispatcher in the Context class) in ItemImport to use the 'noindex' dispatcher, which does not call any of the usual event consumers, including search (the dispatcher is already defined in dspace.cfg) Then, after the import, just run 'index_all'. The event system was designed to facilitate just this sort of context-specific use.
I'll be glad to furnish further details if this isn't clear.
Post by Simon Brown (JIRA)
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------
Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Attachments: batch_importer_speedup.patch
As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().
The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.
Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.
However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.
1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.
As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
_______________________________________________
Dspace-devel mailing list
https://lists.sourceforge.net/lists/listinfo/dspace-devel
------------------------------------------------------------------------------
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
_______________________________________________
Dspace-devel mailing list
https://lists.sourceforge.net/lists/listinfo/dspace-devel
Tom De Mulder
2010-01-21 14:23:21 UTC
Permalink
Post by Richard Rodgers
Apologies for the confusion - 'index_all' was the old name for the script: I did mean index-update. One wouldn't run index-init except in cases of new systems, corrupt indices or the like.
Index-update operates incrementally, and is *much* faster.
Sadly, though, your solution touches the entire repository, and doesn't
scale as well. Once the repository size gets large enough, even "faster"
can take a long time.


Best regards,

--
Tom De Mulder <***@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 21/01/2010 : The Moon is Waxing Crescent (36% of Full)
Graham Triggs
2010-01-27 11:51:19 UTC
Permalink
Post by Richard Rodgers
Post by Richard Rodgers
Apologies for the confusion - 'index_all' was the old name for the
script: I did mean index-update. One wouldn't run index-init except in cases
of new systems, corrupt indices or the like.
Post by Richard Rodgers
Index-update operates incrementally, and is *much* faster.
Sadly, though, your solution touches the entire repository, and doesn't
scale as well. Once the repository size gets large enough, even "faster"
can take a long time.
I'm not going to advocate a specific solution here, but a philosophy. Speed
and scalability are different things, and it's dangerous to conflate the
two.

A batch import is just that - a batch process. A non-interactive job that
churns through a bunch of data until it has exhausted all it's import. The
speed of a batch process should only matter when:

1) You have a specific point in time that a process must be completed by.

2) You are sitting there watching it.

3) You have so much data to process each and every day that you can't
possibly ever complete.

Even if it takes an hour to process 4000 documents an hour, that still means
you can import 100,000 in a day. How many people are close to needing to do
that?

As for watching it, well, surely you have better things to do! Or, as Robert
Llewellyn says about recharging electric cars - it takes 9 secs. That's how
long it takes to initiate the process - and your involvement ceases there.

Yes, having something completed by a specific point in time can be a
concern... "I need to have these articles loaded by the date I need to
submit a report about them". But that ought to only be a concern in
determining when the process needs to start.

If the reason for speeding up the batch import is because it impacts on the
usability of the repository for that duration - that is not a scalable
system. It does not matter how much faster you make it, there is always a
finite limit to how many items you can process with a single importer,
running against a single repository, that exists on a single machine.

A scalable system can run imports all day long without affecting the
functionality or performance of the repository for users accessing it
concurrently. A scalable system can run 10, 20,... 100 importers
simultaneously without detrimental affects. A scalable system lets you
import millions of items an hour, by allowing you to utilise the resources
needed to do it, not by trying to squeeze a single process into a finite
resource.

Or to paraphrase your statement, once the repository gets large enough, even
"faster" can never be fast enough. I would say stop worrying about how fast
you can make a batch import, and think about how you can reduce the impact
on the system. The numbers being posted are nowhere near being a concern for
a non-interactive process. Your batch import still only takes 9 secs ;)

G
Simon Brown
2010-01-27 18:16:32 UTC
Permalink
Post by Graham Triggs
I'm not going to advocate a specific solution here, but a
philosophy. Speed and scalability are different things, and it's
dangerous to conflate the two.
Happily, we weren't doing that. We were bringing to light a
scalability issue by highlighting the incredible reduction in import
speed as the repository grows. This is not a "the importer is too
slow" problem. This is a "the importer slows down dramatically as the
repository grows and it really shouldn't" problem.
Post by Graham Triggs
A batch import is just that - a batch process. A non-interactive job
that churns through a bunch of data until it has exhausted all it's
1) You have a specific point in time that a process must be
completed by.
2) You are sitting there watching it.
3) You have so much data to process each and every day that you
can't possibly ever complete.
Even if it takes an hour to process 4000 documents an hour, that
still means you can import 100,000 in a day. How many people are
close to needing to do that?
If it consistently took the same amount of time to ingest the same
number of items, then this would be a relevant point. It does not. It
takes an increasing amount of time to ingest the same number of items
based on the number of items in the repository. That is not a scalable
solution, because eventually the number of items in the repository
reaches a point where the number of items you can process in a
specific period drops below the number of items you need to process.
When that point is reached depends on the nature of the hardware
running your repository, but it will eventually be reached. On the
other hand, if the time taken by the importer scales according to the
size of the batch rather than the size of the repository, the issue
goes away.
Post by Graham Triggs
If the reason for speeding up the batch import is because it impacts
on the usability of the repository for that duration - that is not a
scalable system. It does not matter how much faster you make it,
there is always a finite limit to how many items you can process
with a single importer, running against a single repository, that
exists on a single machine.
The batch importer code as it is now, on large repositories, hammers
the database, because it is calling pruneIndexes() after every item
imported, which runs a number of database queries, which take an
increasing amount of time to run as the repository grows. Not only
does it make the batch importer *much* slower than it should be, but
yes, it impacts on everything else which is accessing that database.
As you say, that is not a scalable system.
Post by Graham Triggs
A scalable system can run imports all day long without affecting the
functionality or performance of the repository for users accessing
it concurrently. A scalable system can run 10, 20,... 100 importers
simultaneously without detrimental affects. A scalable system lets
you import millions of items an hour, by allowing you to utilise the
resources needed to do it, not by trying to squeeze a single process
into a finite resource.
If there exists a DSpace instance, running on the standard DSpace code
base, which is capable of supporting 100 simultaneous batch imports
importing millions of items an hour while not having any detrimental
effects, I would very much like to know how they did it and what their
hardware configuration is.
Post by Graham Triggs
Or to paraphrase your statement, once the repository gets large
enough, even "faster" can never be fast enough.
While this is true, it's largely meaningless; the issue is how large
the repository can get before you hit that wall, and how the wall can
be moved further away. And, once again, if the time taken by a task
scales with the size of the task, rather than the size of the
repository, the problem ceases to exist entirely.
Post by Graham Triggs
I would say stop worrying about how fast you can make a batch
import, and think about how you can reduce the impact on the system.
The numbers being posted are nowhere near being a concern for a non-
interactive process. Your batch import still only takes 9 secs ;)
I would say that if, for a 4,000-item import, we change the importer
so that it runs pruneIndexes() 3,999 fewer times than previously, that
is a significant reduction of the impact on the system. The batch
importer does not exist in a vacuum and the way it hammers the
database does also have an effect on everything else.

I confess that the reason why we are having this discussion at all
eludes me. It seems like a fairly obvious bug for the importer to
prune the indexes so many times (the comment for pruneIndexes() even
says "called from the public interfaces or at the end of a batch
indexing process"), it has a demonstrably detrimental effect on the
performance of the software, and the fix for it is not particularly
complicated. Is there something I've missed?

--
Simon Brown <***@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 34714 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
Mark Diggory
2010-01-27 23:32:48 UTC
Permalink
Post by Simon Brown
I confess that the reason why we are having this discussion at all
eludes me. It seems like a fairly obvious bug for the importer to
prune the indexes so many times (the comment for pruneIndexes() even
says "called from the public interfaces or at the end of a batch
indexing process"), it has a demonstrably detrimental effect on the
performance of the software, and the fix for it is not particularly
complicated. Is there something I've missed?
--
+44 1223 3 34714 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
We discuss it because we seek to maintain an appropriate separation of
concerns in our architecture. And because Graham usually challenges us
to look at aspects of that architecture that are important. What is
under discussion is not that performance can't be improved by your
patch, you've identified a very important issue in batch processing.
We are discussing architecturally if we want to alter the
Context/EventManager framework and expose calls to pruneIndex. We
want to be careful to avoid exposing too much of the internals of the
Browse system outside in the application architecture.

Excellent work on finding a means to improve DSpace performance.

Cheers,
Mark
--
Mark R. Diggory
Head of U.S. Operations - @mire

http://www.atmire.com - Institutional Repository Solutions
http://www.togather.eu - Before getting together, get ***@ther
Simon Brown
2010-01-28 14:04:43 UTC
Permalink
Post by Mark Diggory
Post by Simon Brown
I confess that the reason why we are having this discussion at all
eludes me. It seems like a fairly obvious bug for the importer to
prune the indexes so many times (the comment for pruneIndexes() even
says "called from the public interfaces or at the end of a batch
indexing process"), it has a demonstrably detrimental effect on the
performance of the software, and the fix for it is not particularly
complicated. Is there something I've missed?
--
+44 1223 3 34714 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
We discuss it because we seek to maintain an appropriate separation of
concerns in our architecture. And because Graham usually challenges us
to look at aspects of that architecture that are important. What is
under discussion is not that performance can't be improved by your
patch, you've identified a very important issue in batch processing.
We are discussing architecturally if we want to alter the
Context/EventManager framework and expose calls to pruneIndex. We
want to be careful to avoid exposing too much of the internals of the
Browse system outside in the application architecture.
That's fine; as I said in my initial submission, I know that my patch
isn't the best way of writing that code. I should have made
pruneIndexes() package private rather than public, for one thing. It
was done to get the patch out and under discussion quickly and in a
way which would hopefully minimise the effect on the rest of the code
base.

Having dug through the code a little more in the meantime, it seems
that the effect of pruneIndexes() is to remove from the browse indexes
information about items which are expunged and/or withdrawn; in that
light it might not be necessary to call it when items are added or
changed at all, thus reducing the patch to a single-line change. If
that's the case I'll happily withdraw mine in favour of the new one. :)
Post by Mark Diggory
Excellent work on finding a means to improve DSpace performance.
Thank you.

--
Simon Brown <***@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 34714 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
Graham Triggs
2010-01-28 17:46:17 UTC
Permalink
Post by Simon Brown
Having dug through the code a little more in the meantime, it seems
that the effect of pruneIndexes() is to remove from the browse indexes
information about items which are expunged and/or withdrawn; in that
light it might not be necessary to call it when items are added or
changed at all, thus reducing the patch to a single-line change. If
that's the case I'll happily withdraw mine in favour of the new one. :)
Simon,

Can you provide me with a dump of your bi_* tables? I would like to look into the performance of those queries.

Regards,
G
Graham Triggs
2010-01-28 20:58:05 UTC
Permalink
Post by Simon Brown
Having dug through the code a little more in the meantime, it seems
that the effect of pruneIndexes() is to remove from the browse indexes
information about items which are expunged and/or withdrawn; in that
light it might not be necessary to call it when items are added or
changed at all,
pruneIndexes() only removes data from the browse indexes, but the tunes under which it can occur are more subtle than that:

1) bi_item and bi_withdrawn

a) the bi_item table needs to be pruned if you withdraw an item.
b) the bi_withdrawn table needs to be pruned if you reinstate an item.
c) either table needs to be pruned when you expunge an item, depending on the state the item was in at the time

2) metadata tables - bi_1_dis, bi_1_dmap, bi_2_dis, bi_2_dmap, etc..

a) the _dis and _dmap tables for a given index number need to be pruned any time that the metadata (author, subject, etc.) that is being indexed by them is changed.
b) all the _dis and _dmap tables need to be pruned whenever an item is withdrawn or expunged.


I've done some more research on the problem. First, the following posts:

http://archives.postgresql.org/pgsql-performance/2009-01/msg00276.php
http://archives.postgresql.org/pgsql-performance/2009-01/msg00280.php

highlight the difference of doing an EXCEPT between two SELECTs (as is currently in the browse code), versus a NOT IN (which would be the alternative).


Further, if you look at the Postgres 8.4 release docs:

http://developer.postgresql.org/pgdocs/postgres/release-8-4.html

you'll see that EXCEPT can now use hash aggregates, which is faster than the existing implementation using sorts.


The story continues though. The post here:

http://archives.postgresql.org/pgsql-performance/2009-06/msg00046.php

indicates that hash aggregates are only used when they can fit in work_mem.


I did some testing using fabricated tables consisting of 150,000 entries.

set work_mem ='64kB';
EXPLAIN ANALYZE DELETE FROM bi_2_dis WHERE id IN (SELECT id FROM bi_2_dis EXCEPT SELECT distinct_id AS id FROM bi_2_dmap);

"Hash Semi Join (cost=50938.90..55518.35 rows=200 width=6) (actual time=888.268..888.268 rows=0 loops=1)"
" Hash Cond: (public.bi_2_dis.id = "ANY_subquery".id)"
" -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=10) (actual time=0.014..0.014 rows=1 loops=1)"
" -> Hash (cost=48550.90..48550.90 rows=150000 width=4) (actual time=888.242..888.242 rows=0 loops=1)"
" -> Subquery Scan "ANY_subquery" (cost=45550.90..48550.90 rows=150000 width=4) (actual time=888.241..888.241 rows=0 loops=1)"
" -> SetOp Except (cost=45550.90..47050.90 rows=150000 width=4) (actual time=888.241..888.241 rows=0 loops=1)"
" -> Sort (cost=45550.90..46300.90 rows=300000 width=4) (actual time=635.657..787.194 rows=300000 loops=1)"
" Sort Key: "*SELECT* 1".id"
" Sort Method: external merge Disk: 5272kB"
" -> Append (cost=0.00..7486.00 rows=300000 width=4) (actual time=0.007..222.252 rows=300000 loops=1)"
" -> Subquery Scan "*SELECT* 1" (cost=0.00..3822.00 rows=150000 width=4) (actual time=0.007..94.056 rows=150000 loops=1)"
" -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=4) (actual time=0.007..43.727 rows=150000 loops=1)"
" -> Subquery Scan "*SELECT* 2" (cost=0.00..3664.00 rows=150000 width=4) (actual time=0.009..83.799 rows=150000 loops=1)"
" -> Seq Scan on bi_2_dmap (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.008..44.104 rows=150000 loops=1)"
"Total runtime: 954.148 ms"


set work_mem ='64MB';
EXPLAIN ANALYZE DELETE FROM bi_2_dis WHERE id IN (SELECT id FROM bi_2_dis EXCEPT SELECT distinct_id AS id FROM bi_2_dmap);


"Hash Semi Join (cost=11611.00..14488.52 rows=200 width=6) (actual time=396.518..396.518 rows=0 loops=1)"
" Hash Cond: (public.bi_2_dis.id = "ANY_subquery".id)"
" -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=10) (actual time=0.017..0.017 rows=1 loops=1)"
" -> Hash (cost=9736.00..9736.00 rows=150000 width=4) (actual time=396.460..396.460 rows=0 loops=1)"
" -> Subquery Scan "ANY_subquery" (cost=0.00..9736.00 rows=150000 width=4) (actual time=396.459..396.459 rows=0 loops=1)"
" -> HashSetOp Except (cost=0.00..8236.00 rows=150000 width=4) (actual time=396.457..396.457 rows=0 loops=1)"
" -> Append (cost=0.00..7486.00 rows=300000 width=4) (actual time=0.008..233.227 rows=300000 loops=1)"
" -> Subquery Scan "*SELECT* 1" (cost=0.00..3822.00 rows=150000 width=4) (actual time=0.008..98.401 rows=150000 loops=1)"
" -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=4) (actual time=0.008..51.253 rows=150000 loops=1)"
" -> Subquery Scan "*SELECT* 2" (cost=0.00..3664.00 rows=150000 width=4) (actual time=0.010..86.050 rows=150000 loops=1)"
" -> Seq Scan on bi_2_dmap (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.009..45.474 rows=150000 loops=1)"
"Total runtime: 399.273 ms"


Setting the work_mem to a value that is larger than required (I've not cut it back to see where the cut of point), results in an execution time that is 40% of the original query. I believe you also mentioned disk activity on the Postgres server, and as you can see in the initial plan the sort is using an 52k disk file. The second execution does not appear to use the disk.


So, that's a 60% improvement without altering a single line of code, immediately cutting the 5 hour import to 2 hours, but more importantly being pervasive throughout the entire repository. Every single operation to submit new records, edit or remove items from the DSpace instance will see a 60% improvement, and no disk thrashing of the Postgres server, so you will likely see better throughput for non-changing operations whilst any changes are being processed.


OK, 2 hours is a fair bit longer than 16 mins, but now we've actually improved the scalability of the instance to about the level that Postgres will allow, we can look at improving the perfomance of the individual operation (and even your patched version will have seen a modest improvement with the optimized Postgres configuration).

Well, patching the batch import process to delay the pruneIndex to the end is an option, and we've looked at a cleaner way of implementing the same result.

Although there could be a residual issue with such a change as you are having to hold a reference to every item that you import until the end of the process. That's going to cause an issue with the size number of items that you can import.

Now, let's look back at Richard Rodger's suggestion. We've already taken 60% off of the pruning part of index-update. But then, in your import - and in Richard's suggestion? - the SearchConsumer was still active, so you are incrementally updating the Lucene index. If you follow the approach of using index-update at the end of the batch import, that updates the search index as well as regenerating the browse entries. So we can actually remove both the SearchConsumer and BrowseConsumer from the batch import saving more time than before.

Now, index-update itself only adds the changes to the Lucene index, but recreates the whole contents of the browse tables. That could be avoided by adding an update method that only finds and indexes item ids that are not already in the bi_item or bi_withdrawn table.

(Admittedly, that's not a perfect version of update - to do that, you would need to index modified items. It's easy enough to achieve if you add a timestamp column to bi_item and bit_withdrawn that records the last_modified value of the item at the time of indexing)

But either way... tuning the Postgres installation will significantly reduce overhead and improve overall scalability of the repository. The simple procedural change to the way the import is run is probably 'good enough' for now. Enhancing the index-update process to only deal with new and changed items will likely be equivalent to the patched importer.

And without the negative scalability aspects of the increased memory usage of holding all imported items in memory.

Regards,
G

Graham Triggs
Technical Architect
Open Repository
Graham Triggs
2010-01-28 00:20:42 UTC
Permalink
Happily, we weren't doing that. We were bringing to light a scalability
issue by highlighting the incredible reduction in import speed as the
repository grows. This is not a "the importer is too slow" problem. This is
a "the importer slows down dramatically as the repository grows and it
really shouldn't" problem.
Well, things are going to get slower as you add more data! That is
unavoidable, although you can do better/worse jobs about how fast an
individual process runs, how many concurrent processes you can have, how
much overall throughput you have. Ultimately, 'a' repository will have to
turn into multiple shards - it simply can't go on as a monolithic entity
forever.

So what come first? Does the repository become unusable overall before batch
importing become 'impossible'? Or does the batch import slow to the point
that it can never complete whilst the repository itself can still serve
users happily?

Does moving the indexing from being incremental to one big process at the
end create a single big transaction that creates a big log jam for anyone
trying to use the repository at that time?

How many disk reads is the DB server performing? Is the machine swapping?
Could it simply need more memory assigned for caching browse tables?

All we have is one metric - time elapsed - and on that basis I can't judge
that this is a scalability fix. I can only say that it is a performance
improvement in your situation.
If it consistently took the same amount of time to ingest the same number
of items, then this would be a relevant point. It does not. It takes an
increasing amount of time to ingest the same number of items based on the
number of items in the repository. That is not a scalable solution, because
eventually the number of items in the repository reaches a point where the
number of items you can process in a specific period drops below the number
of items you need to process. When that point is reached depends on the
nature of the hardware running your repository, but it will eventually be
reached. On the other hand, if the time taken by the importer scales
according to the size of the batch rather than the size of the repository,
the issue goes away.
That's a reasonable point. Although, I will say that you have an increasing
amount of data being stored in the database tables which will (other things
being equal) slow down table reads - which have to occur during the index
process to avoid expensive queries when the user is accessing the system.
Columns in those tables are indexed, which will get larger as you add more
records, and so require more memory to cache (or else get much slower). And
the larger the column indexes are, they will get slower in adding/updating,
or they will get slower in lookup during user access.

It's inevitable (assuming no other changes to the system) that adding 4000
records when you have lots of data in the repository will take longer than
adding 4000 records when you have none. Certainly, it is if you expect to
keep things as optimal for the general user accessing the repository as
possible. Whether that performance difference is excessive, can be
reasonably reduced without causing problems for users accessing the site
normally - whether you can or should optimise the way the system is
configured - that's the question.

I can easily pose this problem in other ways... what happens if you have 400
users trying to deposit 10 items each? What happens if you have publishers
depositing 4000 records via SWORD?
The batch importer code as it is now, on large repositories, hammers the
database, because it is calling pruneIndexes() after every item imported,
which runs a number of database queries, which take an increasing amount of
time to run as the repository grows. Not only does it make the batch
importer *much* slower than it should be, but yes, it impacts on everything
else which is accessing that database. As you say, that is not a scalable
system.
I'll admit that I haven't traced the Postgres code, I worked on the Oracle
queries - which are basically the same, although the behaviour may be
different. For each metadata index table it runs 2 deletes - plus 1 each for
item and withdrawn - for a small set of ids determined by the difference
between two select statements. All the columns used in the selection
criteria are indexed.

In testing, it was by far the most efficient (not necessarily fastest, but
least database load) required for resolving the correct behaviour. Maybe
retesting with a much larger dataset, and/or with Postgres might show up
something different.

It certainly looks like an issue with the indexes on the browse table
columns - whether that could be the need for more memory or vacuuming,
maybe, maybe not.
A scalable system can run imports all day long without affecting the
Post by Graham Triggs
functionality or performance of the repository for users accessing it
concurrently. A scalable system can run 10, 20,... 100 importers
simultaneously without detrimental affects. A scalable system lets you
import millions of items an hour, by allowing you to utilise the resources
needed to do it, not by trying to squeeze a single process into a finite
resource.
If there exists a DSpace instance, running on the standard DSpace code
base, which is capable of supporting 100 simultaneous batch imports
importing millions of items an hour while not having any detrimental
effects, I would very much like to know how they did it and what their
hardware configuration is.
It doesn't. I've yet to see any institutional repository that is designed to
be truly scalable - they are all fundamentally limited to what a single
machine installation can handle. Whether that is a problem or a priority is
an open question. If you want scalability, how much scalability do you
really want?


While this is true, it's largely meaningless; the issue is how large the
repository can get before you hit that wall, and how the wall can be moved
further away. And, once again, if the time taken by a task scales with the
size of the task, rather than the size of the repository, the problem ceases
to exist entirely.
Except we are talking about one thing. A batch import. What other walls
might we hit before we hit the wall of a batch import? Will we hit any of
those walls sooner by pushing out the wall of a batch import?
I would say that if, for a 4,000-item import, we change the importer so
that it runs pruneIndexes() 3,999 fewer times than previously, that is a
significant reduction of the impact on the system. The batch importer does
not exist in a vacuum and the way it hammers the database does also have an
effect on everything else.
I confess that the reason why we are having this discussion at all eludes
me. It seems like a fairly obvious bug for the importer to prune the indexes
so many times (the comment for pruneIndexes() even says "called from the
public interfaces or at the end of a batch indexing process"), it has a
demonstrably detrimental effect on the performance of the software, and the
fix for it is not particularly complicated. Is there something I've missed?
So, I've looked at the patch. My concerns:

1) You've exposed the internal workings of the IndexBrowse class. For
clarity, I'll duplicate the method that is being called by the browse
consumer here:

public void indexItem(Item item) throws BrowseException
{
if (item.isArchived() || item.isWithdrawn())
{
indexItem(new ItemMetadataProxy(item));
pruneIndexes();
}
}

That's three functional lines. So, instead of exposing the internal
workings, and adding an ugly indexNoPrune method, why not:

public void indexItems(Collection<Item> items) throws BrowseException
{
try
{
for (Item item : items)
{
if (item.isArchived() || item.isWithdrawn())
{
indexItem(new ItemMetadataProxy(item));
}
}
}
finally
{
pruneIndexes();
}
}

and simply pass the Set in from the BrowseConsumer?

2) You commit the database connection directly, bypassing the the Context
commit. This is avoiding ALL event consumers from having the opportunity to
process the addition of a new item at the time it occurs. Whilst this might
be appropriate for your and maybe many more importing scenarios, that's not
suitable for general purpose. Especially as you could simply replace the
BrowseConsumer (and any other consumer) with an implementation that moves
the indexing code from finish() to end(), which should mean that it is only
called when the context is destroy, not for each commit. (Alternatively, a
static method could allow you to process the current batch of updates when
you choose).


But, as I said before, this is being debated because there are a lot more
questions about the circumstances under which this is occurring, and a lot
more analysis that needs to be done to say whether it is truly a scalability
improvement, or just a performance boost for a specific situation. It
doesn't do anything to address the ingest time of an item via the web
interfaces - which could be a concern given what you are saying - or how the
change impacts on general repository usage, and maybe still system
configuration could play a part.

Regards,
G
Simon Brown (JIRA)
2010-01-20 18:08:59 UTC
Permalink
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11090#action_11090 ]

Simon Brown commented on DS-470:
--------------------------------

I have tried to keep the changes within the context of the event system as much as possible; any run of the event queue in which multiple items are indexed for browsing should only need to prune the browse indexes once, which is what one of the changes I made does.

There are a couple of problems with running index-update separately:

The first is that it's an additional process which needs to be launched manually.

The second is that, on a repository of the size of our test repository (at the time of this post, around 125,000 items, or around 60% of the size of our live repository) index-update takes just under sixteen and a half minutes to run. That's roughly the same amount of time as the last batch of 4000 items took to import with our patched batch importer. On the assumption that using the noindex dispatcher would shave off some of the time spent by our patched importer, that still makes the noindex importer + index-update take nearly twice as long as the batch importer with the patch.
Post by Simon Brown (JIRA)
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------
Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Attachments: batch_importer_speedup.patch
As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().
The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.
Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.
However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.
1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.
As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Simon Brown (JIRA)
2010-01-20 18:10:58 UTC
Permalink
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11090#action_11090 ]

Simon Brown edited comment on DS-470 at 1/20/10 6:09 PM:
---------------------------------------------------------

I have tried to keep the changes within the context of the event system as much as possible; any run of the event queue in which multiple items are indexed for browsing should only need to prune the browse indexes once, which is what one of the changes I made does.

There are a couple of problems with running index-update separately:

The first is that it's an additional process which needs to be launched manually.

The second is that, on a repository of the size of our test repository (at the time of this post, around 125,000 items, or around 60% of the size of our live repository) index-update takes just under sixteen and a half minutes to run. That's roughly the same amount of time as the last batch of 4000 items took to import with our patched batch importer. On the assumption that using the noindex dispatcher would shave off some of the time spent by our patched importer, that still makes the noindex importer + index-update take nearly twice as long as the batch importer with the patch. And the time taken to run index-update will scale more or less linearly with the number of items in the repository.

was (Author: simes):
I have tried to keep the changes within the context of the event system as much as possible; any run of the event queue in which multiple items are indexed for browsing should only need to prune the browse indexes once, which is what one of the changes I made does.

There are a couple of problems with running index-update separately:

The first is that it's an additional process which needs to be launched manually.

The second is that, on a repository of the size of our test repository (at the time of this post, around 125,000 items, or around 60% of the size of our live repository) index-update takes just under sixteen and a half minutes to run. That's roughly the same amount of time as the last batch of 4000 items took to import with our patched batch importer. On the assumption that using the noindex dispatcher would shave off some of the time spent by our patched importer, that still makes the noindex importer + index-update take nearly twice as long as the batch importer with the patch.
Post by Simon Brown (JIRA)
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------
Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Attachments: batch_importer_speedup.patch
As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().
The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.
Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.
However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.
1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.
As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Graham Triggs (JIRA)
2010-01-27 00:01:00 UTC
Permalink
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11105#action_11105 ]

Graham Triggs commented on DS-470:
----------------------------------

I'm looking at the possibility of having the indexer determine whether it needs to call pruneIndexes on a per item basis (in theory, it shouldn't need to if it is only adding new data). Or alternatively, treat the indexing as a transactional process - ie. explicit start() / inde() / index() / commit() operations.

pruneIndexes() is very much an internal implementation detail of how the indexer updates the browse tables, and as such it really should not be exposed as part of the public API.
Post by Simon Brown (JIRA)
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------
Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Attachments: batch_importer_speedup.patch
As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().
The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.
Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.
However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.
1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.
As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Mark Diggory (JIRA)
2010-01-27 02:57:00 UTC
Permalink
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11106#action_11106 ]

Mark Diggory commented on DS-470:
---------------------------------

A agree with Grahams assessment here.

An a related tangent, there may be a possibility we would want to provide some transactionality within the EventManager at some point. But I also recommend that we are working to leave behind the EventManager in favor of the EventService and Listeners as the mechanism for propagating events. We should look into implementing EventListener shims that approach the same event processing as the EventManager.
Post by Simon Brown (JIRA)
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------
Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Attachments: batch_importer_speedup.patch
As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().
The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.
Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.
However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.
1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.
As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Simon Brown (JIRA)
2010-01-27 11:35:00 UTC
Permalink
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11107#action_11107 ]

Simon Brown commented on DS-470:
--------------------------------

Making pruneIndexes() package private would presumably solve the visibility problem; it would still be callable from BrowseConsumer but not from anywhere outside of the browse package, and this patch could be changed to do that without affecting anything else. Making addItem() package private as well would expose all the places in the code that call it directly rather than using the event system, but if the plan is to move away from that event system that may not be especially useful.

I assume that pruneIndexes() is called from addItem() simply as part of an overall need to make sure the indexes are pruned regularly. If there were some other way to approach that, it might be possible to drop the call to pruneIndexes() from any part of the process of adding a new item, which would also be a solution I'd be completely happy with.
Post by Simon Brown (JIRA)
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------
Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Attachments: batch_importer_speedup.patch
As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().
The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.
Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.
However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.
1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.
As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Graham Triggs (JIRA)
2010-01-27 12:07:59 UTC
Permalink
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11108#action_11108 ]

Graham Triggs commented on DS-470:
----------------------------------

It's a little while since my head was completely in the browse process, so I'm winging it a little bit. But my understanding is that it calls pruneIndexes() because it could be re-indexing an existing item, which may have had terms removed in it's edit.

For a completely new item, it should theoretically be possible to not call pruneIndexes() at all. It should also be easy to test this when entering indexItem() by seeing if the item id occurs in either the bi_item or bi_withdrawn table. If it doesn't, then it's a fresh item as far as the browse system is concerned.
Post by Simon Brown (JIRA)
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------
Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Attachments: batch_importer_speedup.patch
As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().
The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.
Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.
However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.
1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.
As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Tim Donohue (JIRA)
2010-01-27 21:35:59 UTC
Permalink
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Donohue updated DS-470:
---------------------------

Fix Version/s: 1.6.1

We discussed this issue and the proposed patch at today's DSpace Developers meeting. See the IRC logs for the full discussion (search for DS-470):
http://www.duraspace.org/irclogs/index.php?date=2010-01-27

The general consensus was that this may need a bit more discussion about how best to resolve these issues. It's good there's a patch available for folks encountering this problem. But there are some concerns as to the patch implementation, and whether there would be ways to make this pruning "smarter" in general.

There's also a balance between "reducing load" versus "reducing time".
Tim Donohue (JIRA)
2010-01-27 22:28:00 UTC
Permalink
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11115#action_11115 ]

Tim Donohue commented on DS-470:
--------------------------------

Hi Simon,

Just wanted to double check a few things. If my math is right, hypothetically speaking, if we were loading 4,000 items into your repository of 120,000 you are saying you would see the following:

* Before patch (4.7 secs/item): about 313 minutes of process time (5 hrs, 13mins)
* After patch (4.9 items/sec or about 0.2secs/ item): about 13 minutes of process time

So, you've found the patch would decrease your processing time by 5 hours in this hypothetical situation?

Also, did you see a much larger load on the server after the patch (in terms of memory / cpu usage, etc) than before the patch? Trying to get a sense of whether decreasing processing time this drastically causes a large increase in server load.

We're trying to look at this from all angles to see what we can come up with in terms of recommendations, etc. We appreciate any other help you can provide, and I'm sure others will add more comments here, as this has been a topic of much discussion amongst the developers today (esp in #dspace IRC channel).
Post by Simon Brown (JIRA)
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------
Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Fix For: 1.6.1
Attachments: batch_importer_speedup.patch
As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().
The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.
Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.
However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.
1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.
As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Tom De Mulder
2010-01-28 11:45:21 UTC
Permalink
Post by Tim Donohue (JIRA)
Just wanted to double check a few things. If my math is right,
hypothetically speaking, if we were loading 4,000 items into your
I shall respond to this, because I ran the tests; my colleague produced a
patch.
Post by Tim Donohue (JIRA)
* Before patch (4.7 secs/item): about 313 minutes of process time (5 hrs, 13mins)
* After patch (4.9 items/sec or about 0.2secs/ item): about 13 minutes of process time
So, you've found the patch would decrease your processing time by 5 hours in this hypothetical situation?
Yes.
Post by Tim Donohue (JIRA)
Also, did you see a much larger load on the server after the patch (in
terms of memory / cpu usage, etc) than before the patch? Trying to get a
sense of whether decreasing processing time this drastically causes a
large increase in server load.
Quite the contrary.

With the default configuration, the load on the database is enormous
during the entire import. This is because it's constantly hammering the
database to prune indexes.

With our patch, the load on the server drops dramatically, except after
the last item (the only point where the indexes are built), where it is
the same as with the default code.

I should point out that we run the DB and app on separate servers; the
load on the app server is never very high during batch imports, it is
almost entirely focused on the database. We have tested and demonstrated
(timing using two instances of siege, from separate servers) that the
higher the load on the database, the slower the load time for the
web pages.


Best regards,

--
Tom De Mulder <***@cam.ac.uk> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 28/01/2010 : The Moon is Waxing Gibbous (82% of Full)
k***@library.gatech.edu
2010-01-28 12:23:09 UTC
Permalink
Mark and all:

Even if the proposed patch doesn't fit in with the current architecture of the system, I think it would be useful to make a binary easily available with the fast import code.

Graham made some excellent points yesterday evening. I'm paraphrasing and may have muddled this a bit, but:
- Just because a system has been made faster in one area doesn't mean it's now scalable
- A gigantic system may break or become unusable in other areas and need other adjustments - for example, search indexes may need to be sharded.

Making the fast import tool available, at least as an option, would give organizations one means of quickly loading large amounts of their data into test systems so that they can start to poke at prototypes of gigantic systems and see where they might break.

I know that there are people with data collection, testing, and research skills at organizations that have access to large amounts of data, and experience with the DSpace system, who could justify spending staff resources on identifying the scalability issues if they could show a gigantic system now. This fast import tool would help them produce the giant test system.

Can the fast importer be made readily available somewhere as an aid to identifying and testing scalability issues in the current and future versions of DSpace?

thanks,
keith


----- Original Message -----
From: "Mark Diggory" <***@atmire.com>
To: "Simon Brown" <***@cam.ac.uk>, dspace-***@lists.sourceforge.net
Sent: Wednesday, January 27, 2010 6:32:48 PM GMT -05:00 US/Canada Eastern
Subject: Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem


We discuss it because we seek to maintain an appropriate separation of
concerns in our architecture. And because Graham usually challenges us
to look at aspects of that architecture that are important. What is
under discussion is not that performance can't be improved by your
patch, you've identified a very important issue in batch processing.
We are discussing architecturally if we want to alter the
Context/EventManager framework and expose calls to pruneIndex. We
want to be careful to avoid exposing too much of the internals of the
Browse system outside in the application architecture.

Excellent work on finding a means to improve DSpace performance.

Cheers,
Mark
--
Mark R. Diggory
Head of U.S. Operations - @mire
Richard, Joel M
2010-01-28 22:22:04 UTC
Permalink
Hi All,

I'm still new to DSpace and all it's intricacies, so if this is a repeat of existing knowledge, forgive me.

Continuing Graham's findings, I thought I would throw this out there based on my experience having managed PostgreSQL over the past several years.

If you are using anything less than PgSQL 8.3, consider upgrading. If you are using 7.x REALLY upgrade. The performance improvements will be significant overall. (My experience with 8.4 is nil as of yet.)

Secondly, and probably more importantly, PgSQL (even 8.3) ships with default settings that prefer compatibility over performance so you really must tune it to your own system. It assumes a server with a very small amount of memory. I imagine this is to some degree what you are encountering and demonstrating in your investigations, Graham.

http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

I don't expect that many of DSpace users have experience managing PgSQL, but it does require a bit of knowledge, even as far as modifying the Kernel's shared memory values. Our repository only has around 9,000 items so far, but I know for certain that it's un-tuned and not using up nearly as much resources as it has available.

If tuning PgSQL does end up solving this problem (at least for now?), then this info needs to be communicated somewhere (the wiki perhaps?) but tuning PgSQL is something of a black art. One person's settings could be disastrous for someone else. :)

--Joel



Joel Richard
IT Specialist, Web Services Department
Smithsonian Institution Libraries | http://www.sil.si.edu/
(202) 633-1706 | (202) 786-2861 (f) | ***@si.edu



________________________________
From: Graham Triggs <***@gmail.com>
Date: Thu, 28 Jan 2010 15:58:05 -0500
To: Simon Brown <***@cam.ac.uk>
Cc: Mark Diggory <***@atmire.com>, <dspace-***@lists.sourceforge.net>
Subject: Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem
Post by Simon Brown
Having dug through the code a little more in the meantime, it seems
that the effect of pruneIndexes() is to remove from the browse indexes
information about items which are expunged and/or withdrawn; in that
light it might not be necessary to call it when items are added or
changed at all,
pruneIndexes() only removes data from the browse indexes, but the tunes under which it can occur are more subtle than that:

1) bi_item and bi_withdrawn

a) the bi_item table needs to be pruned if you withdraw an item.
b) the bi_withdrawn table needs to be pruned if you reinstate an item.
c) either table needs to be pruned when you expunge an item, depending on the state the item was in at the time

2) metadata tables - bi_1_dis, bi_1_dmap, bi_2_dis, bi_2_dmap, etc..

a) the _dis and _dmap tables for a given index number need to be pruned any time that the metadata (author, subject, etc.) that is being indexed by them is changed.
b) all the _dis and _dmap tables need to be pruned whenever an item is withdrawn or expunged.


I've done some more research on the problem. First, the following posts:

http://archives.postgresql.org/pgsql-performance/2009-01/msg00276.php
http://archives.postgresql.org/pgsql-performance/2009-01/msg00280.php

highlight the difference of doing an EXCEPT between two SELECTs (as is currently in the browse code), versus a NOT IN (which would be the alternative).


Further, if you look at the Postgres 8.4 release docs:

http://developer.postgresql.org/pgdocs/postgres/release-8-4.html

you'll see that EXCEPT can now use hash aggregates, which is faster than the existing implementation using sorts.


The story continues though. The post here:

http://archives.postgresql.org/pgsql-performance/2009-06/msg00046.php

indicates that hash aggregates are only used when they can fit in work_mem.


I did some testing using fabricated tables consisting of 150,000 entries.

set work_mem ='64kB';
EXPLAIN ANALYZE DELETE FROM bi_2_dis WHERE id IN (SELECT id FROM bi_2_dis EXCEPT SELECT distinct_id AS id FROM bi_2_dmap);

"Hash Semi Join (cost=50938.90..55518.35 rows=200 width=6) (actual time=888.268..888.268 rows=0 loops=1)"
" Hash Cond: (public.bi_2_dis.id = "ANY_subquery".id)"
" -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=10) (actual time=0.014..0.014 rows=1 loops=1)"
" -> Hash (cost=48550.90..48550.90 rows=150000 width=4) (actual time=888.242..888.242 rows=0 loops=1)"
" -> Subquery Scan "ANY_subquery" (cost=45550.90..48550.90 rows=150000 width=4) (actual time=888.241..888.241 rows=0 loops=1)"
" -> SetOp Except (cost=45550.90..47050.90 rows=150000 width=4) (actual time=888.241..888.241 rows=0 loops=1)"
" -> Sort (cost=45550.90..46300.90 rows=300000 width=4) (actual time=635.657..787.194 rows=300000 loops=1)"
" Sort Key: "*SELECT* 1".id"
" Sort Method: external merge Disk: 5272kB"
" -> Append (cost=0.00..7486.00 rows=300000 width=4) (actual time=0.007..222.252 rows=300000 loops=1)"
" -> Subquery Scan "*SELECT* 1" (cost=0.00..3822.00 rows=150000 width=4) (actual time=0.007..94.056 rows=150000 loops=1)"
" -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=4) (actual time=0.007..43.727 rows=150000 loops=1)"
" -> Subquery Scan "*SELECT* 2" (cost=0.00..3664.00 rows=150000 width=4) (actual time=0.009..83.799 rows=150000 loops=1)"
" -> Seq Scan on bi_2_dmap (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.008..44.104 rows=150000 loops=1)"
"Total runtime: 954.148 ms"


set work_mem ='64MB';
EXPLAIN ANALYZE DELETE FROM bi_2_dis WHERE id IN (SELECT id FROM bi_2_dis EXCEPT SELECT distinct_id AS id FROM bi_2_dmap);


"Hash Semi Join (cost=11611.00..14488.52 rows=200 width=6) (actual time=396.518..396.518 rows=0 loops=1)"
" Hash Cond: (public.bi_2_dis.id = "ANY_subquery".id)"
" -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=10) (actual time=0.017..0.017 rows=1 loops=1)"
" -> Hash (cost=9736.00..9736.00 rows=150000 width=4) (actual time=396.460..396.460 rows=0 loops=1)"
" -> Subquery Scan "ANY_subquery" (cost=0.00..9736.00 rows=150000 width=4) (actual time=396.459..396.459 rows=0 loops=1)"
" -> HashSetOp Except (cost=0.00..8236.00 rows=150000 width=4) (actual time=396.457..396.457 rows=0 loops=1)"
" -> Append (cost=0.00..7486.00 rows=300000 width=4) (actual time=0.008..233.227 rows=300000 loops=1)"
" -> Subquery Scan "*SELECT* 1" (cost=0.00..3822.00 rows=150000 width=4) (actual time=0.008..98.401 rows=150000 loops=1)"
" -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=4) (actual time=0.008..51.253 rows=150000 loops=1)"
" -> Subquery Scan "*SELECT* 2" (cost=0.00..3664.00 rows=150000 width=4) (actual time=0.010..86.050 rows=150000 loops=1)"
" -> Seq Scan on bi_2_dmap (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.009..45.474 rows=150000 loops=1)"
"Total runtime: 399.273 ms"


Setting the work_mem to a value that is larger than required (I've not cut it back to see where the cut of point), results in an execution time that is 40% of the original query. I believe you also mentioned disk activity on the Postgres server, and as you can see in the initial plan the sort is using an 52k disk file. The second execution does not appear to use the disk.


So, that's a 60% improvement without altering a single line of code, immediately cutting the 5 hour import to 2 hours, but more importantly being pervasive throughout the entire repository. Every single operation to submit new records, edit or remove items from the DSpace instance will see a 60% improvement, and no disk thrashing of the Postgres server, so you will likely see better throughput for non-changing operations whilst any changes are being processed.


OK, 2 hours is a fair bit longer than 16 mins, but now we've actually improved the scalability of the instance to about the level that Postgres will allow, we can look at improving the perfomance of the individual operation (and even your patched version will have seen a modest improvement with the optimized Postgres configuration).

Well, patching the batch import process to delay the pruneIndex to the end is an option, and we've looked at a cleaner way of implementing the same result.

Although there could be a residual issue with such a change as you are having to hold a reference to every item that you import until the end of the process. That's going to cause an issue with the size number of items that you can import.

Now, let's look back at Richard Rodger's suggestion. We've already taken 60% off of the pruning part of index-update. But then, in your import - and in Richard's suggestion? - the SearchConsumer was still active, so you are incrementally updating the Lucene index. If you follow the approach of using index-update at the end of the batch import, that updates the search index as well as regenerating the browse entries. So we can actually remove both the SearchConsumer and BrowseConsumer from the batch import saving more time than before.

Now, index-update itself only adds the changes to the Lucene index, but recreates the whole contents of the browse tables. That could be avoided by adding an update method that only finds and indexes item ids that are not already in the bi_item or bi_withdrawn table.

(Admittedly, that's not a perfect version of update - to do that, you would need to index modified items. It's easy enough to achieve if you add a timestamp column to bi_item and bit_withdrawn that records the last_modified value of the item at the time of indexing)

But either way... tuning the Postgres installation will significantly reduce overhead and improve overall scalability of the repository. The simple procedural change to the way the import is run is probably 'good enough' for now. Enhancing the index-update process to only deal with new and changed items will likely be equivalent to the patched importer.

And without the negative scalability aspects of the increased memory usage of holding all imported items in memory.

Regards,
G

Graham Triggs
Technical Architect
Open Repository
Graham Triggs (JIRA)
2010-02-10 15:04:00 UTC
Permalink
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Graham Triggs updated DS-470:
-----------------------------

Attachment: prune.patch

prune.patch alters the pruning queries to reduce database load in cases where the database isn't tuned to cope with the existing queries, or the dataset is EXTREMELY large.

Using the amended queries does not remove the need for database tuning - these queries are yet more efficient running Postgres 8.4 than they are on pre-8.4 systems, and they do need at least a respectable amount of work_mem (the execution time will double in cases where the work_mem is limited)

But they degrade in performance more gracefully than the existing queries, and are slightly less work in the optimal cases.
Post by Simon Brown (JIRA)
Batch import times increase drastically as repository size increases; patch to mitigate the problem
---------------------------------------------------------------------------------------------------
Key: DS-470
URL: http://jira.dspace.org/jira/browse/DS-470
Project: DSpace 1.x
Issue Type: Improvement
Components: DSpace API
Affects Versions: 1.6.0
Reporter: Simon Brown
Priority: Minor
Fix For: 1.6.1
Attachments: batch_importer_speedup.patch, prune.patch
As mentioned by my colleague Tom De Mulder on dspace-tech and at http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/
As the repository grows, the time taken for batch imports to run also increases. Having profiled the importer in our 1.6.0-RC1 install we determined that most (80%-90%) of the time was spent in calls to IndexBrowse.pruneIndexes().
The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so every time an item is indexed, the indexes are pruned. For any batch of size n, where n > 1, this is (n - 1) times more than is necessary.
Increasing the visibility of pruneIndexes(), removing the call from IndexBrowse.indexItem(), and making a single call at the end of the BrowseConsumer.end() method reduces this to once per event queue run.
However, the batch importer calls Context.commit() after each item is imported. Context.commit() runs the event queue, thus causing one event queue run per imported item.
1. create an IndexBrowse.indexItemNoPrune() method, which is called from the BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to Context.getDBConnection.commit(). The only effective difference between the two is that the event queue is not run; I think that a better solution might be to move the code to run the event queue from the Context.commit() method to the Context.complete() method, but I don't know what effect that will have on the rest of the code.
As noted in Tom's blog post linked above, these changes, on a repository with in excess of 120,000 items, brought import time from 4.7 seconds/item down to 4.9 items/second.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Tim Donohue (JIRA)
2010-02-17 22:02:59 UTC
Permalink
[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11182#action_11182 ]

Tim Donohue commented on DS-470:
--------------------------------
Tom De Mulder
2010-02-17 22:09:40 UTC
Permalink
[15:40] <kshepherd> DS-470 +1 to the general idea, but graham had some reasonable objections to viewing "speeding up batch jobs" as a priority over "reducing system load"
I'd like to point out that this has never been substantiated, and that we
have so far made clear that system load goes UP at the same time as these
batches SLOW DOWN. I don't know where you get this idea from that speeding
up batch times would negatively affect overall system performance.

The best way to reduce system impact here is to reduce the number of times
the indexes get pruned to 1 from N, rather than to do still do them (N-1)
times too many but slightly faster.

--
Tom De Mulder <***@cam.ac.uk> - Cambridge University Computing Service
New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 17/02/2010 : The Moon is Waxing Crescent (20% of Full)
Graham Triggs
2010-02-18 00:03:12 UTC
Permalink
Hi Tom,
Post by Tom De Mulder
I'd like to point out that this has never been substantiated, and that we
have so far made clear that system load goes UP at the same time as these
batches SLOW DOWN. I don't know where you get this idea from that speeding
up batch times would negatively affect overall system performance.
Can I clarify that I never stated that this particular case increased system
load.

I had (days ago) made the general point that making a query run faster can
make it take more resources, causing it to get much worse as the dataset
increases (which is almost what happened here - although this particular
query had originally been worked out on Oracle to reduce the resources it
takes, and then blindly converted to Postgres - which appeared OK initially,
but suffers with more data and an older or unoptimized Postgres instance).

But that's a general point, not specifically relating to this issue, in
order to make the case that if you want to demonstrate that this is a
SCALABILITY improvement, then you have to provide more than just the
execution time. Time elapsed is performance, and performance is NOT
scalability. It may often be the case that you simultaneously improve
performance and scalability, but it's not the case that they will always go
hand in hand.

So, I'm not saying that this patch does increase system load. However I do
have scalability concerns with how this patch is implemented - specifically,
how many items can be batch imported in one execution? Theoretically, the
existing importer could load an infinite number of items. This modification
WILL run out of memory after a finite number of items. How many will depend
on the size of the metadata.

If you want to deal with an arbitrary size of batch import (as well as
importing into a large repository), then you are better served following
Richard's suggestion to simply disable the indexing during batch, and
rebuild at the end. Which will be more overall load than your modification,
but has more general suitability (it shouldn't limit the number of items
that you can process in a single run).

The best way to reduce system impact here is to reduce the number of times
Post by Tom De Mulder
the indexes get pruned to 1 from N, rather than to do still do them (N-1)
times too many but slightly faster.
I quite agree that it would be good to reduce the number of time the indexes
are pruned in a batch import, which is why I voted +1 for resolving this
post-1.6, and with the potential issue of memory usage holding all those
items in memory, I want to modify the browse code so that we can do an
incremental re-index - which would mean that you can import all your items
without indexing, and then at the end simply index just the new (or changed)
items, with a single prune at the end.

But what I was demonstrating was not to make those queries slightly faster,
but to make them more efficient - hash operations instead of sorts, few
sequence scans instead of many loops of index scans. It's not about how fast
they are, but understanding how they are executed and what impact that has
on the system. And in doing so, understanding how to install and configure
the database so that the most efficient execution plans are used.

Because by doing that, we aren't just improving the batch importer. We're
improving the ingestion of new items via sword. The creation of new items
via the UI. And we're probably improving general user operations - like
browsing of items for an author (and/or restricted to a particular
collection), which will involve joins and will be more efficient if they are
using hash operations and not sorts.

G
Kim Shepherd
2010-02-18 00:30:32 UTC
Permalink
Apologies, it was my fault that the “load” vs “speed” issue came up, I think I’ve misquoted Graham there, though it did seem like that’s what a lot of the discussion was initially about. The issue came up in a JIRA review at today’s developer meeting, which is why we were commenting on it
 I guess I should have kept my mouth shut :P



From: Graham Triggs [mailto:***@gmail.com]
Sent: Thursday, 18 February 2010 1:03 p.m.
To: Tom De Mulder
Cc: Tim Donohue (JIRA); dspace-***@lists.sourceforge.net
Subject: Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import timesincrease drastically as repository size increases;patch to mitigate the problem



Hi Tom,



On 17 February 2010 22:09, Tom De Mulder <***@cam.ac.uk> wrote:

I'd like to point out that this has never been substantiated, and that we

have so far made clear that system load goes UP at the same time as these
batches SLOW DOWN. I don't know where you get this idea from that speeding
up batch times would negatively affect overall system performance.



Can I clarify that I never stated that this particular case increased system load.



I had (days ago) made the general point that making a query run faster can make it take more resources, causing it to get much worse as the dataset increases (which is almost what happened here - although this particular query had originally been worked out on Oracle to reduce the resources it takes, and then blindly converted to Postgres - which appeared OK initially, but suffers with more data and an older or unoptimized Postgres instance).



But that's a general point, not specifically relating to this issue, in order to make the case that if you want to demonstrate that this is a SCALABILITY improvement, then you have to provide more than just the execution time. Time elapsed is performance, and performance is NOT scalability. It may often be the case that you simultaneously improve performance and scalability, but it's not the case that they will always go hand in hand.



So, I'm not saying that this patch does increase system load. However I do have scalability concerns with how this patch is implemented - specifically, how many items can be batch imported in one execution? Theoretically, the existing importer could load an infinite number of items. This modification WILL run out of memory after a finite number of items. How many will depend on the size of the metadata.



If you want to deal with an arbitrary size of batch import (as well as importing into a large repository), then you are better served following Richard's suggestion to simply disable the indexing during batch, and rebuild at the end. Which will be more overall load than your modification, but has more general suitability (it shouldn't limit the number of items that you can process in a single run).



The best way to reduce system impact here is to reduce the number of times
the indexes get pruned to 1 from N, rather than to do still do them (N-1)
times too many but slightly faster.



I quite agree that it would be good to reduce the number of time the indexes are pruned in a batch import, which is why I voted +1 for resolving this post-1.6, and with the potential issue of memory usage holding all those items in memory, I want to modify the browse code so that we can do an incremental re-index - which would mean that you can import all your items without indexing, and then at the end simply index just the new (or changed) items, with a single prune at the end.



But what I was demonstrating was not to make those queries slightly faster, but to make them more efficient - hash operations instead of sorts, few sequence scans instead of many loops of index scans. It's not about how fast they are, but understanding how they are executed and what impact that has on the system. And in doing so, understanding how to install and configure the database so that the most efficient execution plans are used.



Because by doing that, we aren't just improving the batch importer. We're improving the ingestion of new items via sword. The creation of new items via the UI. And we're probably improving general user operations - like browsing of items for an author (and/or restricted to a particular collection), which will involve joins and will be more efficient if they are using hash operations and not sorts.



G

Loading...