Happily, we weren't doing that. We were bringing to light a scalability
issue by highlighting the incredible reduction in import speed as the
repository grows. This is not a "the importer is too slow" problem. This is
a "the importer slows down dramatically as the repository grows and it
really shouldn't" problem.
Well, things are going to get slower as you add more data! That is
unavoidable, although you can do better/worse jobs about how fast an
individual process runs, how many concurrent processes you can have, how
much overall throughput you have. Ultimately, 'a' repository will have to
turn into multiple shards - it simply can't go on as a monolithic entity
So what come first? Does the repository become unusable overall before batch
importing become 'impossible'? Or does the batch import slow to the point
that it can never complete whilst the repository itself can still serve
Does moving the indexing from being incremental to one big process at the
end create a single big transaction that creates a big log jam for anyone
trying to use the repository at that time?
How many disk reads is the DB server performing? Is the machine swapping?
Could it simply need more memory assigned for caching browse tables?
All we have is one metric - time elapsed - and on that basis I can't judge
that this is a scalability fix. I can only say that it is a performance
improvement in your situation.
If it consistently took the same amount of time to ingest the same number
of items, then this would be a relevant point. It does not. It takes an
increasing amount of time to ingest the same number of items based on the
number of items in the repository. That is not a scalable solution, because
eventually the number of items in the repository reaches a point where the
number of items you can process in a specific period drops below the number
of items you need to process. When that point is reached depends on the
nature of the hardware running your repository, but it will eventually be
reached. On the other hand, if the time taken by the importer scales
according to the size of the batch rather than the size of the repository,
the issue goes away.
That's a reasonable point. Although, I will say that you have an increasing
amount of data being stored in the database tables which will (other things
being equal) slow down table reads - which have to occur during the index
process to avoid expensive queries when the user is accessing the system.
Columns in those tables are indexed, which will get larger as you add more
records, and so require more memory to cache (or else get much slower). And
the larger the column indexes are, they will get slower in adding/updating,
or they will get slower in lookup during user access.
It's inevitable (assuming no other changes to the system) that adding 4000
records when you have lots of data in the repository will take longer than
adding 4000 records when you have none. Certainly, it is if you expect to
keep things as optimal for the general user accessing the repository as
possible. Whether that performance difference is excessive, can be
reasonably reduced without causing problems for users accessing the site
normally - whether you can or should optimise the way the system is
configured - that's the question.
I can easily pose this problem in other ways... what happens if you have 400
users trying to deposit 10 items each? What happens if you have publishers
depositing 4000 records via SWORD?
The batch importer code as it is now, on large repositories, hammers the
database, because it is calling pruneIndexes() after every item imported,
which runs a number of database queries, which take an increasing amount of
time to run as the repository grows. Not only does it make the batch
importer *much* slower than it should be, but yes, it impacts on everything
else which is accessing that database. As you say, that is not a scalable
I'll admit that I haven't traced the Postgres code, I worked on the Oracle
queries - which are basically the same, although the behaviour may be
different. For each metadata index table it runs 2 deletes - plus 1 each for
item and withdrawn - for a small set of ids determined by the difference
between two select statements. All the columns used in the selection
criteria are indexed.
In testing, it was by far the most efficient (not necessarily fastest, but
least database load) required for resolving the correct behaviour. Maybe
retesting with a much larger dataset, and/or with Postgres might show up
It certainly looks like an issue with the indexes on the browse table
columns - whether that could be the need for more memory or vacuuming,
maybe, maybe not.
A scalable system can run imports all day long without affecting the
Post by Graham Triggs
functionality or performance of the repository for users accessing it
concurrently. A scalable system can run 10, 20,... 100 importers
simultaneously without detrimental affects. A scalable system lets you
import millions of items an hour, by allowing you to utilise the resources
needed to do it, not by trying to squeeze a single process into a finite
If there exists a DSpace instance, running on the standard DSpace code
base, which is capable of supporting 100 simultaneous batch imports
importing millions of items an hour while not having any detrimental
effects, I would very much like to know how they did it and what their
hardware configuration is.
It doesn't. I've yet to see any institutional repository that is designed to
be truly scalable - they are all fundamentally limited to what a single
machine installation can handle. Whether that is a problem or a priority is
an open question. If you want scalability, how much scalability do you
While this is true, it's largely meaningless; the issue is how large the
repository can get before you hit that wall, and how the wall can be moved
further away. And, once again, if the time taken by a task scales with the
size of the task, rather than the size of the repository, the problem ceases
to exist entirely.
Except we are talking about one thing. A batch import. What other walls
might we hit before we hit the wall of a batch import? Will we hit any of
those walls sooner by pushing out the wall of a batch import?
I would say that if, for a 4,000-item import, we change the importer so
that it runs pruneIndexes() 3,999 fewer times than previously, that is a
significant reduction of the impact on the system. The batch importer does
not exist in a vacuum and the way it hammers the database does also have an
effect on everything else.
I confess that the reason why we are having this discussion at all eludes
me. It seems like a fairly obvious bug for the importer to prune the indexes
so many times (the comment for pruneIndexes() even says "called from the
public interfaces or at the end of a batch indexing process"), it has a
demonstrably detrimental effect on the performance of the software, and the
fix for it is not particularly complicated. Is there something I've missed?
So, I've looked at the patch. My concerns:
1) You've exposed the internal workings of the IndexBrowse class. For
clarity, I'll duplicate the method that is being called by the browse
public void indexItem(Item item) throws BrowseException
if (item.isArchived() || item.isWithdrawn())
That's three functional lines. So, instead of exposing the internal
workings, and adding an ugly indexNoPrune method, why not:
public void indexItems(Collection<Item> items) throws BrowseException
for (Item item : items)
if (item.isArchived() || item.isWithdrawn())
and simply pass the Set in from the BrowseConsumer?
2) You commit the database connection directly, bypassing the the Context
commit. This is avoiding ALL event consumers from having the opportunity to
process the addition of a new item at the time it occurs. Whilst this might
be appropriate for your and maybe many more importing scenarios, that's not
suitable for general purpose. Especially as you could simply replace the
BrowseConsumer (and any other consumer) with an implementation that moves
the indexing code from finish() to end(), which should mean that it is only
called when the context is destroy, not for each commit. (Alternatively, a
static method could allow you to process the current batch of updates when
But, as I said before, this is being debated because there are a lot more
questions about the circumstances under which this is occurring, and a lot
more analysis that needs to be done to say whether it is truly a scalability
improvement, or just a performance boost for a specific situation. It
doesn't do anything to address the ingest time of an item via the web
interfaces - which could be a concern given what you are saying - or how the
change impacts on general repository usage, and maybe still system
configuration could play a part.