文档介绍:Abstract
Index maintenance strategy plays a crucial role in a full-text retrieval system which has to deal with dynamic text collections and satisfy users’ real-time query requirement. Existing index maintenance methods are mainly designed based on the features of hard disk drive (HDD), and the performance is limited by the relatively low I/O performance of HDD. The emerging solid state disk (SSD) has many desired merits, pared with HDD, the most prominent one is its high performance of random data access. If we prop- erly use SSD instead of HDD to store the inverted index, system’s overall performance will be greatly improved. However, SSD has some characteristics that are totally different with HDD, and they are not considered in the existing index maintenance methods. Therefore, directly adopting existing methods to maintain index on SSD will not only fail to make full use of SSD’s advantages, but also do harm to SSD.
First, the existing index maintenance methods are analyzed on SSD through experi- ments and they are found no longer suitable for SSD: The pure in-place method produces overmuch random writes, while the merge-based method generates massive size of writes seeming unnecessary for SSD, which impose heavy traffic and harm to SSD. Based on the experiment result, we propose some principles for designing SSD-based index mainten- ance strategy.
Second, a new hybrid index update strategy is proposed. The strategy classifies all terms into short and long according to the length of their posting lists, and their indexes are separately maintained by no merge and in-place, based on SSD’s fast random read and relatively efficient semi-random write characteristics. Through this way, inefficient small random writes are avoided and extra write operations caused by merge are also prevented. Compared to existing methods, experimental results demonstrate our design improves prehensive index maintenance and query performance; meanwhile, it is friendlier to SSD, espec