文档介绍:ABSTRACT
Inverted index is the core data structure for Information Retrieval (IR) systems, and with the rapid growth of digitalization level, a massive size of inverted indexes have to be maintained, which are usually stored on hard disks. Although hard disks have the merits of high capacity and low price, comparing to the speed of CPU and memory, the performance of hard disks is much slower and considering to their mechanic nature this trend is unlikely to change in the nearest future, which further causes the I/O bottleneck of IR systems. On the other hand, the flash based Solid State Drivers (SSD) e a hot research object in the field of data storage. Comparing to the conventional hard disks the most outstanding advantage of SSDs is their much higher I/O performance then hard disks. Therefore, if inverted indexes can be stored on SSDs instead of hard disks, the overall performance of IR systems will definitely improve. However, existing index management strategies are all designed toward the hard disks, and considering the unique characteristics of SSDs, these approaches not only cannot make full use of the SSD but also can be harmful to it.
First, the present thesis analyzes existing index construction and maintenance approaches on SSD. As these conventional strategies are all based on the hard disks, from the experimental results we observe that the in-place method is low efficiency and produces large number of random writes. Meanwhile, the merge-based method results in heavy write traffic on SSD which could further reduce its lifetime. Therefore, considering these analysis, both the proposed index construction and maintenance strategies still follow the basic idea of merge-based methods, however the extra write traffic which triggered by the merge event should be eliminated.
Second, a new index construction and a new index maintenance approaches are given respectively. The proposed index construction method store the temporary index files which are produced