Java On-Memory 高效键值存储

Java On-Memory Efficient Key-Value Store(Java On-Memory 高效键值存储)

本文介绍了Java On-Memory 高效键值存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我存储了 1.11 亿个键值对(一个键可以有多个值 - 最大 2/3),其键是 50 位整数,值是 32 位(最大)整数.现在,我的要求是:

<块引用>

  1. 快速插入(键、值)对[允许重复]
  2. 基于键快速检索值.

这里

是否有任何 Java 库可以正确满足我的所有这些需求(上述/其他 ds 也可以接受.没问题)?实际上,我想要一个高效的 java 库数据结构来存储/检索键值/值对占用更少的内存并且必须是内置内存.

注意:我已经尝试过 Louis Wasserman、京都/东京内阁等提到的 HashMultiMap(Guava 对 trove 进行了一些修改).我对磁盘烘焙解决方案的经验并不好.所以请避免这种情况:).另一点是,对于选择库/ds,一个重要的一点是:密钥是 50 位(所以如果我们分配 64 位)14 位将丢失,值是 32 位 Int(最大值) - 大多数是 10-12-14 位.所以,我们也可以在那里节省空间.

解决方案

我认为 JDK 中没有任何东西可以做到这一点.

然而,实现这样的事情是一个简单的编程问题.这是一个带有线性探测的开放寻址哈希表,其键和值存储在并行数组中:

公共类 LongIntParallelHashMultimap {私有静态最终 long NULL = 0L;私有最终长 [] 键;私有最终 int[] 值;私有 int 大小;公共 LongIntParallelHashMultimap(int 容量) {键=新长[容量];值 = 新的 int [容量];}public void put(long key, int value) {if (key == NULL) throw new IllegalArgumentException("key 不能是 " + NULL);if (size == keys.length) throw new IllegalStateException("map is full");int index = indexFor(key);而(键[索引]!= NULL){索引 = 继任者(索引);}键[索引] = 键;值[索引] = 值;++尺寸;}公共 int[] 获取(长密钥){if (key == NULL) throw new IllegalArgumentException("key 不能是 " + NULL);int index = indexFor(key);int count = countHits(key, index);整数 [] 命中 = 新整数 [计数];int hitIndex = 0;而(键[索引]!= NULL){如果(键[索引] ==键){命中[hitIndex] = 值[索引];++命中索引;}索引 = 继任者(索引);}回击;}私有int countHits(长键,int索引){int numHits = 0;而(键[索引]!= NULL){if (keys[index] == key) ++numHits;索引 = 继任者(索引);}返回 numHits;}私有 int indexFor(long key) {//散列常数是 (黄金分割率 * Long.MAX_VALUE) + 1//参见计算机编程的艺术,第 6.4 节//常量有两个重要的属性://(1) 它与 2^64 互质,所以乘以它是一个双射函数,并且不会在哈希中产生冲突//(2) 它的底部位有一个 1,所以它不会在哈希的底部位添加零,并且不会在索引中产生(无缘无故的)冲突长哈希 = 键 * 5700357409661598721L;返回 Math.abs((int) (hash % keys.length));}私有 int 后继者(int index){return (index + 1) % keys.length;}公共整数大小(){返回大小;}}

请注意,这是一个固定大小的结构.您需要创建足够大的数据来保存所有数据——我的 1.1 亿个条目占用了 1.32 GB.你做得越大,超出存储数据所需的容量,插入和查找的速度就越快.我发现对于 1.1 亿个条目,负载因子为 0.5(2.64 GB,所需空间的两倍),查找密钥平均需要 403 纳秒,但负载因子为 0.75(1.76 GB,比所需空间多三分之一),花费了 575 纳秒.将负载因子降低到 0.5 以下通常不会产生太大影响,实际上,负载因子为 0.33(4.00 GB,比所需空间多三倍),我得到的平均时间为 394 纳秒.因此,即使您有 5 GB 可用空间,也不要全部使用.

还要注意,零不允许作为键.如果这是一个问题,请将 null 值更改为其他值,并在创建时使用该值预填充 keys 数组.

I have store 111 million key-value pairs (one key can have multiple values - maximum 2/3) whose key are 50 bit Integers and values are 32 bit (maximum) Integers. Now, my requirements are:

  1. Fast Insertion of (Key, Value) pair [allowing duplicates]
  2. Fast retrieving of value/values based on key.

A nice solution of it is given here based on MultiMap. However, I want to store more key-values pairs in main memory with no/little bit performance penalty. I studied from web articles that B+ Tree, R+ Tree, B Tree, Compact Multimap etc. can be a nice solution for that. Can anybody help me:

Is there any Java library which satisfies my all those needs properly (above mentioned/other ds also acceptable. no issue with that) ? Actually, I want an efficient java library data structure to store/retrieve key-value/values pairs which takes less memory footprint and must be built in-memory.

NB: I have tried with HashMultiMap (Guava with some modification with trove) as mentioned by Louis Wasserman, Kyoto/Tokyo Cabinet etc etc.My experience is not good with disk-baked solutions. So please avoid that :). Another point is that, for choosing library/ds one important point is: keys are 50 bit (so if we assign 64bit) 14 bit will lost and values are 32 bit Int (maximum)- mostly they are 10-12-14 bits. So, we can save space there also.

解决方案

I don't think there's anything in the JDK which will do this.

However, implementing such a thing is a simple matter of programming. Here is an open-addressed hashtable with linear probing, with keys and values stored in parallel arrays:

public class LongIntParallelHashMultimap {

    private static final long NULL = 0L;

    private final long[] keys;
    private final int[] values;
    private int size;

    public LongIntParallelHashMultimap(int capacity) {
        keys = new long[capacity];
        values = new int[capacity];
    }

    public void put(long key, int value) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
        if (size == keys.length) throw new IllegalStateException("map is full");

        int index = indexFor(key);
        while (keys[index] != NULL) {
            index = successor(index);
        }
        keys[index] = key;
        values[index] = value;
        ++size;
    }

    public int[] get(long key) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);

        int index = indexFor(key);
        int count = countHits(key, index);

        int[] hits = new int[count];
        int hitIndex = 0;

        while (keys[index] != NULL) {
            if (keys[index] == key) {
                hits[hitIndex] = values[index];
                ++hitIndex;
            }
            index = successor(index);
        }

        return hits;
    }

    private int countHits(long key, int index) {
        int numHits = 0;
        while (keys[index] != NULL) {
            if (keys[index] == key) ++numHits;
            index = successor(index);
        }
        return numHits;
    }

    private int indexFor(long key) {
        // the hashing constant is (the golden ratio * Long.MAX_VALUE) + 1
        // see The Art of Computer Programming, section 6.4
        // the constant has two important properties:
        // (1) it is coprime with 2^64, so multiplication by it is a bijective function, and does not generate collisions in the hash
        // (2) it has a 1 in the bottom bit, so it does not add zeroes in the bottom bits of the hash, and does not generate (gratuitous) collisions in the index
        long hash = key * 5700357409661598721L;
        return Math.abs((int) (hash % keys.length));
    }

    private int successor(int index) {
        return (index + 1) % keys.length;
    }

    public int size() {
        return size;
    }

}

Note that this is a fixed-size structure. You will need to create it big enough to hold all your data - 110 million entries for me takes up 1.32 GB. The bigger you make it, in excess of what you need to store the data, the faster that insertions and lookups will be. I found that for 110 million entries, with a load factor of 0.5 (2.64 GB, twice as much space as needed), it took on average 403 nanoseconds to look up a key, but with a load factor of 0.75 (1.76 GB, a third more space than is needed), it took 575 nanoseconds. Decreasing the load factor below 0.5 usually doesn't make much difference, and indeed, with a load factor of 0.33 (4.00 GB, three times more space than needed), i get an average time of 394 nanoseconds. So, even though you have 5 GB available, don't use it all.

Note also that zero is not allowed as a key. If this is a problem, change the null value to be something else, and pre-fill the keys array with that on creation.

这篇关于Java On-Memory 高效键值存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本文标题为:Java On-Memory 高效键值存储

基础教程推荐