bloom filter

大约 9 分钟

bloom filter

当系统设计中出现多级缓存结构时，为了防止大量不存在的key值击穿高速缓存（比如主存），去直接访问低速缓存（如本地磁盘），我们一般需要将这部分key值，直接拦截在高速缓存阶段。这里，当然可以使用普通的hash table，也可以使用bitmap，但是这两种方式都比较耗费内存，当面对海量key值时，问题会变得更加严重。这时，就该介绍我们的主角bloom filter出场了。

一般的，bloom filter用于判断一个key值是否在一个set中，拥有比hash table/bitmap更好的空间经济性。如果bloom filter指示一个key值“不在”一个set中，那么这个判断是100%准确的。这样的特性，非常适合于上述的缓存场景。

bloom filter原理

首先估计要判断的set中的元素个数N，然后选定k个独立的哈希函数。根据N和k，选定一个长度为M的bit array。
遍历set中的N个元素
- 对每个元素，使用k个哈希函数，得到k个哈希值（一般为一个大整数）
- 将上述bit array中，k个哈希值所对应的bit置1
对于需要判断的key值
- 使用k个哈希函数，得到k个哈希值
- 如果k个哈希值所对应的bit array中的值均为1，则判断此值在set中“可能”存在；否则，判定“一定”不存在

根据上面的原理我们其实可以看到，bloom filter有以下特点：

比较节省空间
bloom的识别准确率和数据大小，k个哈希函数有关
如果bloom filter判断key不存在，那么就一定不存在，100%不存在。
如果bloom filter判断key存在，那么可能存在，也可能不存在

bloom filter优缺点

优点：

插入、查找都是常数时间
多个hash函数之间互相独立，可以并行计算
不需要存储元素本身，从而带来空间效率优势，以及一些保密上的优势
bloom filter的bitmap可以进行交、并、差运算

缺点：

判断元素是否在集合中的结果其实是不准确的
bloom filter中的元素是不能删除的

bloom filter的实际使用

guava bloom filter

guava中提供了bloom filter的一种实现:com.google.common.hash.BloomFilter，可以方便我们在单机情况下使用bloom filter。

/**
 * A Bloom filter for instances of {@code T}. A Bloom filter offers an approximate containment test
 * with one-sided error: if it claims that an element is contained in it, this might be in error,
 * but if it claims that an element is <i>not</i> contained in it, then this is definitely true.
 *
 * <p>If you are unfamiliar with Bloom filters, this nice
 * <a href="http://llimllib.github.com/bloomfilter-tutorial/">tutorial</a> may help you understand
 * how they work.
 *
 * <p>The false positive probability ({@code FPP}) of a bloom filter is defined as the probability
 * that {@linkplain #mightContain(Object)} will erroneously return {@code true} for an object that
 * has not actually been put in the {@code BloomFilter}.
 *
 * <p>Bloom filters are serializable. They also support a more compact serial representation via the
 * {@link #writeTo} and {@link #readFrom} methods. Both serialized forms will continue to be
 * supported by future versions of this library. However, serial forms generated by newer versions
 * of the code may not be readable by older versions of the code (e.g., a serialized bloom filter
 * generated today may <i>not</i> be readable by a binary that was compiled 6 months ago).
 *
 * @param <T> the type of instances that the {@code BloomFilter} accepts
 * @author Dimitris Andreou
 * @author Kevin Bourrillion
 * @since 11.0
 */
@Beta
public final class BloomFilter<T> implements Predicate<T>, Serializable {

  /** The bit set of the BloomFilter (not necessarily power of 2!) */
  private final BitArray bits;

  /** Number of hashes per element */
  private final int numHashFunctions;

  /** The funnel to translate Ts to bytes */
  private final Funnel<? super T> funnel;

  /**
   * The strategy we employ to map an element T to {@code numHashFunctions} bit indexes.
   */
  private final Strategy strategy;

  /**
   * Creates a BloomFilter.
   */
  private BloomFilter(
      BitArray bits, int numHashFunctions, Funnel<? super T> funnel, Strategy strategy) {
    checkArgument(numHashFunctions > 0, "numHashFunctions (%s) must be > 0", numHashFunctions);
    checkArgument(
        numHashFunctions <= 255, "numHashFunctions (%s) must be <= 255", numHashFunctions);
    this.bits = checkNotNull(bits);
    this.numHashFunctions = numHashFunctions;
    this.funnel = checkNotNull(funnel);
    this.strategy = checkNotNull(strategy);
  }
  
  
  /**
   * Creates a {@link BloomFilter BloomFilter<T>} with the expected number of insertions and
   * expected false positive probability.
   *
   * <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
   * will result in its saturation, and a sharp deterioration of its false positive probability.
   *
   * <p>The constructed {@code BloomFilter<T>} will be serializable if the provided
   * {@code Funnel<T>} is.
   *
   * <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
   * ensuring proper serialization and deserialization, which is important since {@link #equals}
   * also relies on object identity of funnels.
   *
   * @param funnel the funnel of T's that the constructed {@code BloomFilter<T>} will use
   * @param expectedInsertions the number of expected insertions to the constructed
   *     {@code BloomFilter<T>}; must be positive
   * @param fpp the desired false positive probability (must be positive and less than 1.0)
   * @return a {@code BloomFilter}
   */
  public static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, int expectedInsertions, double fpp) {
    return create(funnel, (long) expectedInsertions, fpp);
  }

  /**
   * Creates a {@link BloomFilter BloomFilter<T>} with the expected number of insertions and
   * expected false positive probability.
   *
   * <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
   * will result in its saturation, and a sharp deterioration of its false positive probability.
   *
   * <p>The constructed {@code BloomFilter<T>} will be serializable if the provided
   * {@code Funnel<T>} is.
   *
   * <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
   * ensuring proper serialization and deserialization, which is important since {@link #equals}
   * also relies on object identity of funnels.
   *
   * @param funnel the funnel of T's that the constructed {@code BloomFilter<T>} will use
   * @param expectedInsertions the number of expected insertions to the constructed
   *     {@code BloomFilter<T>}; must be positive
   * @param fpp the desired false positive probability (must be positive and less than 1.0)
   * @return a {@code BloomFilter}
   * @since 19.0
   */
  public static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp) {
    return create(funnel, expectedInsertions, fpp, BloomFilterStrategies.MURMUR128_MITZ_64);
  }

  @VisibleForTesting
  static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
    checkNotNull(funnel);
    checkArgument(
        expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);
    checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);
    checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);
    checkNotNull(strategy);

    if (expectedInsertions == 0) {
      expectedInsertions = 1;
    }
    /*
     * TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size
     * is proportional to -log(p), but there is not much of a point after all, e.g.
     * optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!
     */
    long numBits = optimalNumOfBits(expectedInsertions, fpp);
    int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
    try {
      return new BloomFilter<T>(new BitArray(numBits), numHashFunctions, funnel, strategy);
    } catch (IllegalArgumentException e) {
      throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
    }
  }

  /**
   * Creates a {@link BloomFilter BloomFilter<T>} with the expected number of insertions and a
   * default expected false positive probability of 3%.
   *
   * <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
   * will result in its saturation, and a sharp deterioration of its false positive probability.
   *
   * <p>The constructed {@code BloomFilter<T>} will be serializable if the provided
   * {@code Funnel<T>} is.
   *
   * <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
   * ensuring proper serialization and deserialization, which is important since {@link #equals}
   * also relies on object identity of funnels.
   *
   * @param funnel the funnel of T's that the constructed {@code BloomFilter<T>} will use
   * @param expectedInsertions the number of expected insertions to the constructed
   *     {@code BloomFilter<T>}; must be positive
   * @return a {@code BloomFilter}
   */
  public static <T> BloomFilter<T> create(Funnel<? super T> funnel, int expectedInsertions) {
    return create(funnel, (long) expectedInsertions);
  }

  /**
   * Creates a {@link BloomFilter BloomFilter<T>} with the expected number of insertions and a
   * default expected false positive probability of 3%.
   *
   * <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
   * will result in its saturation, and a sharp deterioration of its false positive probability.
   *
   * <p>The constructed {@code BloomFilter<T>} will be serializable if the provided
   * {@code Funnel<T>} is.
   *
   * <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
   * ensuring proper serialization and deserialization, which is important since {@link #equals}
   * also relies on object identity of funnels.
   *
   * @param funnel the funnel of T's that the constructed {@code BloomFilter<T>} will use
   * @param expectedInsertions the number of expected insertions to the constructed
   *     {@code BloomFilter<T>}; must be positive
   * @return a {@code BloomFilter}
   * @since 19.0
   */
  public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions) {
    return create(funnel, expectedInsertions, 0.03); // FYI, for 3%, we always get 5 hash functions
  }

// other codes ...
}

基本使用：

BloomFilter<Integer> filter = BloomFilter.create(Funnels.integerFunnel(),500,0.01);

filter.put(1);
filter.put(2);
filter.put(3);

assertThat(filter.mightContain(1)).isTrue();
assertThat(filter.mightContain(2)).isTrue();
assertThat(filter.mightContain(3)).isTrue();
assertThat(filter.mightContain(100)).isFalse();

正确估计预期插入数量是很关键的一个参数。当插入的数量接近或高于预期值的时候，布隆过滤器将会填满，这样的话，它会产生很多无用的误报点。

不过也有文章指出，guava的bloom filter在数据量变大以后，准确性大大降低，这个虽然本身就是bloom filter的特性，但是在这篇文章中也给出了一些参考值，大家可以看看: google guava bloom filter包的坑open in new window :

在0.0001的错误率下，插入量不到1.5亿的时候，numBits已经到达了BitArray的最大容量了，这时如果再增加插入量，哈希函数个数就开始退化。到5亿的时候，哈希函数个数退化到了只有3个，也就是说，对每一个key，只有3位来标识，这时准确率就会大大下降。第一种当然就是减少预期插入量，1亿以内，还是可以保证理论上的准确率的。第二种，如果你的系统很大，就是会有上亿的key，这时可以考虑拆分，将一个大的bloom filter拆分成几十个小的（比如32或64个），每个最多可以容纳1亿，这时整体就能容纳32或64亿的key了。查询的时候，先对key计算一次哈希，然后取模，查找特定的bloom filter即可。

基于redis的bloom filter

guava中的bloom filter是单机的，如果想使用分布式的的话，可以考虑基于redis 的bloom filter。

主要参考这个文章：ReBloom – Bloom Filter Datatype for Redisopen in new window

redis 在 4.0 的版本中加入了 module 功能，布隆过滤器可以通过module的形式添加到 redis 中，所以使用 redis 4.0 以上的版本可以通过加载 module 来使用 redis 中的布隆过滤器。但是这不是最简单的方式，使用 docker 可以直接在 redis 中体验布隆过滤器。

> docker run -d -p 6379:6379 --name bloomfilter redislabs/rebloom
> docker exec -it bloomfilter redis-cli

redis 布隆过滤器主要就两个命令：

bf.add 添加元素到布隆过滤器中：bf.add urls https://jaychen.cc
bf.exists 判断某个元素是否在过滤器中：bf.exists urls https://jaychen.cc

上面说过布隆过滤器存在误判的情况，在 redis 中有两个值决定布隆过滤器的准确率：

error_rate：允许布隆过滤器的错误率，这个值越低过滤器的位数组的大小越大，占用空间也就越大。
initial_size：布隆过滤器可以储存的元素个数，当实际存储的元素个数超过这个值之后，过滤器的准确率会下降。

redis 中有一个命令可以来设置这两个值：

bf.reserve urls 0.01 100

上面三个参数的含义为：

第一个值是过滤器的名字。
第二个值为 error_rate 的值。
第三个值为 initial_size 的值。

使用这个命令要注意一点：执行这个命令之前过滤器的名字应该不存在，如果执行之前就存在会报错：(error) ERR item exists

参考文章：

版权申明

本站点所有内容，版权均归https://wenchao.renopen in new window所有，除非明确授权，否则禁止一切形式的转载协议

打赏

支付宝

bloom filter

# bloom filter

# bloom filter原理

# bloom filter优缺点

# 优点：

# 缺点：

# bloom filter的实际使用

# guava bloom filter

# 基于redis的bloom filter

# 参考文章：

bloom filter

bloom filter原理

bloom filter优缺点

优点：

缺点：

bloom filter的实际使用

guava bloom filter

基于redis的bloom filter

参考文章：