bloom filter
bloom filter
当系统设计中出现多级缓存结构时,为了防止大量不存在的key值击穿高速缓存(比如主存),去直接访问低速缓存(如本地磁盘),我们一般需要将这部分key值,直接拦截在高速缓存阶段。这里,当然可以使用普通的hash table
,也可以使用bitmap
,但是这两种方式都比较耗费内存,当面对海量key值时,问题会变得更加严重。这时,就该介绍我们的主角bloom filter
出场了。
一般的,bloom filter用于判断一个key值是否在一个set中,拥有比hash table/bitmap更好的空间经济性。如果bloom filter指示一个key值“不在”一个set中,那么这个判断是100%准确的。这样的特性,非常适合于上述的缓存场景。
bloom filter原理
首先估计要判断的set中的元素个数N,然后选定k个独立的哈希函数。根据N和k,选定一个长度为M的bit array。
遍历set中的N个元素
- 对每个元素,使用k个哈希函数,得到k个哈希值(一般为一个大整数)
- 将上述bit array中,k个哈希值所对应的bit置1
对于需要判断的key值
- 使用k个哈希函数,得到k个哈希值
- 如果k个哈希值所对应的bit array中的值均为1,则判断此值在set中“可能”存在;否则,判定“一定”不存在
根据上面的原理我们其实可以看到,bloom filter有以下特点:
- 比较节省空间
- bloom的识别准确率和数据大小,k个哈希函数有关
- 如果bloom filter判断key不存在,那么就一定不存在,100%不存在。
- 如果bloom filter判断key存在,那么可能存在,也可能不存在
bloom filter优缺点
优点:
插入、查找都是常数时间
多个hash函数之间互相独立,可以并行计算
不需要存储元素本身,从而带来空间效率优势,以及一些保密上的优势
bloom filter的bitmap可以进行交、并、差运算
缺点:
- 判断
元素是否在集合中
的结果其实是不准确的 - bloom filter中的元素是不能删除的
bloom filter的实际使用
guava bloom filter
guava中提供了bloom filter的一种实现:com.google.common.hash.BloomFilter
,可以方便我们在单机情况下使用bloom filter。
/**
* A Bloom filter for instances of {@code T}. A Bloom filter offers an approximate containment test
* with one-sided error: if it claims that an element is contained in it, this might be in error,
* but if it claims that an element is <i>not</i> contained in it, then this is definitely true.
*
* <p>If you are unfamiliar with Bloom filters, this nice
* <a href="http://llimllib.github.com/bloomfilter-tutorial/">tutorial</a> may help you understand
* how they work.
*
* <p>The false positive probability ({@code FPP}) of a bloom filter is defined as the probability
* that {@linkplain #mightContain(Object)} will erroneously return {@code true} for an object that
* has not actually been put in the {@code BloomFilter}.
*
* <p>Bloom filters are serializable. They also support a more compact serial representation via the
* {@link #writeTo} and {@link #readFrom} methods. Both serialized forms will continue to be
* supported by future versions of this library. However, serial forms generated by newer versions
* of the code may not be readable by older versions of the code (e.g., a serialized bloom filter
* generated today may <i>not</i> be readable by a binary that was compiled 6 months ago).
*
* @param <T> the type of instances that the {@code BloomFilter} accepts
* @author Dimitris Andreou
* @author Kevin Bourrillion
* @since 11.0
*/
@Beta
public final class BloomFilter<T> implements Predicate<T>, Serializable {
/** The bit set of the BloomFilter (not necessarily power of 2!) */
private final BitArray bits;
/** Number of hashes per element */
private final int numHashFunctions;
/** The funnel to translate Ts to bytes */
private final Funnel<? super T> funnel;
/**
* The strategy we employ to map an element T to {@code numHashFunctions} bit indexes.
*/
private final Strategy strategy;
/**
* Creates a BloomFilter.
*/
private BloomFilter(
BitArray bits, int numHashFunctions, Funnel<? super T> funnel, Strategy strategy) {
checkArgument(numHashFunctions > 0, "numHashFunctions (%s) must be > 0", numHashFunctions);
checkArgument(
numHashFunctions <= 255, "numHashFunctions (%s) must be <= 255", numHashFunctions);
this.bits = checkNotNull(bits);
this.numHashFunctions = numHashFunctions;
this.funnel = checkNotNull(funnel);
this.strategy = checkNotNull(strategy);
}
/**
* Creates a {@link BloomFilter BloomFilter<T>} with the expected number of insertions and
* expected false positive probability.
*
* <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
* will result in its saturation, and a sharp deterioration of its false positive probability.
*
* <p>The constructed {@code BloomFilter<T>} will be serializable if the provided
* {@code Funnel<T>} is.
*
* <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
* ensuring proper serialization and deserialization, which is important since {@link #equals}
* also relies on object identity of funnels.
*
* @param funnel the funnel of T's that the constructed {@code BloomFilter<T>} will use
* @param expectedInsertions the number of expected insertions to the constructed
* {@code BloomFilter<T>}; must be positive
* @param fpp the desired false positive probability (must be positive and less than 1.0)
* @return a {@code BloomFilter}
*/
public static <T> BloomFilter<T> create(
Funnel<? super T> funnel, int expectedInsertions, double fpp) {
return create(funnel, (long) expectedInsertions, fpp);
}
/**
* Creates a {@link BloomFilter BloomFilter<T>} with the expected number of insertions and
* expected false positive probability.
*
* <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
* will result in its saturation, and a sharp deterioration of its false positive probability.
*
* <p>The constructed {@code BloomFilter<T>} will be serializable if the provided
* {@code Funnel<T>} is.
*
* <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
* ensuring proper serialization and deserialization, which is important since {@link #equals}
* also relies on object identity of funnels.
*
* @param funnel the funnel of T's that the constructed {@code BloomFilter<T>} will use
* @param expectedInsertions the number of expected insertions to the constructed
* {@code BloomFilter<T>}; must be positive
* @param fpp the desired false positive probability (must be positive and less than 1.0)
* @return a {@code BloomFilter}
* @since 19.0
*/
public static <T> BloomFilter<T> create(
Funnel<? super T> funnel, long expectedInsertions, double fpp) {
return create(funnel, expectedInsertions, fpp, BloomFilterStrategies.MURMUR128_MITZ_64);
}
@VisibleForTesting
static <T> BloomFilter<T> create(
Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
checkNotNull(funnel);
checkArgument(
expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);
checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);
checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);
checkNotNull(strategy);
if (expectedInsertions == 0) {
expectedInsertions = 1;
}
/*
* TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size
* is proportional to -log(p), but there is not much of a point after all, e.g.
* optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!
*/
long numBits = optimalNumOfBits(expectedInsertions, fpp);
int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
try {
return new BloomFilter<T>(new BitArray(numBits), numHashFunctions, funnel, strategy);
} catch (IllegalArgumentException e) {
throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
}
}
/**
* Creates a {@link BloomFilter BloomFilter<T>} with the expected number of insertions and a
* default expected false positive probability of 3%.
*
* <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
* will result in its saturation, and a sharp deterioration of its false positive probability.
*
* <p>The constructed {@code BloomFilter<T>} will be serializable if the provided
* {@code Funnel<T>} is.
*
* <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
* ensuring proper serialization and deserialization, which is important since {@link #equals}
* also relies on object identity of funnels.
*
* @param funnel the funnel of T's that the constructed {@code BloomFilter<T>} will use
* @param expectedInsertions the number of expected insertions to the constructed
* {@code BloomFilter<T>}; must be positive
* @return a {@code BloomFilter}
*/
public static <T> BloomFilter<T> create(Funnel<? super T> funnel, int expectedInsertions) {
return create(funnel, (long) expectedInsertions);
}
/**
* Creates a {@link BloomFilter BloomFilter<T>} with the expected number of insertions and a
* default expected false positive probability of 3%.
*
* <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,
* will result in its saturation, and a sharp deterioration of its false positive probability.
*
* <p>The constructed {@code BloomFilter<T>} will be serializable if the provided
* {@code Funnel<T>} is.
*
* <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of
* ensuring proper serialization and deserialization, which is important since {@link #equals}
* also relies on object identity of funnels.
*
* @param funnel the funnel of T's that the constructed {@code BloomFilter<T>} will use
* @param expectedInsertions the number of expected insertions to the constructed
* {@code BloomFilter<T>}; must be positive
* @return a {@code BloomFilter}
* @since 19.0
*/
public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions) {
return create(funnel, expectedInsertions, 0.03); // FYI, for 3%, we always get 5 hash functions
}
// other codes ...
}
基本使用:
BloomFilter<Integer> filter = BloomFilter.create(Funnels.integerFunnel(),500,0.01);
filter.put(1);
filter.put(2);
filter.put(3);
assertThat(filter.mightContain(1)).isTrue();
assertThat(filter.mightContain(2)).isTrue();
assertThat(filter.mightContain(3)).isTrue();
assertThat(filter.mightContain(100)).isFalse();
正确估计预期插入数量是很关键的一个参数。当插入的数量接近或高于预期值的时候,布隆过滤器将会填满,这样的话,它会产生很多无用的误报点。
不过也有文章指出,guava的bloom filter在数据量变大以后,准确性大大降低,这个虽然本身就是bloom filter的特性,但是在这篇文章中也给出了一些参考值,大家可以看看: google guava bloom filter包的坑 :
在0.0001的错误率下,插入量不到1.5亿的时候,numBits已经到达了BitArray的最大容量了,这时如果再增加插入量,哈希函数个数就开始退化。到5亿的时候,哈希函数个数退化到了只有3个,也就是说,对每一个key,只有3位来标识,这时准确率就会大大下降。 第一种当然就是减少预期插入量,1亿以内,还是可以保证理论上的准确率的。 第二种,如果你的系统很大,就是会有上亿的key,这时可以考虑拆分,将一个大的bloom filter拆分成几十个小的(比如32或64个),每个最多可以容纳1亿,这时整体就能容纳32或64亿的key了。查询的时候,先对key计算一次哈希,然后取模,查找特定的bloom filter即可。
基于redis的bloom filter
guava中的bloom filter是单机的,如果想使用分布式的的话,可以考虑基于redis 的bloom filter。
主要参考这个文章:ReBloom – Bloom Filter Datatype for Redis
redis 在 4.0 的版本中加入了 module 功能,布隆过滤器可以通过module的形式添加到 redis 中,所以使用 redis 4.0 以上的版本可以通过加载 module 来使用 redis 中的布隆过滤器。但是这不是最简单的方式,使用 docker 可以直接在 redis 中体验布隆过滤器。
> docker run -d -p 6379:6379 --name bloomfilter redislabs/rebloom
> docker exec -it bloomfilter redis-cli
redis 布隆过滤器主要就两个命令:
bf.add
添加元素到布隆过滤器中:bf.add urls https://jaychen.cc
bf.exists
判断某个元素是否在过滤器中:bf.exists urls https://jaychen.cc
上面说过布隆过滤器存在误判的情况,在 redis 中有两个值决定布隆过滤器的准确率:
error_rate
:允许布隆过滤器的错误率,这个值越低过滤器的位数组的大小越大,占用空间也就越大。initial_size
:布隆过滤器可以储存的元素个数,当实际存储的元素个数超过这个值之后,过滤器的准确率会下降。
redis 中有一个命令可以来设置这两个值:
bf.reserve urls 0.01 100
上面三个参数的含义为:
- 第一个值是过滤器的名字。
- 第二个值为
error_rate
的值。 - 第三个值为
initial_size
的值。
使用这个命令要注意一点:执行这个命令之前过滤器的名字应该不存在,如果执行之前就存在会报错:(error) ERR item exists