Linear Counting¶

Introduction to the streaming problem: Distinct Count¶

The question we are trying to answer is the number of distinct elements in a stream of elements. The naive approach to solve this problem can be using a set or hashmap implementation where we store the frequency of elements against their values as keys. However, this is a space consuming solution O(n*\(log(u)\)). There can be some more naive and space consuming methods as discussed in the previous lecture notes.

To optimize for space, we can incorporate a small tradeoff of the “exactness” of the solution. Hence, we want to try to do this in less space when the exact solution is not needed.

Linear Counting¶

Linear Counting embraces hash collisions and doesn’t store the exact original items.

  • It allocates a hash table/ bit array B of size \(m\) bits where \(m\) is the same scale as \(n\), the number of unique items. All these \(m\) bits are initialized to 0.

  • hash function h: [n] –> [m]

  • Now, when we encounter any element x in the stream, set B[h(x)] = 1, i.e, set the bit returned by h(x) to 1.

Then, we want to look at the number of zero entries - \(Z_m\).

return \( \hat{n} = - m\text{ }ln (\frac{Z_m}{m}) \)

Linear Counting Analysis¶

  • Probability that any paritcular bit position remains 0 is the probability that all the insertions went to the remaning m - 1 bits. \

    \(P_r\) [position remaining 0] = \( (1 - \frac{1}{m})^n \approx e^{-\frac{n}{m}}\)

  • Expected number of positions at 0: \(E[Z_m]\) = \( m e^{-\frac{n}{m}}\)
    because the probability that a single one remains at zero was \(e^{-\frac{n}{m}}\) and taking this indicated random variable and this expectation turns out to be \( m e^{-\frac{n}{m}}\)

  • suppose \( \hat{Z_m} \) is close to its expectation, then

    • \( \hat{Z_m} \approx m e^{-\frac{n}{m}}\).

    • Thus, \( ln(\frac{\hat{Z_m}}{m}) \approx -\frac{n}{m} \)

    • \( \hat{n} = - m\text{ }ln (\frac{Z_m}{m}) \).

    • This equality (concentration around the estimate) can be easily shown using tail inequalities. For the same to be useful in practice, we need \(m\) to be some constant factor of \(n\), i.e, O(\(n\)).

Author(s): Eshan Gujarathi, Hitarth Gandhi, Vishal Soni