Building LSH Table¶

Locality Sensitive Hashing¶

-> Originally defined in terms of a similarity function

-> We have a universe of elements, say, $U$ and we define a similarity function, $s: U \times U \to [0,1]$

-> Next, we are concerned with an existence of a probability distribution such over a hash family H such that

$Pr_{h \in H}[h(x) = h(y)] = s(x,y)$

Here, $s(x,y) = 1 \to x = y \ and\ s(x,y) = s(y,x)$

Jacard Similarity: minhashing¶

Consider two sets, A and B, the Jaccard similarity is defined as $JS(A,B) = \frac{A \cap B}{A \cup B}$

As an example, consider a elemental universe U from which we have to pick a uniform permutation (the hash family consists of all the random permutations!)

For any set S, we define - $h(S) = min_{x \in S}h(x)$

This represents a mapping!

We will now look at collection of sets as a {0,1} matrix

$

(6)¶\[\begin{bmatrix} & S_{1} & S_{2} & S_{3} & S_{4} \\ A & 1 & 0 & 1 & 0 \\ B & 1 & 0 & 0 & 1 \\ C & 0 & 1 & 0 & 1 \\ D & 0 & 1 & 0 & 1 \\ E & 0 & 1 & 0 & 1 \\ F & 1 & 0 & 1 & 0 \\ G & 1 & 0 & 1 & 0 \\ \end{bmatrix}\]

$

Here, {A..G} are the elements in universe U and $S_{i}$ represents the set(s)

Now, we consider the following random permutation of the elements

$

(7)¶\[\begin{bmatrix} & S_{1} & S_{2} & S_{3} & S_{4} \\ A & 1 & 0 & 1 & 0 \\ C & 0 & 1 & 0 & 1 \\ G & 1 & 0 & 1 & 0 \\ F & 1 & 0 & 1 & 0 \\ B & 1 & 0 & 0 & 1 \\ E & 0 & 1 & 0 & 1 \\ D & 0 & 1 & 0 & 1 \\ \end{bmatrix}\]

$

From, our initial definition of $h(S)$ we have $h(S_{1}) = A,\ h(S_{2}) = C,\ h(S_{3}) = A,\ h(S_{4}) = C$

Why is this LHS?¶

For sets S and T,

The first row where one of them has a 1 belong to $S \cup T$

We have $h(S) = h(T)$ iff both the rows contain 1

This means that this row belongs to $S \cap T$

Since, the event h(S) = h(T) is same as the event that a row in $S \cap T$ appears first among all rows in $S \cap T$

$Pr[h(S) = h(T)]=\frac{S \cap T}{S \cup T}$

How to choose random permutations¶

-> Picking a uniform at random permutation is expensive!

-> We look at a family of min-wise independent permutations

-> In practice, we use standard hash functions, hash all the values and then sort

Which similarities admit LSH?¶

Theorem : S is LSHable $\to$ 1 - S is a metric

Consider a hash family, H and fix a hash function $h \in H$ and define

$\Delta_{h}(A,B) = [h(A) \neq h(B)]$

$1 - S(A,B) = Pr_{h}[\Delta_{h}(A,B)]$

Also, $\Delta_{h}(A,B) + \Delta_{h}(B,C) \geq \Delta_{h}(A,C)$

Example of non-LSHable similarities¶

$d(A,B) = 1 - s(A,B)$

$Sorenson-Dice \ :\ s(A,B) = \frac{2|A \cap B|}{|A| + |B|}$

$Overlap \ :\ s(A,B) = \frac{A \cap B}{min(|A|,|B|)}$

Gap Definition of LSH¶

LSH being defined as a distance measure

Gap Definition

A family is (r,R,p,P) LSH if

$Pr_{h \in H}[h(x) = h(y)] \geq p \ if\ d(x,y) \leq r$

$Pr_{h \in H}[h(x) = h(y)] \geq P \ if\ d(x,y) \geq R$

If points are closer than the lower radius collision probability is higher and if the points are farther than the greater radius, collision probability is lower!

Jaccard similarity follows Gap LSH!

L2 Norm¶

$d(x,y) = \sqrt{\sum_{i}(x_{i} - y_{i})^{2}}$

u = random unit norm vector, w $\in$ R parameter, b $\sim$ Unif[0,w]

$h(x) = \lfloor \frac{u.x + b}{w} \rfloor$

If $|x - y|_{2} < \frac{w}{2}$ , the probability that they will fall in separate partitions is low (x and y are close to each other)!

If $|x - y|_{2} > 4w$ , the probability that they will fall in the same partition is low (x and y are far apart)!

Solving the near neighbour¶

(r,c) - near neighbour problem

-> Given query point q, return all points p such that $d(p,q) < r$ and none such that $d(p,q) > cr$

-> Solving this gives a subroutine to solve the “nearest neighbour”, by building a data-structure for each r, in powers of $(1 + \epsilon)$

How to actually use¶

We need to amplify the probability of collisions for close points since the graph of collision probability versus distance goes down smoothly for Jaccard distance.

Amplify Close

So, we want high value of collision probability if distance is less than $r$ and low value if distance is greater than $cr$ where $c$ is some constant!

Band Construction¶

AND-ing of LSH¶

-> Define a composite function $H(x) = (h_{1}(x),...,h_{k}(x))$

-> $Pr[H(x) = H(y)] = \Pi_{i}Pr[h_{i}(x) = h_{i}(y)] = Pr[h_{1}(x) = h_{1}(y)]^{k}$

OR-ing¶

-> Create L independent hash-tables for $H_{1},H_{2},...H_{L}$

-> Given query q, search in $\cup_{j} H_{j}(q)$

Why is it better¶

Consider q,r with $Pr[h(q) = h(y)] = 1 - d(q,y)$

Probability of not finding y as one of the candidates in $\cup_{j} H_{j}(q)$ is $1 - (1 - (1-d)^{k})$

Creating an LSH¶

-> If we have $(r,cr,p,q) LSH$

-> For any y, wih $|q - y| < r$

Probability of y as candidate in $\cup_{j}H_{j}(q) \geq 1 - (1-p^{k})^{L} \geq 1 - \frac{1}{e}$

-> For any z, $|q - z| > cr$

Probability of z as candidate in any fixed $H_{j}(q) \leq q_{k}$

Expected number of such $z \leq Lq^{k} \leq L = n^{\rho}$

Space used = $n^{1 + \rho}$

Query time = $n^{\rho}$

We can show that for Hamming, angle etc, $\rho \approx \frac{1}{c}$

Therefore, we can get 2-approx near neighbours in $O(\sqrt(n))$ query time

LSH : theory vs practice¶

In order to design LSH in practice, the theoretical parameter values are only a guidance, we need to search over the parameter space to find a good operating point!

Looking back¶

LSH is a tool for near neighbour problems

Trades off space with query time

Practical for medium to large datasets with fairly large number of dimensions, exceptions being sparse, very very high dimensional datasets

CS328-2022 Notes

Building LSH Table

Contents

Building LSH Table¶

Locality Sensitive Hashing¶

Jacard Similarity: minhashing¶

Why is this LHS?¶

How to choose random permutations¶

Which similarities admit LSH?¶

Example of non-LSHable similarities¶

Gap Definition of LSH¶

L2 Norm¶

Solving the near neighbour¶

How to actually use¶

Band Construction¶

AND-ing of LSH¶

OR-ing¶

Why is it better¶

Creating an LSH¶

LSH : theory vs practice¶

Looking back¶

LSH (Locality Sensitive Hashing) Application¶

Finding Similar Items from a large Dataset with Precesion using LSH.

Intro

Different Scenarios

Let's Explore the LSH usage in a large Documents Database

Generating Unique Hashes

The Signature Matrix

Now to make it Precise we'll use LSH (Locality Sensitive Hashingh)

LSH: The band structure