Embeddings in data science

August 15, 2017

In studying a mathematical model, it is helpful to put the model in the context of a simpler, well-studied mathematical structure. For example, the Earth is a spherical object, and yet we can learn much about its surface by studying its Mercator projection on a two-dimensional plane.

The study of geometric methods of context transfer starts with embeddings, which are exact copies of mathematical models in another mathematical structure. More specifically, an embedding of a mathematical object $X$ of type $\mathcal{A}$ (or, formally, of category $\mathcal{A}$) in a mathematical object $Y$ of type $\mathcal{B}$ is a one-to-one mapping such that $f(X)$ is a copy of $X$ in $Y$, viz., an object of type $\mathcal{A}$ that shares the same structural properties as $X$.

Differential-geometric tools such as Whitney embedding theorem (1944) and Nash embedding theorem (1954–1966) allow us to view abstract geometric objects as concrete surfaces in the Euclidean space. Moreover, discrete analogues such as Kuratowski’s theorem (1930) and the Hopcroft–Tarjan planarity testing algorithm (1974), as well as results from metric embedding theory like Johnson–Lindenstrauss lemma (1984), suggest embedding methods can be applied in quite general contexts.

Indeed, datasets can be represented as vectors or weighted graphs: the former by encoding each feature as coordinate values, and the latter by encoding relationships between data points as weights of edges between nodes. The latter is already a graph; the former can be thought of as sampled points on a manifold, a principal object of study in geometry, that describes the true nature of the dataset. As discussed above, both representations can be thought of as geometric objects embedded within a Euclidean space, allowing us to visualize the structure of a dataset.

Nevertheless, embedding in the strict sense is not terribly useful in data science, because such visualizations often take place in extremely high-dimensional space. For example, if we wish to encode color information, to be taken from the color set {red, orange, yellow, green, cyan, blue, violet}, in numeric vectors, we might apply one-hot encoding to represent

id	color
1	red
2	orange
3	yellow
4	green

id	red?	orange?	yellow?	green?
1	1	0	0	0
2	0	1	0	0
3	0	0	1	0
4	0	0	0	1

which is a collection of 7-dimensional vectors. As we consider more features, the requisite dimension increases significantly. This results in the so-called curse of dimensionality, which is a heuristic principle referring to various computational difficulties that worsen significantly as the dimension increases.

It is thus beneficial to consider low-dimensional representations of datasets via approximate embeddings, which do not constitute exact copies but still contain enough information to be useful. For example, we might try and encode certain features of a dataset as geometric properties, such as the distance between two data points.

The method of data analysis via low-dimensional representations has a long history, significantly predating the formal development of embedding theory in mathematics. There are hints of early theoretical results found in literature as early as 1880 (see Section 1.3 of de Leeuw–Heiser, 1982), and computational methods have been around for decades as well (see, for example, Sammon, 1969).

Nevertheless, it is the explosion of neural-network methods, backed by ever-so-powerful modern computers, that afforded embedding methods with the significance they have now. A landmark example of the modern embedding method is Word2Vec (Mikolov, et al., 2013), a computationally efficient technique for modeling similarities between words built on the neural probabilistic language model (Bengio, et al., 2003).

Word2Vec and its variants, such as Node2Vec (Grover–Leskovec, 2016), make use of vector representations of data, as their names suggest. Vector-space embeddings are versitile but their dimensionality reduction efficiencies leave much to be desired. For example, linear embedding of certain kinds of graph-structured data require extremely large number of dimensions (Nickel–Jiang–Tresp, 2014, Theorem 1). This is because the inherent shape of a dataset is not necessarily flat. The intrinsic dimension of a curved object in a vector space may well be significantly lower than the dimension of the vector space itself: indeed, a sphere is two-dimensional, but the usual coordinate representation requires a three-dimensional vector space.

The manifold hypothesis posits that each high-dimensional dataset has a corresponding lower-dimensional structure that approximates the data points well. Such a structure is assumed to be a manifold: curves, surfaces, and their high-dimensional generalizations. In light of this, we reason that there may be a curved manifold that represents data better than flat vector spaces.

Now, many complex networks exhibit a hierarchial, tree-like, organization (Ravasz–Barabási, 2003). Since a tree is a discrete analogue of hyperbolic manifolds (Gromov, 1987), it follows that a wide variety of complex datasets can be studied profitably by assuming an underlying hyperbolic structure (Krioukov–Papadopoulos–Kitsak–Vahdat–Boguñá, 2010). Standard optimization techniques for data science such as gradient descent generalize to hyperbolic manifolds (Bonnabel, 2013), and so it is natural to try for a refinement of embedding methods via hyperbolic embedding.

Indeed, datasets with hierarchical organization can be embedded into a hyperbolic space of a much lower dimension without losing the accuracy of representation, compared to the usual vector-space embeddings. Amusingly, two papers, (Nickel–Kiela, 2017) and (Chamberlain–Clough–Deisenroth, 2017), exploiting this methodology appeared on ArXiv within a week of each other, developing the same theory independently. They nevertheless tackle different flavors of problems, and the reader is encouraged to check out both papers for experimental results.