Although digital information can be viewed regarding size – a daily email is typically 75-100 kilobytes (Kb) in size – it’s important to note that data also has dimensions. These dimensions are based on the numbers of variables in a piece of data. So, a typical email can be viewed as a high-dimensional vector where there is one coordinate for each word in the dictionary. The value in that coordinate is the number of times that word is used in the email. That daily 75 Kb email with 1000 words? It would result in a vector in the millions.
This is the geometric view on data. Although it’s quite useful in applications like learning spam classifiers, the problem lies in some dimensions: the more dimensions, the longer it takes for an algorithm to run, and the more memory that algorithm uses.
To help speed up the algorithmic processing of data, researchers used a theorem proved in the 1980s by mathematicians William B. Johnson and Joram Lindenstrauss – JL lemma. Using this theorem, computer scientists reduced the dimensionality of data and helped speed up all types of algorithms across many different fields.
But as data has grown larger and larger and more complex, many have asked the JL lemma is the best approach?
Now, computer scientists Jelani Nelson and co-author Kasper Green Larsen have proven that the 30-year-old theorem is still the best way to reduce the dimensionality of data and speed up the algorithms.
“We have proven that there are ‘hard’ data sets for which dimensionality reduction beyond what’s provided by the JL lemma is impossible,” said the authors.