Enterprise knowledge bases have come a long way from printed documents to digital and searchable databases. There is no doubt that we are at another tipping point: One in which we are switching from reading static interfaces such as FAQs / manuals to conversational interfaces. The next generation interface will thus use AI language models that can talk to the user (both speech and text) to do query-based search and data retrieval as well as question answering and summarization. However, current AI models, typically called large language models (LLMs), are extremely large and computationally complex, requiring elaborate GPU-based supercomputer systems to run. Since such models can only be made available to the public via cloud infrastructure, the "AI world" migrated to the cloud and produced solutions like ChatGPT. There are two problems with this approach: 1) it is impossible for the users to protect their data privacy, and 2) compute costs are extremely large, making access to LLMs harder for organizations of all sizes. The solution is developing small-footprint models and allowing on-prem deployment.

At ShallowAI, we are dedicated to the democratization of AI solutions and AI research in general. To this end, we have tackled the problem of building small-footprint models on hundreds of vision and NLP research tasks, and developed what we call "The ShallowAI Framework". This framework encapsulates the "bag of tricks" we came up with while working on small-footprint models of various modalities, and lays the foundation of our flagship product HIVE™ which boosts enterprise productivity with total data privacy thanks to the small-footprint on-prem LLMs it employs.

In this article, we first present a short history of efficient AI solutions, and try to break down our framework in the hope of better conveying the value of HIVE™ as well as motivating further research on next generation conversational AI.

The Basics: Data, Network Architecture and Training

Neural networks are basically parametric computational graphs which repeat certain basic operations (e.g., conv, matmul, ...) on the input data to convert it first to latent "learned" representations (commonly referred to as "features" / "embeddings"), and finally to a desired output representation. In the case of LLMs, both the input data and the output representation are natural language, allowing training on digital text (e.g., the whole internet!) and a conversational interface for the users at inference time. Such generative models (the G in GPT) typically use sequence-to-sequence transformation networks like the well-known Transformers (the T in GPT). Due to the types of basic operations it utilizes (different flavors of batched matrix multiplication and attention), the Transformer computational graph is extremely suitable for efficiently running on the modern massively parallel processors such as GPUs or TPUs. However, these models come at an amazingly high cost. They require TBs of training data, utilize trillions of parameters and take up to months to train on server farms. The total cost of training one model is upwards of $50M.

Obviously, this development cycle needs to change before LLMs can serve the needs of enterprises of all sizes since every business has unique cases and very few of those cases can justify a $M spending. The solution requires innovation on all fronts of the development process, and the ideal AI development cycle for business cases needs to work with few training samples, small-footprint network architectures and efficient / fast training algorithms.

The Framework

While the recipe is clear, training small-footprint networks with few samples to get high-accuracy models is an active field of research with no clear general solution. Through the research we have conducted over the years with partners in many AI application fields, we have discovered unique techniques to realize this in chosen areas. Our "secret sauce" is made of quantized network architectures that utilize special computational operators and activation functions.

Reducing the size and number of operations in a neural network is very important for bringing down deployment cost. One common approach is using fewer bits for representing the parameters and activations in the network. Such networks, which use 8, 4, 2 or even 1-bit fixed point integers instead of 16/32/64-bit floating point numbers, are called quantized neural networks, or QNNs.

Deploying QNNs on target hardware that supports efficient inference can bring considerable performance gains without significant accuracy loss. For instance, switching out a 32-bit network for its properly-trained binary QNN version results in 32x memory savings and around 20x computational speed-up. Quantizing floating-point networks without significant accuracy loss is a notoriously hard task. As a result, research efforts in this area have produced specialized training algorithms and architectural heuristics for QNNs.

The simplest QNN uses 8-bit parameters and activations (INT8) and is supported by almost all compute platforms ranging from common CPU and GPUs to special low-power fixed datapath accelerators. For some basic applications, accurate INT8 versions of floating-point networks can be obtained via “naive” quantization, i.e., quantizing trained parameters using scalars that map the statistical min-max floating-point values from each layer to the INT8 domain. However, in most practical scenarios this technique results in severe accuracy loss and explicit parameter recalibration is required to find a better quantized representation of the network. This operation is called quantization-aware training (QAT).

QAT is a must have for most INT8 quantized neural networks, as well as for any QNN with precision lower than INT8 (naive quantization completely fails at such low bit-precision number formats). At ShallowAI, we developed QAT algorithms that achieve unprecedented accuracy for both ultra-compact QNNs targeting embedded platforms, and for more capable networks that target cloud CPU/GPU deployment.

We support our QAT methods with special operators and activation functions. Modern networks are pre-dominantly utilizing activation functions based on the rectified units (ReLU, GeLU, Swish, ...). While these functions facilitate "smoother" training and manifest universal approximation properties, they are not efficient for  real AI tasks. The simplest way to picture their inefficiency is by considering the approximation of trigonometric functions like sin() and cos() with these units. While exact mapping is out of question, even a sufficiently close approximation would require a dense piecewise linear map covering the whole input-output range. Considering the fact that latent representations of real AI problems are much trickier than these simple trigonometric functions, special activation functions are necessary for more accurate and efficient modeling. The ShallowAI Framework provides this capability for building efficient models for all domains.

Showcases

Instead of using our framework on the LLMs, which would still require at least a few GBs of RAM, we chose to showcase our framework on models that would fit on extremely small platforms (even embedded systems with <100 KB of memory!). The same framework that we demonstrate here is used for building the small-footprint language models in our HIVE™ product.

We provide the full source code for reproducing the results on our GitHub page.

Our algorithms achieve the following performance numbers on a highly constrained embedded platform by Maxim Integrated MAX7800x (only 400 KB on-chip memory).

  • 2x  accuracy compared to the known best solution, SoTA, XNOR-Net for extremely compact binary neural networks for image classification – demonstrated on the CIFAR100 task for less than 100 KB models:

  • 30x smaller footprint person detection compared to the known best solution by Google, EfficientDet, producing the same accuracy. See the results for NanoTracker below:

The video clip below demonstrates our detector performance. A common problem in quantized models is low detection confidence scores, causing jittery bounding boxes in videos (left). We solve this problem with a compact tracker algorithm (right) that post-processes the detector output to provide consistent and precise predictions over time.

  • Our sequence-to-sequence language translation model, NanoTranslator, fits on an ultra-low-power convolutional neural network accelerator, MAX78000 by Maxim Integrated, with 400 KB on-chip memory. It does accurate Spanish-to-English translation of general domain news articles with 34 BLEU while requiring approximately 400x smaller footprint compared to the known best solutions.

A demonstration of our compact translation model on an arbitrary news article from the web is shown below: