Groq – Reimagining High Performance Computing

One of the most fun aspects of my job is talking to entrepreneurs who are developing disruptive technology. Jonathan Ross is the CEO and founder of Groq, a maker of next-generation chips that specializes in sequential processing and accelerating systems with real-time artificial intelligence (AI) and high-performance computing (HPC).

Ross previously designed the tensor processing unit (TPU) that powered Google’s machine learning (ML). He founded Groq in 2017, sensing an opportunity as the growth of AI presented computational challenges to traditional computing architecture.

AI is the next step in solving legacy planning with a helping hand

AI training creates a neural network or ML algorithm using a training dataset. AI inference refers to the process of using a trained neural network model to make a guess.

A critical observation throughout this work flow is that no single processor, whether CPU, GPU, Field Programmable Gate Arrays (FPGA), or Tensor Processing Units (TPU), is the best solution. A case of one size fits all.

That hasn’t stopped folks from trying, such as 4,000 core CPUs, reconfigurable FPGAs, or GPUs with more capable cores or cores that are independent of software. These screens will deliver marginal improvements to existing designs and will not fill the ML computing thirst.

Groq decided to do something completely different, innovating against the conventional wisdom of the semiconductor industry. Ross expanded on Groq’s mission by considering what is available on the market today. “We decided our mission was to drive the cost of computing to zero. Everyone hates that. But, if you look at computing history, that’s what happened. When we say, ‘Driving the cost of computing to zero,’ we’re still selling our solutions as a competitive price point in the industry. But when We’re delivering orders of magnitude performance improvements of 200x, 600x, 1000x-, 200, 600, 1000 times per dollar of performance. So the book is approaching.”

As a result, AI has reached a bottleneck

Using existing architecture and connecting multiple CPUs solves performance challenges. AI implementation is much more difficult because it is real-time, latency-sensitive, and needs high performance and efficiency.

Over time, CPUs became larger and more complex, with more cores, more threads, on-chip networks, and control circuits. Developers considering speeding up programming and output of complex programming models must deal with security issues and loss of visibility into compiler control due to processing abstraction layers. In summary, standard computing architectures have hardware features and components that offer no consequential performance advantages.

GPU architectures are designed with DRAM bandwidth and are fixed in multi-data, or multi-processor architectures. GPUs perform massively parallel processing tasks, but memory access latency and ML are already pushing the limits of external memory bandwidth.

A less complex chip design is the answer

Groq designed a chip that delivers predictable and repeatable performance with low latency and high throughput through a system called tensor streaming (TSP).

The new, simpler processor architecture is specifically designed to meet the requirements of machine learning applications and other computationally intensive tasks.

Explaining his approach to chip design, Ross commented, “It started with software. Every chip that is made today is made by hardware engineers. That would be like having a mechanical design for cars – it’s all about engine optimization. In the ideal case, the car would have a driver would you like to help build the mechanics. I’m a mechanical engineer, and a lot of our team members are mechanical engineers, so we approached building the GroqChip from the ground up as a user of the chip, as opposed to someone who’s optimizing for the chip structure. The result is a very different architecture that’s very easy to implement it is.”

This software-defined hardware approach is what Groq does completely different Compiler was the sole focus of the team for the first six months and only after the team startup work begins on the chip architecture.

The result is that the implementation of the plans can be done in software, with the Groq™ Compiler controlling the operating hardware and freeing up valuable Silicon space for more processing. The control provided by this architecture leads to predictive performance and better and faster model deployment.

“Customers usually have some code and they’ve tried running it on a GPU, FPGA, or CPU, and it doesn’t work. They come to Groq and we can get up and running quickly. The power of the Groq™ Compiler, Ross shared.

Groq also offers the GroqWare™ Suite, a host of tools to simplify the development experience. A developer typically develops, compiles, and deploys a program to hardware, runs it multiple times, releases it, and then returns to optimize and recompile, often multiple times. With Groq, there is no need to develop hate. Instead, developers can use the GroqView™ Profiler, which calculates visual and memory usage, tightening the development loop. There is also GroqFlow™, a simple interface to compare models running on GroqNode™ servers by adding a single line of code.

GroqChip™ performance is determinative

Because the Groq Compiler orchestrates everything, data flows in time and place so that calculations occur immediately, without delay.

The compiler understands the speed of each instruction and instructs the hardware exactly what to do and when. There is no abstraction between the compiler and the GroqChip. In a traditional architecture, it takes power and time to move data from DRAM to a processor, and the processor’s workload is variable at the same time.

The Groq Compiler controls the flow of instructions to the hardware making the process fast and predictable, so developers can run the same model multiple times on the GroqChip and receive exactly the same result every time.

Ross commented, “When developers build on the GroqChip, we share exact performance with no variation from run to run.” This type of deterministic operation is essential to the growth of applications.

Solving the challenge “batch size 1”

Mass processing of one or more computations in a single image inference process is required for real-time performance in applications such as natural language processing.

Batch size 1 introduces complexities of performance and agility for machine learning applications, specifically cluster inference platforms based on GPUs.

Groq architecture does not experience latency in bulk batch 1. Simple architecture, single core in TSP, delivers maximum performance size in any batch. Groq claims that the TSP is about 2.5 times faster than GPU-based platforms on large batch sizes and 17.6 times faster on a batch size of 1 .


Today, running models in the cloud requires a lot of time. Since most customers do not know how long the samples will run, time blocks are purchased based on a good guess.

A deterministic implementation of supercomputing-as-a-service will be possible because the Groq Compiler can determine a small amount of time to complete the work of the model.

This will potentially disrupt the future market by providing the ability to request accurate time rather than redundancy, saving customers on TCO and resources across the board.

With this great vision, Ross shared, “I’ve always said – comfort is comfort. Fear is a sign that you can do it.”


As a former “chip guy” myself, I find the GroqChip and its architecture a “thing of beauty”.

At some point, we realized that realizing the benefits of AI, infrastructure innovation, and predictive intelligence would require a much simpler and more scalable process architecture than legacy solutions.

As a general rule, CPUs are great for video tasks, but when the overhead is significant in orchestrating hundreds or even thousands, they consume most of the gains. Parallel ML processing. While you might think the GPU shines, even the extraneous hardware on the GPUs eats into the profit.

Groq realized that a less complex chip design was the answer and cracked the code, giving you answers faster than a CPU, with higher throughput and parallel performance than a GPU, with the ability to do one PetaOp or one quadrillion operations per second.

That’s amazing fifteen digits after one!

Moor Insights & Strategy, like all research and technology industry firms, provides or has provided services to technology companies. These services include research, analysis, planning, consulting, benchmarking, award winning, and advocacy. Business partnership with 8×8, Accenture, A10 Networks, Amazon Advanced Micro Devices, Amazon Web Services, Ambient Scientific, Anuta Networks, Applied Brain Research, Micro, Apstra, Arm, Aruba Networks (now HPE), Atom Computing, AT&T, Aura, Automation Anywhere, AWS, A-10 Strategies, Bitfusion, Blaize, Box, Broadcom, , C3.AI, Calix, Campfire, Cisco Systems, Software, Cloudera, Clumio, Cognitive Systems, CompuCom, Cradlepoint, CyberArk, Dell, Dell EMC, Dell Technologies, Diablo Technologies, Dialog Group, Digital Optics, Dreamium Labs, D-Wave, Echelon, Ericsson, Extreme Networks, Quinque9, Flex,, Foxconn, Frame (now VMware), Fujitsu, Gen Z Consortium, Glue Networks, GlobalFoundries, Revolve (now Google), Google Cloud, Graphcore, Groq, Hiregenics, Hotwire Global, HP Inc., Hewlett Packard Enterprise, Honeywell, Huawei Technologies, IBM, Infinidat, Infosys, Inseego, IonQ, IonVR, Inseego, Infosys, Infiot, Intel, Interdigo Italian, Jabil Circuit, Keysight, Konica Minolta, Lattice Semiconductor, Lenovo, Linux Foundation, Lightbits Labs, LogicMonitor, Luminar, MapBox, Marvell Technology, Mavenir, Massilia Inc, Mayfair Equity, Meraki (Cisco), Merck KGaA, Mesophere, Micron Technology Microsoft, MiTEL, Mojo Networks, MongoDB , National Instruments, Neat, NetApp, Nightwatch, NOKIA (Alcatel-Lucent), Nortek, Novumind, NVIDIA, Nutanix, Nuvia (now Qualcomm), Onsemi, ONUG, OpenStack Foundation, Oracle, Palo Alto Networks, Panasas, Peraso, Pexip, Pixelworks, Plume Design, PlusAI, Poly (formerly Plantronics), Portworx, Pure Storage, Qualcomm, Quantum, Rackspace, Rambus, Rayvolt E-Bikes, Red Hat, Renesas, Residio, Samsung Electronics, Samsung Semi, SAP, SAS, Scale Computing , Schneider Electric, SiFive, Silver Peak (now Aruba-HPE), SkyWorks, SONY Optical Storage, Splunk, Springpath (now Cisco), Spirent, Splunk, Sprint (now T-Mobile); Stratus Technologies, Symantec, Synaptica, Syniverse, Synopsys, Tanium, Telesign, TE Connectivity, TensTorrent, Tobii Technologies, Teradata, T-Mobile, Treasure Data, Twitter, Unity Technologies, UiPath, Verizon Communications, VAST Data, Ventana Micro Systems, Vidyo, VMware, Wave Computing, Wellsmith, Xilinx, Zayo, Zebra, Zededa, Zendesk, Zoho, Zoom, and Zscaler. Moor Insights & Strategy Founder, CEO, and Chief Analyst Patrick Moorhead is an investor in dMY Technology Group Inc. 6, Dreamium Labs, Groq, Luminar Technologies, MemryX, and Movandi.


Also Read :  Journalism, AI and satellite imagery: how to get started

Leave a Reply

Your email address will not be published.

Related Articles

Back to top button